# R Markdown

**DO NOT CHANGE THE RUNTIME BECAUSE YOU WON'T BE ABLE TO CHANGE IT BACK!**

In [None]:
#@title Run this segment first { display-mode: "form" }
install.packages("gcookbook")
install.packages("doParallel")
install.packages("plyr")
library(plyr)
library(datasets)
library(lubridate)
library(ggplot2)
library(gcookbook)

# Graphics

There are three plotting systems in R
* base
  * Convenient, but hard to adjust after the plot is created
* lattice
  * Good for creating conditioning plot
* ggplot2
  * Powerful and flexible, many tunable feature, may require some time to master


Each has its pros and cons, so it is up to the users which one to choose.

## base

A few functions are avaible in the base plot systems
* `plot()`: line and scatter plots
* `boxplot()`: box plots
* `hist()`: histograms

A quick scatter plot example with the base plot system.

In [None]:
# Create the plot with title and axis labels.
plot(pressure,type="l",
     main="Vapor Pressure of Mercury",
     xlab="Temperature", 
     ylab="Vapor Pressure")
# Add points
points(pressure,col='red') 
# Add annotation
text(150,700,"Source: Weast, R. C., ed. (1973) Handbook \n
     of Chemistry and Physics. CRC Press.")

## ggplot2

The `qplot()` function is the ggplot2 version of `plot()`.

In [None]:
qplot(weightLb, heightIn, data=heightweight, geom="point")

The `ggplot()` function is the main function in the ggplot2 package.

Here is an example:

In [None]:
ggplot(heightweight, aes(x=weightLb, y=heightIn, color=sex, shape=sex)) + 
  geom_point(size=3.5) +
  ggtitle("School Children\nHeight ~ Weight") +
  labs(y="Height (inch)", x="Weight (lbs)") +
  stat_smooth(method=loess, se=T, color="black", fullrange=T) +
  annotate("text",x=145,y=75,label="Locally weighted polynomial fit with 95% CI",color="Green",size=6) +
  scale_color_brewer(palette = "Set1", labels=c("Female", "Male")) +
  guides(shape=F) +
  theme_bw() +
  theme(plot.title = element_text(size=20, hjust=0.5), 
        legend.position = c(0.9,0.2),
        axis.title.x = element_text(size=20), axis.title.y = element_text(size=20),
        legend.title = element_text(size=15),legend.text = element_text(size=15))

If you are interested to learn more, please visit the [Data Visualization in R](http://hpc.loni.org/training/weekly-materials/2018-Spring/Slides.html#(1)) tutorial from LONI HPC.

# Parallel Processing

Modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel, but base R is single‐threaded, i.e. not parallel.

In other words, regardless how many cores are available, R can only 
use one of them.

There are two options to run R in parallel: **implicit** and **explicit**.


## Implicit parallel processing

Some functions in R can call parallel numerical libraries.

For instance, on the LONI QB-2 and QB-3 clusters most linear algebraic and related functions (e.g. linear regression, matrix decomposition, computing inverse and determinant of a matrix) leverage the multi‐threaded Intel MKL library.

In this case, no extra coding is needed to take advange of the multiple CPU cores - those functions will automatically use multiple cores when being called.

## Explicit parallel processing

If the implicit option is not available for what you'd like to do, some codes need to be written.

Here is an example of using the `%dopar%` directive in the `doParallel` package.

The workload is to generate 100 random samples, each with
1,000,000 observations from a standard normal distribution, then take a summary for each sample.

In [None]:
iters <- 100

Below is the sequential version with a for loop. The `system.time()` function is used to measure how long it takes to process the workload.

In [None]:
# This code segment shows us how long it takes to run on one core.
system.time(
for (i in 1:iters) {
  to.ls <- rnorm(1e6)
  to.ls <- summary(to.ls)
}
)

This is the parallel example with the `doParallel` package.

In [None]:
# This code segment shows us how long it takes to run on all available cores.
library(doParallel)

# Obtain the number of cores available.
ncpu <- detectCores()
ncpu

system.time({
  cl <- makeCluster(ncpu)
  registerDoParallel(cl)
  ls<-foreach(icount(iters)) %dopar% {
    to.ls<-rnorm(1e6)
    to.ls<-summary(to.ls)
  }
  stopCluster(cl)
})

If you are interested to learn more, please visit the [Parallel Computing in R](http://hpc.loni.org/training/weekly-materials/2017-Fall/HPC_Parallel_R_Fall2017.pdf) tutorial from LONI HPC.

# Exercise 5



## Introduction

The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

In each year, six metrics are generated for each country:
* Economic production
* Social support
* Life expectancy
* Freedom
* Absence of corruption
* Generosity



[Data source](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021)

[World happiness report](https://worldhappiness.report)

## Datasets

Each dataset contains variables such as country name, year, and the scores for the six metrics.

2008-2020:
http://hpc.loni.org/training/weekly-materials/Downloads/world-happiness-report.csv

2021:
http://hpc.loni.org/training/weekly-materials/Downloads/world-happiness-report-2021.csv

## Tasks

1. Download both datasets and read them into R; 
2. Inspect the datasets (the data structure, what the columns are, etc.);
3. Merge the datasets so the data covers 2008 to 2021;
4. Using the merged dataset, answer the following questions:
  * In year 2011, what are the top and bottom five countries with the highest freedom to make life choices? 
  * Among the 50 countries with the highest life expentancy in 2021, how many are in western Europe?
  * How has the average generosity over all countries changed from 2008 to 2021? 
  * From 2011 to 2021, which country's rank of perceptions of corruption rises the most? Which drops the most?

Note:

You can use the `read.csv()` funtion o read data in a csv file into R. For instance, to read the data in the file "mydata.csv" into a dataframe "mydataframe":

```
mydataframe <- read.csv("mydata.csv") 
```

To download a file, use the `download.file(<uri>,<file_name>)` function. For instance, to download the file "mydata.csv" to the current work directory:

```
download.file("http://url/to/mydata.csv","mydata.csv")
```

In [None]:
#@title Solution

###### TASK 1 ######

# Download the files.
download.file("http://hpc.loni.org/training/weekly-materials/Downloads/world-happiness-report.csv","world-happiness-report.csv")
download.file("http://hpc.loni.org/training/weekly-materials/Downloads/world-happiness-report-2021.csv","world-happiness-2021.csv")

# Read the data into two dataframes.
rawdf2020 <- read.csv("world-happiness-report.csv")
rawdf2021 <- read.csv("world-happiness-2021.csv")

###### TASK 2 ######

# After this step, we need to inspect the data.
#str(rawdf2020)
#str(rawdf2021)

###### TASK 3 ######

# This is actually the toughest step, 
# as the columns in the two dataframes
# are not aligned. 

# In the merged data frame, we will need these columns:
# Country, region, year
# The six happiness metrics

df2020 <- rawdf2020[,1:9]
df2021 <- rawdf2021[,c(1:3,7:12)]

# Need to add the "year" column for 2021.
df2021$year <- 2021

# Match the regional indicator in the 2021
# dataset to the country names in the 2020 
# dataset.
df2020$Regional.indicator <- 
  df2021[match(df2020$Country.name,df2021[,1]),"Regional.indicator"]

# Reorder the columns
df2020r <- df2020[,c(1,10,2:9)]
df2021r <- df2021[,c(1:2,10,3:9)]
colnames(df2021r) <- colnames(df2020r)

# Rowbind the dataframes.
dataFinal <- rbind.data.frame(df2020r,df2021r)

# Drop the rows with missing values.
dataClean <- dataFinal[complete.cases(dataFinal),]

In [None]:
#@title Solution 4.1
# Question 1
happy2011 <- subset(dataClean, year == 2011)
cat("The top 5 countries are:\n")
happy2011[order(-happy2011$Freedom.to.make.life.choices),"Country.name"][1:5]
cat("The bottom 5 countries are:\n")
happy2011[order(happy2011$Freedom.to.make.life.choices),"Country.name"][1:5]

In [None]:
#@title Solution 4.2
# Question 2
happy2021 <- subset(dataClean, year == 2021)
table(happy2021[order(-happy2021$Healthy.life.expectancy.at.birth)[1:50],"Regional.indicator"])

In [None]:
#@title Solutoin 4.3
# Question 3
library(plyr)
plot(ddply(dataClean,"year",summarize,average=mean(Generosity)))

In [None]:
#@title Solution 4.4

# Store the ranking in a new variable in the 2011 and 2021 data frames.
happy2011$rank11 <- rank(happy2011$Perceptions.of.corruption, ties.method = "min")
happy2021$rank21 <- rank(happy2021$Perceptions.of.corruption, ties.method = "min")

# Merge the 2011 and 2021 data frames and keep only the country names and rankings.
happyRank <- merge(subset(happy2011,select=c(Country.name,rank11)),
  subset(happy2021,select=c(Country.name,rank21)),
  by = "Country.name")

# Calcuate the ranking change and find the top 10 and bottom 10.
happyRank$diff <- happyRank$rank21 - happyRank$rank11
happyRank[order(happyRank$diff),][1:10,]
happyRank[order(-happyRank$diff),][1:10,]