<a href="https://colab.research.google.com/github/lsuhpchelp/lbrnloniworkshop2021/blob/main/day1/Introduction_to_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to R

**DO NOT CHANGE THE RUNTIME BECAUSE YOU WON'T BE ABLE TO CHANGE IT BACK!**

In [None]:
#@title Run this segment first { display-mode: "form" }
install.packages("gcookbook")
install.packages("doParallel")
install.packages("plyr")
library(gcookbook)
library(datasets)
library(ggplot2)
library(lubridate)

# Outline

* Introduction
* How to run R codes
* Basic syntax
* Data classes and objects
* Data management
* Flow control structures
* Useful functions
* Managing R packages
* File operations
* Graphics
* Parallel processing





# Introduction

## What is R

* R is a programming language for statistical computing
  * Importing, storing, exporting and manipulating data
  * Conducting statistical analyses
  * Displaying the results by tables, graphs, etc.
* R is also a software environment for the development and implementation of new algorithms.
  * Many graphical user interface to R both free and commercial (e.g. Rstudio and Microsoft R Open).
* R is being used by many disciplines
  * [A collection of repositories of R codes categorized by discipline](https://github.com/lsuhpchelp/r_collection)


## History of R

* R is a dialect of the S language
  * S was created in 1976 at the Bell Labs as an internal statistical analysis environment
  * Goal of S was “to turn ideas into software, quickly and faithfully".
  * Most well known implementation is S-plus (most recent stable release was in 2010). S-Plus integrates S with a nice GUI interface and full customer support.
* R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
* The R core group was formed in 1997, who controls the source code of R (written in C)
* The first stable version R 1.0.0 was released in 2000
* Latest stable version is 4.1.0 released on May 18, 2021

In [None]:
version

## Features of R

* Designed for statistical analysis, with rich data analysis functionalities and sophisticated graphical capabilities
* Available on most platform/OS
* Active development and (very) active community
  * [CRAN](https://cran.r-project.org/): The Comprehensive R Archive Network
  * Source code and binaries, user contributed packages and documentation
  * More than 17,000 packages available on CRAN (as of May 2021)
    * Compared to 6,000 four years ago
* Free to use

# How to Run R Codes

There are (at least) three options to run R codes:
* Google Colaboratory
* Rstudio (or Microsoft R Open or any other IDE)
* HPC clusters

## Google Colaboratory

This is what we are using right now. It is the most convenient - browser-based and no setup is needed. The only thing you need is a Google account. In the meanwhile, we don't have much control on the environment. For instance, we can't choose which version to use and have very limited control on where the data is stored.

## Rstudio (or Microsoft R Open or any other IDE that supports R)

![Rstudio screen shot](https://support.rstudio.com/hc/article_attachments/115020357707/Screen_Shot_2017-08-24_at_1.22.05_PM.png)

Rstudio is an intergrated development environment (IDE) for R.

* Free to use: [Rstudio website](https://www.rstudio.com/)
* Its user interface is similar to IDEs, dividing the screen into panes
  * Source code editor
  * Console
  * Workspace
  * Others (help message, plot etc.)
* Rstudio in a desktop environment is better suited for code development and/or a limited number of small jobs.

Rstudio also provides a collection of very useful cheat sheets about how to use R [here](https://www.rstudio.com/resources/cheatsheets/).

## HPC Clusters

The HPC clusters are good for resource‐demanding workloads, e.g. resource-intense tasks or many small tasks.

# Basic Syntax

## Assignment

For new users, the biggest difference from other languages is perhaps the assignment operator: **R uses "<-" instead of "="**. 

In [None]:
x <- 2*4

The content of the object "x" can be viewed by typing its name at the R prompt.

In [None]:
x

In [None]:
# What happens if we run this?
y

Actually, "=" works too:

In [None]:
y = 2 * 4
y

There are some subtle differences between ```<-``` and ```=```, which are explained [here](https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/) if you are interested. 

The R convention is to use ```<-``` for assignment.

## Comment

In R, any line staring with "#" will be interpreted as a comment.

In [None]:
# z <- 2*4
# Nothing will happen.

## Legal R Names

Names for R objects can be any combination of letters, numbers, underscores (_) and periods (.), but must not start with a number or an underscore.

These are legal names:

In [None]:
# This is legal.
num.Cats.2 <- 4
num.Cats.2

# This is legal too.
num_Cats <- 5
num_Cats

These are not:

In [None]:
# Illegal to start with "_"
_num.cats <- 5

In [None]:
# "-" is not allowed.
num-cats <- 5

In [None]:
# Start with a number.
2cats <- 3

R is case sensitive, e.g. X and x are different in R.

In [None]:
# The cat function prints to screen.
x <- 4
cat("The value of x is:",x)
cat("The value of X is:",X)

## Arithematic operations

Basic arithematic operators: +, -, *, /, ^

In [None]:
1 + 2*4^(3/5)

Scientific notation: 1e-2

In [None]:
1e2 + 1e-2

## Comparisons and logical operations

Comparisons that will return a logical value:
* Less than: ```<```
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=
* Equal to: ==
* Not equal to: ```!=```

In [None]:
# This statement is false.
1 > 2

In [None]:
# This will reture a TRUE value.
1 != 2

Other logical operations:
* NOT: !
* AND (elementwise): &
* OR (elementwise): |

In [None]:
# The negation of 1 < 2
! 1 < 2

In [None]:
# Assign initial values for a and b.
a <- 4
b <- 5
# Logical experessions.
cat("a < 10 is", a < 10, "\n")
cat("b < 3 is", b < 3, "\n")


In [None]:
# Are both expressions TRUE?
cat("a < 10 & b < 3 is", a < 10 & b < 3, "\n") 


In [None]:
# Are one of the expressions TRUE?
cat("a < 10 | b < 3 is", a < 10 | b < 3)

## Getting help

Getting help is straightforward in R. 

For information about specific functions or objects, use **?\<name of funciton>**. 

In [None]:
?class

For a keyword search, use **??\<keyword>**

In [None]:
??assignment

# Data Classes And Objects

## Atomic Date Types

R has five atomic data classes:



* Numeric (double)
  * Numbers in R are treated as numeric unless specified otherwise.

In [None]:
# The class() function reveals the class of a R object.
class(9.3)
class(3)

* Integer

In [None]:
# The as.integer() function "casts" an object into integer.
class(as.integer(3))

* Complex

In [None]:
class(3+2i)

* Character

In [None]:
# Both are categorized as characters.
class("a")
class("a cat")

In [None]:
# What about this?
class(a)

* Logical (T, TRUE, F, FALSE)
  * Note that they must be upper case



In [None]:
class(TRUE)
class(T)
class(True)

`NA` is a logical constant R uses to denote a missing value.

In [None]:
class(NA)

The **is.\<type>()** functions, which return logical values, can be used to check for the data classes too. 

In [None]:
a <- 3
# Is a numeric?
is.numeric(a)
# Is a integer?
is.integer(a)
# Is a logical?
is.logical(a)

## Derivative Data Types

There are many derivative data types which are built using the atomic ones. For exmple, the "Date" type.

In [None]:
# The function today() returns the date of today.
mydate <- today()
mydate
class(mydate)

## Data Objects

Now let's look at the data objects in R. They are:

* Vectors: elements of same class, one dimension
* Matrices: elements of same class, two dimensions
* Arrays: elements of same class, 2+ dimensions
* Lists: elements can be any objects
* Data frames: “datasets” where columns are variables and rows are observations

### Vectors



Vectors are one-dimentional arrays that contain elements of the **same** data type.

Vectors can be constructed by 
* The **c()** funtion (concatenate):

In [None]:
myvec <- c(1,2,3)
myvec

In [None]:
# myvec is a numeric vector, so class() will report "numeric".
class(myvec)
is.vector(myvec)

In [None]:
# Of course you can have a character vector, a logical vector, etc.
myvec <- c("a","b","c")
myvec
class(myvec)
is.vector(myvec)

* The **vector()** function
  * The vector will be initiated to the default values.

In [None]:
myvec <- vector("logical", length = 10)
myvec

* The **seq()** and **rep()** functions, or the ":" operator

In [None]:
myvec <- seq(from=2,to=10,by=2)
myvec

In [None]:
# When calling a function in R, the argument names are optional, so the code below does exactly the same thing.
myvec <- seq(2,10,2)
myvec

In [None]:
myvec <- seq(from=2,to=10,length=5)
myvec

In [None]:
# Without the argument names, the order of arguments is very important!!!
# In the example below, the "5" will be treated as the "by" argument instead of "length".
myvec <- seq(2,10,5)
myvec

In [None]:
# Repeat the element "5" six times.
myvec <- rep(5,6)
myvec

In [None]:
# A INTEGER sequence from 1 to 15.
myvec <- 1:15
myvec
class(myvec)

* Or a combination of all of them

In [None]:
# How many elements would myvec end up with?
myvec <- c(1,8:10,rep(2:4,3))
myvec

You can convert an object to a different type using the **as.\<TYPE>()** functions.

Note: the output will be an object of the specified type, while the input remains untouched.

In [None]:
# myvec is an integer vector.
myvec <- 1:3
myvec

In [None]:
# The output of as.character(myvec) is a character vector.
as.character(myvec)

In [None]:
# myvec is still an integer vector.
myvec

When converting to logical values, a numeric/integer "0" will be FALSE while all non-zeroes will be TRUE.

In [None]:
as.logical(0:3)

Coercion will occur when mixed objects are passed to the **c()** function, as if an **as.\<Type>()** function is explicitly called.

In [None]:
# Which data type would myvec be?
myvec <- c(1e3,"a")
myvec
class(myvec)

How about this one?

In [None]:
c(1.7,"a")

And this one?

In [None]:
c(T,2)

Coercion hierarchy in R:

logical < integer < numeric < character

Caution: type coercion may happen **without you being aware of it and may have unintended results**.

One can use the [\<index>] operator to access individual element in a vector. Note that in R the indices start from 1.

In [None]:
myvec <- 1:10
# myvec[4] points to the 4th element in the vector.
myvec[4]

For multiple elements with multiple indices, use the **c()** function (the indices need to be passed as a vector themselves).

In [None]:
# This will not work:
myvec[1,4]

In [None]:
# But this will:
myvec[c(1,4)]

Negative indices will drop the corresponding elements from the vector.

In [None]:
# Drop element 2 through 6.
myvec[-6:-2]

Logical values can be used to access individual elements too.

In [None]:
# What output do you expect to see?
myvec[c(T,F)]

**Important:** Lots of R operations process objects in a vectorized way
* More efficient, concise, and easier to read.

In [None]:
# vec1: 1 2 3 4
# vec2: 4 3 2 1
vec1 <- 1:4
vec2 <- 4:1

In [None]:
# Element-wise addition
vec1 + vec2

In [None]:
# Element-wise comparison
# The result will be a logical vector with four elements.
vec1 > 2

In [None]:
# Element-wise comparison
vec1 > vec2

In [None]:
vec1[3] > 2

In [None]:
# Equivalent to "show all vec1 elements that are greater than their counterparts in vec2".
vec1[vec1 > 1]

Sometimes R takes it to the extreme.

In [None]:
# vec1: 1 2 3 4 5 
# vec2: 4 3 2 1
vec1 <- 1:8
vec2 <- 4:1
# Now vec1 and vec2 are of different length. Would this end up with an error? 
vec1+vec2

### Matrices


Matrices are two-dimensional arrays that contain elements of the same type. 

Assigning an "dim" attribute to a vector turns it into a matrix.

In [None]:
# "mymat" is a what?
mymat <- 1:12
mymat
dim(mymat)

In [None]:
# Assign a "dim" attribut to mymat turns it into a matrix.
dim(mymat) <- c(3,4)
mymat

In [None]:
# Note the dual purposes of the dim() function.
# It displays the dimensions of a data object.
dim(mymat)
# It also provides a handle to assign values to dimensions.
dim(mymat) <- c(2,6)
mymat

Actually, matrices are merely vectors with a "dimension" attribute. 

R matrices can be constructed by using the **matrix()** function as well.

In [None]:
matrix(1:12,nrow=3,ncol=4)

The **matrix()** fucntion construct matrices column‐wise by default. You can use the "byrow=T" option to switch to row-wise.


In [None]:
matrix(1:12,nrow=3,ncol=4,byrow=T)

Or by using the **cbind()** or **rbind()** functions.

In [None]:
# Treat each argument as a column and bind them should-by-should into a matrix.
cbind(1:3,4:6,7:9,10:12)

In [None]:
# Treat each argument as a row and bind them row-by-row into a matrix.
rbind(1:3,4:6,7:9,10:12)

• One can use [\<index>,\<index>] to access individual elements.

In [None]:
# What is the output?
mymat[3,4]

### Arrays

Arrays consist of elements of same class with a number of dimensions. Vectors and matrices are arrays of 1 and 2 dimensions.

In [None]:
# myarray will be a three-dimensional array
myarray <- array(data = 1:12,dim = c(2,2,3))
myarray

In [None]:
dim(myarray)

In [None]:
# Access an element using the indices.
myarray[1,1,2]

In [None]:
myarray[3,4]

In [None]:
# Actually, it's fine to treat it as a one-dimensional vector.
myarray[5]

In [None]:
dim(myarray)
# Mold myarray into a matrix.
dim(myarray) <- c(3,4)
# Then you won't be able to access its element using three indices.
myarray[1,1,2]

### Lists

Lists are an ordered collection of objects, which can be of **different types or classes**.

Lists can be constructed by using the **list()** function.


In [None]:
# Mixing numeric, logical and character.
list(1,F,"a")

Members of a list do not have to be of atomic types, i.e. they can be vectors, matrices and even lists.

In [None]:
# Mixing numeric, vector (integer), matrix (integer), list
mylist <- list(1,1:5,matrix(1:6,2,3),list(1,F,"a"))
mylist

Lists can be indexed using the `[[ ]]` operator.

In [None]:
# The second element in mylist.
mylist[[2]]

The indices can be nested.

In [None]:
# The second element of the fourth element (a list) in my list.
mylist[[4]][[2]]

In [None]:
# The [1,3] element of the third element (a matrix) in my list.
mylist[[3]][1,3]

Elements of R objects can have names.

Names can be specified when an object is created.

In [None]:
list(inst="LSU",location="Baton Rouge",state="LA")

Or they can be specified later when the **names()** function.

In [None]:
names(mylist) <- c("num","vec","mat","lst")
mylist

Names can be used to access elements in a data object using the `$` operator. When there are many elements, this could be more convenient than using the indices.

In [None]:
# This is equivalent to mylist[[4]]
mylist$lst

 Indexing operations by names and indices can be nested and mixed.

In [None]:
names(mylist$lst) <- c("c1","c2","c3")
mylist

In [None]:
# The "c2" element of the "lst" element in the list "mylist".
mylist$lst$c2
# The same thing.
mylist$lst[[2]]
# Again, the same thing.
mylist[[4]][[2]]

### Data frames

Data frames are used to store tabular data.
* They are a special type of **lists**, where each element is a R vector ("column" or "variable") and has to be of the same length.
* The elements (columns) can be of different classes.
* Data frames have special attributes such as row.names.
* Data frames can be created by reading data files, using functions such as **read.table()** or **read.csv()**.


Data frames can be created directly by calling the **data.frame()** function.

In [None]:
# Create a dataframe with three rows and two columns (variables).
mydf <- data.frame(c(31,40,50), c("M","F","M"))
mydf
is.vector(mydf[[2]])

We usually name the columns so that it's more meanful.

In [None]:
# Name the columns
colnames(mydf) <- c("age","sex")
mydf

Row names can be specified as well.

In [None]:
# Name the rows
rownames(mydf) <- c("obs1","obs2","obs3")
mydf

To access individual elements in a data frame, there are a few options:

* Numeric indices

In [None]:
# First row, second column
mydf[1,2]

* Row and column names

In [None]:
# Sex of the observation #1 (same element as above)
mydf["obs1","sex"]

* Or a mix of indices and names

In [None]:
mydf["obs1",2]

Since data frames are lists, both `[[ ]]` and `$` operators work.

In [None]:
mydf[[2]]

In [None]:
mydf$sex

You could select rows by leaving the other index blank:

In [None]:
# The first two rows
mydf[1:2,]

And vice versa:

In [None]:
# The "sex" column
mydf[,"sex"]

We can subset a data frame like this:

In [None]:
# Both columns for obs 1 and 3
mydf[c(1,3),]

Or using a vector of logical values:

In [None]:
# This gives us a logical vector.
mydf$sex == "M"

In [None]:
# Which can be used to obtain a subset of mydf.
mydf[mydf$sex == "M",]

## Querying Object Attributes

There are a few functions in R that help us obtain information about an object.

We will work with the "mtcars" data frame in this section.

In [None]:
# Get some help information on the data frame.
?mtcars

In [None]:
# Print the data frame to screen.
mtcars

We have already seen the **class()** function, which reveals the type of an object.

In [None]:
class(mtcars)

The **length()** function shows the length of an object.

In [None]:
length(mtcars)

The **nrow()** function counts the number of rows in a data frame.

In [None]:
nrow(mtcars)

The **dim()** function reveals the dimension of an object.

In [None]:
dim(mtcars)

The **attributes()** function reveals attributes of an object.

In [None]:
attributes(mtcars)

The **str()** function shows the internal strucutre of a R object.


In [None]:
str(mtcars)

## Exercise 1

1. Learn about the **airquality** data frame and answer the following quetions:
  * What is the source of the data?
  * How many rows and columns are there in the data frame? 
  * What does each column represent?
2. Find the percentage of days when the high tempature measured at La Guardia Airport exceeded 70.
3. Find the number of days when the wind speed is between 10 and 20 miles per hour.

In [None]:
#@title Hint
# Use a logical vector as indices to subset a data frame.

In [None]:
#@title Solution { display-mode: "form" }
# question 1
?airquality

attach(airquality)

# question 2
nrow(airquality[Temp > 70,])/nrow(airquality)*100

# question 3
nrow(airquality[Wind > 10 & Wind < 20,])

detach(airquality)

# Data Management

In this section, we will cover some of the most frequently used operations during data cleaning and preprocessing.

## Adding and removing variables

To add columns/variables to a dataframe, we can simply use the assignment operation.

In [None]:
mydf

In [None]:
# Add a logical variable "active" to the dataframe.
mydf$active <- c(T,F,T)
mydf

In [None]:
# Add a numeric variable "ageFirstDose".
mydf$ageFirstDose <- 21
mydf

In [None]:
# Add a numeric variable whose values are the differences between "age" and "ageFirstDose".
mydf$ageDiff <- mydf$age - mydf$ageFirstDose
mydf

To drop columns, we can either include the ones to keep or exclude the ones to drop.

In [None]:
# Drop all columns other than "age" and "sex".
mydf[,c("age","sex")]

In [None]:
# Drop the third and fourth columns
mydf[,c(-3,-4)]

In [None]:
# We can rearrage the order of columns too.
mydf[,c(3:5,1)]

## Subsetting/slicing

We have seen that we can include/exclude certain columns and rows by using either indices or names.

In [None]:
# Select row 1,3,5 and column "mpg" and "cyl" from the mtcars dataframe.
mtcars[c(1,3,5),c("mpg","cyl")]

We can use a logical vector as the row indices to include/exclude rows. 

In [None]:
mtcars$mpg > 20
# Keep only the rows with mpg > 20.
mtcars[mtcars$mpg > 20,]

In [None]:
# Keep only the rows with mpg > 20 AND hp > 100.
mtcars[mtcars$mpg > 20 & mtcars$hp > 100,]

Another way of generating the logical vector is to use the ```%in%``` operator. It takes two vectors as arguments, and, for each element in the first vector, performs a match operation in the second vector. If a match is made, it will return a logical TRUE; if not, a FALSE. 

In [None]:
carmodels <- data.frame(model=c('Toyota Corolla','Toyota Corona','Toyota Camry'),
  country=rep('Japan',3))
carmodels

# The output is a logical vector of the same length with mtcars, 
# each element of which indicates whether a match is found in the second vector.
row.names(mtcars) %in% carmodels$model

In [None]:
# The output can be used to slice the orginal data frame.
mtcars[row.names(mtcars) %in% carmodels$model,]

The ```match()``` function performs a similar operation, but instead of a logical vector, returns the indices of the matched values in the second vector.

In [None]:
match(row.names(mtcars), carmodels$model)

In [None]:
# We can use the match() function to perform a "dictionary lookup".
countryLookup <- carmodels[match(row.names(mtcars), carmodels$model),"country"]
mtcarsWithCountry <- cbind.data.frame(mtcars,country=countryLookup)
mtcarsWithCountry

Or we can use the ```subset``` function, which makes the code easier to read.

In [None]:
# Keep only the rows with mpg > 20 AND hp > 100 AND disp > 100,
# and drop all columns excpet mpg,hp,cyl, and disp.
subset(mtcars, mpg > 20 & hp > 100 & disp > 100, select = c(mpg,hp,cyl,disp))

## Merging

We can use the ```merge``` function to merge two dataframes that have common columns, i.e. merge horizontally.

In [None]:
mydf1 <- data.frame(state=c("Louisiana","Texas","Alabama"),short=c("LA","TX","AL"))
mydf1

In [None]:
mydf2 <- data.frame(state=c("Louisiana","Texas","Alabama"),pop=c(5,29,5))
mydf2

In [None]:
# Merge mydf1 and mydf2 with the common column being "state".
mydf3 <- merge(mydf1,mydf2,by="state")
mydf3

When merging dataframes vertically, i.e., adding rows, use the ```rbind.data.frame``` function. 

In [None]:
mydf4 <- data.frame(state=c("Arkansas","Florida"),pop=c(3,22),short=c("AR","FL"))
mydf4

In [None]:
rbind.data.frame(mydf3,mydf4)

## Sorting

Sort and order elements: `sort()`, `rank()` and `order(`).

By default, the **sort()** functions sorts the values in a vector into ascending order.

In [None]:
# This is the original order.
mtcars$mpg
# This is the sorted order.
sort(mtcars$mpg)

The ```decreasing=T``` option will sort a vector into descending order.

In [None]:
sort(mtcars$mpg, decreasing=T)

In contrast, the **order()** functions returns the indices in order.


In [None]:
#order(mtcars$mpg)
order(mtcars$mpg, decreasing=T)

Users can use the indices returned by **order()** to change the order of rows in a data frame.

In [None]:
# The data frame in the original order.
mtcars

In [None]:
# The reorder data fram according the values of the mpg variable.
mtcars[order(mtcars$mpg),]

The **rank()** function returns the ranks of the values in a vector.

In [None]:
rank(mtcars$mpg, ties.method = "min")

In [None]:
# The "ties.method" option decides how equal values are handled.
data.frame(mpg=mtcars$mpg,rank=rank(mtcars$mpg, ties.method = "min"))

## Missing values

Missing values (missing data) are a common problem in data science. It can have a significant effect on the conclusions that can be drawn from the data. Therefore, R has many functions and packages that deal with the missing value problem.

The ```complete.cases``` function will scan a dataframe and return a logical vector where the rows without missing values are TRUE and those with missing values FALSE.

In [None]:
# In the airquality dataset we have used for Exercise 1, there are some missing values.
complete.cases(airquality)

The simplest way of dealing with missing values is to drop all rows with missing values (not necessarily the best way!).

In [None]:
airquality[complete.cases(airquality),]

## Exercise 2

Using the **airquality** data, find the average wind speed for the 10 hottest days on record.



In [None]:
#@title Hint
# Use the order() function 
# to find the indices of the 10 hottest days.

In [None]:
#@title Solution { display-mode: "form" }
mean(airquality[order(airquality$Temp,decreasing=T),"Wind"][1:10])

# Flow Control Structures

Flow control structures in R, which allow one to control the flow of execution, are similar to other languages.

## Condition testing

Test a condition with the **if...else if...else** structure:

```
if (condition 1 is true) {
  do something
} else if (condition 2 is true) {
  do something else
} else {
  do something more
}
```



In [None]:
if (length(mtcars) > 3) {
  print("We have more than 3 car models!")
}

## Loops

Loops with ```for```:

```
for (variable in sequence) {
  statements
}
```

In [None]:
for (i in 1:10) {
  print(i^3)
}

Compared to other languages, loops are not as frequently used in R because many operations and functions are inherently vectorized. 

For instance, This line of code does exactly the same thing as the code block above:

In [None]:
(1:10)^3

In addition, the family of **apply()** functions are very useful to perform operation over all elements of a vector/list (more on this later).

# Useful Functions (1)

## Simple Statistic Functions

* **min()**: Minimum value
* **max()**: Maximum value
* **which.min()**: Location of minimum value
* **which.max()**: Location of maximum value
* **sum()**: Sum of the elements of a vector
* **mean()**: Mean of the elements of a vector
* **sd()**: Standard deviation of the elements of a vector
* **quantile()**: Show quantiles of a vector
* **summary()**: Display descriptive statistics

In [None]:
mean(mtcars$mpg)

In [None]:
# The summary function will report a few statistics for each varialbe in the data frame.
summary(mtcars)

The **table()** function tabulates factors or find the frequency of an object.

For instance, in the **mtcars** data frame, we can get the frequency table by the combination of the numbers of cylinders and gears: 


In [None]:
table(mtcars[,c("cyl","gear")])

## Distributions and Random Variables

For each statistic distribution below, R provides four functions: density (d), cumulative density (p), quantile (q), and random generation (r). 


Distrituion | Name in R 

* Uniform | ```unif``` 
* Binomial | ```binom```
* Poisson | ```pois```
* Geometric | ```geom```
* Gamma | ```gamma```
* Normal | ```norm```
* Log Normal | ```lnorm```
* Exponential | ```exp```
* Student's t | ```t```

The function name is of the form **[d|p|q|r]\<name of
distribution\>**. For example,  **qbinom()** gives the quantile of a binomial distribution.

To generate a random sample of 10 from the standard normal distribution:

In [None]:
# Each time it is run, the sample would be different.
rnorm(10,mean=0,sd=1)

The p-value for 1.96 and its inverse function (standard normal distribution):

In [None]:
pnorm(1.96)
qnorm(pnorm(1.96))

When genrating random samples, setting the seed to the same value will generate the same sample.

In [None]:
# A "true" random sample (varies every time)
rnorm(10,mean=0,sd=1)

In [None]:
# A sample with seed "15" (mainly for debugging purpose)
set.seed(15)
rnorm(10,mean=0,sd=1)
set.seed(15)
rnorm(10,mean=0,sd=1)

## Exercise 3

Using the **airquality** data, draw 1000 random samples of 10 from the temperature measurement, and plot the sample means (to show the central tendency).

Use the *hist(dataset)* function to create a histogram.


In [None]:
#@title Hint
# Use a for loop
# Use the runif function to create a random sample of 10

In [None]:
#@title Solution { display-mode: "form" }

# Create a vector to store the results.
nSamples <- 1000
sampleMean <- vector("numeric",nSamples)

for (i in 1:nSamples) {

  # In each iteration, use the runif function to 
  # generate 10 random numbers between 0 and 1.
  # Then use them to get 10 measurements of temperature.
  sampleMean[i] <- mean(airquality[ceiling(runif(10)*nrow(airquality)),"Temp"])
  
  # R provides a sample() function that can 
  # perform the same operation  
  #sampleMean[i] <- mean(airquality[sample(1:nrow(airquality),10,replace=F),"Temp"])

}

# Plot the data.
hist(sampleMean)

# Useful Functions (2)

## The Apply family of functions

The **apply()** function evaluate a function over
the margins of an array
* More concise than the ```for``` loops (not necessarily
faster)

Syntax:

```
 apply(data,dimension,function,function parameters)
 ```

For example, if we want to calculate the mean of each variable in **mtcars**:

In [None]:
# Apply the mean() function to the columns (variables) of mtcars.
apply(mtcars,2,mean)


Which is (almost) equivalent to:

In [None]:
for (i in 1:ncol(mtcars)) {
  print(mean(mtcars[,i]))
}

It can perform multiple calculations in one function call:

In [None]:
# Find the 1st and 3rd quantile of each varialbe in mtcars.
apply(mtcars, 2, quantile, probs = c(0.25, 0.5, 0.75))

Other member of the **apply()** family include:
* **lapply** - Loop over a list (data frame) and evaluate a function on each element
* **sapply** - Same as **lapply** but simplifies the result to array 
* **tapply** - Apply a function over subsets of a vector
* **mapply** - Multivariate version of **sapply**

## The ```plyr``` package

Suppose that, with the **mtcars** data frame, we want to know the average mileage-per-gallon for cars with 4, 6 and 8 cylinders. How do we do that?

We will need to 
* **split** the data into subsets according to the value of the "cyl" column (one for cyl==4, one for cyl==6 and one for cyl==8)
* **apply** the mean function to the "mpg" column
* **combine** the results from each subset

In [None]:
# Subset where cyl==4
mean(mtcars[mtcars$cyl==4,"mpg"])
# Subset where cyl==6
mean(mtcars[mtcars$cyl==6,"mpg"])
# Subset where cyl==8
mean(mtcars[mtcars$cyl==8,"mpg"])

The "split-apply-combine" pattern is very common in data analysis, where you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.

The ```plyr``` packages provide a group of functions that implement this split-apply-combine pattern.

For example, the **ddply()** function takes a data frame, split it accorindg to the condition you supply, apply a function to each piece, then combine the result into a new data frame.

In [None]:
library(plyr)
ddply(mtcars,"cyl", summarize, AverageMPG=mean(mpg))

The "split" step can be done according to more than one variable.

The command below will tell us the average MPG for each unique combination of "gear" and "cyl":

In [None]:
# Find the average mpg for each unique combination of "gear" and "cyl".
ddply(mtcars,c("gear","cyl"), summarize, AverageMPG=mean(mpg))

## User-defined functions

* Users can define their own functions in R by using the ```function()``` directives. 
* The return value is the last expression in the function body to be evaluated.
* Functions can be nested.
* Functions are R objects and can be passed as an argument to
other functions.

Syntax to define a function:


```
function_name <- function (arguments) {
  statements
}
```



In [None]:
# Create a function pow(), which takes two arguments.
pow <- function(x, y) {
  result <- x^y
}

Then it can be called like any other function:

In [None]:
# The result will be 4^2.
pow(4,2)

Functions can be used as an argument for other functions.

In [None]:
# Define a new function, which takes a function as one of the arguments.
myfunc <- function(func,a,b) {
  result <- func(a,b) - 1
}

c <- myfunc(pow,4,2)
c

c <- myfunc(rep,1:2,5)
c

## Exercise 4

Using the **airquality** data, find the average high tempature of each month.

In [None]:
#@title Hint
# Use the ddply() function

In [None]:
#@title Solution { display-mode: "form" }
ddply(airquality,"Month",summarize,AvgTemp=mean(Temp))

# Managing R Packages

To load a R package so you can use the functions included in it, use the **library()** or **require()** function:

In [None]:
library(lubridate)
require(devtools)

The main difference is that, if a package is not installed, **library()** will throw out an error message and the execution will stop, while **require()** throws out a warning and the execution continues.

In [None]:
library(reshape)
print("End of code segment")

In [None]:
require(reshape)
print("End of code segment")

If a package is not available, the **install.packages()** function can be used to install it.

In [None]:
install.packages("reshape")
library(reshape)

Multiple packages can be installed with one call of the **install.packages()** function.

In [None]:
require(datarium)
require(BiocManager)
install.packages(c("datarium","BiocManager"))
library(datarium)
library(BiocManager)

Note that double quotation is **NOT** needed when loading packages, but necessary when installing them.

Use the **remove.packages** function to remove installed packages.

In [None]:
remove.packages("datarium")
library(datarium)

The **update.packages** function updates installed packages.

In [None]:
update.packages("lubricate")

List **installed** packages.

In [None]:
installed.packages()

List all **loaded** packages (and attached objects).

In [None]:
search()

# File And Directory Operations

## Query working directory

Each R session has a working directory. The **getwd()** function shows the current working directory.

In [None]:
getwd()

The **list.files()** and **list.dirs()** functions list the files and subdirectories.

In [None]:
list.files()
list.dirs()

To change the working directory, use the **setwd()** functions.

In [None]:
setwd("/content/sample_data")
getwd()

## Handling files

R has a set of file.\<opearation> functions.

Check if a file exists:

In [None]:
file.exists("testfile")

Create an empty file:

In [None]:
file.create("testfile")
file.exists("testfile")

Copy and delete files:

In [None]:
list.files()

In [None]:
file.copy("testfile","anotherfile")
list.files()

In [None]:
file.remove("testfile","anotherfile")
list.files()

# Graphics

There are three plotting systems in R
* base
  * Convenient, but hard to adjust after the plot is created
* lattice
  * Good for creating conditioning plot
* ggplot2
  * Powerful and flexible, many tunable feature, may require some time to master


Each has its pros and cons, so it is up to the users which one to choose.

## base

A few functions are avaible in the base plot systems
* **plot()**: line and scatter plots
* **boxplot()**: box plots
* **hist()**: histograms

A quick scatter plot example with the base plot system.

In [None]:
# Create the plot with title and axis labels.
plot(pressure,type="l",
     main="Vapor Pressure of Mercury",
     xlab="Temperature", 
     ylab="Vapor Pressure")
# Add points
points(pressure,col='red') 
# Add annotation
text(150,700,"Source: Weast, R. C., ed. (1973) Handbook \n
     of Chemistry and Physics. CRC Press.")

## ggplot2

The **qplot()** function is the ggplot2 version of **plot()**.

In [None]:
qplot(weightLb, heightIn, data=heightweight, geom="point")

The **ggplot()** function is the main function in the ggplot2 package.

Here is an example:

In [None]:
ggplot(heightweight, aes(x=weightLb, y=heightIn, color=sex, shape=sex)) + 
  geom_point(size=3.5) +
  ggtitle("School Children\nHeight ~ Weight") +
  labs(y="Height (inch)", x="Weight (lbs)") +
  stat_smooth(method=loess, se=T, color="black", fullrange=T) +
  annotate("text",x=145,y=75,label="Locally weighted polynomial fit with 95% CI",color="Green",size=6) +
  scale_color_brewer(palette = "Set1", labels=c("Female", "Male")) +
  guides(shape=F) +
  theme_bw() +
  theme(plot.title = element_text(size=20, hjust=0.5), 
        legend.position = c(0.9,0.2),
        axis.title.x = element_text(size=20), axis.title.y = element_text(size=20),
        legend.title = element_text(size=15),legend.text = element_text(size=15))

If you are interested to learn more, please visit the [Data Visualization in R](http://www.hpc.lsu.edu/training/weekly-materials/2018-Spring/Slides.html#(1)) tutorial from LSU HPC

# Parallel Processing

Modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel, but base R is single‐threaded, i.e. not parallel.

In other words, regardless how many cores are available, R can only 
use one of them.

There are two options to run R in parallel: **implicit** and **explicit**.


## Implicit parallel processing

Some functions in R can call parallel numerical libraries.

For instance, on the LONI QB2 cluster most linear algebraic and related functions (e.g. linear regression, matrix decomposition, computing inverse and determinant of a matrix) leverage the multi‐threaded Intel MKL library.

In this case, no extra coding is needed to take advange of the multiple CPU cores - those functions will automatically use multiple cores when being called.

## Explicit parallel processing

If the implicit option is not available for what you'd like to do, some codes need to be written.

Here is an example of using the **%dopar%** directive in the **doParallel** package.

The workload is to generate 100 random samples, each with
1,000,000 observations from a standard normal distribution, then take a summary for each sample.

In [None]:
iters <- 100

Below is the sequential version with a for loop. The **system.time()** function is used to measure how long it takes to process the workload.

In [None]:
# This code segment shows us how long it takes to run on one core.
system.time(
for (i in 1:iters) {
  to.ls <- rnorm(1e6)
  to.ls <- summary(to.ls)
}
)

This is the parallel example with the **doParallel** package.

In [None]:
# This code segment shows us how long it takes to run on all available cores.
library(doParallel)

# Obtain the number of cores available.
ncpu <- detectCores()
ncpu

system.time({
  cl <- makeCluster(ncpu)
  registerDoParallel(cl)
  ls<-foreach(icount(iters)) %dopar% {
    to.ls<-rnorm(1e6)
    to.ls<-summary(to.ls)
  }
  stopCluster(cl)
})

If you are interested to learn more, please visit the [Parallel Computing in R](http://www.hpc.lsu.edu/training/weekly-materials/2017-Fall/HPC_Parallel_R_Fall2017.pdf) tutorial from LSU HPC.

# Exercise 5



## Introduction

The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

In each year, six metrics are generated for each country:
* Economic production
* Social support
* Life expectancy
* Freedom
* Absence of corruption
* Generosity



[Data source](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021)

[World happiness report](https://worldhappiness.report)

## Datasets

Each dataset contains variables such as country name, year, and the scores for the six metrics.

2008-2020:
http://www.hpc.lsu.edu/training/weekly-materials/Downloads/world-happiness-report.csv

2021:
http://www.hpc.lsu.edu/training/weekly-materials/Downloads/world-happiness-report-2021.csv

## Tasks

1. Download both datasets and read them into R; 
2. Inspect the datasets (the data structure, what the columns are, etc.);
3. Merge the datasets so the data covers 2008 to 2021;
4. Using the merged dataset, answer the following questions:
  * In year 2011, what are the top and bottom five countries with the highest freedom to make life choices? 
  * Among the 50 countries with the highest life expentancy in 2021, how many are in western Europe?
  * How has the average generosity over all countries changed from 2008 to 2021? 
  * From 2011 to 2021, which country's rank of perceptions of corruption rises the most? Which drops the most?

Note:

You can use the `read.csv()` funtion o read data in a csv file into R. For instance, to read the data in the file "mydata.csv" into a dataframe "mydataframe":

```
mydataframe <- read.csv("mydata.csv") 
```

To download a file, use the `download.file(<uri>,<file_name>)` function. For instance, to download the file "mydata.csv" to the current work directory:

```
download.file("http://url/to/mydata.csv","mydata.csv")
```

In [None]:
#@title Solution

###### TASK 1 ######

# Download the files.
download.file("http://www.hpc.lsu.edu/training/weekly-materials/Downloads/world-happiness-report.csv","world-happiness-report.csv")
download.file("http://www.hpc.lsu.edu/training/weekly-materials/Downloads/world-happiness-report-2021.csv","world-happiness-2021.csv")

# Read the data into two dataframes.
rawdf2020 <- read.csv("world-happiness-report.csv")
rawdf2021 <- read.csv("world-happiness-2021.csv")

###### TASK 2 ######

# After this step, we need to inspect the data.
#str(rawdf2020)
#str(rawdf2021)

###### TASK 3 ######

# This is actually the toughest step, 
# as the columns in the two dataframes
# are not aligned. 

# In the merged data frame, we will need these columns:
# Country, region, year
# The six happiness metrics

df2020 <- rawdf2020[,1:9]
df2021 <- rawdf2021[,c(1:3,7:12)]

# Need to add the "year" column for 2021.
df2021$year <- 2021

# Match the regional indicator in the 2021
# dataset to the country names in the 2020 
# dataset.
df2020$Regional.indicator <- 
  df2021[match(df2020$Country.name,df2021[,1]),"Regional.indicator"]

# Reorder the columns
df2020r <- df2020[,c(1,10,2:9)]
df2021r <- df2021[,c(1:2,10,3:9)]
colnames(df2021r) <- colnames(df2020r)

# Rowbind the dataframes.
dataFinal <- rbind.data.frame(df2020r,df2021r)

# Drop the rows with missing values.
dataClean <- dataFinal[complete.cases(dataFinal),]

In [None]:
#@title Solution 4.1
# Question 1
happy2011 <- subset(dataClean, year == 2011)
cat("The top 5 countries are:\n")
happy2011[order(-happy2011$Freedom.to.make.life.choices),"Country.name"][1:5]
cat("The bottom 5 countries are:\n")
happy2011[order(happy2011$Freedom.to.make.life.choices),"Country.name"][1:5]

In [None]:
#@title Solution 4.2
# Question 2
happy2021 <- subset(dataClean, year == 2021)
table(happy2021[order(-happy2021$Healthy.life.expectancy.at.birth)[1:50],"Regional.indicator"])

In [None]:
#@title Solutoin 4.3
# Question 3
library(plyr)
plot(ddply(dataClean,"year",summarize,average=mean(Generosity)))

In [None]:
#@title Solution 4.4

# Store the ranking in a new variable in the 2011 and 2021 data frames.
happy2011$rank11 <- rank(happy2011$Perceptions.of.corruption, ties.method = "min")
happy2021$rank21 <- rank(happy2021$Perceptions.of.corruption, ties.method = "min")

# Merge the 2011 and 2021 data frames and keep only the country names and rankings.
happyRank <- merge(subset(happy2011,select=c(Country.name,rank11)),
  subset(happy2021,select=c(Country.name,rank21)),
  by = "Country.name")

# Calcuate the ranking change and find the top 10 and bottom 10.
happyRank$diff <- happyRank$rank21 - happyRank$rank11
happyRank[order(happyRank$diff),][1:10,]
happyRank[order(-happyRank$diff),][1:10,]