<a href="https://colab.research.google.com/github/lsuhpchelp/lbrnloniworkshop2020/blob/master/day3/Introduction_to_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to R

**DO NOT CHANGE THE RUNTIME BECAUSE YOU WON'T BE ABLE TO CHANGE IT BACK!**

In [0]:
#@title Run this segment first { display-mode: "form" }
install.packages("gcookbook")
install.packages("doParallel")
install.packages("plyr")
library(gcookbook)
library(datasets)
library(ggplot2)
library(lubridate)

# Outline

* Introduction
* How to run R codes
* Basic syntax
* Data classes and objects in R
* Flow control structures
* Useful functions
* Managing R packages
* File operations
* Graphics
* Parallel processing



# Introduction

## What is R

* R is a programming language for statistical computing
  * Importing, storing, exporting and manipulating data
  * Conducting statistical analyses
  * Displaying the results by tables, graphs, etc.
* R is also a software environment for the development and implementation of new algorithms.
  * Many graphical user interface to R both free and commercial (e.g. Rstudio and Revolution R (now Microsoft R) ).
* R is being used by many disciplines
  * [A collection of repositories of R codes categorized by discipline](https://github.com/lsuhpchelp/r_collection)


## History of R

* R is a dialect of the S language
  * S was created in 1976 at the Bell Labs as an internal statistical analysis environment
  * Goal of S was “to turn ideas into software, quickly and faithfully".
  * Most well known implementation is S-plus (most recent stable release was in 2010). S-Plus integrates S with a nice GUI interface and full customer support.
* R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
* The R core group was formed in 1997, who controls the source code of R (written in C)
* The first stable version R 1.0.0 was released in 2000
* Latest stable version is 4.0.0 released on Apr 24, 2020

In [0]:
version

## Features of R

* Designed for statistical analysis, with rich data analysis functionalities and sophisticated graphical capabilities
* Available on most platform/OS
* Active development and (very) active community
  * [CRAN](https://cran.r-project.org/): The Comprehensive R Archive Network
  * Source code and binaries, user contributed packages and documentation
  * More than 15,000 packages available on CRAN (as of May 2020)
    * Compared to 6,000 three years ago
* Free to use

# How to Run R Codes

There are (at least) three options to run R codes:
* Google Colaboratory
* Rstudio
* HPC clusters

## Colab

This is what we are using right now. It is the most convenient - browser-based and no setup is needed. The only thing you need is a Google account. In the meanwhile, we don't have much control on the environment. For instance, we can't choose which version to use and have very limited control on where the data is stored.

## Rstudio

![Rstudio screen shot](https://support.rstudio.com/hc/article_attachments/115020357707/Screen_Shot_2017-08-24_at_1.22.05_PM.png)

Rstudio is an intergrated development environment (IDE) for R.

* Free to use: [Rstudio website](https://www.rstudio.com/)
* Its user interface is similar to IDEs, dividing the screen into panes
  * Source code editor
  * Console
  * Workspace
  * Others (help message, plot etc.)
* Rstudio in a desktop environment is better suited for code development and/or a limited number of small jobs.

Rstudio also provides a collection of very useful cheat sheets about how to use R [here](https://www.rstudio.com/resources/cheatsheets/).

## HPC Clusters

The HPC clusters are good for resource‐demanding workloads, e.g. resource-intense tasks or many small tasks.

# Basic Syntax

## Assignment

For new users, the biggest difference from other languages is perhaps the assignment operator: **R uses "<-" instead of "="**. 

In [0]:
x <- 2*4

The contents of the object "x" can be viewed by typing its name at the R prompt.

In [0]:
x

In [0]:
# What happens if we run this?
y

Actually, "=" works too, but there are some subtle differences, which are explained [here](https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/). 

In [0]:
y = 2 * 4
y

Again, "<-" is the recommended way of assigning a value to a variable.

## Comment

In R, any line staring with "#" will be interpreted as a comment.

In [0]:
# z <- 2*4
# Nothing will happen.

## Legal R Names

Names for R objects can be any combination of letters, numbers, underscores (_) and periods (.), but must not start with a number or an underscore.

These are legal names:

In [0]:
num.Cats.2 <- 4
num.Cats.2
num_Cats <- 5
num_Cats

These are not:

In [0]:
# Start with "_"
_num.cats <- 5

In [0]:
# "-" is not allowed.
num-cats <- 5

In [0]:
# Start with a number.
2cats <- 3

R is case sensitive, e.g. X and x are different in R.

In [0]:
x <- 4
cat("The value of x is:",x)
cat("The value of X is:",X)

## Arithematic operations

Basic arithematic operators: +, -, *, /, ^

In [0]:
1 + 2*4^(3/5)

Scientific notation: 1e-2

In [0]:
1e2 + 1e-2

Special values: Inf (non-finite numeric values), NaN (not a number)

In [0]:
1 / 0
-1 / 0
1/0 - 1/0

## Comparisons and logical operations

Comparisons that will return a logical value:
* Less than: ```<```
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=
* Equal to: ==
* Not equal to: ```!=```

In [0]:
1 > 2

In [0]:
1 != 2

Logical operations:
* NOT: !
* AND (elementwise): &
* OR (element wise): |

In [0]:
# The negation of 1 < 2
! 1 < 2

In [0]:
# Assign initial values for a and b.
a <- 4
b <- 5
# Logical experessions.
cat("a < 10 is", a < 10, "\n")
cat("b < 3 is", b < 3, "\n")
# Are both expressions TRUE?
cat("a < 10 & b < 3 is", a < 10 & b < 3, "\n") 
# Are one of the expressions TRUE?
cat("a < 10 | b < 3 is", a < 10 | b < 3)

## Getting help

Getting help is straightforward in R. 

For information about specific functions or objects, use **?\<name of funciton>**. 

In [0]:
?class

For a keyword search, use **??\<keyword>**

In [0]:
??assignment

# Data Classes And Objects

## Atomic Date Types

R has five atomic classes:



* Numeric (double)
  * Numbers in R are treated as numeric unless specified otherwise.

In [0]:
# The class() function reveals the class of a R object.
class(9.3)
class(3)

* Integer

In [0]:
# The as.integer() function "casts" a variable into integer.
class(as.integer(3))

* Complex

In [0]:
class(3+2i)

* Character

In [0]:
# Both are categorized as characters.
class("a")
class("a cat")
# What about this?
class(a)

* Logical (T, TRUE, F, FALSE)
  * Note that they must be upper case



In [0]:
class(TRUE)
class(T)
class(True)

`NA` is a logical constant R uses to denote a missing value.

In [0]:
class(NA)

The **is.\<type>()** functions, which return logical values, can be used to check for the data classes too. 

In [0]:
a <- 3
# IS a numeric?
is.numeric(a)
# Is a integer?
is.integer(a)

## Derivative Data Types

There are many derivative data types which are built using the atomic ones. For exmple, the "Date" type.

In [0]:
# The function today() returns the data of today.
mydate <- today()
mydate
class(mydate)

## Data Objects

Now let's look at the data objects in R. They are:

* Vector: elements of same class, one dimension
* Matrix: elements of same class, two dimensions
* Array: elements of same class, 2+ dimensions
* Lists: elements can be any objects
* Data frames: “datasets” where columns are variables and rows are observations

### Vectors



Vectors contain elements of the **same** data type.

Vectors can be constructed by 
* The **c()** funtion (concatenate):

In [0]:
myvec <- c(1,2,3)
myvec
# myvec is a numeric vector, so class() will report "numeric".
class(myvec)
is.vector(myvec)

In [0]:
myvec <- c("a","b","c")
myvec
class(myvec)
is.vector(myvec)

* The **vector()** function
  * The vector will be initiated to the default values.

In [0]:
myvec <- vector("numeric", length = 10)
myvec

* The **seq()** and **rep()** functions, or the ":" operator

In [0]:
myvec <- seq(from=2,to=10,by=2)
myvec
# When calling a function in R, the argument names are optional, so the code below does exactly the same thing.
myvec <- seq(2,10,2)
myvec

In [0]:
myvec <- seq(from=2,to=10,length=5)
myvec
# When calling a function in R, the argument names are optional; without the argument names, the order of arguments is very important!!!
myvec <- seq(2,10,5)
myvec

In [0]:
myvec <- rep(5,6)
myvec

In [0]:
myvec <- 1:15
myvec
class(myvec)

* Or a combination of all of them

In [0]:
# How many elements would myvec end up with?
myvec <- c(1,8:10,rep(2:4,3))
myvec

You can convert an object to a different type using the **as.\<TYPE>()** functions.

Note: the output will be an object of the specified type, while the input remains untouched.

In [0]:
# myvec is a numeric vector.
myvec <- 1:3
myvec
# The output of as.character(myvec) is a character vector.
as.character(myvec)
# myvec is still a numeric array.
myvec

When converting to logical values, a numeric "0" will be FALSE while all non-zeroes will be TRUE.

In [0]:
as.logical(0:3)

Coercion will occur when mixed objects are passed to the **c()** function, as if an **as.\<Type>()** function is explicitly called.

In [0]:
# Which data type would mybec be?
myvec <- c(1e3,"a")
class(myvec)

How about this one?

In [0]:
c(1.7,"a")

And this one?

In [0]:
c(T,2)

Caution: type coercion may happen **without you being aware of it and may have unintended results**.

One can use the [\<index>] operator to access individual element in a vector. Note that in R the indices start from 1.

In [0]:
myvec <- 1:10
# myvec[4] points to the 4th element in the vector.
myvec[4]

For multiple elements with multiple indices, use the **c()** function (the indices need to be passed as a vector themselves).

In [0]:
# This will not work:
myvec[1,4]

In [0]:
# But this will:
myvec[c(1,4)]

Logical values can be used to access individual elements too.

In [0]:
myvec[c(T,T,F)]

Negative indices will drop the corresponding elements from the vector.

In [0]:
myvec[-6:-2]

**Important:** Lots of R operations process objects in a vectorized way
* More efficient, concise, and easier to read.

In [0]:
# vec1: 1 2 3 4
# vec2: 4 3 2 1
vec1 <- 1:4
vec2 <- 4:1

In [0]:
# Element-wise addition
vec1 + vec2

In [0]:
# Element-wise comparison
vec1 > 2

In [0]:
# Element-wise comparison
vec1 > vec2

In [0]:
# Equivalent to "show all vec1 elements that are greater than their counterparts in vec2".
vec1[vec1 > vec2]

Sometimes R takes it to the extreme.

In [0]:
# vec1: 1 2 3 4 5 
# vec2: 4 3 2 1
vec1 <- 1:5
vec2 <- 4:1
# Now vec1 and vec2 are of different length. Would this end up with an error? 
vec1+vec2

### Matrices


In R, elements in matrices must be of the same type as well. Actually, matrices are merely vectors with a "dimension" attribute. 


Assigning an "dim" attribute to a vector turns it into a matrix.

In [0]:
# "mymat" is a what?
mymat <- 1:12
mymat
dim(mymat)

In [0]:
# Assign a "dim" attribut to mymat turns it into a matrix.
dim(mymat) <- c(3,4)
mymat

R matrices can be constructed by using the **matrix()** function as well.

In [0]:
matrix(1:12,nrow=3,ncol=4)

The **matrix()** fucntion construct matrices column‐wise by default. You can use the "byrow=T" option to switch to row-wise.


In [0]:
matrix(1:12,nrow=3,ncol=4,byrow=T)

Or by using the **cbind()** or **rbind()** functions.

In [0]:
# Treat each argument as a column and bind them should-by-should into a matrix.
cbind(1:3,4:6,7:9,10:12)

In [0]:
# Treat each argument as a row and bind them row-by-row into a matrix.
rbind(1:3,4:6,7:9,10:12)

• One can use [\<index>,\<index>] to access individual elements.

In [0]:
mymat[3,4]

### Arrays

Arrays consist of elements of same class with a number of dimensions. Vectors and matrices are arrays of 1 and 2 dimensions.

In [0]:
# myarray will be a three-dimensional array
myarray <- array(data = 1:12,dim = c(2,2,3))
myarray
# Access an element using the indices.
myarray[1,1,2]
# Actually, it's fine to treat it as a one-dimensional vector.
myarray[5]

In [0]:
# Mold myarray into a matrix.
dim(myarray) <- c(3,4)
# Then you won't be able to access its element using three indices.
myarray[1,1,2]

### Lists

Lists are an ordered collection of objects, which can be of **different types or classes**.

Lists can be constructed by using the **list()** function.


In [0]:
# Mixing numeric, logical and character.
list(1,F,"a")

Members of a list do not have to be of atomic types, i.e. they can be vectors, matrices and even lists.

In [0]:
# Mixing numeric, vector (numeric), matrix (numeric), list
mylist <- list(1,1:5,matrix(1:6,2,3),list(1,F,"a"))
mylist

Lists can be indexed using the `[[ ]]` operator.

In [0]:
# The second element in mylist.
mylist[[2]]

The indices can be nested.

In [0]:
# The second element of the fourth element (a list) in my list.
mylist[[4]][[2]]
# The [1,3] element of the third element (a matrix) in my list.
mylist[[3]][1,3]

Elements of R objects can have names.

Names can be specified when an object is created.

In [0]:
list(inst="LSU",location="Baton Rouge",state="LA")

Or they can be specified later when the **names()** function.

In [0]:
names(mylist) <- c("num","vec","mat","lst")
mylist

Names can be used to access elements in a data object using the `$` operator. When there are many elements, this could be more convenient than using the index.

In [0]:
# This is equivalent to mylist[[4]]
mylist$lst

 Indexing operations by names and indices can be nested and mixed.

In [0]:
names(mylist$lst) <- c("c1","c2","c3")
# The "c2" element of the "lst" element in the list "mylist".
mylist$lst$c2
# The same thing.
mylist$lst[[2]]
# Again, the same thing.
mylist[[4]][[2]]

### Data frames

Data frames are used to store tabular data.
* They are a special type of **lists**, where each element is a R vector ("column" or "variable") and has to be of the same length.
* The elements (columns) can be of different classes.
* Data frames have special attributes such as row.names.
* Data frames can be created by reading data files, using functions such as **read.table()** or **read.csv()** (more on this later).


Data frames can be created directly by calling the **data.frame()** function.

In [0]:
mydf <- data.frame(c(31,40,50), c("M","F","M"), stringsAsFactors = F)
mydf
is.vector(mydf[[2]])

We usually name the columns so that it's more meanful.

In [0]:
names(mydf) <- c("age","sex")
mydf

Row names can be specified as well.

In [0]:
row.names(mydf) <- c("obs1","obs2","obs3")
mydf

To access individual elements in a data frame, there are a few options:

* Numeric indices

In [0]:
# First row, second column
mydf[1,2]

* Row and column names

In [0]:
# Sex of the observation #1 (same element as above)
mydf["obs1","sex"]

* Or a mix of indices and names

In [0]:
mydf["obs1",2]

Since data frames are lists, both `[[ ]]` and `$` operators work.

In [0]:
mydf[[2]]

In [0]:
mydf$sex

You could select rows by leaving the other index blank:

In [0]:
# The first two rows
mydf[1:2,]

And vice versa:

In [0]:
# The "sex" column
mydf[,"sex"]

We can subset a data frame like this:

In [0]:
# Both columns for obs 1 and 3
mydf[c(1,3),c("age","sex")]

Or using a vector of logical values:

In [0]:
# This gives us a logical vector.
mydf$sex == "M"
# Which can be used to obtain a subset of mydf.
mydf[mydf$sex == "M",]

## Querying Object Attributes

There are a few functions in R that help us obtain information about an object.

We will work with the "mtcars" data frame in this section.

In [0]:
?mtcars

We have already seen the **class()** function, which reveals the type of an object.

In [0]:
class(mtcars)

The **length()** function shows the length of an object.

In [0]:
length(mtcars)

The **nrow()** function counts the number of rows in a data frame.

In [0]:
nrow(mtcars)

The **dim()** function reveals the dimension of an object.

In [0]:
dim(mtcars)

The **attributes()** function reveals attributes of an object.

In [0]:
attributes(mtcars)

The **str()** function shows the internal strucutre of a R object.


In [0]:
str(mtcars)

## Exercise 1

1. Learn about the **airquality** data frame and answer the following quetions:
  * What is the source of the data?
  * How many rows and columns are there in the data frame? 
  * What does each column represent?
2. Find the percentage of days when the high tempature measured at La Guardia Airport exceeded 70.
3. Find the number of days when the wind speed is between 10 and 20 miles per hour.

In [0]:
#@title Solution { display-mode: "form" }
# question 1
?airquality

attach(airquality)

# question 2
nrow(airquality[Temp > 70,])/nrow(airquality)*100

# question 3
nrow(airquality[Wind > 10 & Wind < 20,])

detach(airquality)

# Flow Control Structures

Flow control structures in R, which allow one to control the flow of execution, are similar to other languages.

## Condition testing

Test a condition with the **if...else if...else** structure:

```
if (condition 1 is true) {
  do something
} else if (condition 2 is true) {
  do something else
} else {
  do something more
}
```



In [0]:
if (length(mtcars) > 3) {
  print("We have more than 3 car models!")
}

## Loops

Loops with ```for```:

```
for (variable in sequence) {
  statements
}
```

In [0]:
for (i in 1:10) {
  print(i^3)
}

Compared to other languages, loops are not as frequently used in R because many operations and functions are inherently vectorized. 

For instance, This line of code does exactly the same thing as the code block above:

In [0]:
(1:10)^3

In addition, the family of **apply()** functions are very useful to perform operation over all elements of a vector/list (see next section).

# Useful Functions

## Simple Statistic Functions

* **min()**: Minimum value
* **max()**: Maximum value
* **which.min()**: Location of minimum value
* **which.max()**: Location of maximum value
* **sum()**: Sum of the elements of a vector
* **mean()**: Mean of the elements of a vector
* **sd()**: Standard deviation of the elements of a vector
* **quantile()**: Show quantiles of a vector
* **summary()**: Display descriptive statistics

In [0]:
mean(mtcars$mpg)

In [0]:
# The summary function will report a few statistics for each varialbe in the data frame.
summary(mtcars)

## Distributions and Random Variables

For each statistic distribution below, R provides four functions: density (d), cumulative density (p), quantile (q), and random generation (r). 


Distrituion | Name in R 

* Uniform | ```unif``` 
* Binomial | ```binom```
* Poisson | ```pois```
* Geometric | ```geom```
* Gamma | ```gamma```
* Normal | ```norm```
* Log Normal | ```lnorm```
* Exponential | ```exp```
* Student's t | ```t```

The function name is of the form **[d|p|q|r]\<name of
distribution\>**. For example,  **qbinom()** gives the quantile of a binomial distribution.

Generate a random sample of 10 from the standard normal distribution:

In [0]:
# Each time it is run, the sample would be different.
rnorm(10,mean=0,sd=1)

The p-value for 1.96 and its inverse function (standard normal distribution):

In [0]:
pnorm(1.96)
qnorm(pnorm(1.96))

When genrating random samples, setting the seed to the same value will generate the same sample.

In [0]:
# A "true" random sample (varies every time)
rnorm(10,mean=0,sd=1)
# A sample with seed "15" (mainly for debugging purpose)
set.seed(15)
rnorm(10,mean=0,sd=1)
set.seed(15)
rnorm(10,mean=0,sd=1)

## Sorting

Sort and order elements: `sort()`, `rank()` and `order(`).

By default, the **sort()** functions sorts the values in a vector into ascending order.

In [0]:
# This is the original order.
mtcars$mpg
# This is the sorted order.
sort(mtcars$mpg)

The ```decreasing=T``` option will sort a vector into descending order.

In [0]:
sort(mtcars$mpg, decreasing=T)

In contrast, the **order()** functions returns the indices in order.


In [0]:
order(mtcars$mpg)
order(mtcars$mpg, decreasing=T)

Users can use the indices returned by **order()** to change the order of rows in a data frame.

In [0]:
# The data frame in the original order.
mtcars

In [0]:
# The reorder data fram according the values of the mpg variable.
mtcars[order(mtcars$mpg),]

## Table

The **table()** function tabulates factors or find the frequency of
an object.


For instance, in the **mtcars** data frame, we can get the frequency table by the combination of the numbers of cylinders and gears: 

In [0]:
table(mtcars[,c("cyl","gear")])

## The Apply family of functions

The **apply()** function evaluate a function over
the margins of an array
* More concise than the for loops (not necessarily
faster)

Syntax:

```
 apply(data,dimension,function,function perimeters)
 ```

For example, if we want to calculate the mean of each variable in **mtcars**:

In [0]:
# Apply the mean() function to the columns (variables) of mtcars.
apply(mtcars,2,mean)


It can perform multiple calculations in one function call:

In [0]:
# Find the 1st and 3rd quantile of each varialbe in mtcars.
apply(mtcars, 2, quantile, probs = c(0.25, 0.75))

Other member of the **apply()** family include:
* **lapply** - Loop over a list (data frame) and evaluate a function on each element
* **sapply** - Same as **lapply** but simplifies the result to array 
* **tapply** - Apply a function over subsets of a vector
* **mapply** - Multivariate version of **sapply**

## The ```plyr``` package

Suppose that, with the **mtcars** data frame, we want to know the average mileage-per-gallon for cars with 4, 6 and 8 cylinders. How do we do that?

We will need to 
* **split** the data into subsets according to the value of the "cyl" column (one for cyl==4, one for cyl==6 and one for cyl==8)
* **apply** the mean function to the "mpg" column
* **combine** the results from each subset

In [0]:
# Subset where cyl==4
mean(mtcars[mtcars$cyl==4,"mpg"])
# Subset where cyl==6
mean(mtcars[mtcars$cyl==6,"mpg"])
# Subset where cyl==8
mean(mtcars[mtcars$cyl==8,"mpg"])

The "split-apply-combine" pattern is very common in data analysis, where you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.

The ```plyr``` packages provide a group of functions that implement this split-apply-combine pattern.

For example, the **ddply()** function takes a data frame, split it accorindg to the condition you supply, apply a function to each piece, then combine the result into a new data frame.

In [0]:
library(plyr)
ddply(mtcars,"cyl", summarize, AverageMPG=mean(mpg))

## User-defined functions

* Users can define their own functions in R by using the ```function()``` directives. 
* The return value is the last expression in the function body to be evaluated.
* Functions can be nested.
* Functions are R objects and can be passed as an argument to
other functions.

Syntax to define a function:


```
function_name <- function (arguments) {
  statements
}
```



In [0]:
# Create a function pow(), which takes two arguments.
pow <- function(x, y) {
  result <- x^y
}

Then it can be called like any other function:

In [0]:
pow(4,2)

Functions can be used as an argument for other functions.

In [0]:
# Define a new function, which takes a function as one of the arguments.
myfunc <- function(func,a,b) {
  result <- func(a,b) - 1
}

c <- myfunc(pow,4,2)
c

c <- myfunc(rep,1:2,5)
c

## Exercise 2

Using the **airquality** data, find the average wind speed for the 10 hottest days.



In [0]:
#@title Hint
# Use the order() function

In [0]:
#@title Solution { display-mode: "form" }
mean(airquality[order(airquality$Temp,decreasing=T),"Wind"][1:10])

## Exercise 3

Using the **airquality** data, find the average high tempature of each month.

In [0]:
#@title Hint
# Use the ddply() function

In [0]:
#@title Solution { display-mode: "form" }
ddply(airquality,"Month",summarize,AvgTemp=mean(Temp))

# Managing R Packages

To load a R package so you can use the functions included in it, use the **library()** or **require()** function:

In [0]:
library(lubridate)
require(devtools)

The main difference is that, if a package is not installed, **library()** will throw out an error message and the execution will stop, while **require()** throws out a warning and the execution continues.

In [0]:
library(reshape)
print("End of code segment")

In [0]:
require(reshape)
print("End of code segment")

If a package is not available, the **install.packages()** function can be used to install it.

In [0]:
install.packages("reshape")
library(reshape)

Multiple packages can be installed with one call of the **install.packages()** function.

In [0]:
require(datarium)
require(BiocManager)
install.packages(c("datarium","BiocManager"))
library(datarium)
library(BiocManager)

Note that double quotation is **NOT** needed when loading packages, but necessary when installing them.

Use the **remove.packages** function to remove installed packages.

In [0]:
remove.packages("datarium")
library(datarium)

The **update.packages** function updates installed packages.

In [0]:
update.packages("lubricate")

List **installed** packages.

In [0]:
installed.packages()

List all **loaded** packages (and attached objects).

In [0]:
search()

# File And Directory Operations

## Query working directory

Each R session has a working directory. The **getwd()** function shows the current working directory.

In [0]:
getwd()

The **list.files()** and **list.dirs()** functions list the files and subdirectories.

In [0]:
list.files()
list.dirs()

To change the working directory, use the **setwd()** functions.

In [0]:
setwd("/content")
getwd()

## Handling files

R has a set of file.\<opearation> functions.

Check if a file exists:

In [0]:
file.exists("testfile")

Create an empty file:

In [0]:
file.create("testfile")
file.exists("testfile")

Copy and delete files:

In [0]:
list.files()
file.copy("testfile","anotherfile")
list.files()
file.remove("testfile","anotherfile")

# Graphics

There are three plotting systems in R
* base
  * Convenient, but hard to adjust after the plot is created
* lattice
  * Good for creating conditioning plot
* ggplot2
  * Powerful and flexible, many tunable feature, may require some time to master


Each has its pros and cons, so it is up to the users which one to choose.

## base

A few functions are avaible in the base plot systems
* **plot()**: line and scatter plots
* **boxplot()**: box plots
* **hist()**: histograms

A quick scatter plot example with the base plot system.

In [0]:
# Create the plot with title and axis labels.
plot(pressure,type="l",
     main="Vapor Pressure of Mercury",
     xlab="Temperature", 
     ylab="Vapor Pressure")
# Add points
points(pressure,col='red') 
# Add annotation
text(150,700,"Source: Weast, R. C., ed. (1973) Handbook \n
     of Chemistry and Physics. CRC Press.")

## ggplot2

The **qplot()** function is the ggplot2 version of **plot()**.

In [0]:
qplot(weightLb, heightIn, data=heightweight, geom="point")

The **ggplot()** function is the main function in the ggplot2 package.

Here is an example:

In [0]:
ggplot(heightweight, aes(x=weightLb, y=heightIn, color=sex, shape=sex)) + 
  geom_point(size=3.5) +
  ggtitle("School Children\nHeight ~ Weight") +
  labs(y="Height (inch)", x="Weight (lbs)") +
  stat_smooth(method=loess, se=T, color="black", fullrange=T) +
  annotate("text",x=145,y=75,label="Locally weighted polynomial fit with 95% CI",color="Green",size=6) +
  scale_color_brewer(palette = "Set1", labels=c("Female", "Male")) +
  guides(shape=F) +
  theme_bw() +
  theme(plot.title = element_text(size=20, hjust=0.5), 
        legend.position = c(0.9,0.2),
        axis.title.x = element_text(size=20), axis.title.y = element_text(size=20),
        legend.title = element_text(size=15),legend.text = element_text(size=15))

If you are interested to learn more, please visit the [Data Visualization in R](http://www.hpc.lsu.edu/training/weekly-materials/2018-Spring/Slides.html#(1)) tutorial from LSU HPC

# Parallel Processing

Modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel, but base R is single‐threaded, i.e. not parallel.

In other words, regardless how many cores are available, R can only 
use one of them.

There are two options to run R in parallel: **implicit** and **explicit**.


## Implicit parallel processing

Some functions in R can call parallel numerical libraries.

For instance, on the LONI QB2 cluster most linear algebraic and related functions (e.g. linear regression, matrix decomposition, computing inverse and determinant of a matrix) leverage the multi‐threaded Intel MKL library.

In this case, no extra coding is needed to take advange of the multiple CPU cores - those functions will automatically use multiple cores when being called.

## Explicit parallel processing

If the implicit option is not available for what you'd like to do, some codes need to be written.

Here is an example of using the **%dopar%** directive in the **doParallel** package.

The workload is to generate 100 random samples, each with
1,000,000 observations from a standard normal distribution, then take a summary for each sample.

In [0]:
iters <- 100

Below is the sequential version with a for loop. The **system.time()** function is used to measure how long it takes to process the workload.

In [0]:
# This code segment shows us how long it takes to run on one core.
system.time(
for (i in 1:iters) {
  to.ls <- rnorm(1e6)
  to.ls <- summary(to.ls)
}
)

This is the parallel example with the **doParallel** package.

In [0]:
# This code segment shows us how long it takes to run on all available cores.
library(doParallel)

# Obtain the number of cores available.
ncpu <- detectCores()
ncpu

system.time({
  cl <- makeCluster(ncpu)
  registerDoParallel(cl)
  ls<-foreach(icount(iters)) %dopar% {
    to.ls<-rnorm(1e6)
    to.ls<-summary(to.ls)
  }
  stopCluster(cl)
})

If you are interested to learn more, please visit the [Parallel Computing in R](http://www.hpc.lsu.edu/training/weekly-materials/2017-Fall/HPC_Parallel_R_Fall2017.pdf) tutorial from LSU HPC.