# Before We Start

* **DO NOT CHANGE THE RUNTIME BECAUSE YOU WON'T BE ABLE TO CHANGE IT BACK!**

In [0]:
require(datasets)
require(ggplot2)
require(gcookbook)

# Outline

* How to run R codes
* Data classes and objects in R
* Flow control structures
* Functions
* How to install and load R packages



# Introduction

## What is R

* R is an integrated suite of software facilities for
  * importing, storing, exporting and manipulating data;
  * scientific computation;
  * conducting statistical analyses;
  * displaying the results by tables, graphs, etc.
* Highly customizable via thousands of freely available packages.
  * R is also a platform for the development and implementation of new algorithms.
  * Many graphical user interface to R both free and commercial (e.g. Rstudio and Revolution R (now Microsoft R) ).

## History of R

* R is a dialect of the S language
  * S was created in 1976 at the Bell Labs as an internal statistical analysis environment
  * Goal of S was “to turn ideas into software, quickly and faithfully".
  * Most well known implementation is S-plus (most recent stable release was in 2010). S-Plus integrates S with a nice GUI interface and full customer support.
* R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
* The R core group was formed in 1997, who controls the source code of R (written in C)
* The first stable version R 1.0.0 was released in 2000
* Latest stable version is 3.5.2 released on Dec 20, 2018

## Features of R

* R is a language designed for statistical analysis
  * Available on most platform/OS
  * Rich data analysis functionalities and sophisticated graphical capabilities
* Active development and very active community
  * CRAN: The Comprehensive R Archive Network
  * Source code and binaries, user contributed packages and documentation
  * More than 13,000 packages available on CRAN (as of March 2018)
  * 6,000 three years ago
 * Free to use!

# How to Run R Codes

## Three options

* Rstudio
* Colab
* HPC systems

## Colab

This is what we are using right now. No setup is needed, but we don't have much control either. For instance, we have very limited control on where the data is stored.

## Rstudio

* Integrated Development Environment for R
* Free to use
* Similar user interface to other, dividing the screen into panes
  * Source code
  * Console
  * Workspace
  * Others (help message, plot etc.)
* Rstudio in a desktop environment is better suited for development and/or a limited number of small jobs

## HPC Systems

* Clusters are better for resource‐demanding jobs
  * Many small tasks, or
  * Resource-intense tasks

# Basic Syntax

## Getting help

Getting help is easy in R. 

For specific functions, use **?\<name of funciton>**. 

In [0]:
?class

For a keyword search, use **??\<key word>**

In [0]:
??assignment

## Assignment

For new users, the biggest difference from other languages is perhaps the assignment operator: **R uses "<-" instead of "="**. 

In [0]:
x <- 2*4

The contents of the object "x" can be viewed by typing value at the R prompt.

In [0]:
x

Actually, "=" works too, but there is some subtle differences, which are explained [here](https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/). 

In [0]:
y = 2 * 4
y

Again, "<-" is the recommended way of assigning a value to a variable.

## Comment

In R, any line staring with "#" will be interpreted as a comment.

In [0]:
# 2*4
# Nothing will happen.

## Legal R Names

Names for R objects can be any combination of letters, numbers and periods (.), but must not start with a number or a period.



In [0]:
num.Cats.2 <- 4
.num.Cats <-5

R is case sensitive, e.g. X and x are different in R.

In [0]:
x <- 4
print("The value of x is:")
x
print("The value of X is:")
X

## Arithematic operations

Basic arithematic operators: +, -, *, /, ^

In [0]:
1 + 2*4^(3/5)

Scientific notation: 1e-2

In [0]:
1e2 + 1e-2

Specail values: Inf (non-finite numeric values), NaN (not a number)

In [0]:
1 / 0
-1 / 0
1/0 - 1/0

# Data Classes and Objects

## Atomic Date Types

R has five atomic classes
* Numeric (double)
  * Numbers in R are treated as numeric unless specified otherwise.

Note: the function **class()** reveals the class of a data object.


In [0]:
class(9.3)
class(3)

* Integer

In [0]:
class(as.integer(3))

* Complex

In [0]:
class(3+2i)

* Character

In [0]:
class("a")
class("a cat")
class(a)

* Logical (T, TRUE, F, FALSE)

In [0]:
class(TRUE)
class(T)
class(True)

The code missing values in R is NA. The is.\<type>()functions can be used to check for the data classes.

## Derivative Data Types

There are many derivative data types. For exmple, the "Date" type.

In [0]:
require(lubridate)
today()
class(today())

## Data Objects

* R Data objects
  * Vector: elements of same class, one dimension
  * Matrix: elements of same class, two dimensions
  * Array: elements of same class, 2+ dimensions
  * Lists: elements can be any objects
  * Data frames: “datasets” where columns are variables and rows are observations

### Vectors



Vectors can only contain elements of the same data type.

Vectors can be constructed by 
* The **c()** funtion (concatenate):

In [0]:
d <- c(1,2,3)
d

In [0]:
d <- c("a","b","c")
d

* The **vector()** function

In [0]:
d <- vector("numeric", length = 10)
d

* The **seq()** and **rep()** functions, or the ":" operator

In [0]:
d <- 1:15
d
d <- seq(from=2,to=10,by=2)
d
d <- seq(from=2,to=10,length=5)
d
d <- rep(5,6)
d

* Or a combination of all of them

In [0]:
d <- c(1,2,3:5,rep(6,3))
d

You can convert an object to a different type using the **as.\<TYPE>()** functions

In [0]:
d <- 1:3
d
as.character(d)

Coercion will occur when mixed objects are passed to the c() function, as if the as.\<Type>()function is explicitly called.

In [0]:
d <- c(1e3,"a")
d
class(d)

In [0]:
c(1.7,"a")

In [0]:
c(T,2)

In [0]:
as.logical(0:6)

One can use [\<index>] to access individual element in a vector
  * In R, indices start from 1

In [0]:
d <- 1:10
d[4]
#d[1,4]
d[c(1,4)]
d[-6:-2]
d[c(T,T,F)]

Lots of R operations process objects in a vectorized way
* More efficient, concise, and easier to read.

In [0]:
x <- 1:4
y <- 6:9

In [0]:
x + y

In [0]:
x > 2

In [0]:
x > y

In [0]:
x[x > 2]

### Matrices


In R, matrices are merely vectors with a "dimension" attribute. Therefore, elements in matrices must be of the same type as well.


R matrices can be constructed by using the **matrix()** function.

In [0]:
matrix(1:12,nrow=3,ncol=4)

Or by passing an "dim" attribute to a vector.

In [0]:
m <- 1:12
m
dim(m) <- c(3,4)
m

Or by using the cbind() or rbind() functions.

In [0]:
cbind(1:3,4:6,7:9,10:12)

R matrices are constructed column‐wise.


In [0]:
matrix(1:12,nrow=3,ncol=4)
matrix(1:12,nrow=3,ncol=4,byrow=T)

• One can use [\<index>,\<index>] to access individual elements.

In [0]:
m[3,4]

### Arrays

Arrays consist of elements of same class with a number of dimensions
– Vectors and matrices are arrays of 1 and 2 dimensions.

In [0]:
a <- array(data = 1:12,dim = c(2,2,3))

### Lists

Lists are an ordered collection of objects, which can be of different types or classes.

Lists can be constructed by using the **list()** function.


In [0]:
list(1,F,"a")

Members of a list do not have to be of atomic types, i.e. they can be vectors, matrices and even lists.

In [0]:
mylist <- list(1,1:5,matrix(1:6,2,3),list(1,F,"a"))
mylist

• Lists can be indexed using the [[ ]] operator.

In [0]:
mylist[[2]]
mylist[[4]][[2]]

Elements of R objects can have names.

In [0]:
names(mylist)

Names can be specified when an object is created.

In [0]:
list(inst="LSU",location="Baton Rouge",state="LA")

Or they can be specified later when the **names()** function.

In [0]:
names(mylist) <- c("num","vec","mat","lst")
names(mylist)

Names can be used to access elements in a data object using the $ operator.

In [0]:
mylist$lst

 Indexing operations by names and indices can be nested and mixed.

In [0]:
names(mylist$lst) <- c("c1","c2","c3")
mylist$lst$c2
mylist$lst[[2]]
mylist[[4]][[2]]

### Data frames

Data frames are used to store tabular data
* They are a special type of lists where every element (i.e. column) has to be of the same length, but can be of different class
* Data frames can store different classes of objects in each column
* Data frames can have special attributes such as row.names
* Data frames can be created by reading data files, using functions such as **read.table()** or **read.csv()** (more on this later)
* Can be converted to a matrix using the function **data.matrix()**

In [0]:
mtcars

Data frames can be created directly by calling the data.frame() function.

In [0]:
mydf <- data.frame(c(31,40,50), c("M","F","M"))
mydf

We usually name the columns so that it's more meanful.

In [0]:
names(mydf) <- c("age","sex")
mydf

Row names can be specified as well.

In [0]:
row.names(mydf) <- c("obs1","obs2","obs3")
mydf

To individual elements in a data frame, there are a few options:

* Numeric indices

In [0]:
mydf[1,2]

* Row and column names

In [0]:
mydf["obs1","sex"]

* Or a mix of indices and names

In [0]:
mydf["obs1",2]

Since data frames are lists, the [[]] operator and name work.

In [0]:
mydf[[2]]

In [0]:
mydf$sex

You could select rows by leaving the other index blank:

In [0]:
mydf[1:2,]

And vice versa:

In [0]:
mydf[,"sex"]

Subsetting a data frame:

In [0]:
mydf[c(1,3),c("age","sex")]

## Querying Object Attributes

* The **length()** function
* The **class()** function
* The **dim()** function
* The **str()** function
* The **attributes()** function reveals attributes of an object
  * Class
  * Names
  * Dimensions
  * Length
  * User defined attributes
* They work on all objects (including functions)
* More examples later

In [0]:
length(mtcars)
class(mtcars)
dim(mtcars)
str(mtcars)
attributes(mtcars)

## Exercise 1

* Find the airquality dataframe in the datasets package
* Find the average tempature of June and July.

In [0]:
summary(airquality)

# File and Directory Operations

## Query working directory

In [0]:
getwd()
list.files()
list.dirs()
setwd("/content/sample_data")
getwd()

## Downloading files

If we need files from the internet, we can either
* Manually download the file to the working directory
* or with R function **download.file()**

In [0]:
download.file("http://www.hpc.lsu.edu/training/weekly-materials/Downloads/Forbes2000.csv.zip", "Forbes2000.csv.zip")

## Handling files

R has a set of file.\<opearation> functions.

In [0]:
file.exists("testfile")

In [0]:
file.create("testfile")
file.exists("testfile")

In [0]:
list.files()

In [0]:
file.copy("testfile","anotherfile")
file.remove("testfile","anotherfile")

## Reading data from and Writing data to Files

R can recognize a lot of different file formats.

# Flow Control Structures

Flow control structures in R, which allow one to control the flow of execution, are similar to other languages.

## Condition testing

Test a condition with the if...else structure:

```
if (condition 1 is true) {
  do something
} else if (condition 2 is true) {
  do something else
} else {
  do something more
}
```



In [0]:
if (length(mtcars) > 3) {
  print("We have more than 3 cars!")
}

Comparisons:
* Less than: ```<```
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=
* Equal to: ==
* Not equal to: ```!=```


Logical operations:
* NOT: !
* AND (elementwise): &
* AND (only leftmost element): &&
* OR (element wise): |
* OR (only leftmost element): ||

## Loops

Loops with ```for```:

```
for (variable in sequence) {
  statements
}
```

In [0]:
for (i in 1:10) {
  print(i^3)
}

Loops are not very frequently used in R because many operations are inherently vectorized. The family of **apply()** functions are also very useful to perform operation over all elements of a vector/list (see next section).

In [0]:
(1:10)^3

## Exercise 2

Create a random sample of 100 numbers and count how many of them are greater than 5.

## Exercise 3

Use the airquality data frame, find
* the average tempature of June and July
* the number of days when temperature is between 50 and 70

### Solution

In [0]:
#@title
x <- runif(100,0,10)
length(x[x>5])

# Functions

# Graphics in R

There are three plotting systems in R
* base
  * Convenient, but hard to adjust after the plot is created
* lattice
  * Good for creating conditioning plot
* ggplot2
  * Powerful and flexible, many tunable feature, may require some time to master


Each has its pros and cons, so it is up to the users which one to choose.

## base

A few functions are avaible in the base plot systems
* plot(): line and scatter plots
* boxplot(): box plots
* hist(): histograms

In [0]:
# Create the plot with title and axis labels.
plot(pressure,type="l",
     main="Vapor Pressure of Mercury",
     xlab="Temperature", 
     ylab="Vapor Pressure")
# Add points
points(pressure,col='red') 
# Add annotation
text(150,700,"Source: Weast, R. C., ed. (1973) Handbook \n
     of Chemistry and Physics. CRC Press.")

## ggplot2

In [0]:
install.packages("gcookbook")
require(gcookbook)
qplot(weightLb, heightIn, data=heightweight, geom="point")

In [0]:
ggplot(heightweight, aes(x=weightLb, y=heightIn, color=sex, shape=sex)) + 
  geom_point(size=3.5) +
  ggtitle("School Children\nHeight ~ Weight") +
  labs(y="Height (inch)", x="Weight (lbs)") +
  stat_smooth(method=loess, se=T, color="black", fullrange=T) +
  annotate("text",x=145,y=75,label="Locally weighted polynomial fit with 95% CI",color="Green",size=6) +
  scale_color_brewer(palette = "Set1", labels=c("Female", "Male")) +
  guides(shape=F) +
  theme_bw() +
  theme(plot.title = element_text(size=20, hjust=0.5), 
        legend.position = c(0.9,0.2),
        axis.title.x = element_text(size=20), axis.title.y = element_text(size=20),
        legend.title = element_text(size=15),legend.text = element_text(size=15))

[Data Visualization in R](http://www.hpc.lsu.edu/training/weekly-materials/2018-Spring/Slides.html#(1)) tutorial from LSU HPC

# Parallel Processing

Modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel, but base R is single‐threaded, i.e. not parallel.

In other words, regardless how many cores are available, R can only 
use one of them.

There are two options to run R in parallel
* Implicit
* Explicit

## Implicit parallel processing

Some functions in R can call parallel numerical libraries.
* On LONI and LSU HPC clusters this is the multi‐threaded Intel MKL library
* Mostly linear algebraic and related functions (e.g. linear regression, matrix decomposition, computing inverse and determinant of a matrix)

## Explicit parallel processing

If the implicit option is not available for what you'd like to do, some codes need to be written.

In [0]:
install.packages("doParallel")
require(doParallel)

[Parallel Computing in R](http://www.hpc.lsu.edu/training/weekly-materials/2017-Fall/HPC_Parallel_R_Fall2017.pdf) tutorial from LSU HPC

# How to Install and Load R Packages

To load a R package, use the library() or require() function:

In [0]:
library(lubridate)

In [0]:
require(dplyr)

If a package is not available, the **install.packages()** function can be used to install it.

In [0]:
library(reshape)

In [0]:
install.packages("reshape")
require(reshape)

Multiple packages can be installed with one call of the **install.packages()** function.

In [0]:
install.packages(c("reshape","xlsx"))

Note that double quotation is not needed when loading packages, but necessary when installing packages.

Remove packages.

In [0]:
remove.packages("reshape")

Update packages.

In [0]:
update.packages("lubricate")

List installed packages.

In [0]:
installed.packages()

# Case Study

## Steps of Data Analysis

* Data acquisition
* Data inspection
* Data cleanup
  * Remove missing and dubious values, discard columns not needed etc.
* Data analysis
* Report generation

## Data acquisition

In [0]:
download.file("http://www.hpc.lsu.edu/training/weekly-materials/Downloads/Forbes2000.csv.zip", "Forbes2000.csv.zip")
unzip("Forbes2000.csv.zip","Forbes2000.csv")
file.remove("Forbes2000.csv.zip")
list.files()

In [0]:
forbes <- read.csv("Forbes2000.csv")

## Data inspection

The head() and tail() functions do similar things as their counterparts in Linux, revealing the first and last few rows of a dataframe. 

In [0]:
head(forbes)

A set of functions that are useful to examine data.

* **min()** - Minimum value
* **max()** - Maximum value
* **which.min()** - Location of minimum value
* **which.max()** - Location of maximum value
* **sum()** - Sum of the elements of a vector
* **mean()** - Mean of the elements of a vector
* **sd()** - Standard deviation of the elements of a vector
* **quantile()** - Show quantiles of a vector
* **summary()** - Display descriptive statistics

In [0]:
summary(forbes)

The str() and attributes() functions can be useful as well.

In [0]:
str(forbes)

A quick plot or two won't hurt either.

In [0]:
attach(forbes) # attach the data frame
boxplot(sales) # boxplot
plot(sales,assets) # scatterplot

## Data cleanup

### Dealing with missing values

Missing values are denoted in R by ```NA``` or ```NaN``` for undefined mathematical operations.

The **is.na()** function is used to test objects if they are ```NA```.

Make sure when reading data R can recognize the missing values. 

Many R functions also have a logical “na.rm” option
– na.rm=TRUE means the NA values should be discarded
mean(weight,na.rm=T)

Note: Not all missing values are marked with “```NA```” in raw data!

There are many statistical techniques that can deal with the missing
values, but the simplest way is to remove them.
* If a row (observation) has a missing value, remove the row with
na.omit(). e.g.

* If a column (variable) has a high percentage of the missing value,
remove the whole column or just don’t use it for the analysis

In [0]:
forbes <- na.omit(forbes)
dim(forbes)

In [0]:
forbes[! complete.cases(forbes),]

### Subsetting the data

Use the **subset()** function.

## Data analysis

In [0]:
housing <- read.csv("sample_data//california_housing_train.csv")

In [0]:
str(housing)

In [0]:
complete.cases(housing)

### Exercise 4

Find the average sales, profit and market value of all US-based banks.

## Report generation