# Introduction to R
### Justin Strate
### El Paso Data Scientists
### 10/25/2017

## Background of R

* R is an implementation of the S programming language
* R was developed by Ross Ihaka and Robert Gentleman
* The name comes partly from the fact that its precursor is S and the fact that both of the first names of the creators begin with an "R".

## Editors
* RStudio is the main editor
* Jupyter also supports R as well as other editors

# Why Use R?

### Pros of R
* Free, open source
* Tools and applications- Twitter, Linkedin, geospatial tools
* Cutting edge tools
* Tailored specifically for data analysis
* Large community with domain knowledge around data science
* Easily accessible help documentation

### Cons of R
* Inconsistentcy among developers
* Slow
* Steep learning curve
* Memory usage

# Getting Help with R
* You can use the help function or the "?" to pull up help documentation    
For example,     
        help(mean)
or      
        ?mean

# Operators
* Logical
* Mathematical
* Assignment

## Logical Operators
* \>, >=, <, <=, and == are straightforward
* ! denotes NOT
* & denotes AND
* | denotes OR

## Logical Operators

In [1]:
2 > 1 
2 >= 1
2 < 1
2 <= 1

2 == 2
! 2==2

TRUE & FALSE

TRUE | FALSE

# Mathematical Operators
* All the mathematical operations in R follow standard notation

In [2]:
2 + 2
2 - 2
2 * 2
2 / 2
2 ^ 2
log(2) #natuarl log of 2
log(4, base = 2)


## Assignment Operators
* There are several assignment operators in R.
* The two most common are = and  <-
* The assignment operators <<- , <-, and <<-  are less common

## Assignment Operators
The following all assign the object x to 3

In [3]:
#Two most common
x = 3 
x <- 3

#Less common
x <<- 3
3 -> x
3 ->> x


# Data Types
* Integer
* Numeric
* Character
* Factor
* Logical

The c function can be used to combine arguments into a vector of one of the above data types 

## Numeric Data Types
* integer- 1,2,3,4,...
* numeric- any real number
* date- January 1, 1970 is assigned the value 0. 

### Integer

In [4]:
# Integer
x_int <- c(-1, 0, 1)
x_int2 <- -1:1 #Assigns a vector of the integers -1 to 1 to x_int2

print(x_int)

x_int2 # Also prints

[1] -1  0  1


### Numeric

In [5]:
# Numeric
x_num <- c(1.618, 2.718, 3.14)
print(x_num)

x_num #also prints x_num

# R is 1st element iterable
x_num[1]#prints the 1st element of x_num

[1] 1.618 2.718 3.140


### Dates

In [6]:
#Dates
date1 <- as.Date('01/01/1970', format = '%m/%d/%Y')
print(date1 + 1)
print(as.numeric(date1))

[1] "1970-01-02"
[1] 0


## String Data Types

### Character
* Character variables can be any string and later appended with any string.

### Factor
* Factors variables can be a string or a numeric variable, but they must take on a value of a certain level 
* Factors are appropriate for certain data types like a survey response or gender.

### Example of Factor vs Character

In [7]:
# Character- can be any string and appended with any string
(x_char <- c('F', 'F'))

# Factors- can be any string that is one of the exisiting levels
x_fac <- factor(c('F', 'F'), levels=c('M', 'F'))
print(x_fac)


[1] F F
Levels: M F


# Logical Data Type
* Logical vectors must be either TRUE, T, FALSE or F.
* The functions sum, mean, sd, and var all operate on logical data types where T's and TRUE's are treated as 1 and F's and FALSE's are treated as 0.
* If you append a logical vector with a 0 or 1 the vector will converted to a numeric vactor, where T and TRUE will be converted to a 1, and F and FALSE will be converted to a 0

### Example of logical data type

In [8]:
x_bin <- c(TRUE, FALSE, T, F)

x_bin
sum(x_bin)
mean(x_bin)

### Example of the which function
* The which function returns the indices where a logical vector reads TRUE

In [9]:
which(x_bin)

# Data Structures
* Vectors
* Data Frames
* Matrices
* Lists

## Vectors
* Can be created using the c() function as we have seen in past examples
* Must be composed of the same data types
* Very similar to numpy arrays
* Elements can be accessed by indexing numerically or logically

### Example of vector operations

In [10]:
# Example of a numeric vector in R
x1 <- c(1, 2, 3) # equivalent to x1 <- 1:3

print(x1 + 1)

print(x1 - c(0, 1, 2))

print(x1[1])

print(x1[c(TRUE, FALSE, TRUE)])

[1] 2 3 4
[1] 1 1 1
[1] 1
[1] 1 3


## Data Frames
* Data Frames are very similar to pandas data frames
* The columns can be of different data types
* You can index numerically or logically both the columns and the observations
* You can use the "$" operator to extract a particular variable from a data frame
* Pre-loaded data sets are commonly structured in data frames

### Example of Data Frames

In [11]:
# Example of Data Frame Manipulation
dat <- data.frame(id = 1:3, name = c('Joe', 'Bob', 'Amy'), age = c(22, 42, 31))

# Index the first observation numerically
dat[1,] #don't forget that trailing comma

# Index column of dat logically
dat[,c(F, T, F)]

# Extract the name column using the "$" operator
dat$name #Or you could use dat[['name']]

# Index both the rows logically and numerically
dat[c(1,3), c(F, T, T)]

id,name,age
1,Joe,22


Unnamed: 0,name,age
1,Joe,22
3,Amy,31


### Example using the iris data set

In [12]:
data(iris) #Load the data set

names(iris) #print the variables in the data set

dim(iris) #prints the dimension of the data set

head(iris) #view first 6 observations

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


## Matrices
* In contrast to data frames, matrices must consist of the same data type
* You can createa  matrix using the matrix function
* Matrices are filled in column-wise by default
* You can access the elements of a matrix by indexing logically or numerically

### Example of Matrices

In [13]:
matrix(1:4, nrow = 2, ncol = 2)

matrix(1:4, nrow = 2, ncol = 2, byrow = TRUE)

A <- matrix(c(100, 1, 10, 1), nrow = 2, ncol = 2)

rownames(A) <- c("Predicted 1", "Predicted 0")
colnames(A) <- c("Observed 1", "Observed 0")
dim(A)
A

0,1
1,3
2,4


0,1
1,2
3,4


Unnamed: 0,Observed 1,Observed 0
Predicted 1,100,10
Predicted 0,1,1


## Lists
* Lists can be composed of several data structures including other lists
* Many objects inherit from lists because of the versatility
* You can construct a list with the list function
* You can access the elements of a list with the $ operator

### Example of Lists

In [14]:
my_list <- list(x1 = "Hello", x2 = c(1,3), dat = data.frame(x=c(1,3),y=c(2,4)),
               dum_list = list(y1 = 2, y2 = 3))

my_list$x1
my_list$dat
my_list$dum_list
my_list$dum_list$y1

x,y
1,2
3,4


# Functions
* All functions are defined as follows
        f <- function(parameter1, parameter2,...)
* Shorter functions may be defined on a single line as the one below
        g <- function(x)x^2
* Longer functions use curly braces, "{}", to encapsulate their code
        h <- function(){
            ...
        }
* You can use the return function to explicitly return the output
* If there are no explicit returns, the last evaluated expression is returned

#### Example of 3 ways  to define the same function

In [15]:
f <- function(x)(x+1)^2
g <- function(x){
    return((x+1)^2)
}
h <- function(x){
    tmp <- x + 1
    tmp2 <- tmp^2
}

f(-1) == g(-1) & g(-1)==h(-1)

## Packages
* An R packages are extensions from R that provide additional functionality
* Core set of R packages are installed and loaded
* Other packages come installed with R but not loaded
* You can load an installed package using the library function or the require function
* To reference a particular package use the :: operator. For example, stats::filter references the filter function in the stats package

## CRAN
* The Comprhensive R Archive Network Repository or more commonly "CRAN" is the source for most R packages
* You can install packages from CRAN using the install.packages function
* Packages from CRAN can be installed via the install.packages function

## tidyverse
* The tidyverse is a collection of packages that "share an underlying philosophy and common APIs"
* The tidyverse's main author is Hadley Wickham
* As of version 1.1.1 the core of the tidyverse includes the packages ggplot2, dplyr, tidyr, readr, purrr, and tibble
* You can install all these packages by installing the tidy verse via install.packages
* Here, we will briefly introduce ggplot2 and dplyr

### ggplot2
* ggplot2 is used for data visualization package for R based on the grammar of graphics
* ggplot2 uses the "+" operator to add layers to a plot
* It aims simplify the addition of more complex features to a plot

### Example using ggplot2
First, we'll load packages and the mtcars packag. Then, we'll print the first 5 observations.

In [16]:
library(ggplot2)
library(ggthemes)
data(mtcars)

head(mtcars, 5)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2


### Example using ggplot2
Next, we'll plot the mpg vs wt as a scatterplot with the names of the cars beside each point, and the size of the dot will reflect a relatively larger qsec. We'll apply a theme to this plot, and assign this to the variable plt.

In [17]:
plt <- ggplot(mtcars, aes(wt, mpg, size = qsec, shape = factor(vs), col = disp))+
    geom_point()+
    geom_text(aes(label=rownames(mtcars)))+
    labs(x='Weight', y = 'Miles per Gallon', title='Miles per Gallon vs Weight', 
         shape = 'V/S', size = '1/4 mile time', col='Displacement')+
    theme_bw()


ggsave(filename = 'img/mpg_vs_wt.png', plot = plt)

Saving 6.67 x 6.67 in image


<img src="img/mpg_vs_wt.png" width=450 height=450>

### dplyr
* Provides tools based on the "grammar of data manipulation"
* Makes extensive use of the "%>%" operator from the magrittr package
* The most common functions are select, arrange, mutate, filter, group_by and summarise

#### The magrittr pipe operator (%>%)

* The magrittr pipe, %>%, is used heavily in the tidyverse
* The %>% passes an object into the first argument of a subsequent function
* For example g(f(0)) can be expressed as 

In [19]:
library(dplyr)
0 %>%
        f() %>%
        g()

### Example 
Using the mtcars data set, create an additional variable, Name, which is the name of the car using the rownames function. Print out the first 5 observations of Name and mpg in descending order of mpg for cars with vs equal to 1.

In [20]:
mtcars %>%
    mutate(Name = rownames(mtcars)) %>%
    filter(vs == 1) %>%
    select(Name, mpg) %>%
    arrange(desc(mpg)) %>%
    head(5)


Name,mpg
Toyota Corolla,33.9
Fiat 128,32.4
Honda Civic,30.4
Lotus Europa,30.4
Fiat X1-9,27.3


# Resources
* r-bloggers (https://www.r-bloggers.com/)
* R for Data Science by Hadley Wickham (http://r4ds.had.co.nz/)
* Advanced R by Hadley Wickham (http://Adv-r.had.co.nz)
* The Art of R Programming by Norman Matloff (https://www.nostarch.com/artofr.htm)
