# Intro to R Programming

- R is a software environment for data analysis, statistical computing, and graphics. 
- It is also a programming language. 
- Like Python, R is free and open-source. 
- Compared to Python, R's syntax is more natural to use and allows you to complete data analyses in fewer lines of code.
- It comes with <b>base</b> package and the rest should be installed
- More than 10,000 packages are available in CRAN

<b>Author</b>: Geoffrey Kee, modified by Nadav Rindler

# Installation

- conda install -c r r-essentials


# Basics



In [None]:
# To get Working Directory -- which folder are you working from?
getwd()

In [None]:
# To change workind directory
setwd('Your/FilePath/Here')
    #Note that for Windows users, you MUST use backslashes ('/') in all file paths

In [None]:
# To assign values to an object
num1 <- 5  #Using a carrot "<-"
name = 'John' #Using a single equals sign "="
x = 1:20

In [None]:
# To reurn the values of R objects, just type the object name
num1
name
x

In [None]:
# NOTE: R is case sensitive:
NUM1

In [None]:
# NOTE: A single equals sign "=" is for variable assignment. A double equals sign "==" tests for equivalence
num1 == 4

In [None]:
# Your turn: Create an object called "z" and set it equal to a sequence of integers from 5 to 15


In [None]:
# Concatenate two or more strings
greeting = paste('Hello', name, '!!', sep=" ")
greeting
    #What does the "sep=" argument do?

In [None]:
# List all defined objects
ls()

In [None]:
# Remove an object
rm(num2)

# I. Data Types

Elements of R objects can be one of 5 data types: 
1. **character**: "a", "swc"
2. **numeric**: 2, 15.5
3. **integer**: 2L (the L tells R to store this as an integer)
4. **logical**: TRUE, FALSE
5. **complex**: 1+4i (complex numbers with real and imaginary parts)

# II. Data Structures

- All data in R is stored as in objects
- Objects can be various **data structures**, such as:
    - vector
    - list
    - matrix
    - data frame
    - factors


- Base data structures can be organized by their dimensionality and 
  whether they are homogeneous or heterogeneous
  - <font color=blue><b>Homogeneous</b></font>: all elements must be of the same type
  - <font color=blue><b>Heterogeneous</b></font>: the elements can be of diﬀerent type
  
## 1. Vector

- A vector is a series of elements, stored together as a single object. 
- Vectors can be created in R using the <b>c()</b> function, which stands for combine.
- Vectors are **ordered**, so you can select elements from a vector by their position.

**Note: All elements of a vector should be of the same data-type.**

In [None]:
# Create  a vector
weights = c(4, 2, 0, 8.65, 93, 4, 9, 3)
countries = c("Malaysia", "Japan", "Iran", "Singapore", "Germany")

countries
weights

In [None]:
# Select the 3rd element from the countries vector
countries[3]

In [None]:
# Select the 2nd, 3rd, and 4th elements from the countries vector
countries[2:4]
countries[4:2]

In [None]:
# To show 1st and 4th elements of the weights vector
weights[c(1, 4)]

In [None]:
#To show all elements of the countries vector EXCEPT the 1st and 4th
countries[-c(1, 4)]

In [None]:
# Show information about an object

# What Data Type?
typeof(weights)
typeof(countries)

class(weights)
class(countries)

print ("=================")

# How many elements
length(countries)

In [None]:
# To name elements of a vector
names(weights) = c("Apple", "Orange", "Kivi", "Watermelon", "Strawberry", "Blueberry", "Banana", "Durian")

weights
weights["Watermelon"]

In [None]:
animals = c(Lion = 3, Horse = 12, Fish = 53, Eagle = 6)
is.vector(animals)
animals

In [None]:
animals[!animals < 10]
    # What does the "!" mean?

In [None]:
# Your turn: Change the names of the animals vector -- pick 4 new animal names


In [None]:
# Computations on vectors are performed element-wise
a = seq(20, 30)
b = seq(1, 6, by = 0.5)
c = seq(1, 10, by = 2)

a
b
c

In [None]:
# Computations on vectors are performed element-wise - Continue
a - b
"============"
a + c

In [None]:
# Computations on vectors are performed element-wise - Continue
d = c * 3
c
d
sum(c)
sum(d)

### 1.1. Add elements to a vector

To add additional elements to a vector, use the **c()** function to combine the elements.

In [None]:
e  <- c(a,b)
e

e  <- c(e,1)

In [None]:
# Your turn: Select the 4th and 6th elements of vector 'e'

## 2. List

- A list can be considered a vector where its **elements can be of any data type**, including lists.
- Lists can be defined by <b>list()</b> function.
- Lists sometimes called “recursive” vectors, because **lists can contain other lists**.
- List elements often have names.
- <b>[[ and $ to subset and extend lists</b>

In [None]:
mylist = list(attr1 = 3, attr2 = TRUE, attr3 = NA, attr4 = "Kuala Lumpur")

names(mylist)
'============'
mylist$attr1
mylist[[1]]
mylist[1]
'============'
mylist$attr4
mylist[[4]]
mylist[4]

In [None]:
mylist2 = list(attr1 = 3, attr2 = TRUE, attr3 = NA, attr4 = "Kuala Lumpur", city=list(city1="KL", city2="PJ", city3="JB"))
mylist2

In [None]:
mylist2$city$city2

## 3. Matrix

A matrix is a collection of data elements of the <font color=red>same data type</font> arranged in a <b>two-dimensional rectangular layout</b>.<br>
In mathematics, a matrix is shown by its name followed by Nrows * Ncols (e.g. A2 ** 3).<br>
R can hold a matrix using <b>matrix()</b> function.

In [None]:
#Explore R's documentation / help page for the matrix() function
?matrix

In [None]:
# How to save a matrix?
m = matrix(c(4, 7, 12, -5, 6, 0, 5, 7, 1, 103, 21, -9), 
           nrow = 3, 
           ncol = 4)
m

In [None]:
# Your turn: Copy the code from the cell above, and add a fourth argument "byrow=TRUE". What does the "byrow" argument do?


In [None]:
# To name dimensions (rows and columns)
dimnames(m) = list(
                    c('r1', 'r2', 'r3'),
                    c('col1', 'col2', 'col3', 'col4'))
m

In [None]:
# Select elements by their row and column indexes: matrix_name[row_number, col_number]
m[2, 3]

# Select all elements in 2nd row
m[2,]

# Select all elements in 3rd column
m[, 3]

# Select multiple rows and columns from matrix
m[c(2, 3), c(1, 4)]

In [None]:
# Transpose matrix
t(m)

In [None]:
# Matrix multiplication
m %*% t(m)

## 4. Factors

Factors are used to represent <b>categorical variables</b>.<br>
Factors are integer vectors.
<br><b><font color=red>R sorts levels alphabetically</font></b>
<br><b>factor()</b> function is being used in R to make a factor.

In [None]:
# How to make a vector in accordance to a vector of strings?
day <- c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun') 
day_factor = factor(day) 
day_factor
str(day_factor)

In [None]:
# To make the vector values sorted as we want
day <- c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun') 
day_factor = factor(day, levels = day) 
day_factor
str(day_factor)

## 5. DataFrame

- To hold observations (entities) data, stronger data structure is needed.
- <b>A DataFrame is used for storing data tables. 
- It is a list of vectors of equal length. </b>
- In fact, we can create a DataFrame by combining vectors.
- DataFrames are suitable to work with DataSets.

In DataFrames:
- every row represents an obsservation (entity)
- every column represents value of an attribute of an observation (e.g. name, age, Gender, etc)
- Can contain elements of diﬀerent types but **elements in the same column must have the same type**

In [None]:
# To make a DataFrame
# 1. Combine multiple vectors using data.frame() function
Name = c("Amin", "Nur", "John", "Sara")
Age = c(35, 18, 53, 28)
Graduated = c(TRUE, FALSE, FALSE, TRUE)
students = data.frame(Name, Age, Graduated)# Strings will be stored as factors

students

In [None]:
# View structure of data frame
str(students)
    # What are the data types of each column in the data frame?

In [None]:
# You can choose whether to store character variables as data type character or factor
students2 = data.frame(Name, Age, Graduated,  stringsAsFactors = FALSE) # Strings will NOT be stored as factors
str(students2)

In [None]:
# Select specific rows and columns from the data frame: df_name[row_number, col_number]
students[2, 1]
students[2, c(1,3)]

In [None]:
# There are many ways to select columns from a data frame 

students[,1]
students[,'Name']
students$Name
students[1]
students[[1]]
students['Name']
students[['Name']]

In [None]:
# Select multiple data frame columns by *combining* them c()
students[,c('Name','age')]

In [None]:
# Select data frame rows where condition is TRUE
students[students$age>30,]

In [None]:
# Change values in a row of a data frame
students2[1,] = list(Name = "Amin_2", Age = 36, Graduated = TRUE) 
students2

In [None]:
# Change values in a column of a data frame. Remember, columns are just vectors!
students2[, 2] = c(37, 29, 67, 19)
students2

In [None]:
# Add column to a data frame using assignment
height = c(177, 162, 168, 170)
students$Height = height
students

In [None]:
# Add column to a data frame using cbind()
Weight = c(83, 60, 75, 80)
students = cbind(students, Weight)
students

In [None]:
# Add row to a data frame using rbind()
tmp1 = data.frame(Name = "Ali" , Age = 21, Graduated = FALSE, Height = 169, Weight = 68)
students = rbind(students, tmp1)
students

In [None]:
# Number of rows?
nrow(students)

# Number of columns?
ncol(students)

# Dimensions?
dim(students)

- How to sort a DataFrame based on one of its columns?
  <br><b>order()</b> function helps

In [None]:
# To sort a dataFrame

# We want to sort students based on their age descending
# We have to find students rank first

rank = order(students$Age)
rank

students[rank,]

In [None]:
# R's default is to sort ascending. To sort descending, add the "decreasing=TRUE" argument
students[order(students$Age, decreasing = TRUE),]

In [None]:
# Read data from file (e.g. CSV file)
?read.csv()

R comes with some built-in data frames helping us to enhance our analysis skills.
<br>Try looking into the following data frames:
- women
- mtcars
- USArrests
- chickwts
- airquality

In [None]:
head(mtcars, 5) # Print first 5 rows of mtcars

In [None]:
tail(USArrests, 4) # Print last 4 rows of USArrests

In [None]:
colnames(USArrests) # column names

In [None]:
rownames(tail(mtcars, 7)) # row names

## 7. Missing values

- Missing values in R are denoted by either <font color=red><b>NA</b></font> or <font color=red><b>NaN</b></font>. 
- <b>NaN</b> is used for undefined mathematical operations.
    - <font color=green>is.nan()</font> can be used to check for it.
- <b>NA</b> is usually used for everything else
    - <font color=green>is.na()</font> can be used to check for it.


- <font color=red>NaN value is also NA, but the converse is not true.</font>

In [None]:
testvals = c(4, 0 , -7, NaN, 23, NA, -60)
print (testvals)
"========================="
is.nan(testvals)
"========================="
is.na(testvals)

### 7.1. How to remove NA values?

#### 7.1.1. using is.na() function:

In [None]:
bads = is.na(testvals)
bads
!bads
testvals[!bads]

#### 7.1.2. using complete.cases() function:

In [None]:
goods = complete.cases(testvals)
print (goods)
"================================"
testvals[goods]

In [None]:
testvals1 = c( 4 , 0 ,  -7,  NaN, 23,  NA, -60)

testvals2 = c("a", NA , "b", "c", "d", NA, "e")

goods = complete.cases(testvals1, testvals2)
goods

## <font color=red><b>Quiz:</b></font>

Write a part of code to show following results:
1. Using USArrests dataset, those 3 states having highest Murder rate.
2. Using mtcars dataset, the strongest car that has 3 gears.
3. Using mtcars dataset, average of horse power (hp) of those cars having 4 gears and 6 cylindars (cyl).

# II. Control Structures

A control structure is a block of programming that <font color=red>analyzes variables and chooses a direction</font> in which to go based on given parameters.<br>
Control Structures can be divided into two main categories:
- Conditional statements
- Repeat Loops

## 1. Conditional statements

In R, as the other programing languages, <b>if-else statement</b> is used to evaluate a condition (simple or complex) to decise which parts of code should be executed.<br>

if (condition) {<br>
 statement1<br>
} else {<br>
    statement2
}

In [None]:
head(students)

In [None]:
students[1,]$Age
students[1,]$Graduated
"========================"
if(students[1,]$Age > 40 | students[1,]$Graduated == FALSE){
        print ("Is not eligible!!")
}else{
        print ("Is eligible!!")
    }

In [None]:
# Nested if

x = -5
y = 0
z = -3

if(x > 0){
    if(y == 10){
        print ("M1_1")
    }else{
        print ("M1_2")
    }
}else if(x == 0){
    if(y <= 0){
        print ("M2_1")
    }
    else{
        print ("M2_2")
    }
}else{
    print("M3")
}

In [None]:
# ifelse

x = 5
z = ifelse(x == 5, 1, -1)
print (z)
"================"
z = ifelse(x != 5, 1, -1)
print (z)

## 2. Repeat Loops

These structures repeats given statements for <b>specified</b> or <b>unspecified number</b> of repetition.

### 2.1. for

- implements a loop with specified number of repetitions.<br>

for (<i>variable</i> in <i>sequence</i>){<br>
<tab>statements<br>
}

In [None]:
for(i in 1:5){
    print (paste(i, ") Hello World !"))
}

In [None]:
for(name in names(students)){
    print (paste("Column ", name, ": "))
    print (students[[name]])
}

In [None]:
for(i in 1:nrow(students)){
    str = paste(students[i,]$Name, "is", students[i, ]$Age, "Years old", sep = ' ')
    if(students[i,]$Graduated == TRUE)
        str = paste(str , " and is graduated.")
    else
        str = paste(str , " but is not graduated.")
    print(str)
}

### 2.2. while

- implements unspecified loops
- It repeats statement(s)while the given condition is true.

while(condition){<br>
statements<br>
}

In [None]:
cnt = 1
while(students[cnt,]$Graduated == TRUE){
    print (students[cnt,])
    cnt = cnt + 1
    }

In [None]:
cnt = 1
while(students[cnt,]$Graduated == TRUE | students[cnt,]$Age < 20){
    print (students[cnt,])
    cnt = cnt + 1
    }

### 2.3. Repeating Functions

- <font color=red><b>Will be tought after functions</b></font>

# 3. Functions

- Two kinds of functions are being used in R:
1. Built-in function (e.g. c(), data.frame(), print() )
2. User-defined functions

The structure of a function is as below:

<b>
myfunction <- function(arg1, arg2, ... ){<br>
statements<br>
return (object)<br>
}</b>

In [None]:
# A simple function to multiply a vector of numbers

multiply <- function (x){
    if(length(x) == 0 | is.vector(x) == FALSE){
        NULL
    }
    else{
        m = 1
        for ( num in x){
            m = m * num
        }
        return(m)
    }
}

numbers = c(2, 4, 6, 10)
multiply(numbers)
multiply(c())

## <font color=red><b>Quiz:</b></font>

- Write a function to find the second greatest number of a list of numbers.

In [None]:
second_max <- function (x){
    if(length(x) <= 1){
        NULL
    }else{
        fmax = -.Machine$double.xmax#-99999
        smax = -.Machine$double.xmax#-99999
        
        for(num in x){
            if(num > fmax){
                smax = fmax
                fmax = num
            }else if(num > smax){
                smax = num
            }
        }
        smax
    }
}

second_max(c(1,2,4,7,5,9,3))
second_max(c(2))
second_max(c(5,5))

- <font color=red>Example:</font> Write a function to return x^y (x powered by y).

In [None]:
powerxy <- function(base=2, pow=0){
    if(base == 0){
        0
    }else if(pow == 0){
        1
    }else{
        res = 1
        for(i in 1:pow){
            res = res * base
        }
        return (res)
    }
}
    
powerxy(2, 5)
powerxy(5, 2)
powerxy(pow=5, base=2)
powerxy(pow=10)
powerxy(base=10)

# 4. Statistics

## 4.1. How to generate random numbers

- There are some functions in R for generating random numbers with particular distribution.


In [None]:
# ● BINOMIAL Distribution

# rbinom(n, size, prob)
# n: number of observations
# size: number of trials (zero or more)
# prob: probability of success on each trial

rbinom(10, 4, 0.5)
rbinom(10, 4, 0.5)
rbinom(10, 4, 0.5)

In [None]:
# ● Geometric Distribution

# rgeom(n, prob)
# n: number of observations
# prob: probability of success on each trial

rgeom(10, 0.5)
rgeom(10, 0.5)
rgeom(10, 0.5)

In [None]:
# ● Poisson Distribution

# rpois(n, lambda)
# n: number of observations
# lambda: vector of (non-negative) means

rpois(10, 0.5)
rpois(10, 0.5)
rpois(10, c(0.75, 0.5, 0.25))

In [None]:
# ● Exponential Distribution

# rexp(n, rate)
# n: number of observations
# rate: vector of rates

rexp(10, rate = 1)
rexp(10, rate = 5)
round(rexp(10, rate = c(1, 3, 2)), 3)

In [None]:
# ● Normal Distribution

# rnorm(n, mean, sd)
# n: number of observations
# mean: vector of means
# sd: vector of Standard Deviations

rnorm(10, mean = 0, sd = 1)
rnorm(10, 5, 2)
rnorm(10, 5, 0.2)

In [None]:
# ● Uniform Distribution

# runif(n, min, max)
# n: number of observations
# min: lower limit of the distribution. Must be finite.
# max: upper limit of the distribution. Must be finite.

runif(10, min = 0, max = 1)
runif(10, min = 0, max = 5)
round(runif(10, min = 4, max = 6), 4)

## <font color=red>Exercise</font>

- Make a list of 3 sets of 25 random values following normal, uniform and poisson distribution respectively.

## 4.2. How to guess which distribution fits a set of values?

- <font color=green><b>fitdistrplus package</b></font> is a good tool to do so.

In [None]:
data = rnorm(1000, 8, 2)

#install.packages('fitdistrplus', repos = "https://cloud.r-project.org")
library('fitdistrplus')

descdist(data, discrete = FALSE) # If discrete is TRUE, the distribution is considered as discrete

In [None]:
fit.weibull = fitdist(data, "weibull")
fit.norm = fitdist(data, "norm")

In [None]:
plot(fit.norm)

In [None]:
plot(fit.weibull)

In [None]:
data = runif(1000, 8, 20)

descdist(data, discrete = FALSE) 

In [None]:
descdist(whodata$Population, discrete = FALSE) 

In [None]:
classHeight = c(168, 173, 176, 171, 168, 176, 175, 176, 168, 162, 175, 174, 180, 172, 175, 174)
descdist(classHeight, discrete = FALSE) 

In [None]:
wh = read.csv("Weight_Height.csv")
head(wh)
descdist(wh$Height, discrete = FALSE) 

## 4.3. Statistical Tests

- Suppose that there are two sets of data. Using Statistical Tests it is possible for us to know whether those sets are significantly different or not.
- Statistical tests give a p-value.
- If <b>p-value < 0.05</b>, the given sets are significantly different.

I. t.test
performs one and two sample t-tests(student test) on vector(s) of <b>Normal distributed</b> data.

In [None]:
x = rnorm(1000, 10, 2)
y = rnorm(1000, 10, 3)

t.test(x, y, paired=TRUE)

In [None]:
ttest.res = t.test(x, y, paired=TRUE)
names(ttest.res)

ttest.res$p.value
round(ttest.res$p.value, 4)

if(ttest.res$p.value < 0.05){
    print ("Two sets are significantly different.")
}else{
    print ("Two sets are NOT significantly different.")
}

# 5. Graphing

- Data Visualization plays an important role in making data and data analysis results more understandable.
- R provides strong packages providing various types of graphs for different purposes.

<font color=green>** we favor <b>ggplot2</b> package. Here's why:
- More elegent & compact code
- More aesthetically pleasing
- More powerful and flexible</font>

In ggplot2, the focus is not on drawing lines and points, but on creating data visualisations, and it shines in rapid data exploration. In fact, it has become the go-to tool for flexible and professional plots in R.

<font color=green>** A basic ggplot2 plot consists of:<br>
<b>- data:</b> Must be a data.frame<br>
<b>- aesthetics:</b> In ggplit jargon, aesthetics means "something you can see", and it's used to describe visual<br> <b>- characteristics</b> that represent data, such as: color, size, shape, and position of x, y.<br>
<b>- geometry:</b> Geometries of plotted objects (points, lines, polygons, etc.)</font>

In [None]:
#install.packages('ggplot2')
library('ggplot2')
x = seq(1, 100, 0.5)
y= x^3
qplot(x, y, color=x, main="Main Title", xlab="X axis label", ylab="Y axis label")

In [None]:
qplot(mpg, wt, data = mtcars, geom = "auto")

In [None]:
qplot(factor(cyl), hp, data = mtcars, geom = "boxplot")

In [None]:
# For more advanced plots, use ggplot
ggplot(data=mtcars) +
  geom_point(aes(x=mpg, y=wt, color=cyl)) +
  labs(title="Car Data",
       subtitle="MPG vs. Weight",
       x="MPG", y="Weight")

# 6. Other Built-in R Functions

In [None]:
# Constants
letters
month.name
month.abb
pi

# Built-in functions
sqrt(64)
round(3.141593, digits=2)
strsplit("dengue", "")
toupper("hello world")
paste("Today is", date())

# User input
my.name <- readline(prompt="Enter your name: ")