# Introduction to R Programming

R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques.

In addition to its extensive analytical packages, R’s plotting capabilities far exceed that of almost any plotting library Python can offer. ggplot2 (R's graphical package) charts can be found in many (if not most) publications and papers you may read.

R programming is very different from Python because it is an analytical language, not one designed for software engineering. Thus, many of the constructs were built with analysis in mind.

In this lecture, we will go over basic syntax for R, the corolaries between Python and R, and some of the key differences between the two languages.

### Basics

#### Data Types and Variable Assignment

R does several things differently for the same result in Python. Let's go through the basics. First, variables are assigned via "arrows" instead of equal signs. The equal sign *can* work, but there are situations where it will not, so defaulting to arrows should be your standard practice.

##### Numerics
Numeric values in R are all the same. Unlike in Python, R does not distinguish between floats and integers.

In [1]:
#Assign numerics
a <- 1
b <- 1.1

In [2]:
print(class(a))
print(class(b))

[1] "numeric"
[1] "numeric"


##### Characters
Character values in R are like strings in Python.

In [3]:
#Assign characters
c <- 'My name is Jason'
d <- "Quote types don't matter"

In [4]:
print(c)
print(d)

[1] "My name is Jason"
[1] "Quote types don't matter"


In [5]:
print(class(c))
print(class(d))

[1] "character"
[1] "character"


#### Vectors
Vectors in R are like lists in Python

In [6]:
e <- c(1,2,4,7,9,1)
length(e)
print(e)
class(e)

[1] 1 2 4 7 9 1


Accessing an element of a vector much like in Python lists. R, however, poses two distinct differences. First, it indexes starting with 1 instead of 0. Second, it is *inclusive* of the ending range.

In [7]:
#First element
e[1]

In [8]:
#Fifth element
e[5]

In [9]:
#Trying to access beyond what is available
e[7]

In [10]:
#Accessing first through third
e[1:3]

In [11]:
#Accessing third through end
e[3:] #Nope!

ERROR: Error in parse(text = x, srcfile = src): <text>:2:5: unexpected ']'
1: #Accessing third through end
2: e[3:]
       ^


In [12]:
e[3:length(e)]

Note that the class for the vector is not of vector-type, but rather the type of data contained IN the vector. Let's see what happens when we try other combinations of values.

In [13]:
f <- c('a', 'b', 'hello')
length(f)
print(f)
class(f)

[1] "a"     "b"     "hello"


Can we mix types like we can in Python?

In [14]:
g <- c('a', 1, FALSE)
length(g)
print(g)
class(g)

[1] "a"     "1"     "FALSE"


Uh oh! Much like Pandas and Numpy will sometimes do to get all of the data into the same data types, R is picking the "lowest common denominator" and auto-converting things for you. You should be very careful when creating arrays to ensure that you are not mixing data types.

#### Lists
Ironically, lists in R are a hybrid of lists and dictionaries in Python. They are not restricted to holding all of the same data types as vectors are. The indices are defaulted to positions, but those positions can also be named.

In [15]:
h <- list('a', 1, c(1,2,3), FALSE)
length(h)
print(h)
class(h)

[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] 1 2 3

[[4]]
[1] FALSE



In [16]:
#Access first element of the list (unlike Python dictionaries, they are ordered!)
h[1]

In [17]:
#Access first and third elements of the list
print(h[c(1,3)])

[[1]]
[1] "a"

[[2]]
[1] 1 2 3



Now let's give these list indices some names

In [18]:
names(h) <- c('first', 'second', 'third', 'fourth')

In [19]:
print(h)

$first
[1] "a"

$second
[1] 1

$third
[1] 1 2 3

$fourth
[1] FALSE



In [20]:
h['first']

In [21]:
h$first

In [22]:
h[1]

In [23]:
h[2] + 2

ERROR: Error in h[2] + 2: non-numeric argument to binary operator


In [24]:
h$second + 2

##### Variable Naming
Unlike in Python, variable names in R often contain periods as opposed to underscores. For example:

In [25]:
my_number <- 7

In [26]:
my_number.plus7 <- my_number + 7
my_number.plus7

In [27]:
my_number.plus7.another7 <- my_number.plus7 + 7
my_number.plus7.another7

While this would never work in Python, this is common practice for people coding in R

### Operators

Many of the operators in R and Python are the same while others are different.

In [28]:
#addition
9 + 4

In [29]:
#multiplication
9 * 4

In [30]:
#division
9 / 4

In [31]:
#exponentiation
9 ^ 4
9 ** 4

In [32]:
#integer division
9 %/% 4

In [33]:
#modulo
9 %% 4

### Conditionals
Conditional statements are nearly identical to Python. Note that the values for booleans in R are in upper case rather than title case.

In [34]:
3 == 3.0

In [35]:
3 < 3

In [36]:
3 <= 4

In [37]:
3 != 4

In [38]:
(4+3) >= (14/2)

In [39]:
1 %in% c(1,4,5)

In [40]:
is.element(1, c(2,3,4))

In [41]:
match(1, c(1,4,5)) #returns index of first match; else NA

In [42]:
match(1, c(2,3,4))

### Control Flow in R

Conditional statements in R are quite similar to Python. However, the big difference is that indentation is no longer the determining factor in identifying code blocks. R uses curly braces to segment off sections of code. Let's try it below.

In [44]:
i <- 7

if (i %% 2 == 1) {
    return('odd')
} else {
    return('even')
}

In [45]:
i <- 7.1

if (i %% 2 == 1) {
    return('odd')
} else if (i %% 2 == 0) {
    return('even')
} else {
    return('not a whole number')
}

### For-Loops
The structure of for-loops in R is very similar to if statements.

In [46]:
for (i in c(1:10)) {
    print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10


In [53]:
for (i in c(1:10)) { #NOTE: c(1:10) is equivalent to range(1,11) in Python
    if (i %% 2 == 0) {
        print(c(i, 'even'))
    } else {
        print(c(i, 'odd'))
    }
}

[1] "1"   "odd"
[1] "2"    "even"
[1] "3"   "odd"
[1] "4"    "even"
[1] "5"   "odd"
[1] "6"    "even"
[1] "7"   "odd"
[1] "8"    "even"
[1] "9"   "odd"
[1] "10"   "even"


### Functions
Functions act exactly as they do in Python. They allow you to store procedures that are accessible by passing arguments (if any). Let's write a simple one here.

In [61]:
determine_if_even <- function(x,y,z) {
    if(x %% 2 == 0) {
        return(TRUE)
    } else {
        return(FALSE)
    }
}

In [62]:
j <- 7
determine_if_even(j)
k <- 5
m <- 4
determine_if_even(k, m)

In [50]:
for (i in c(1:10)) {
    print(c(i, determine_if_even(i)))
}

[1] 1 0
[1] 2 1
[1] 3 0
[1] 4 1
[1] 5 0
[1] 6 1
[1] 7 0
[1] 8 1
[1] 9 0
[1] 10  1


### DataFrames

Pandas DataFrames were derived from R DataFrames. Some of their functionality is similar but the syntax is very different. Let's go over a few of the basic commands.

In [63]:
df <- read.csv('students.csv')

In [64]:
head(df, 5)

student_id,first,last,gender,class,major,gpa
5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens,Female,Freshman,Finance,3.4
6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
84cec8f4-0b64-44ce-a628-c0eb73f6ca6f,Lisa,Gonzalez,Female,Junior,Finance,4.0


In [65]:
tail(df, 10)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
9991,3927810e-476d-45db-b88c-660e00385ae5,Mable,Thompson,Female,Senior,Economics,3.74
9992,6b8c4f7a-546d-41d5-978f-d461c287fba1,Joseph,Robinson,Male,Senior,Engineering,2.89
9993,c32e6f19-817d-4e60-bf20-ed7ca573a7f7,Gladys,Paul,Female,Freshman,Chemistry,2.96
9994,32606dc2-862b-45cc-b0ac-f2b24253abdf,Douglas,Haas,Male,Junior,Economics,3.73
9995,8dc612f4-8150-4045-9e2d-cf160fb71da4,Brandy,Alford,Female,Senior,Economics,3.02
9996,3f1f6525-3ec0-4184-b435-c829419bf582,Kendra,Bayer,Female,Sophomore,Finance,3.86
9997,bc551659-ba48-447e-aa6a-0c2f49aaa9c1,Tonya,Burnett,Female,Senior,Economics,3.94
9998,4884e643-4a94-4362-a422-604763401487,Deborah,Conley,Female,Senior,Engineering,3.28
9999,034754f5-50dd-42e5-a916-cc6c9d9d0131,David,Seay,Male,Senior,Math,3.14
10000,75c02f31-566f-439e-875e-5af9fe412977,Cheryl,Edwards,Female,Junior,Engineering,2.93


In [66]:
df[1,]

student_id,first,last,gender,class,major,gpa
5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12


In [67]:
df[c(1,3,4),c('student_id', 'first', 'last')]

Unnamed: 0,student_id,first,last
1,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown
3,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Elizabeth,Owens
4,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson


In [69]:
df[c(1,3,4), c(1,3,2)]

Unnamed: 0,student_id,last,first
1,5a397209-3782-4764-a285-10fae807ee71,Brown,Janis
3,975c1581-5ba2-430c-a3d1-01ce03bd83f9,Owens,Elizabeth
4,6081f91d-365c-46ce-ad1b-38af120781d9,Pearson,Edward


In [72]:
student.genders <- df$student_id
class(student.genders)

In [73]:
male_df <- subset(df, gender=='Male')
head(male_df)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
2,e26c3d69-3c74-49b6-81d7-47232787fad9,Timothy,Bishop,Male,Sophomore,Economics,3.48
4,6081f91d-365c-46ce-ad1b-38af120781d9,Edward,Pearson,Male,Freshman,Math,3.84
6,6c849c3e-e640-4bba-a86a-4323fd513b90,Alphonse,Allen,Male,Freshman,Engineering,2.99
7,a5c87c39-447c-4c29-92af-fa702a8d5595,Linwood,Coleman,Male,Freshman,Engineering,3.58
8,f6b177e8-e00a-480e-b62e-906c2ad80f85,Arthur,Mccolpin,Male,Junior,Math,3.04
9,8387594f-c9b2-4daa-ae93-c3e40f58cb26,Daniel,Carter,Male,Junior,Chemistry,2.41


In [84]:
female_econ_df <- subset(df, gender=='Female' & major=='Economics')
head(female_econ_df)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
1,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
13,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4
16,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81
44,667a8999-3765-4a83-bc64-612f64a2011e,Tosha,Flanagan,Female,Junior,Economics,3.92
50,ee1ce005-5671-431b-b832-61452d15ad76,Yvonne,Delo,Female,Sophomore,Economics,2.87
55,ed1f11d4-cf19-4792-83dd-0f9fea5a4692,Karla,Cota,Female,Junior,Economics,3.12


In [82]:
male_engineering_honors <- subset(df, gender=='Male' & major=='Engineering' & gpa >= 3.7)
head(male_engineering_honors)
length(male_engineering_honors)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
48,4260cfbb-cf43-4049-b03c-75143f64c52d,Omar,Gaston,Male,Junior,Engineering,3.86
488,d53370d0-30fa-4f46-be05-7d0edd6f0322,Michael,Greco,Male,Senior,Engineering,3.78
853,3e366b15-cb12-47e0-93f1-b9d4af23209f,Patrick,Exline,Male,Junior,Engineering,3.79
1105,458a5ae3-4069-4f0d-8169-777bbd558e79,Gary,Mojica,Male,Senior,Engineering,4.0
1224,08b8617d-6206-4636-826d-7ab66e91b0b4,Luis,Pool,Male,Senior,Engineering,3.88
1295,bfa95934-e80f-4db7-8150-23bd42e9d838,Mark,Fuller,Male,Sophomore,Engineering,3.72


In [76]:
mean(female_econ_df$gpa)
sd(female_econ_df$gpa)

In [77]:
female_econ_df$mean_gpa <- mean(female_econ_df$gpa)
head(female_econ_df)

Unnamed: 0,student_id,first,last,gender,class,major,gpa,mean_gpa
1,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12,3.495707
13,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4,3.495707
16,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81,3.495707
44,667a8999-3765-4a83-bc64-612f64a2011e,Tosha,Flanagan,Female,Junior,Economics,3.92,3.495707
50,ee1ce005-5671-431b-b832-61452d15ad76,Yvonne,Delo,Female,Sophomore,Economics,2.87,3.495707
55,ed1f11d4-cf19-4792-83dd-0f9fea5a4692,Karla,Cota,Female,Junior,Economics,3.12,3.495707


In [80]:
diff <- female_econ_df$gpa - mean(female_econ_df$gpa)
stds <- diff / sd(female_econ_df$gpa)
female_econ_df$less_than_1_std <- stds < -1

In [85]:
head(female_econ_df)

Unnamed: 0,student_id,first,last,gender,class,major,gpa
1,5a397209-3782-4764-a285-10fae807ee71,Janis,Brown,Female,Junior,Economics,3.12
13,1846c044-9a87-49e4-ad4f-d1bfadb4e41b,Lisa,Walden,Female,Senior,Economics,3.4
16,35de9214-505e-4d55-80e6-0302098a44b6,Joan,Decoteau,Female,Senior,Economics,3.81
44,667a8999-3765-4a83-bc64-612f64a2011e,Tosha,Flanagan,Female,Junior,Economics,3.92
50,ee1ce005-5671-431b-b832-61452d15ad76,Yvonne,Delo,Female,Sophomore,Economics,2.87
55,ed1f11d4-cf19-4792-83dd-0f9fea5a4692,Karla,Cota,Female,Junior,Economics,3.12


# In-Class Exercise

In [86]:
data(ChickWeight)

In [87]:
head(ChickWeight)

weight,Time,Chick,Diet
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1


In [88]:
subset(ChickWeight, Chick==1)

weight,Time,Chick,Diet
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1
106,12,1,1
125,14,1,1
149,16,1,1
171,18,1,1


In [89]:
aggregate(x=ChickWeight$weight, by=list(ChickWeight$Time), FUN=mean)

Group.1,x
0,41.06
2,49.22
4,59.95918
6,74.30612
8,91.2449
10,107.83673
12,129.2449
14,143.8125
16,168.08511
18,190.19149


In [90]:
dim(ChickWeight)