# Lecture 2: An overview of R: part I
This lecture gives an overview of R and introduces some basic characteristics of R. This includes 
1. Basic computations in R
2. How to create an object
3. Data types
4. How to generate data
5. Operators

## 2.1 Basic computations in R

In [1]:
2+3

In [2]:
2*3

In [3]:
log(4)    # Natural log

In [4]:
exp(2)

In [5]:
2e3

In [6]:
2^3; 8^(1/3)

In [7]:
sqrt(4)

## 2.2 Create an R object
We cannot always work with numbers by copy-paste. Create R objects to store the numbers -> data.

Not only numbers but also text, date, etc. are data that R can use.

In [8]:
x <- 5; x

In [9]:
no.students <- count.students <- 21
# both no.students and count.students are set to 21
course = "EPIB 613"
"Yi" -> me           #  <- , = and -> are equivalent
cat(c(me, "teaches", no.students, "students in", course))
# Don't worry about cat(). If you do, run ?cat in R.

Yi teaches 21 students in EPIB 613

### Calculations with stored R objects
#### Example: calculate the number of students left in Yi's class.

In [10]:
no.students <- 21 # Is this line redundant? why?
hated.Yi.and.left <- 3
no.stay <- no.students - hated.Yi.and.left
print(no.stay)

[1] 18


In [11]:
cat(c(me, "teaches the", no.stay, "smartest students in", course))
# All students in EPIB 613 are smart!!!!!

Yi teaches the 18 smartest students in EPIB 613

###### Advantages:
- Most importantly, re-use the values by simply calling the the object. Reproducibility!
- Can use variable names that make sense to yourself - Very clear and can be easily edited later.

###### Note:
- Whether or not to store data in a named R object is totally up to you.

### R is case sensitive.

In [12]:
Course <- "EPIB 601"
course

### The old value will be replaced by the new one.

In [13]:
print(no.students)

[1] 21


In [14]:
no.students <- no.stay
print(no.students)

[1] 18


###### Rule for creating an object:
- Variables can be alphabetic or alphanumeric, but not numeric (you are not allowed to create numeric variables).
- There are no restrictions to the length of the variable name.
- Do NOT assign the single letter names c, g, t, C, D, F, I and T as they are default names that are used by R. For instance, T and F are abbreviations for TRUE and FALSE in logical operations. We should avoid using names that are already used by the system.

## 2.3 Data types and structures
### 2.3.0 Data types

In [15]:
number <- c(1, 2, 3)
class(number)

In [6]:
# As in most programming languages, there are integers and floating-point numbers in R
class(5L)

In [8]:
# Double precision floating-point numbers in R
# is.double() checks whether an object is a double precision floating-point number
is.double(5); is.double(5L)

In [19]:
# How precise is double precision?
options(digits = 22) # show more decimal points
print(1/3)
options(digits = 7) # reset to default

[1] 0.3333333333333333148296


In [16]:
letters <- letters[1:3]; print(letters)
class(letters)

[1] "a" "b" "c"


In [17]:
logical <- c(TRUE, FALSE)
class(logical)

In [18]:
factor <- as.factor(letters[1:3]); print(factor)
class(factor)

[1] a b c
Levels: a b c


### 2.3.1* Scalar
Not considered as a stand-alone data type because it is basically a vector of length 1.

In [19]:
x <- 5; x

### 2.3.2 Vector
In R, we work with vectors.

In [20]:
# As a big fan of winter sports, I hope that...
snow.days.per.week.mtl <- c(7, 7, 7, 7)
print(snow.days.per.week.mtl)

[1] 7 7 7 7


In [21]:
# We can add names to the vector for each entry
names(snow.days.per.week.mtl) <- rep("Jan 2018", 4)
print(snow.days.per.week.mtl)

Jan 2018 Jan 2018 Jan 2018 Jan 2018 
       7        7        7        7 


In [22]:
# But the names will not affect calculations.
sum(snow.days.per.week.mtl)

### 2.3.3 Matrix

In [23]:
mymatrix1 <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(mymatrix1)

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    6    7    8
[3,]    9   10   11
[4,]   12   13   14


In [24]:
mymatrix2 <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(mymatrix2)

     [,1] [,2] [,3]
[1,]    3    7   11
[2,]    4    8   12
[3,]    5    9   13
[4,]    6   10   14


In [25]:
rownames <- c("row1", "row2", "row3", "row4")
colnames <- c("col1", "col2", "col3")
rownames(mymatrix1) <- rownames
colnames(mymatrix1) <- colnames
print(mymatrix1)

     col1 col2 col3
row1    3    4    5
row2    6    7    8
row3    9   10   11
row4   12   13   14


### 2.3.4 Array

Mathematically, scalars, vectors and matrices are all arrays of different dimensions
- Scalar: 1 x 1 array
- Vector of length k: 1 x k array
- Matrix of dimension m x n: m x n array

R treats every array below 3 dimensions differently but they are essentially not. Python treats them in the same way.

Now let's look at a 3-dimensional array.

In [26]:
myarray <- array(c(mymatrix1, mymatrix2), dim = c(4,3,2))
print(myarray)

, , 1

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    6    7    8
[3,]    9   10   11
[4,]   12   13   14

, , 2

     [,1] [,2] [,3]
[1,]    3    7   11
[2,]    4    8   12
[3,]    5    9   13
[4,]    6   10   14



###### A demonstration of high dimensional arrays
Fake data:
- Disease: 1=Yes, 0=No
- Drug: 1=Exposed, 0=Unexposed
- BMI category: 1,2,3
- Age category: 1,2,3,4

In [27]:
# Don't worry about the data generating process.
set.seed(613) # Make random numbers generated from sample() reproducible.
# Randomly assign ~20% of patients to have disease.
disease <- sample(c(0,1), size = 100, replace = TRUE, prob = c(0.2, 0.8))
# Randomly assign ~40% of patients to take drug.
drug <- sample(c(0,1), size = 100, replace = TRUE, prob = c(0.4, 0.6))
bmi.cat <- sample(1:3, size = 100, replace = TRUE) # Randomly assign BMI categories
age.cat <- sample(1:4, size = 100, replace = TRUE) # Randomly assign age categories
data <- data.frame(drug, disease, bmi.cat, age.cat) # Make our data frame
head(data)

# The table below shows the first 6 rows of the fake dataset.
# This is a typical dataset you will see in Epidemiology.
# Each row is a patient, with their own information.
# Goal is to assess the association between disease and drug (drug safety).

drug,disease,bmi.cat,age.cat
1,0,1,1
1,0,3,3
1,1,1,2
0,0,3,4
0,1,2,4
1,0,2,3


In [28]:
# By tabulating the data, we can assess the association (EPIB 601 material).
# If we only tabulate drug and disease, we get a 2x2 table, which is a matrix or a 2-dimensional array.
# 1st dimension: drug, 2nd dimension: disease
table(data[c("drug","disease")])

    disease
drug  0  1
   0 11 31
   1 19 39

In [29]:
# This may not be enough, we want to see how people with different BMI may differ (confounder, also 601 material).
# We now need a 2x2x3 table, which is a 3-dimensional array.
# 1st dimension: drug, 2nd dimension: disease, 3rd dimension: bmi.cat
table(data[c("drug", "disease", "bmi.cat")])

, , bmi.cat = 1

    disease
drug  0  1
   0  3 15
   1  7 14

, , bmi.cat = 2

    disease
drug  0  1
   0  5  7
   1  6  5

, , bmi.cat = 3

    disease
drug  0  1
   0  3  9
   1  6 20


In [30]:
# Further include age to see how age category comes into the association
# We now need a 2x2x3x4 table, which is a 4-dimensional array.
# 1st dimension: drug, 2nd dimension: disease, 3rd dimension: bmi.cat, 4th dimension: age.cat
table(data)

, , bmi.cat = 1, age.cat = 1

    disease
drug 0 1
   0 0 5
   1 5 1

, , bmi.cat = 2, age.cat = 1

    disease
drug 0 1
   0 1 3
   1 0 0

, , bmi.cat = 3, age.cat = 1

    disease
drug 0 1
   0 1 3
   1 1 2

, , bmi.cat = 1, age.cat = 2

    disease
drug 0 1
   0 2 5
   1 0 8

, , bmi.cat = 2, age.cat = 2

    disease
drug 0 1
   0 1 0
   1 2 0

, , bmi.cat = 3, age.cat = 2

    disease
drug 0 1
   0 0 1
   1 0 6

, , bmi.cat = 1, age.cat = 3

    disease
drug 0 1
   0 1 3
   1 0 1

, , bmi.cat = 2, age.cat = 3

    disease
drug 0 1
   0 1 0
   1 2 5

, , bmi.cat = 3, age.cat = 3

    disease
drug 0 1
   0 1 3
   1 2 6

, , bmi.cat = 1, age.cat = 4

    disease
drug 0 1
   0 0 2
   1 2 4

, , bmi.cat = 2, age.cat = 4

    disease
drug 0 1
   0 2 4
   1 2 0

, , bmi.cat = 3, age.cat = 4

    disease
drug 0 1
   0 1 2
   1 3 6


### 2.3.5 Data frames
Data frame is the most commonly used member of the data types family in R. A data frame is a generalization of a matrix, in which different columns may have different modes. All elements of any column must have the same mode, i.e. all numeric or all factor, or all character.

In [31]:
names <- c("Lucy", "John", "Mark", "Candy")
score = c(67, 56, 87, 91)
df <- data.frame(names, score); print(df)
curved.score <- sqrt(score)*10
# cbind() and data.frame() work the same when combining a data frame with a new column.
new <- data.frame(df, curved.score)
new1 <- cbind(df, curved.score)
str(new); str(new1)
# Order of columns only matters sometimes (display, tabulation, etc.).
df.order <- data.frame(score, names); print(df.order)
df$names; df.order$names

  names score
1  Lucy    67
2  John    56
3  Mark    87
4 Candy    91
'data.frame':	4 obs. of  3 variables:
 $ names       : Factor w/ 4 levels "Candy","John",..: 3 2 4 1
 $ score       : num  67 56 87 91
 $ curved.score: num  81.9 74.8 93.3 95.4
'data.frame':	4 obs. of  3 variables:
 $ names       : Factor w/ 4 levels "Candy","John",..: 3 2 4 1
 $ score       : num  67 56 87 91
 $ curved.score: num  81.9 74.8 93.3 95.4
  score names
1    67  Lucy
2    56  John
3    87  Mark
4    91 Candy


In [32]:
str(df) # checking the structure of an object

'data.frame':	4 obs. of  2 variables:
 $ names: Factor w/ 4 levels "Candy","John",..: 3 2 4 1
 $ score: num  67 56 87 91


In [33]:
mt <- (cbind(names, score)); print(mt) # cbind() means column bind

     names   score
[1,] "Lucy"  "67" 
[2,] "John"  "56" 
[3,] "Mark"  "87" 
[4,] "Candy" "91" 


In [34]:
is.matrix(mt) # verifying whether 'mt' is a matrix

In the data frame, the first column 'names' is a character variable and the second column 'score' is a numerical variable.
In the matrix, as all entries must have the same mode, 'score' is coerced into a character variable (with "").

###### Conversion between a matrix and a data frame
- as.matrix(dataframe)    # data.frame -> matrix
- as.data.frame(matrix)   # matrix -> data.frame

###### Checking whether an object is a data frame or a matrix
- is.matrix()
- is.data.frame()

In [35]:
my.df <- as.data.frame(mymatrix1)
print(my.df)

     col1 col2 col3
row1    3    4    5
row2    6    7    8
row3    9   10   11
row4   12   13   14


In [36]:
str(my.df); is.data.frame(my.df)

'data.frame':	4 obs. of  3 variables:
 $ col1: int  3 6 9 12
 $ col2: int  4 7 10 13
 $ col3: int  5 8 11 14


In [37]:
# Converting a data frame into a matrix
my.mt <- as.matrix(df)
print(my.mt)

     names   score
[1,] "Lucy"  "67" 
[2,] "John"  "56" 
[3,] "Mark"  "87" 
[4,] "Candy" "91" 


In [38]:
str(my.mt)

 chr [1:4, 1:2] "Lucy" "John" "Mark" "Candy" "67" "56" "87" "91"
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "names" "score"


In [39]:
is.matrix(my.mt); is.data.frame(my.mt)

### 2.3.6 List
In above data types, dimensions have to match. But not in lists.

In [40]:
mylist <- list("Red", factor(c("a","b")), c(21,32,11), TRUE)
print(mylist)

[[1]]
[1] "Red"

[[2]]
[1] a b
Levels: a b

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE



In [41]:
str(mylist)

List of 4
 $ : chr "Red"
 $ : Factor w/ 2 levels "a","b": 1 2
 $ : num [1:3] 21 32 11
 $ : logi TRUE


### 2.3.7* Factors
Factor is considered as a data type - among vectors, matrices, etc.

In my opinion, factor is an object class.

In [42]:
ch.letter <- letters[1:3]
print(ch.letter)

[1] "a" "b" "c"


In [43]:
class(ch.letter)

In [44]:
fac.letter <- as.factor(letters[1:3])
print(fac.letter)
# Note the additional 'Levels: a b c' in the output

[1] a b c
Levels: a b c


In [45]:
class(fac.letter)
# Should factor be considered as an object class or a data type?

## 2.4 How to generate data
Combinations of the following
- c( )
- seq( )
- rep( )
- sequence( )

In [46]:
c(-1, 5.44, 100, 34123)


In [47]:
-1:10 # By increments of 1.


In [48]:
seq(from = 0.33, to = 9.33, by = 3)


In [49]:
seq(from = 0, to = 1, length = 5)


In [50]:
rep(1.2, times = 5)


In [51]:
rep(c("six", "one", "three"), times = 2)


In [52]:
c(6, 1, 3, rep(seq(from = 3, to = 5, by = 0.5), times = 2))


In [53]:
sequence(5)

In [54]:
sequence(c(6, 1, 3))

###### Open question: can you think of any real-life use of this function?
Maybe useful for making indices.

## 2.5 Operators
### 2.5.1 Arithmetic operators
###### Vector operations

In [55]:
a <- c(1, 8, 8)
b <- c(2, 8, 4)

In [56]:
a+1 # here 1 is considered as a vector (1, 1, 1)

In [57]:
a+b

In [58]:
a*b

In [59]:
a^2

Operations between corresponding entries.

###### Matrix operations

In [60]:
c <- matrix(c(1,2,3,4), nrow = 2, byrow = T)
d <- matrix(c(5,6,7,8), nrow = 2, byrow = F)
print(c); print(d)

     [,1] [,2]
[1,]    1    2
[2,]    3    4
     [,1] [,2]
[1,]    5    7
[2,]    6    8


In [61]:
c+1

0,1
2,3
4,5


In [62]:
c+d

0,1
6,9
9,12


In [63]:
c*d

0,1
5,14
18,32


In [64]:
c^2

0,1
1,4
9,16


Again, operations between corresponding entries.
###### (Optional) If you know linear algebra - cross product, dot product, matrix transpose, diagnol, determinant, rank, etc..

In [65]:
a %*% b

0
98


In [66]:
a %o% b

0,1,2
2,8,4
16,64,32
16,64,32


In [67]:
c %*% d

0,1
17,23
39,53


In [68]:
c; t(c)

0,1
1,2
3,4


0,1
1,3
2,4


In [69]:
diag(c)

In [70]:
det(c)

### 2.5.2 Logical operators

In [71]:
# Recall vector a and b from above.
print(a); print(b)

[1] 1 8 8
[1] 2 8 4


In [72]:
a == b # Equal or not?

In [73]:
a != b # Not equal?

In [74]:
a > b

In [75]:
a <= b

In [76]:
# And
a; b
a>5 & b>5

In [77]:
# Or
a>=5 | b>=5

In [78]:
"ABC" == "ABC"

In [79]:
"ABC" == "abc"

In [80]:
TRUE + TRUE + FALSE # True = 1, False = 0.