# Introduction - Data Structures

## Overview
### Plan

At the end of this session, you will have learned the most important data structures in base R. More specifically, you will be able to:

* Create vectors, lists, matrices, factors and data frames.
* Understand the difference between the different data structures and how they are interrelated.

#### R Primitives ####

1. Use the `typeof` function to explore the type of various objects

In [28]:
random_norms <- rnorm(10)
random_norms
typeof(random_norms)

some_letters <- letters[1:10]
some_letters
typeof(some_letters)

int_vector <- c(1L, 2L, 3L)
int_vector
typeof(int_vector)

booleans <- int_vector == 1
booleans
typeof(booleans)

2. Can you mix types in a vector?

In [30]:
combine_char_num <- c(random_norms, some_letters)
combine_char_num
typeof(combine_char_num)

3. To combine types, use list:

In [33]:
list_char_num <- list(random_norms, some_letters)
typeof(list_char_num)
str(list_char_num)

List of 2
 $ : num [1:10] -1.83 0.809 1.317 0.677 0.729 ...
 $ : chr [1:10] "a" "b" "c" "d" ...


4. Matrices are of same length and same type. However, matrices can't have different types.

In [34]:
matrix_num <- matrix(rnorm(10), nrow = 5, ncol = 2)
matrix_num

0,1
0.6468122,1.9302397
-1.2689713,0.6527831
0.180622,0.1208698
0.2340501,-2.1468009
0.3227169,-0.415305


#### FACTORS ###
##### What's a factor and why would you use it? #####

* A statistical data type used to store categorical variables.
* Factors can be ordered or unordered and are an important class for statistical analysis.
* Categories are 'factor levels' in R.


1. Create a gender vector of 5 individuals (Male, Female) and generate a factor from it.

In [8]:
gender_vector <- c("Male","Female","Female","Male","Male")
factor_gender_vector <- factor(gender_vector)
print(factor_gender_vector)

[1] Male   Female Female Male   Male  
Levels: Female Male


2. Create a factor for ordered categories (Example: Temperature - low, med, high)

In [13]:
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
summary(factor_temperature_vector)

##### Change the levels' names #####

* With datasets, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity/consistency or other reasons.

3. Change the factor levels from Male/Female to M/F:

In [12]:
levels(factor_gender_vector) <- c("F", "M")
summary(factor_gender_vector)

3. Identify the class attendee background distribution using factors

##### Comparing factors #####

In [25]:
factor_temperature_vector[1]==factor_temperature_vector[3]
factor_gender_vector[1]<factor_gender_vector[2]

In Ops.factor(factor_gender_vector[1], factor_gender_vector[2]): < not meaningful for factors

[1] NA

#### DATA FRAMES ###

1. Create a data frame using the following vectors:

In [5]:
Died.At <- c(22,40,72,41)
Writer.At <- c(16, 18, 36, 36)
First.Name <- c("John", "Edgar", "Walt", "Jane")
Second.Name <- c("Doe", "Poe", "Whitman", "Austen")
Sex <- c("MALE", "MALE", "MALE", "FEMALE")
Date.Of.Death <- c("2015-05-10", "1849-10-07", "1892-03-26","1817-07-18")

writers_df <- data.frame(Died.At, Writer.At, First.Name, Second.Name, Sex, Date.Of.Death)

2. Get to know more about your data frame:

In [6]:
str(writers_df)

'data.frame':	4 obs. of  6 variables:
 $ Died.At      : num  22 40 72 41
 $ Writer.At    : num  16 18 36 36
 $ First.Name   : Factor w/ 4 levels "Edgar","Jane",..: 3 1 4 2
 $ Second.Name  : Factor w/ 4 levels "Austen","Doe",..: 2 3 4 1
 $ Sex          : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 1
 $ Date.Of.Death: Factor w/ 4 levels "1817-07-18","1849-10-07",..: 4 2 3 1


In [51]:
head(writers_df)

Age.At.Death,Age.As.Writer,Name,Surname,Gender,Death,Location
3,16,John,Doe,MALE,2015-05-10,Belgium
3,18,Edgar,Poe,MALE,1849-10-07,United Kingdom
3,36,Walt,Whitman,MALE,1892-03-26,United States
3,36,Jane,Austen,FEMALE,1817-07-18,United Kingdom


3. Pay attention to the indicies after colon in factors. What does 3 1 4 2 indicate?

Example: $ First.Name   : Factor w/ 4 levels "Edgar","Jane",..: 3 1 4 2

4. To get a data frame's column names and change them:

In [None]:
names(writers_df)
names(writers_df) <- c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death")
names(writers_df)

In [None]:
5. Change all the values of the first column to 3 and obtain the mean of the first and second column.

In [39]:
writers_df[,1] = c(3, 3, 3, 3)
apply(writers_df[, 1:2], 2, mean)

6. Can you now obtain the mean values of each row?

#### MODIFING A DATAFRAME ####

6. Add location vector to a data frame:

In [40]:
writers_df$Location <- c("Belgium", "United Kingdom", "United States", "United Kingdom")

In [41]:
writers_df

Age.At.Death,Age.As.Writer,Name,Surname,Gender,Death,Location
3,16,John,Doe,MALE,2015-05-10,Belgium
3,18,Edgar,Poe,MALE,1849-10-07,United Kingdom
3,36,Walt,Whitman,MALE,1892-03-26,United States
3,36,Jane,Austen,FEMALE,1817-07-18,United Kingdom


7. Add a new row to a data frame and notice how the factors affect the binding? What is the right way of doing the binding?

In [46]:
new_row <- c(50, 22, "Roberto", "Bolano", "MALE", "2003-07-15", "India")
rbind(writers_df, new_row)

In `[<-.factor`(`*tmp*`, ri, value = "2003-07-15"): invalid factor level, NA generated

Age.At.Death,Age.As.Writer,Name,Surname,Gender,Death,Location
50,22,,,MALE,,India
3,16,John,Doe,MALE,2015-05-10,Belgium
3,18,Edgar,Poe,MALE,1849-10-07,United Kingdom
3,36,Walt,Whitman,MALE,1892-03-26,United States
3,36,Jane,Austen,FEMALE,1817-07-18,United Kingdom


In [50]:
new_row <- data.frame(Age.At.Death=50, Age.As.Writer=22, Name="Roberto", Surname="Bolano", Gender="MALE", Death="2003-07-15", Location="India")
writers_df_with_new_row <- rbind(writers_df, new_row)
writers_df_with_new_row

Age.At.Death,Age.As.Writer,Name,Surname,Gender,Death,Location
3,16,John,Doe,MALE,2015-05-10,Belgium
3,18,Edgar,Poe,MALE,1849-10-07,United Kingdom
3,36,Walt,Whitman,MALE,1892-03-26,United States
3,36,Jane,Austen,FEMALE,1817-07-18,United Kingdom
50,22,Roberto,Bolano,MALE,2003-07-15,India
