# Lecture 5 Data management: Part II - Reshaping data
This section describes how to prepare data for further analysis. There are situations when we need the <b>data frame</b> in a format that is different from the format in which we received it.
- Subsetting data
- Merging data
- The reshape package*
- aggregate( )

<b style="color:red;"> Attention!!! </b>

Even explicitly mentioned data frames, [---] showed some operations using matrices.

Many operations that work for matrices may not work properly for data frames. Problems mainly arise as data frames allow the columns to be of different data types while matrices do not.

I included a section on matrix operations at the very end of the lecture notes.

## 5.1 Subset data
##### Common tasks
- Select/delete columns
- Select/delete rows with or without conditions
- Select columns and rows with or without conditions

##### Using
- $
- [ , ]
- subset( ) Very powerful!
- dplyr package

##### Pick your favorite - one is enough.

In [1]:
df <- data.frame(names = c("Lucy", "John", "Mark", "Candy"),
                score = c(67, 56, 87, 91))
for (i in 1:4){
    df$student.no[i] <- paste("student", i)
    df$pass[i] <- ifelse(df$score[i]>=60, TRUE, FALSE)
}
df
str(df)

names,score,student.no,pass
Lucy,67,student 1,True
John,56,student 2,False
Mark,87,student 3,True
Candy,91,student 4,True


'data.frame':	4 obs. of  4 variables:
 $ names     : Factor w/ 4 levels "Candy","John",..: 3 2 4 1
 $ score     : num  67 56 87 91
 $ student.no: chr  "student 1" "student 2" "student 3" "student 4"
 $ pass      : logi  TRUE FALSE TRUE TRUE


### 5.1.0 $
##### Can only pick one variable.

In [2]:
names(df)

In [3]:
# Recall the indexing system in R
df$names   # Select one variable

In [4]:
# Delete one variable
df.copy <- df
df.copy$names <- NULL
df.copy

score,student.no,pass
67,student 1,True
56,student 2,False
87,student 3,True
91,student 4,True


### 5.1.1 [ , ]
##### Not shown here, but remember that we can use indices, e.g. df[ , 1]

In [5]:
df[ , "score"]

In [6]:
str(df[ , "score"])   # 1D vector

 num [1:4] 67 56 87 91


In [7]:
df[ , "score", drop = FALSE]
str(df[ , "score", drop = FALSE])   # 4 x 1 data frame
# The argument "drop = FALSE" maintains the original dimension
# The default is true

score
67
56
87
91


'data.frame':	4 obs. of  1 variable:
 $ score: num  67 56 87 91


In [8]:
df[1, ]
str(df[1, ])   # 1 x 4 data frame
# Can we drop a dimension here? Why?

names,score,student.no,pass
Lucy,67,student 1,True


'data.frame':	1 obs. of  4 variables:
 $ names     : Factor w/ 4 levels "Candy","John",..: 3
 $ score     : num 67
 $ student.no: chr "student 1"
 $ pass      : logi TRUE


In [49]:
df[1, , drop = TRUE]

##### Any advantage of an n x 1 data frame over a vector of length n? <==> Is the drop argument useful?

In [9]:
df[ , c("student.no", "score", "pass")]
# Delete variable "names" + reorder columns

student.no,score,pass
student 1,67,True
student 2,56,False
student 3,87,True
student 4,91,True


In [10]:
# Select rows that passed
df[df$pass == TRUE, ]

Unnamed: 0,names,score,student.no,pass
1,Lucy,67,student 1,True
3,Mark,87,student 3,True
4,Candy,91,student 4,True


In [11]:
# Show the name and score of those who passed except Lucy(s).
df[df$pass == TRUE & df$names != "Lucy", c("names", "score")]

Unnamed: 0,names,score
3,Mark,87
4,Candy,91


In [12]:
df[df$pass == TRUE & df$names != "Lucy", ]$names

In [13]:
# Delete variable
df[ , -c(1, 2)]   # Delete the 1st and 2nd

student.no,pass
student 1,True
student 2,False
student 3,True
student 4,True


In [14]:
# I believe that this used to work, but not anymore.
# df[ , -c("names", "score")]

# Now
drop <- c("names", "score")
df[ , !names(df) %in% drop]

student.no,pass
student 1,True
student 2,False
student 3,True
student 4,True


In [15]:
select = c("student.no", "pass")
df[ , names(df) %in% select]

student.no,pass
student 1,True
student 2,False
student 3,True
student 4,True


In [16]:
# How does this work?
1 %in% c(1, 3, 5)
"b" %in% c("a", "c", "e")
1:10 %in% c(1, 3, 5)

##### a %in% b checks whether $a\in b$ for every single entry in a.

### 5.1.2 subset( )

In [50]:
# "select" argument selects columns
subset(df, select = c(student.no, pass))

student.no,pass
student 1,True
student 2,False
student 3,True
student 4,True


In [18]:
# Can also delete unwanted columns
subset(df, select = -c(names, score))

student.no,pass
student 1,True
student 2,False
student 3,True
student 4,True


In [19]:
# "subset" argument selects rows
# Can apply conditions
subset(df, subset = (score > 80))

Unnamed: 0,names,score,student.no,pass
3,Mark,87,student 3,True
4,Candy,91,student 4,True


In [20]:
# Now use both select and subset arguments to apply conditions
# Select the names of those who passed
subset(df, select = names, subset = (pass == TRUE))

Unnamed: 0,names
1,Lucy
3,Mark
4,Candy


In [52]:
# Show the name and score of those who passed except Lucy(s).
subset(df, select = c(names, score),
       subset = (pass == TRUE & names != "Lucy"))
# Recall logical operators &, | and !

Unnamed: 0,names,score
3,Mark,87
4,Candy,91


##### Note that all subsets are still data frames.

### 5.1.3 dplyr package
- I do not use this package.
- [---] tells me that it is really powerful.
    - It only taught one function though.
- So I have to teach it.
- How?
- This will be a tutorial on how to study new packages and functions.

#### Resources
- Any course material that you have access to.
- Package documentation written by the package developers (required)
    - Overview of the package.
    - Help files of all the functions inside the package.
- Google!

#### Steps
- Read the documents
- Run the package's example codes on their demo datasets.
- Test the functions on your own (simulated or real) data.
- Google!
- Back and forth in the previous steps.

In [53]:
# Show the name and score of those who passed except Lucy(s).
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [61]:
# Show the name and score of those who passed except Lucy(s).
df.col <- filter(df, names != "Lucy" & pass == TRUE)
df.col
df.final <- select(df.col, c(names, score))
df.final

names,score,student.no,pass
Mark,87,student 3,True
Candy,91,student 4,True


names,score
Mark,87
Candy,91


##### dplyr cheetsheet
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

### 5.1.4 Keep/drop variables by name pattern
- dplyr package
- grep( ) family of functions that deals with text in R
    - See [---] if interested.
    - run ?grep

## 5.2 Merge data
### 5.2.1 Add cases/observations to a data frame
This is basically adding rows.

In [23]:
df

names,score,student.no,pass
Lucy,67,student 1,True
John,56,student 2,False
Mark,87,student 3,True
Candy,91,student 4,True


In [24]:
new.student <- data.frame(names = "[---]",
                          score = 5,
                          student.no = "student 0",
                          pass = TRUE)
new.student

names,score,student.no,pass
[---],5,student 0,True


In [25]:
df.new <- rbind(df, new.student); df.new

names,score,student.no,pass
Lucy,67,student 1,True
John,56,student 2,False
Mark,87,student 3,True
Candy,91,student 4,True
[---],5,student 0,True


In [26]:
new.students <- data.frame(names = c("[---]", "[----]"),
                          score = c(5, 6),
                          student.no = c("student 0", "student 00"),
                          pass = c(TRUE, TRUE))
new.students
df.newnew <- rbind(df, new.students); df.newnew

names,score,student.no,pass
[---],5,student 0,True
[----],6,student 00,True


names,score,student.no,pass
Lucy,67,student 1,True
John,56,student 2,False
Mark,87,student 3,True
Candy,91,student 4,True
[---],5,student 0,True
[----],6,student 00,True


### 5.2.2 Add variables to a dataset
This is adding columns.

In [27]:
# Option 1
df.copy$id1 <- 1:4
df.copy

score,student.no,pass,id1
67,student 1,True,1
56,student 2,False,2
87,student 3,True,3
91,student 4,True,4


In [28]:
# Option 2
df.copy <- data.frame(df.copy, id2 = 1:4)
df.copy

score,student.no,pass,id1,id2
67,student 1,True,1,1
56,student 2,False,2,2
87,student 3,True,3,3
91,student 4,True,4,4


In [29]:
# Option 3
id3 <- 1:4
cbind(df.copy, id3)

score,student.no,pass,id1,id2,id3
67,student 1,True,1,1,1
56,student 2,False,2,2,2
87,student 3,True,3,3,3
91,student 4,True,4,4,4


##### Easily extend to adding multiple columns.
### 5.2.3 Merge data frames

In [63]:
df.a <- df[, -4]
df.b <- df[-2, ]
df.b[2, 3] <- "Student 5"
df.b[2, 2] <- 56
df.b[2, 4] <- F
df.a; df.b

names,score,student.no
Lucy,67,student 1
John,56,student 2
Mark,87,student 3
Candy,91,student 4


Unnamed: 0,names,score,student.no,pass
1,Lucy,67,student 1,True
3,Mark,56,Student 5,False
4,Candy,91,student 4,True


In [31]:
# Identical rows based on common columns
merge(x = df.a, y = df.b, all = FALSE)
# all = False by default

names,score,student.no,pass
Candy,91,student 4,True
Lucy,67,student 1,True


In [32]:
merge(x = df.a, y = df.b, all = TRUE)   # All rows

names,score,student.no,pass
Candy,91,student 4,True
John,56,student 2,
Lucy,67,student 1,True
Mark,56,Student 5,False
Mark,87,student 3,


In [33]:
# All rows based on common score
merge(df.a, df.b, all = TRUE, by = "score")

score,names.x,student.no.x,names.y,student.no.y,pass
56,John,student 2,Mark,Student 5,False
67,Lucy,student 1,Lucy,student 1,True
87,Mark,student 3,,,
91,Candy,student 4,Candy,student 4,True


## 5.3* The reshape package
<b style="color:red;"> What the [---]!!! Skip!!! </b>

## 5.3 aggregate( )
- [---] put this in "5.3 The reshape package".
- While aggregate( ) is not in the package.
- And it does not reshape data.
- However, very very very useful function!

##### I need a big and complex dataset.

In [34]:
# Some simple simulation
# People who take the drug, that are obese and that are older are more likely to get the disease.
# Setting seeds make random number generation reproducible.
set.seed(613)
n <- 100
drug <- sample(c(0, 1), size = n, replace = TRUE, prob = c(0.8, 0.2))
obesity <- sample(c(0, 1), size = n, replace = TRUE, prob = c(0.5, 0.5))
age <- round(rnorm(n, mean = 60, sd = 10))
logit.p <- log(1.8)*drug + log(1.05)*(age - 60) + log(1.2)*obesity + log(0.2)
p <- exp(logit.p)/(1 + exp(logit.p))
disease <- rbinom(n, size = 1, prob = p)
sim <- data.frame(drug, obesity, age, disease)
head(sim)

drug,obesity,age,disease
1,1,53,0
1,1,44,0
0,1,61,1
1,0,41,0
0,0,49,1
1,1,54,0


In [35]:
# Tabulate exposure and outcome
table(sim[, c("drug", "disease")])
# 20% among unexposed to the drug
# and 30% among exposed had the outcome disease.

    disease
drug  0  1
   0 56 14
   1 21  9

In [65]:
# Replicate the tabulation
aggregate(sim$disease,
          by = list(drug = sim$drug, disease = sim$disease),
          FUN = length)

drug,disease,x
0,0,56
1,0,21
0,1,14
1,1,9


In [37]:
# Demo of length()
length(1:10)

In [38]:
aggregate(sim$disease,
          by = list(drug = sim$drug,
                    obesity = sim$obesity,
                    disease = sim$disease),
          FUN = length)

drug,obesity,disease,x
0,0,0,33
1,0,0,12
0,1,0,23
1,1,0,9
0,0,1,7
1,0,1,3
0,1,1,7
1,1,1,6


In [39]:
# Recall that table() in this case will return a 3D array
table(sim[, c("drug", "obesity", "disease")])

, , disease = 0

    obesity
drug  0  1
   0 33 23
   1 12  9

, , disease = 1

    obesity
drug  0  1
   0  7  7
   1  3  6


In [69]:
# More tabulation
# Cross-tabulate obesity and disease with drug at the same time
# sum gives the count
# because obese=1, non-obese=0
aggregate(cbind(obesity, disease)~drug, data = sim, sum)

drug,obesity,disease
0,30,14
1,15,9


In [41]:
# Mean age in different disease groups
aggregate(age~disease, data = sim, FUN = mean)

disease,age
0,58.76623
1,64.21739


In [42]:
aggregate(age~., data = sim, FUN = mean)
# . refers to all other variables

drug,obesity,disease,age
0,0,0,58.9697
1,0,0,60.08333
0,1,0,58.78261
1,1,0,56.22222
0,0,1,60.14286
1,0,1,67.33333
0,1,1,64.57143
1,1,1,67.0


In [43]:
aggregate(cbind(age, disease)~drug + obesity, data = sim, FUN = mean)

drug,obesity,age,disease
0,0,59.175,0.175
1,0,61.53333,0.2
0,1,60.13333,0.2333333
1,1,60.53333,0.4


##### With aggregate( ), we are already doing analysis.

## *Adding rows/columns to matrices
##### Make sure that dimensions match.

In [44]:
matrix1 <- matrix(1:6, byrow = TRUE, nrow = 2)
matrix2 <- matrix(7:12, byrow = TRUE, nrow = 2)
vector1 <- 13:15
matrix1
matrix2
vector1

0,1,2
1,2,3
4,5,6


0,1,2
7,8,9
10,11,12


In [45]:
merged.matrix <- rbind(matrix1, matrix2, vector1)
merged.matrix

0,1,2,3
,1,2,3
,4,5,6
,7,8,9
,10,11,12
vector1,13,14,15


##### Note the unwanted row name.

In [46]:
matrix3 <- matrix(letters[1:6], byrow = TRUE, nrow = 2)
vector2 <- letters[7:8]
matrix3
vector2

0,1,2
a,b,c
d,e,f


In [47]:
cbind(matrix3, vector2)

Unnamed: 0,Unnamed: 1,Unnamed: 2,vector2
a,b,c,g
d,e,f,h


In [76]:
class(matrix3)

In [77]:
df

names,score,student.no,pass
Lucy,67,student 1,True
John,56,student 2,False
Mark,87,student 3,True
Candy,91,student 4,True


In [93]:
set.seed(192)
row.index <- sample(1:4, 1);
row.index

In [82]:
df[row.index, ]

Unnamed: 0,names,score,student.no,pass
4,Candy,91,student 4,True
