In [1]:
options(jupyter.rich_display = FALSE)

_Data frames_ are used for representing tabular data where each column has a different type, such as

|Name | Height| Weight | Gym member? | City|
|-----|----|----|----|
|Cem | 1.75 | 66 |T | Istanbul|
|Can | 1.70 | 65 | F | Ankara|
|Hande | 1.62 | 61| T | Izmir|

* Lists: heterogeneous analogs of vectors.
* Data frames: heterogenous analogs of matrices.

A data frame is actually a _list_ of equal-length _vectors_.

Why not use a matrix?
===
Earlier we have seen how to store data in vectors, for example:

In [2]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)

If we want to have this data combined in a table, we can generate a matrix out of it:

In [3]:
height_weight <- cbind(
    c(1.70, 1.75,1.62),
    c(65, 66, 61)
)
rownames(height_weight) <- c("Can","Cem","Hande")
colnames(height_weight) <- c("Height","Weight")
height_weight

      Height Weight
Can   1.70   65    
Cem   1.75   66    
Hande 1.62   61    

Or more directly, if the data is already in a vector:

In [4]:
height_weight <- cbind(heights, weights)
height_weight

      heights weights
Can   1.70    65     
Cem   1.75    66     
Hande 1.62    61     

In [5]:
colnames(height_weight) <- c("Height","Weight")
print(height_weight)

      Height Weight
Can     1.70     65
Cem     1.75     66
Hande   1.62     61


In [6]:
class(height_weight)

[1] "matrix"

In [7]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

[1] 22.49135

Now suppose we want to add the Boolean gym membership data into this matrix as well.

In [8]:
member <- c(Cem=FALSE, Can=TRUE, Hande=TRUE)
height_weight <- cbind(heights, weights,member)
colnames(height_weight) <- c("Height","Weight","Gym member")
print(height_weight)

      Height Weight Gym member
Can     1.70     65          0
Cem     1.75     66          1
Hande   1.62     61          1


* The last column has numeric values 0 or 1, instead of `TRUE` or `FALSE`.
* Reason: All elements in a matrix must have _the same mode_ (numeric here).
* If a new mode is forced (Boolean here) not, all elements are _coerced_ to a common type (numeric here).
* `TRUE` becomes 1, `FALSE` becomes 0.

This could be tolerable. However, now we add the city data.

In [9]:
city <- c(Cem="Istanbul",Can="Ankara",Hande="Izmir")

height_weight <- cbind(heights, weights,member,city)
colnames(height_weight) <- c("Height","Weight","Gym member","City")

print(height_weight)

      Height Weight Gym member City      
Can   "1.7"  "65"   "FALSE"    "Istanbul"
Cem   "1.75" "66"   "TRUE"     "Ankara"  
Hande "1.62" "61"   "TRUE"     "Izmir"   


All entries are now coerced to strings. The data is still there, but we cannot perform computations anymore.

In [10]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

ERROR: Error in height_weight["Can", "Height"]^2: non-numeric argument to binary operator


How about keeping the data in separate vectors? There would be no coercion, but data manipulation would be difficult. Selecting subsets, adding/removing entries, would require several operations and great care.

A _data frame_ that combines several vectors as data columns is used for such convenience.

Creating data frames
====
If we already have data in the form of one-dimensional vectors, we can combine them into a data frame using the `data.frame()` function.

In [11]:
people <- data.frame(Height=heights, Weight=weights, Member=member, City=city, stringsAsFactors = F)
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

Data recycling works for data frames as well. Suppose we add the "City" data and make it "Istanbul" for all.

In [12]:
data.frame(Height=heights, Weight=weights, City="Istanbul")

      Height Weight City    
Can   1.70   65     Istanbul
Cem   1.75   66     Istanbul
Hande 1.62   61     Istanbul

Here, the element `"Istanbul"` is repeated until it matches the length of other vectors.

What if row names are not given during the construction, or we want to change them? The functions `rownames()` and `colnames()` can be used.

In [13]:
tempdf <- data.frame(
    h = c(1.70, 1.75,1.62),
    w = c(65, 66, 61)
)
tempdf

  h    w 
1 1.70 65
2 1.75 66
3 1.62 61

In [14]:
rownames(tempdf) <- c("Can","Cem","Hande")
colnames(tempdf) <- c("Height","Weight")
tempdf

      Height Weight
Can   1.70   65    
Cem   1.75   66    
Hande 1.62   61    

Accessing data frames
====

Accessing via column numbers or column names
----

In [15]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

The data frame is a list; so we can access its components using the notation we've seen last week.

In [16]:
people[[1]]  # indexing with component number

[1] 1.70 1.75 1.62

In [17]:
people$Height  # component name

[1] 1.70 1.75 1.62

In [18]:
people[["Height"]]  # indexing with component name

[1] 1.70 1.75 1.62

Accessing via matrix-like indexing
-----
A data frame can be indexed as if it is a matrix, using the `[row, col]` notation.

In [19]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

In [20]:
people[,1]  # column 1

[1] 1.70 1.75 1.62

In [21]:
people[2,1] # row 2, column 1

[1] 1.75

In [22]:
people["Cem","Height"]

[1] 1.75

Selecting rows using indices
===

In [23]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

We can specify a vector of indices to select rows.

In [24]:
people[c(1,3),]

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Hande 1.62   61      TRUE  Izmir   

We can also select using a vector of row names.

In [25]:
people[c("Can","Hande"),]

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Hande 1.62   61      TRUE  Izmir   

A negative index, again, indicates a row that is to be omitted.

In [26]:
people[-2,]

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Hande 1.62   61      TRUE  Izmir   

Selecting some columns
====

We can provide a list of column names to get a subframe.

In [27]:
people[, c("Member","City")]

      Member City    
Can   FALSE  Istanbul
Cem    TRUE  Ankara  
Hande  TRUE  Izmir   

Numeric indices can also be used.

In [28]:
people[, 3:4]

      Member City    
Can   FALSE  Istanbul
Cem    TRUE  Ankara  
Hande  TRUE  Izmir   

A subset of rows and a subset of columns:

In [29]:
people[c("Can","Cem"), 1:2]

    Height Weight
Can 1.70   65    
Cem 1.75   66    

Filtering data frames
==
The Boolean operators to select vector elements are applicable to data frames as well. 

In [30]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

In [31]:
people$Height >= 1.70

[1]  TRUE  TRUE FALSE

In [32]:
people[ people$Height>= 1.70, ]

    Height Weight Member City    
Can 1.70   65     FALSE  Istanbul
Cem 1.75   66      TRUE  Ankara  

In [33]:
people[ people$Member, ]

      Height Weight Member City  
Cem   1.75   66     TRUE   Ankara
Hande 1.62   61     TRUE   Izmir 

In [34]:
people[ people$Member, c("Height","City")]

      Height City  
Cem   1.75   Ankara
Hande 1.62   Izmir 

Adding new rows
===
As with matrices, we can use `rbind()` to add a new row to an existing data frame. The new row is usually in the form of a list.

In [35]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

In [36]:
rbind(people, Lale=list(1.71, 64, FALSE, "Bursa"))

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   
Lale  1.71   64     FALSE  Bursa   

In [37]:
newpeople <- data.frame(
    Weight=c(64, 50),
    Member=c(F,T),
    City=c("Bursa","Istanbul"),
    Height=c(Lale=1.71, Ziya=1.45)
)
newpeople

     Weight Member City     Height
Lale 64     FALSE  Bursa    1.71  
Ziya 50      TRUE  Istanbul 1.45  

In [38]:
rbind(people, newpeople)

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   
Lale  1.71   64     FALSE  Bursa   
Ziya  1.45   50      TRUE  Istanbul

Adding new columns
===

In [39]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using `cbind()` as follows.

In [40]:
people$Weight/people$Height^2

[1] 22.49135 21.55102 23.24341

In [41]:
people_bmi <- cbind(people, people$Weight/people$Height^2)
people_bmi

      Height Weight Member City     people$Weight/people$Height^2
Can   1.70   65     FALSE  Istanbul 22.49135                     
Cem   1.75   66      TRUE  Ankara   21.55102                     
Hande 1.62   61      TRUE  Izmir    23.24341                     

Note that the name of the new column is automatically set. We can change this using the `names()` function. (`colnames()` can also be used.)

In [42]:
names(people_bmi)[5] <- "BMI"

In [43]:
people_bmi

      Height Weight Member City     BMI     
Can   1.70   65     FALSE  Istanbul 22.49135
Cem   1.75   66      TRUE  Ankara   21.55102
Hande 1.62   61      TRUE  Izmir    23.24341

A more direct way would be to utilize directly the dynamic extensibility of lists.

In [44]:
people2 <- people
people2

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

In [45]:
people2$BMI <- people2$Weight/people$Height^2
people2

      Height Weight Member City     BMI     
Can   1.70   65     FALSE  Istanbul 22.49135
Cem   1.75   66      TRUE  Ankara   21.55102
Hande 1.62   61      TRUE  Izmir    23.24341

We can create a new column without using existing columns. If we specify a single value, the rest is set by vector recycling.

In [46]:
people2$obese <- NA
people2

      Height Weight Member City     BMI      obese
Can   1.70   65     FALSE  Istanbul 22.49135 NA   
Cem   1.75   66      TRUE  Ankara   21.55102 NA   
Hande 1.62   61      TRUE  Izmir    23.24341 NA   

In [47]:
people2$obese <- ifelse(people2$BMI>30, T, F)

In [48]:
people2

      Height Weight Member City     BMI      obese
Can   1.70   65     FALSE  Istanbul 22.49135 FALSE
Cem   1.75   66      TRUE  Ankara   21.55102 FALSE
Hande 1.62   61      TRUE  Izmir    23.24341 FALSE

Remove a column

In [49]:
people2$obese <- NULL
people2

      Height Weight Member City     BMI     
Can   1.70   65     FALSE  Istanbul 22.49135
Cem   1.75   66      TRUE  Ankara   21.55102
Hande 1.62   61      TRUE  Izmir    23.24341

Merging data frames
===
The `merge(x,y)` function is used to create a new data frame from existing frames `x` and `y`, by combining them along a common column.

To illustrate this, consider the following simple data frames.

In [50]:
df1 <- data.frame(Name=c("Can","Cem","Hande"), Phone=c(1234,4345,8492))
df1

  Name  Phone
1 Can   1234 
2 Cem   4345 
3 Hande 8492 

In [51]:
df2 <- data.frame(Age=c(25,27,26), Name=c("Cem","Hande","Can"))
df2

  Age Name 
1 25  Cem  
2 27  Hande
3 26  Can  

In [52]:
merge(df1,df2)

  Name  Phone Age
1 Can   1234  26 
2 Cem   4345  25 
3 Hande 8492  27 

The `merge()` function uses the column `Name` to merge the two data frames. Note that the entries are correctly identified even though the order of names are different in the two frames.

The columns we want to merge over may have different names in the two frames. In that case we use the `by.x` and `by.y` arguments to `merge()`.

In [53]:
df1

  Name  Phone
1 Can   1234 
2 Cem   4345 
3 Hande 8492 

In [54]:
df2 <- data.frame(Age=c(25,27,26), first_name=c("Cem","Hande","Can"))
df2

  Age first_name
1 25  Cem       
2 27  Hande     
3 26  Can       

In [55]:
merge(df1, df2, by.x="Name", by.y="first_name")

  Name  Phone Age
1 Can   1234  26 
2 Cem   4345  25 
3 Hande 8492  27 

What if we have named rows, and we want to merge over these indices? For example, recall the `people` dataframe we created above:

In [56]:
people

      Height Weight Member City    
Can   1.70   65     FALSE  Istanbul
Cem   1.75   66      TRUE  Ankara  
Hande 1.62   61      TRUE  Izmir   

And suppose we have a phone book with named indices:

In [57]:
phonebook <- data.frame(phone=c(Can=1234, Cem=4345, Lale=8492))
phonebook

     phone
Can  1234 
Cem  4345 
Lale 8492 

Note that `phonebook` does not contain Hande, and `people` does not contain Lale.

Then we merge by the row names by specifying `"row.names"` as `by.x` and `by.y` parameter values. 

In [58]:
merge(people,phonebook,by.x="row.names", by.y="row.names")

  Row.names Height Weight Member City     phone
1 Can       1.70   65     FALSE  Istanbul 1234 
2 Cem       1.75   66      TRUE  Ankara   4345 

Note that the resulting data frame does not include Hande or Lale, because they are missing in one or the other data frame. This is called an _inner join_ operation.

If we want to have all the rows, even though they contain missing data, we use the `all=TRUE` parameter. This is called an _outer join_ operation.

In [59]:
merged_df <- merge(people,phonebook,by.x="row.names", by.y="row.names", all=TRUE)

In [60]:
merged_df

  Row.names Height Weight Member City     phone
1 Can       1.70   65     FALSE  Istanbul 1234 
2 Cem       1.75   66      TRUE  Ankara   4345 
3 Hande     1.62   61      TRUE  Izmir      NA 
4 Lale        NA   NA        NA  NA       8492 

And if we want to use the names of people as row names, we can assign them using `rownames()` function, and remove the `"Row.names"` column later.

In [61]:
rownames(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL
merged_df

      Height Weight Member City     phone
Can   1.70   65     FALSE  Istanbul 1234 
Cem   1.75   66      TRUE  Ankara   4345 
Hande 1.62   61      TRUE  Izmir      NA 
Lale    NA   NA        NA  NA       8492 

Applications
===
Analyze the grades in a class
---

In [62]:
grades <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    midterm1 = c(45, 74, 67, 52, 31),
    midterm2 = c(68, 83, 56, 22, 50),
    final = c(59, 91, 62, 49, 65),
    stringsAsFactors = F)
grades

  student midterm1 midterm2 final
1 Can     45       68       59   
2 Cem     74       83       91   
3 Hande   67       56       62   
4 Lale    52       22       49   
5 Ziya    31       50       65   

In [63]:
grades$score <- grades$midterm1*0.3 + grades$midterm2*0.3 + grades$final*0.4
grades

  student midterm1 midterm2 final score
1 Can     45       68       59    57.5 
2 Cem     74       83       91    83.5 
3 Hande   67       56       62    61.7 
4 Lale    52       22       49    41.8 
5 Ziya    31       50       65    50.3 

In [64]:
grades[-1]

  midterm1 midterm2 final score
1 45       68       59    57.5 
2 74       83       91    83.5 
3 67       56       62    61.7 
4 52       22       49    41.8 
5 31       50       65    50.3 

In [65]:
apply(grades[-1],2,mean)

midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

In [66]:
sapply(grades[-1],mean)

midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

In [67]:
lapply(grades[-1],mean)

$midterm1
[1] 53.8

$midterm2
[1] 55.8

$final
[1] 65.2

$score
[1] 58.96


In [68]:
lettergrade <- function(score){
    if (score > 80) "A" else if (score > 70) "B" else if (score>60) "C" else if (score>50) "D" else "F"
}

In [69]:
sapply(grades$score,lettergrade)

[1] "D" "A" "C" "F" "D"

In [70]:
grades$letter <- sapply(grades$score, lettergrade)
grades

  student midterm1 midterm2 final score letter
1 Can     45       68       59    57.5  D     
2 Cem     74       83       91    83.5  A     
3 Hande   67       56       62    61.7  C     
4 Lale    52       22       49    41.8  F     
5 Ziya    31       50       65    50.3  D     

Grading multiple-choice exams
---
Our students have taken a multiple-choice exam. All their answers, as well as the answer key, is recorded as vectors.

In [71]:
key <- c("A","B","C","D","A")
answers <- rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
)

We initialize a separate data frame with the student information:

In [72]:
exam <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    stringsAsFactors = F
)
exam

  student
1 Can    
2 Cem    
3 Hande  
4 Lale   
5 Ziya   

And add the exam data to that data frame with `cbind()`.

In [73]:
exam <- cbind(exam, answers)
exam

  student 1 2 3 4 5
1 Can     A B D A B
2 Cem     A D C D A
3 Hande   B B C D B
4 Lale    A B C D D
5 Ziya    C C C D A

Now we can process this data frame to get the number of correct answers for each student. For that, we can use the `sum(x==y)` operation, which gives us the number of equal elements.

In [74]:
exam[1,]
key
exam[1,]==key
sum(exam[1,]==key)

  student 1 2 3 4 5
1 Can     A B D A B

[1] "A" "B" "C" "D" "A"

  student 1     2     3    4    5    
1 FALSE   FALSE FALSE TRUE TRUE FALSE

[1] 2

To repeat this for each row, we create a function that returns the number of matching answers.

In [75]:
ncorrect <- function(x){
    sum(x==key)
}

In [76]:
ncorrect(exam[1,2:6])

[1] 2

And we use `apply()` to apply it to every row.

In [77]:
apply(exam[,2:6],1,ncorrect)

[1] 2 4 3 4 3

We can store this result by creating a new column in the data frame.

In [78]:
exam$correct <- apply(exam[,2:6],1,ncorrect)

In [79]:
exam

  student 1 2 3 4 5 correct
1 Can     A B D A B 2      
2 Cem     A D C D A 4      
3 Hande   B B C D B 3      
4 Lale    A B C D D 4      
5 Ziya    C C C D A 3      

Item database
---
Suppose you run a retail store and you keep a data base of your items, their unit price, and the VAT rate for each item, such as the following.

In [80]:
items <- data.frame(
    itemname = c("Milk","Meat","Toothpaste","Pencil","Detergent"),
    vat = c(0.05, 0.04, 0.05, 0.06, 0.03),
    unitprice = c(10, 20, 5, 1, 4))
items

  itemname   vat  unitprice
1 Milk       0.05 10       
2 Meat       0.04 20       
3 Toothpaste 0.05  5       
4 Pencil     0.06  1       
5 Detergent  0.03  4       

You get some orders for some items, which your automated system stores with an order ID:

In [81]:
orders <- data.frame(
    orderid = c("1234","5761","1832"),
    item = c("Milk","Meat","Toothpaste"),
    amount = c(3,1,2))
orders

  orderid item       amount
1 1234    Milk       3     
2 5761    Meat       1     
3 1832    Toothpaste 2     

Our task is to add a new column to the `orders` data frame that holds the total payment for each order, including the VAT.

We begin by merging the orders and items data frames. We do not use all the items, so we make an inner join. We store the result in a new data frame.

In [82]:
orders2 <- merge(orders,items,by.x="item",by.y="itemname")
orders2

  item       orderid amount vat  unitprice
1 Meat       5761    1      0.04 20       
2 Milk       1234    3      0.05 10       
3 Toothpaste 1832    2      0.05  5       

Now that we have the unit price and the VAT information on the same data frame, we can calculate the total to pay and store it in a new column.

In [83]:
orders2$total <- (orders2$amount*orders2$unitprice)*(1+orders2$vat)
orders2

  item       orderid amount vat  unitprice total
1 Meat       5761    1      0.04 20        20.8 
2 Milk       1234    3      0.05 10        31.5 
3 Toothpaste 1832    2      0.05  5        10.5 