In [1]:
options(jupyter.rich_display = FALSE);
options(stringsAsFactors = FALSE)

# Data Frames

**Data frames** are used for representing tabular data where each column has a different type, such as


|Name | Height| Weight | Gym member? | City|
|-----|----|----|----|---|
|Cem | 1.75 | 66 |T | Istanbul|
|Can | 1.70 | 65 | F | Ankara|
|Hande | 1.62 | 61| T | Izmir|

* _Lists_ are heterogeneous analogs of _vectors_.
* _Data frames_ are heterogenous analogs of _matrices_.
* Internally, a data frame is a _list_ of equal-length _vectors_.

# Why not use a matrix?

Earlier we have seen how to store data in vectors, for example:

In [2]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)

If we want to have this data combined in a table, we can generate a matrix out of it:

In [3]:
height_weight <- cbind(
    c(1.70, 1.75,1.62),
    c(65, 66, 61)
)
rownames(height_weight) <- c("Can","Cem","Hande")
colnames(height_weight) <- c("Height","Weight")
height_weight

      Height Weight
Can     1.70     65
Cem     1.75     66
Hande   1.62     61

Alternatively, if the data are already stored in a vector:

In [4]:
height_weight <- cbind(heights, weights)
colnames(height_weight) <- c("Height","Weight")
height_weight

      Height Weight
Can     1.70     65
Cem     1.75     66
Hande   1.62     61

For example, get the BMI of "Can".

In [5]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

[1] 22.49135

Trouble arises when we want to store the Boolean gym membership data in this matrix as well.

In [7]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)
member <- c(Can=TRUE, Cem=FALSE, Hande=TRUE)
height_weight <- cbind(heights, weights,member)
colnames(height_weight) <- c("Height","Weight","Gym member")
height_weight

      Height Weight Gym member
Can     1.70     65          1
Cem     1.75     66          0
Hande   1.62     61          1

* The last column has numeric values 0 or 1, instead of `TRUE` or `FALSE`.
* Reason: All elements in a matrix must have _the same mode_ (numeric here).
* If a new mode is added (Boolean here), all elements are _coerced_ to a common type (numeric here).
* `TRUE` becomes 1, `FALSE` becomes 0.

Suppose we also want to add the city data.

In [8]:
city <- c(Can="Ankara",Cem="Istanbul",Hande="Izmir")

height_weight <- cbind(heights, weights,member,city)
colnames(height_weight) <- c("Height","Weight","Gym member","City")

print(height_weight)

      Height Weight Gym member City      
Can   "1.7"  "65"   "TRUE"     "Ankara"  
Cem   "1.75" "66"   "FALSE"    "Istanbul"
Hande "1.62" "61"   "TRUE"     "Izmir"   


All entries are now coerced to strings. The data is still there, but we cannot perform computations anymore.

In [9]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

ERROR: Error in height_weight["Can", "Height"]^2: non-numeric argument to binary operator


Keep the data in separate vectors?
* There would be no coercion, but data manipulation would be difficult.
* Selecting subsets, adding/removing entries, would require several operations and great care.
* A _data frame_ that combines several vectors as data columns provides convenience.

Creating data frames
====
Several vectors can be combined into a data frame using the `data.frame()` function.

In [12]:
help(data.frame)

data.frame                package:base                 R Documentation

_D_a_t_a _F_r_a_m_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     This function creates data frames, tightly coupled collections of
     variables which share many of the properties of matrices and of
     lists, used as the fundamental data structure by most of R's
     modeling software.

_U_s_a_g_e:

     data.frame(..., row.names = NULL, check.rows = FALSE,
                check.names = TRUE,
                stringsAsFactors = default.stringsAsFactors())
     
     default.stringsAsFactors()
     
_A_r_g_u_m_e_n_t_s:

     ...: these arguments are of either the form ‘value’ or ‘tag =
          value’.  Component names are created based on the tag (if
          present) or the deparsed argument itself.

row.names: ‘NULL’ or a single integer or character string specifying a
          column to be used as row names, or a character or integer
          vector giving the row names for the data 

In [13]:
people <- data.frame(Height=heights, 
                     Weight=weights, 
                     Member=member, 
                     City=city)
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

**Recycling** applies to data frames as well. Suppose we add the `"City"` data and make it `"Istanbul"` for all:

In [14]:
data.frame(Height=heights, Weight=weights, City="Istanbul")

      Height Weight     City
Can     1.70     65 Istanbul
Cem     1.75     66 Istanbul
Hande   1.62     61 Istanbul

Here, the element `"Istanbul"` is repeated until it matches the length of other vectors.

The functions `rownames()` and `colnames()` can be used to change labels of rows and columns.

In [15]:
tempdf <- data.frame(c(1.70, 1.75,1.62),c(65, 66, 61))
tempdf

  c.1.7..1.75..1.62. c.65..66..61.
1               1.70            65
2               1.75            66
3               1.62            61

In [16]:
rownames(tempdf) <- c("Can","Cem","Hande")
colnames(tempdf) <- c("Height","Weight")
tempdf

      Height Weight
Can     1.70     65
Cem     1.75     66
Hande   1.62     61

Accessing columns of data frames
====

A data frame is a **list of columns**; so we can access a column using the list notation we've seen before.

In [17]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [18]:
people[[1]]  # indexing with component number
people$Weight  # component name
people[["City"]]

[1] 1.70 1.75 1.62

[1] 65 66 61

[1] "Ankara"   "Istanbul" "Izmir"   

# Accessing elements via matrix-like indexing
A data frame can be indexed as if it is a matrix, using the `[row, col]` notation.

In [20]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [21]:
people[,1]  # column 1
people[2,1] # row 2, column 1
people["Cem","Height"]

[1] 1.70 1.75 1.62

[1] 1.75

[1] 1.75

# Selecting rows using indices

We can specify a vector of indices to select rows.

In [23]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [24]:
people[c(1,3),]
people[c("Can","Hande"),]

      Height Weight Member   City
Can     1.70     65   TRUE Ankara
Hande   1.62     61   TRUE  Izmir

      Height Weight Member   City
Can     1.70     65   TRUE Ankara
Hande   1.62     61   TRUE  Izmir

A negative index, again, indicates a row that is to be omitted.

In [25]:
people[-2,]

      Height Weight Member   City
Can     1.70     65   TRUE Ankara
Hande   1.62     61   TRUE  Izmir

Selecting some columns
====

We can provide a list of column names or numeric indices to get a subframe.

In [26]:
people[, c("Member","City")]
people[, 3:4]

      Member     City
Can     TRUE   Ankara
Cem    FALSE Istanbul
Hande   TRUE    Izmir

      Member     City
Can     TRUE   Ankara
Cem    FALSE Istanbul
Hande   TRUE    Izmir

A subset of rows and a subset of columns:

In [27]:
people[c("Can","Cem"), 1:2]

    Height Weight
Can   1.70     65
Cem   1.75     66

Filtering data frames
==
The Boolean operators to select vector elements are applicable to data frames as well. 

In [28]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [29]:
people$Height >= 1.70

[1]  TRUE  TRUE FALSE

In [30]:
people[ people$Height>= 1.70, ]

    Height Weight Member     City
Can   1.70     65   TRUE   Ankara
Cem   1.75     66  FALSE Istanbul

In [31]:
people[ people$Member, ]

      Height Weight Member   City
Can     1.70     65   TRUE Ankara
Hande   1.62     61   TRUE  Izmir

In [32]:
people[ people$Member, c("Height","City")]

      Height   City
Can     1.70 Ankara
Hande   1.62  Izmir

Adding new rows
===
As with matrices, we can use `rbind()` to add a new row to an existing data frame. The new row is usually in the form of a list.

In [33]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [34]:
rbind(people, Lale=list(1.71, 64, FALSE, "Bursa"))

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir
Lale    1.71     64  FALSE    Bursa

# Concatenate two data frames

In [35]:
newpeople <- data.frame(
    Weight=c(64, 50),
    Member=c(F,T),
    City=c("Bursa","Istanbul"),
    Height=c(Lale=1.71, Ziya=1.45)
)
newpeople

     Weight Member     City Height
Lale     64  FALSE    Bursa   1.71
Ziya     50   TRUE Istanbul   1.45

In [36]:
rbind(people, newpeople)

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir
Lale    1.71     64  FALSE    Bursa
Ziya    1.45     50   TRUE Istanbul

Adding new columns
===

Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using `cbind()` as follows.

In [37]:
people_bmi <- cbind(people, people$Weight/people$Height^2)
people_bmi

      Height Weight Member     City people$Weight/people$Height^2
Can     1.70     65   TRUE   Ankara                      22.49135
Cem     1.75     66  FALSE Istanbul                      21.55102
Hande   1.62     61   TRUE    Izmir                      23.24341

Note that the name of the new column is automatically set. We can change this using the `names()` or `colnames()` functions.

In [38]:
names(people_bmi)[5] <- "BMI"
people_bmi

      Height Weight Member     City      BMI
Can     1.70     65   TRUE   Ankara 22.49135
Cem     1.75     66  FALSE Istanbul 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

A more direct way:

In [39]:
people2 <- people
people2

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [40]:
people2$BMI <- people2$Weight/people2$Height^2
people2

      Height Weight Member     City      BMI
Can     1.70     65   TRUE   Ankara 22.49135
Cem     1.75     66  FALSE Istanbul 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

We can create a new column as we please. For example, a column with a single `NA` value.

In [43]:
people2$obese <- NA
people2

      Height Weight Member     City      BMI obese
Can     1.70     65   TRUE   Ankara 22.49135    NA
Cem     1.75     66  FALSE Istanbul 21.55102    NA
Hande   1.62     61   TRUE    Izmir 23.24341    NA

In [44]:
people2$obese <- ifelse(people2$BMI>30, T, F)
people2

      Height Weight Member     City      BMI obese
Can     1.70     65   TRUE   Ankara 22.49135 FALSE
Cem     1.75     66  FALSE Istanbul 21.55102 FALSE
Hande   1.62     61   TRUE    Izmir 23.24341 FALSE

Remove a column by setting it to `NULL`.

In [45]:
people2$obese <- NULL
people2

      Height Weight Member     City      BMI
Can     1.70     65   TRUE   Ankara 22.49135
Cem     1.75     66  FALSE Istanbul 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

Merging data frames
===
The `merge(x,y)` function is used to create a new data frame from existing frames `x` and `y`, by combining them along a common column.

In [46]:
df1 <- data.frame(Name=c("Can","Cem","Hande"), Phone=c(1234,4345,8492))
df2 <- data.frame(Age=c(25,27,26), Name=c("Cem","Hande","Can"))

In [47]:
df1
df2

   Name Phone
1   Can  1234
2   Cem  4345
3 Hande  8492

  Age  Name
1  25   Cem
2  27 Hande
3  26   Can

In [48]:
merge(df1,df2)

   Name Phone Age
1   Can  1234  26
2   Cem  4345  25
3 Hande  8492  27

* The `merge()` function automatically detects that the `Name` column is common in both, and merges the data on it. 
* The order of names are different in the two frames, which is accounted for.

The columns we want to merge over may have different names in the two frames. In that case we use the `by.x` and `by.y` arguments to `merge()`.

In [49]:
df2 <- data.frame(Age=c(25,27,26), first_name=c("Cem","Hande","Can"))
df1
df2

   Name Phone
1   Can  1234
2   Cem  4345
3 Hande  8492

  Age first_name
1  25        Cem
2  27      Hande
3  26        Can

In [50]:
merge(df1, df2, by.x="Name", by.y="first_name")

   Name Phone Age
1   Can  1234  26
2   Cem  4345  25
3 Hande  8492  27

Suppose we want to merge on row names.

In [51]:
people

      Height Weight Member     City
Can     1.70     65   TRUE   Ankara
Cem     1.75     66  FALSE Istanbul
Hande   1.62     61   TRUE    Izmir

In [52]:
phonebook <- data.frame(phone=c(Can=1234, Cem=4345, Lale=8492))
phonebook

     phone
Can   1234
Cem   4345
Lale  8492

Note that `phonebook` does not contain Hande, and `people` does not contain Lale.

To merge by row names, specify`"row.names"` for the `by.x` and `by.y` parameters. 

In [53]:
merge(people, phonebook, by.x="row.names", by.y="row.names")

  Row.names Height Weight Member     City phone
1       Can   1.70     65   TRUE   Ankara  1234
2       Cem   1.75     66  FALSE Istanbul  4345

# Inner and outer joins

* The merged dataframe does not include Hande or Lale, because they are missing in one or the other data frame.
* This is called an **inner join** operation.
* To get all the rows, with some data missing, set `all=TRUE` (**outer join** operation).

In [54]:
merged_df <- merge(people,phonebook,by.x="row.names", by.y="row.names", all=TRUE)
merged_df

  Row.names Height Weight Member     City phone
1       Can   1.70     65   TRUE   Ankara  1234
2       Cem   1.75     66  FALSE Istanbul  4345
3     Hande   1.62     61   TRUE    Izmir    NA
4      Lale     NA     NA     NA     <NA>  8492

To set the people names as row names, assign them using `rownames()` function, and remove the `"Row.names"` column afterwards.

In [55]:
rownames(merged_df) <- merged_df$Row.names
merged_df

      Row.names Height Weight Member     City phone
Can         Can   1.70     65   TRUE   Ankara  1234
Cem         Cem   1.75     66  FALSE Istanbul  4345
Hande     Hande   1.62     61   TRUE    Izmir    NA
Lale       Lale     NA     NA     NA     <NA>  8492

In [56]:
merged_df$Row.names <- NULL
merged_df

      Height Weight Member     City phone
Can     1.70     65   TRUE   Ankara  1234
Cem     1.75     66  FALSE Istanbul  4345
Hande   1.62     61   TRUE    Izmir    NA
Lale      NA     NA     NA     <NA>  8492

Applications
===

# Analyze the grades in a class

In [57]:
grades <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    midterm1 = c(45, 74, 67, 52, 31),
    midterm2 = c(68, 83, 56, 22, 50),
    final = c(59, 91, 62, 49, 65))
grades

  student midterm1 midterm2 final
1     Can       45       68    59
2     Cem       74       83    91
3   Hande       67       56    62
4    Lale       52       22    49
5    Ziya       31       50    65

Get weighted average

In [58]:
grades$score <- grades$midterm1*0.3 + grades$midterm2*0.3 + grades$final*0.4
grades

  student midterm1 midterm2 final score
1     Can       45       68    59  57.5
2     Cem       74       83    91  83.5
3   Hande       67       56    62  61.7
4    Lale       52       22    49  41.8
5    Ziya       31       50    65  50.3

Get averages of columns

In [64]:
grades[-1]

  midterm1 midterm2 final score
1       45       68    59  57.5
2       74       83    91  83.5
3       67       56    62  61.7
4       52       22    49  41.8
5       31       50    65  50.3

In [68]:
apply(grades[-1],2,mean)

midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

In [66]:
sapply(grades[-1],mean)

midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

In [69]:
lapply(grades[-1],mean)

$midterm1
[1] 53.8

$midterm2
[1] 55.8

$final
[1] 65.2

$score
[1] 58.96


Assign letter grades

In [71]:
lettergrade <- function(score){
    if (score > 80) "A" else if (score > 70) "B" else if (score>60) "C" else if (score>50) "D" else "F"
}

In [72]:
grades$score
sapply(grades$score,lettergrade)

[1] 57.5 83.5 61.7 41.8 50.3

[1] "D" "A" "C" "F" "D"

In [73]:
grades$letter <- sapply(grades$score, lettergrade)
grades

  student midterm1 midterm2 final score letter
1     Can       45       68    59  57.5      D
2     Cem       74       83    91  83.5      A
3   Hande       67       56    62  61.7      C
4    Lale       52       22    49  41.8      F
5    Ziya       31       50    65  50.3      D

# Grading multiple-choice exams
Our students have taken a multiple-choice exam. All their answers, as well as the answer key, are recorded as vectors.

In [75]:
key <- c("A","B","C","D","A")
answers <- rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
)

We initialize a separate data frame with the student information:

In [76]:
exam <- data.frame(answers,row.names = c("Can","Cem","Hande","Lale","Ziya"))
exam

      X1 X2 X3 X4 X5
Can    A  B  D  A  B
Cem    A  D  C  D  A
Hande  B  B  C  D  B
Lale   A  B  C  D  D
Ziya   C  C  C  D  A

Now we can process this data frame to get the number of correct answers for each student. For that, we can use the `sum(x==y)` operation, which gives us the number of equal elements.

In [77]:
key
exam[1,]
exam[1,]==key
sum(exam[1,]==key)

[1] "A" "B" "C" "D" "A"

    X1 X2 X3 X4 X5
Can  A  B  D  A  B

      X1   X2    X3    X4    X5
Can TRUE TRUE FALSE FALSE FALSE

[1] 2

To repeat this for each row, we create a function that returns the number of matching answers.

In [78]:
ncorrect <- function(x){
    sum(x==key)
}

In [79]:
ncorrect(exam[1,])

[1] 2

And we use `apply()` to apply it to every row.

In [80]:
apply(exam,1,ncorrect)

  Can   Cem Hande  Lale  Ziya 
    2     4     3     4     3 

We can store this result by creating a new column in the data frame.

In [81]:
exam$correct <- apply(exam,1,ncorrect)
exam

      X1 X2 X3 X4 X5 correct
Can    A  B  D  A  B       2
Cem    A  D  C  D  A       4
Hande  B  B  C  D  B       3
Lale   A  B  C  D  D       4
Ziya   C  C  C  D  A       3

# Item database
Suppose you run a retail store and you keep a data base of your items, their unit price, and the VAT rate for each item, such as the following.

In [82]:
items <- data.frame(
    row.names = c("Milk","Meat","Toothpaste","Pencil","Detergent"),
    vat = c(0.05, 0.04, 0.05, 0.06, 0.03),
    unitprice = c(10, 20, 5, 1, 4)
)
items

            vat unitprice
Milk       0.05        10
Meat       0.04        20
Toothpaste 0.05         5
Pencil     0.06         1
Detergent  0.03         4

You get some orders for some items, which your automated system stores with an order ID:

In [83]:
orders <- data.frame(
    row.names = c("1234","5761","1832"),
    item = c("Milk","Meat","Toothpaste"),
    amount = c(3,1,2))
orders

           item amount
1234       Milk      3
5761       Meat      1
1832 Toothpaste      2

Our task is to add a new column to the `orders` data frame that holds the total payment for each order, including the VAT.

      item       amount vat  unitprice total
    1 Meat       1      0.04 20        20.8 
    2 Milk       3      0.05 10        31.5 
    3 Toothpaste 2      0.05  5        10.5

* Merge the orders and items with an inner join. 
* Store the result in a new data frame.

In [84]:
orders2 <- merge(orders,items,by.x="item",by.y="row.names")
orders2

        item amount  vat unitprice
1       Meat      1 0.04        20
2       Milk      3 0.05        10
3 Toothpaste      2 0.05         5

Now that we have the unit price and the VAT information on the same data frame, we can calculate the total to pay and store it in a new column.

In [85]:
orders2$total <- (orders2$amount*orders2$unitprice)*(1+orders2$vat)
orders2

        item amount  vat unitprice total
1       Meat      1 0.04        20  20.8
2       Milk      3 0.05        10  31.5
3 Toothpaste      2 0.05         5  10.5