In [None]:
options(jupyter.rich_display = FALSE)

# Data Frames

**Data frames** are used for representing tabular data where each column has a different type, such as


|Name | Height| Weight | Gym member? | City|
|-----|----|----|----|---|
|Cem | 1.75 | 66 |T | Istanbul|
|Can | 1.70 | 65 | F | Ankara|
|Hande | 1.62 | 61| T | Izmir|

* _Lists_ are heterogeneous analogs of _vectors_.
* _Data frames_ are heterogenous analogs of _matrices_.
* Internally, a data frame is a _list_ of equal-length _vectors_.

# Why not use a matrix?

Earlier we have seen how to store data in vectors, for example:

In [None]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)

If we want to have this data combined in a table, we can generate a matrix out of it:

In [None]:
height_weight <- cbind(
    c(1.70, 1.75,1.62),
    c(65, 66, 61)
)
rownames(height_weight) <- c("Can","Cem","Hande")
colnames(height_weight) <- c("Height","Weight")
height_weight

Alternatively, if the data are already stored in a vector:

In [None]:
height_weight <- cbind(heights, weights)
colnames(height_weight) <- c("Height","Weight")
height_weight

For example, get the BMI of "Can".

In [None]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

Trouble arises when we want to store the Boolean gym membership data in this matrix as well.

In [None]:
member <- c(Cem=FALSE, Can=TRUE, Hande=TRUE)
height_weight <- cbind(heights, weights,member)
colnames(height_weight) <- c("Height","Weight","Gym member")
height_weight

* The last column has numeric values 0 or 1, instead of `TRUE` or `FALSE`.
* Reason: All elements in a matrix must have _the same mode_ (numeric here).
* If a new mode is forced (Boolean here) not, all elements are _coerced_ to a common type (numeric here).
* `TRUE` becomes 1, `FALSE` becomes 0.

Suppose we also want to add the city data.

In [None]:
city <- c(Cem="Istanbul",Can="Ankara",Hande="Izmir")

height_weight <- cbind(heights, weights,member,city)
colnames(height_weight) <- c("Height","Weight","Gym member","City")

print(height_weight)

All entries are now coerced to strings. The data is still there, but we cannot perform computations anymore.

In [None]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

* Keep the data in separate vectors?
* There would be no coercion, but data manipulation would be difficult.
* Selecting subsets, adding/removing entries, would require several operations and great care.
* A _data frame_ that combines several vectors as data columns provides convenience.

Creating data frames
====
Several vectors can be combined into a data frame using the `data.frame()` function.

In [None]:
people <- data.frame(Height=heights, Weight=weights, Member=member, City=city, stringsAsFactors = F)
people

**Recycling** applies to data frames as well. Suppose we add the `"City"` data and make it `"Istanbul"` for all:

In [None]:
data.frame(Height=heights, Weight=weights, City="Istanbul")

Here, the element `"Istanbul"` is repeated until it matches the length of other vectors.

The functions `rownames()` and `colnames()` can be used to change labels of rows and columns.

In [None]:
tempdf <- data.frame(h=c(1.70, 1.75,1.62),w=c(65, 66, 61))
tempdf

In [None]:
rownames(tempdf) <- c("Can","Cem","Hande")
colnames(tempdf) <- c("Height","Weight")
tempdf

Accessing columns of data frames
====

A data frame is a **list of columns**; so we can access a column using the list notation we've seen before.

In [None]:
people

In [None]:
people[[1]]  # indexing with component number
people$Weight  # component name
people[["City"]]

# Accessing elements via matrix-like indexing
A data frame can be indexed as if it is a matrix, using the `[row, col]` notation.

In [None]:
people

In [None]:
people[,1]  # column 1
people[2,1] # row 2, column 1
people["Cem","Height"]

# Selecting rows using indices

We can specify a vector of indices to select rows.

In [None]:
people

In [None]:
people[c(1,3),]
people[c("Can","Hande"),]

A negative index, again, indicates a row that is to be omitted.

In [None]:
people[-2,]

Selecting some columns
====

We can provide a list of column names or numeric indices to get a subframe.

In [None]:
people[, c("Member","City")]
people[, 3:4]

A subset of rows and a subset of columns:

In [None]:
people[c("Can","Cem"), 1:2]

Filtering data frames
==
The Boolean operators to select vector elements are applicable to data frames as well. 

In [None]:
people

In [None]:
people$Height >= 1.70

In [None]:
people[ people$Height>= 1.70, ]

In [None]:
people[ people$Member, ]

In [None]:
people[ people$Member, c("Height","City")]

Adding new rows
===
As with matrices, we can use `rbind()` to add a new row to an existing data frame. The new row is usually in the form of a list.

In [None]:
people
rbind(people, Lale=list(1.71, 64, FALSE, "Bursa"))

# Concatenate two data frames

In [None]:
newpeople <- data.frame(
    Weight=c(64, 50),
    Member=c(F,T),
    City=c("Bursa","Istanbul"),
    Height=c(Lale=1.71, Ziya=1.45)
)
newpeople

In [None]:
rbind(people, newpeople)

Adding new columns
===

Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using `cbind()` as follows.

In [None]:
people_bmi <- cbind(people, people$Weight/people$Height^2)
people_bmi

Note that the name of the new column is automatically set. We can change this using the `names()` or `colnames()` functions.

In [None]:
names(people_bmi)[5] <- "BMI"
people_bmi

A more direct way:

In [None]:
people2 <- people
people2

In [None]:
people2$BMI <- people2$Weight/people$Height^2
people2

We can create a new column as please. For example, a column with a single `NA` value.

In [None]:
people2$obese <- NA
people2

In [None]:
people2$obese <- ifelse(people2$BMI>30, T, F)
people2

Remove a column by setting it to `NULL`.

In [None]:
people2$obese <- NULL
people2

Merging data frames
===
The `merge(x,y)` function is used to create a new data frame from existing frames `x` and `y`, by combining them along a common column.

In [None]:
df1 <- data.frame(Name=c("Can","Cem","Hande"), Phone=c(1234,4345,8492))
df2 <- data.frame(Age=c(25,27,26), Name=c("Cem","Hande","Can"))

In [None]:
df1
df2
merge(df1,df2)

* The `merge()` function automatically detects that the `Name` column is common in both, and merges the data on it. 
* The order of names are different in the two frames, which is accounted for.

The columns we want to merge over may have different names in the two frames. In that case we use the `by.x` and `by.y` arguments to `merge()`.

In [None]:
df2 <- data.frame(Age=c(25,27,26), first_name=c("Cem","Hande","Can"))
df1
df2

In [None]:
merge(df1, df2, by.x="Name", by.y="first_name")

Suppose we want to merge on row names; e.g. gym membership and phone number data.

In [None]:
people
phonebook <- data.frame(phone=c(Can=1234, Cem=4345, Lale=8492))
phonebook

Note that `phonebook` does not contain Hande, and `people` does not contain Lale.

To merge by row names, specify`"row.names"` for the `by.x` and `by.y` parameters. 

In [None]:
merge(people,phonebook,by.x="row.names", by.y="row.names")

# Inner and outer joins

* The merged dataframe does not include Hande or Lale, because they are missing in one or the other data frame.
* This is called an **inner join** operation.
* To get all the rows, with some data missing, set `all=TRUE` (**outer join** operation).

In [None]:
merged_df <- merge(people,phonebook,by.x="row.names", by.y="row.names", all=TRUE)
merged_df

To set the people names as row names, assign them using `rownames()` function, and remove the `"Row.names"` column afterwards.

In [None]:
rownames(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL
merged_df

Applications
===

# Analyze the grades in a class

In [None]:
grades <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    midterm1 = c(45, 74, 67, 52, 31),
    midterm2 = c(68, 83, 56, 22, 50),
    final = c(59, 91, 62, 49, 65),
    stringsAsFactors = F)
grades

Get weighted average

In [None]:
grades$score <- grades$midterm1*0.3 + grades$midterm2*0.3 + grades$final*0.4
grades

Get averages of columns

In [None]:
apply(grades[-1],2,mean)

In [None]:
sapply(grades[-1],mean)

In [None]:
lapply(grades[-1],mean)

Assign letter grades

In [None]:
lettergrade <- function(score){
    if (score > 80) "A" else if (score > 70) "B" else if (score>60) "C" else if (score>50) "D" else "F"
}
sapply(grades$score,lettergrade)

In [None]:
grades$letter <- sapply(grades$score, lettergrade)
grades

# Grading multiple-choice exams
Our students have taken a multiple-choice exam. All their answers, as well as the answer key, are recorded as vectors.

In [None]:
key <- c("A","B","C","D","A")
answers <- rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
)

We initialize a separate data frame with the student information:

In [None]:
exam <- data.frame(answers,row.names = c("Can","Cem","Hande","Lale","Ziya"))
exam

Now we can process this data frame to get the number of correct answers for each student. For that, we can use the `sum(x==y)` operation, which gives us the number of equal elements.

In [None]:
key
exam[1,]
exam[1,]==key
sum(exam[1,]==key)

To repeat this for each row, we create a function that returns the number of matching answers.

In [None]:
ncorrect <- function(x){
    sum(x==key)
}
ncorrect(exam[1,])

And we use `apply()` to apply it to every row.

In [None]:
apply(exam,1,ncorrect)

We can store this result by creating a new column in the data frame.

In [None]:
exam$correct <- apply(exam,1,ncorrect)
exam

# Item database
Suppose you run a retail store and you keep a data base of your items, their unit price, and the VAT rate for each item, such as the following.

In [None]:
items <- data.frame(
    row.names = c("Milk","Meat","Toothpaste","Pencil","Detergent"),
    vat = c(0.05, 0.04, 0.05, 0.06, 0.03),
    unitprice = c(10, 20, 5, 1, 4)
)
items

You get some orders for some items, which your automated system stores with an order ID:

In [None]:
orders <- data.frame(
    row.names = c("1234","5761","1832"),
    item = c("Milk","Meat","Toothpaste"),
    amount = c(3,1,2))
orders

Our task is to add a new column to the `orders` data frame that holds the total payment for each order, including the VAT.

      item       amount vat  unitprice total
    1 Meat       1      0.04 20        20.8 
    2 Milk       3      0.05 10        31.5 
    3 Toothpaste 2      0.05  5        10.5

* Merge the orders and items. 
* Make an inner join. 
* Store the result in a new data frame.

In [None]:
orders2 <- merge(orders,items,by.x="item",by.y="row.names")
orders2

Now that we have the unit price and the VAT information on the same data frame, we can calculate the total to pay and store it in a new column.

In [None]:
orders2$total <- (orders2$amount*orders2$unitprice)*(1+orders2$vat)
orders2