In [None]:
options(jupyter.rich_display = FALSE)

_Data frames_ are used for representing tabular data wherre each column has a different type, such as

|Name | Height| Weight | Gym member? | City|
|-----|----|----|----|
|Cem | 1.75 | 66 |T | Istanbul|
|Can | 1.70 | 65 | F | Ankara|
|Hande | 1.62 | 61| T | Izmir|

* Lists: heterogeneous analogs of vectors.
* Data frames: heterogenous analogs of matrices.

A data frame is actually a _list_ of equal-length _vectors_.

Why use data frames?
===
Earlier we have seen how to store data in vectors, for example:

In [None]:
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)

If we want to have this data combined in a table, we can generate a matrix out of it:

In [None]:
height_weight <- cbind(
    c(1.70, 1.75,1.62),
    c(65, 66, 61)
)
rownames(height_weight) <- c("Can","Cem","Hande")
colnames(height_weight) <- c("Height","Weight")
height_weight

Or more directly, if the data is already in a vector:

In [None]:
height_weight <- cbind(heights, weights)
colnames(height_weight) <- c("Height","Weight")
print(height_weight)

In [None]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

Now suppose we want to add the Boolean gym membership data into this matrix as well.

In [None]:
member <- c(Cem=FALSE, Can=TRUE, Hande=TRUE)
height_weight <- cbind(heights, weights,member)
colnames(height_weight) <- c("Height","Weight","Gym member")
print(height_weight)

* The last column has numeric values 0 or 1, instead of `TRUE` or `FALSE`.
* Reason: All elements in a matrix must have _the same mode_ (numeric here).
* If a new mode is forced (Boolean here) not, all elements are _coerced_ to a common type (numeric here).
* `TRUE` becomes 1, `FALSE` becomes 0.

This could be tolerable. However, now we add the city data.

In [None]:
city <- c(Cem="Istanbul",Can="Ankara",Hande="Izmir")

height_weight <- cbind(heights, weights,member,city)
colnames(height_weight) <- c("Height","Weight","Gym member","City")

print(height_weight)

All entries are now coerced to strings. The data is still there, but we cannot perform computations anymore.

In [None]:
height_weight["Can","Weight"]/height_weight["Can","Height"]^2

How about keeping the data in separate vectors? There would be no coercion, but data manipulation would be difficult. Selecting subsets, adding/removing entries, would require several operations and great care.

A _data frame_ that combines several vectors as data columns is used for such convenience.

Creating data frames
====
If we already have data in the form of one-dimensional vectors, we can combine them into a data frame using the `data.frame()` function.

In [None]:
people <- data.frame(Height=heights, Weight=weights, Member=member, City=city, stringsAsFactors = F)
people

Data recycling works for data frames as well. Suppose we add the "City" data and make it "Istanbul" for all.

In [None]:
data.frame(Height=heights, Weight=weights, City="Istanbul")

Here, the element `"Istanbul"` is repeated until it matches the length of other vectors.

What if row names are not given during the construction, or we want to change them? The functions `rownames()` and `colnames()` can be used just like with matrices.

In [None]:
tempdf = data.frame(
    h = c(1.70, 1.75,1.62),
    w = c(65, 66, 61)
)
tempdf

In [None]:
rownames(tempdf) <- c("Can","Cem","Hande")
colnames(tempdf) <- c("Height","Weight")
tempdf

Accessing data frames
====

Accessing via column numbers or column names
----

In [None]:
people

The data frame is a list; so we can access its components using the notation we've seen last week.

In [None]:
people[[1]]  # idexing with component number

In [None]:
people$Height  # component name

In [None]:
people[["Height"]]  # indexing with component name

Accessing via matrix-like indexing
-----
A data frame can be indexed as if it is a matrix, using the `[row, col]` notation.

In [None]:
people

In [None]:
people[,1]  # column 1

In [None]:
people[2,1] # row 1, column 1

In [None]:
people["Cem","Height"]

Selecting rows using indices
===

In [None]:
people

We can specify a vector of indices to select rows.

In [None]:
people[c(1,3),]

We can also select using a vector of row names.

In [None]:
people[c("Can","Hande"),]

A negative index, again, indicates a row that is to be omitted.

In [None]:
people[-2,]

Selecting some columns
====

We can provide a list of column names to get a subframe.

In [None]:
people[, c("Member","City")]

Numeric indices can also be used.

In [None]:
people[, 3:4]

A subset of rows and a subset of columns:

In [None]:
people[c("Can","Cem"), 1:2]

Filtering data frames
==
The Boolean operators to select vector elements are applicable to data frames as well. 

In [None]:
people

In [None]:
people$Height >= 1.70

In [None]:
people[ people$Height>= 1.70, ]

In [None]:
people[ people$Member, ]

In [None]:
people[ people$Member, c("Height","City")]

Adding new rows
===
As with matrices, we can use `rbind()` to add a new row to an existing data frame. The new row is usually in the form of a list.

In [None]:
people

In [None]:
rbind(people, Lale=list(1.71, 64, FALSE, "Bursa"))

In [None]:
newpeople = data.frame(
    Height=c(Lale=1.71, Ziya=1.45),
    Weight=c(64, 50),
    Member=c(F,T),
    City=c("Bursa","Istanbul")
)
newpeople

In [None]:
rbind(people, newpeople)

Adding new columns
===

In [None]:
people

Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using `cbind()` as follows.

In [None]:
people_bmi <- cbind(people, people$Weight/people$Height^2)
people_bmi

Note that the name of the new column is automatically set. We can change this using the `names()` function. 

In [None]:
names(people_bmi)[5] <- "BMI"

In [None]:
people_bmi

A more direct way would be to utilize directly the dynamic extensibility of lists.

In [None]:
people2 <- people
people2$BMI <- people2$Weight/people$Height^2
people2

We can create a new column without using existing columns. If we specify a single value, the rest is set by vector recycling.

In [None]:
people2$obese <- NA
people2

Remove a column

In [None]:
people2$obese <- NULL
people2

Merging data frames
===
The `merge(x,y)` function is used to create a new data frame from existing frames `x` and `y`, by combining them along a common column.

To illustrate this, consider the following simple data frames.

In [None]:
df1 <- data.frame(Name=c("Can","Cem","Hande"), Phone=c(1234,4345,8492))
df1

In [None]:
df2 <- data.frame(Age=c(25,27,26), Name=c("Cem","Hande","Can"))
df2

In [None]:
merge(df1,df2)

The `merge()` function uses the column `Name` to merge the two data frames. Note that the entries are correctly identified even though the order of names are different in the two frames.

The columns we want to merge over may have different names in the two frames. In that case we use the `by.x` and `by.y` arguments to `merge()`.

In [None]:
df1

In [None]:
df2 <- data.frame(Age=c(25,27,26), first_name=c("Cem","Hande","Can"))
df2

In [None]:
merge(df1, df2, by.x="Name", by.y="first_name")

What if we have named rows, and we want to merge over these indices? For example, recall the `people` dataframe we created above:

In [None]:
people

And suppose we have a phone book with named indices:

In [None]:
phonebook <- data.frame(phone=c(Can=1234, Cem=4345, Lale=8492))
phonebook

Note that the phone book does not contain Hande, and has an extra entry, Lale.

Then we merge by the row names by specifying `"row.names"` as `by.x` and `by.y` parameter values. 

In [None]:
merge(people,phonebook,by.x="row.names", by.y="row.names")

Note that the resulting data frame does not include Hande or Lale, because they are missing in one or the other data frame. This is called an _inner join_ operation.

If we want to have all the rows, even though they contain missing data, we use the `all=TRUE` parameter. This is called an _outer join_ operation.

In [None]:
merged_df <- merge(people,phonebook,by.x="row.names", by.y="row.names", all=TRUE)

In [None]:
merged_df

And if we want to use the names of people as row names, we can assign them using `rownames()` function, and remove the `"Row.names"` column later.

In [None]:
rownames(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL
merged_df

Applications
===
Analyze the grades in a class
---

In [None]:
grades <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    midterm1 = c(45, 74, 67, 52, 31),
    midterm2 = c(68, 83, 56, 22, 50),
    final = c(59, 91, 62, 49, 65),
    stringsAsFactors = F)
grades

In [None]:
grades$score <- grades$midterm1*0.3 + grades$midterm2*0.3 + grades$final*0.4
grades

In [None]:
apply(grades[-1],2,mean)

In [None]:
lettergrade <- function(score){
    if (score > 80) "A" else if (score > 70) "B" else if (score>60) "C" else if (score>50) "D" else "F"
}

In [None]:
sapply(grades$score,lettergrade)

In [None]:
grades$letter <- sapply(grades$score, lettergrade)
grades

Grading multiple-choice exams
---
Our students have taken a multiple-choice exam. All their answers, as well as the answer key, is recorded as vectors.

In [None]:
key <- c("A","B","C","D","A")
answers <- rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
)

We initialize a separate data frame with the student information:

In [None]:
exam <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    stringsAsFactors = F
)
exam

And add the exam data to that data frame with `cbind()`.

In [None]:
exam <- cbind(exam,
     rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
         )
)
exam

Now we can process this data frame to get the number of correct answers for each student. For that, we can use the `sum(x==y)` operation, which gives us the number of equal elements.

In [None]:
v1 <- c("A","B","C","D")
v2 <- c("A","C","B","D")
sum(v1 == v2)

In [None]:
exam[1,]
key
sum(exam[1,]==key)

To repeat this for each row, we create a function that returns the number of matching answers.

In [None]:
ncorrect <- function(x){
    sum(x==key)
}

In [None]:
ncorrect(exam[1,2:6])

And we use `apply()` to apply it to every row.

In [None]:
apply(exam[2:6],1,ncorrect)

We can store this result by creating a new column in the data frame.

In [None]:
exam$correct <- apply(exam[,2:6],1,ncorrect)

In [None]:
exam

Item database
---
Suppose you run a retail store and you keep a data base of your items, their unit price, and the VAT rate for each item, such as the following.

In [None]:
items <- data.frame(
    itemname = c("Milk","Meat","Toothpaste","Pencil","Detergent"),
    vat = c(0.05, 0.04, 0.05, 0.06, 0.03),
    unitprice = c(10, 20, 5, 1, 4))
items

You get some orders for some items, which your automated system stores with an order ID:

In [None]:
orders <- data.frame(
    orderid = c("1234","5761","1832"),
    item = c("Milk","Meat","Toothpaste"),
    amount = c(3,1,2))
orders

Our task is to add a new column to the `orders` data frame that holds the total payment for each order, including the VAT.

We begin by merging the orders and items data frames. We do not use all the items, so we make an inner join. We store the result in a new data frame.

In [None]:
orders2 <- merge(orders,items,by.x="item",by.y="itemname")
orders2

Now that we have the unit price and the VAT information on the same data frame, we can calculate the total to pay and store it in a new column.

In [None]:
orders2$total <- (orders2$amount*orders2$unitprice)*(1+orders2$vat)
orders2