**DATA FRAMES**

A "data frame" is a special type of list that has equally sized vectors - that may be of different types - as items 

Now let's create some vectors of equal length and of different types

In [1]:
names <- c("Ian G.", "Ritchie", "Ian P.", "Roger", "Jon", "Steve", "Don", "David")

homework <- c(95, 60, 40, 40, 100, 25, 50, 94)

midterm<-c(80, 62, 38, 20, 74, 56, 18, 67)

final<-c(87, 50, 62, 10, 72, 61, 27, 60)

attendance<-c(T, F, F, F, T, T, T, T)

names
class(names)

homework
class(homework)

midterm
class(midterm)

final
class(final)

attendance
class(attendance)


Let's create a data frame with the given information:

In [2]:
results_1 <- data.frame(names, homework, midterm, final, attendance)
results_1

names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


Now, let's change the length of a vector

In [3]:
midterm[-1] # excludes 80
midterm_2 <- midterm[-1]
results_2<-data.frame(names, homework, midterm_2, final, attendance)
results_2

ERROR: Error in data.frame(names, homework, midterm_2, final, attendance): arguments imply differing number of rows: 8, 7


Hmm.. Did not recycle vectors of size 8 and 7.

Now let's try a similar thing

In [4]:
midterm_2 <- midterm[1:4]
midterm_2
results_2<- data.frame(names, homework, midterm_2, final, attendance)
results_2

names,homework,midterm_2,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,80,72,True
Steve,25,62,61,True
Don,50,38,27,True
David,94,20,60,True


**See, when the vector lengths are multiples of one another, R recycles to get a data.frame**

**However when the lengths are not multiples, R throws an error and does not accept to recycle!**

Now let's first create a list and then convert it into a data.frame

In [5]:
results_2 <- list(names = names, homework = homework, midterm = midterm, final = final, attendance = attendance)
results_2
attributes(results_2)
class(results_2)
length(results_2)

Does it have a dimension attribute? (Note that lists doesn't have...)

In [6]:
dim(results_2)

NULL

Before taking a look at the attributes of a data frame, let's create a sample matrix - to see the differences and similarities, and get its length:

In [7]:
a <- matrix(1:40, nrow = 8)
a
length(a)
dim(a)

0,1,2,3,4
1,9,17,25,33
2,10,18,26,34
3,11,19,27,35
4,12,20,28,36
5,13,21,29,37
6,14,22,30,38
7,15,23,31,39
8,16,24,32,40


See, a length of a matrix is the count of all cells, so it is basically (count of rows) * (count of columns)

Now create a data frame with "as.data.frame" function and see its attributes and length.

Not that "stringsAsFactors = F" is supplied so that character values are not automatically converted to categoric variables

In [8]:
results_3 <- as.data.frame(results_2, stringsAsFactors = F)
results_3
attributes(results_3)
class(results_3)
length(results_3)

names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


In [9]:
names(results_3)
colnames(results_3)
rownames(results_3)

For the time being, the stringsAsFactors = F is necessary so that characters are not automatically converted to factor levels. One by one!

It looks like a matrix, does it behave like a matrix?

Length returns number of columns for a data.frame, however for a matrix it gives rows*columns.

So technically a data.frame is closer to a list than it is to a matrix, althogh the appearance resembles a matrix

What else? For example, does it have a dimension attribute?

In [10]:
dim(results_3)

So a data frame is a list with dimension attribute. However it still has a length! Just like a list...

Just as a matrix is a vector with a dimension attribute but still has a length, just as a vector!

Let's convert it back to a list

In [11]:
as.list(results_3)

Now, let's subset a data frame in different ways, and see the class of outputs

In [12]:
results_3[1]

names
Ian G.
Ritchie
Ian P.
Roger
Jon
Steve
Don
David


In [13]:
class(results_3[1])

Since a data frame is a list of equal length vectors with a dimension attribute, we can subset it with single brackets and a single index, just like a list

And the output is a data.frame!

Now let's subset it with 2 indices, just like a matrix

In [14]:
results_3[1,1]

In [15]:
class(results_3[1,1])

So when a df is subsetted with two indices, and items from a single column is returned, we get inside the object - the column

How about that:

In [16]:
results_3[1,1:2]

names,homework
Ian G.,95


In [17]:
class(results_3[1,1:2])

When we subset two columns wtih two indices, the result is still a data.frame! 

How about that:

In [18]:
results_3[1][1][1]

names
Ian G.
Ritchie
Ian P.
Roger
Jon
Steve
Don
David


You just subset the first column of a data frame that is itself a single column!

In [19]:
class(results_3[1][1])

Now let's subset with double brackets, just like a list!

In [20]:
results_3[[1]]
class(results_3[[1]])

So we get inside the object, just like list indexing with double brackets

And we can chain subsetting operators to get into the items, just like in lists:

In [21]:
results_3[[1]][1]
class(results_3[[1]][1])

How about subsetting with names

In [22]:
results_3$names[1]
class(results_3$names[1])

In [23]:
results_3[["names"]][1]
class(results_3[["names"]][1])

What if we cbind a vector?

In [24]:
instruments <- c("vocals", "guitar", "drums", "bass", "hammond organ", "guitar", "hammond organ", "vocals")

In [25]:
results_4 <- cbind(results_3, instruments)
results_4
class(results_4)

names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


See it retains its data.frame status

Now add a column by assignment

In [26]:
results_4 <- results_3
results_4
results_4$instruments <- instruments
results_4
class(results_4)

names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


Add a column by indexing

In [27]:
results_4 <- results_3
results_4
results_4[6] <- instruments
results_4
class(results_4)
names(results_4)


names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


names,homework,midterm,final,attendance,V6
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


In [28]:
# you should add the name of the last column manually!!!
names(results_4) <- c('names','homework','midterm','final',"attendance","instruments")
results_4
# alternatively,
names(results_4[6]) <- "instruments"
results_4
# or,
names(results_4[[6]]) <- "instruments"
results_4

names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


In [29]:
results_4 <- results_3
results_4
results_4[[6]] <- instruments
results_4
class(results_4)

names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


names,homework,midterm,final,attendance,V6
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


They all work

And to delete

In [30]:
results_5 <- results_4
results_5
results_5[6] <- NULL
results_5

names,homework,midterm,final,attendance,V6
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


In [31]:
results_5 <- results_4
results_5
results_5[[6]] <- NULL
results_5

names,homework,midterm,final,attendance,V6
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals


names,homework,midterm,final,attendance
Ian G.,95,80,87,True
Ritchie,60,62,50,False
Ian P.,40,38,62,False
Roger,40,20,10,False
Jon,100,74,72,True
Steve,25,56,61,True
Don,50,18,27,True
David,94,67,60,True


Now let's subset only vocals from the data frame

In [32]:
results_4 <- cbind(results_3, instruments)
results_4[results_4$instrument == "vocals",c(1,2,3)]

Unnamed: 0,names,homework,midterm
1,Ian G.,95,80
8,David,94,67


And get only attendance values

In [33]:
results_4[results_4$instrument == "vocals","attendance"]

In [34]:
results_4[results_4$instrument == "vocals",]$attendance

In [35]:
results_4[results_4$instrument == "vocals",5]

In [36]:
class(results_4[results_4$instrument == "vocals",][5])

The last one returns a data.frame, other ones return boolean vectors

And we can subset with explicit boolean vectors - instead of boolean tests - of course

In [37]:
results_4[c(T,F,F,F,F,F,F,T),]

Unnamed: 0,names,homework,midterm,final,attendance,instruments
1,Ian G.,95,80,87,True,vocals
8,David,94,67,60,True,vocals


Let's add a row

In [38]:
glenn <- c("Glenn", 93, 74, 85, TRUE, "bass")
glenn
class(glenn)

In [39]:
results_6 <- rbind(results_4, glenn)
results_6

names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals
Glenn,93,74,85,True,bass


In [40]:
class(results_6$homework)

See, since the numbers in the vector are coerced to characters, the related columns are also coerced to characters

So what should we do, to add a row, without messing up the column types?

In [41]:
glenn_2 <- list("Glenn", 93, 74, 85, TRUE, "bass")
glenn_2

In [42]:
results_6 <- rbind(results_4, glenn_2)
results_6

names,homework,midterm,final,attendance,instruments
Ian G.,95,80,87,True,vocals
Ritchie,60,62,50,False,guitar
Ian P.,40,38,62,False,drums
Roger,40,20,10,False,bass
Jon,100,74,72,True,hammond organ
Steve,25,56,61,True,guitar
Don,50,18,27,True,hammond organ
David,94,67,60,True,vocals
Glenn,93,74,85,True,bass


In [43]:
class(results_6$homework)

Since data frame is a list, can we lapply or sapply through it?

In [44]:
lapply(results_6, class)

In [45]:
sapply(results_6, class)

Yes, surely we can!

 Let's create a simple data frame from 3 vectors. 
 Then, Order the entire data frame by the first column.

In [46]:
v <- c(45:41, 30:33)
b <- LETTERS[rep(1:3, 3)]
n <- c(1,6,3,5,2,4,6,9,2)

df <- data.frame(Age = v, Class = b, Grade = n)
df

Age,Class,Grade
45,A,1
44,B,6
43,C,3
42,A,5
41,B,2
30,C,4
31,A,6
32,B,9
33,C,2


In [47]:
order(df$Age)
df[order(df$Age), ]  

Unnamed: 0,Age,Class,Grade
6,30,C,4
7,31,A,6
8,32,B,9
9,33,C,2
5,41,B,2
4,42,A,5
3,43,C,3
2,44,B,6
1,45,A,1


For another exercise, use the (built-in) dataset VADeaths.
First, make sure the object is a data frame, if not change it to a data frame.

In [48]:
class(VADeaths)
df <- as.data.frame(VADeaths)
class(df)

Now, create a new variable, named Total, which is the sum of each row.

In [49]:
df$Total <- df[, 1] + df[, 2] + df[, 3] + df[, 4]
df

Unnamed: 0,Rural Male,Rural Female,Urban Male,Urban Female,Total
50-54,11.7,8.7,15.4,8.4,44.2
55-59,18.1,11.7,24.3,13.6,67.7
60-64,26.9,20.3,37.0,19.3,103.5
65-69,41.0,30.9,54.6,35.1,161.6
70-74,66.0,54.3,71.1,50.0,241.4


Finally, change the order of the columns so total is the first variable.

In [50]:
df <- df[, c(5, 1:4)]
df

Unnamed: 0,Total,Rural Male,Rural Female,Urban Male,Urban Female
50-54,44.2,11.7,8.7,15.4,8.4
55-59,67.7,18.1,11.7,24.3,13.6
60-64,103.5,26.9,20.3,37.0,19.3
65-69,161.6,41.0,30.9,54.6,35.1
70-74,241.4,66.0,54.3,71.1,50.0


For our last execise, let's use the (built-in) dataset state.x77. 
Again, make sure the object is a data frame and if not change it to a data frame.

In [51]:
state.x77
class (state.x77)
df <- as.data.frame(state.x77)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073


Now, find out how many states have an income of less than 4300.

In [52]:
nrow(subset(df, df$Income < 4300))

# alternatively,

df[df$Income < 4300, ]

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073
Idaho,813,4119,0.6,71.87,5.3,59.5,126,82677
Kentucky,3387,3712,1.6,70.1,10.6,38.5,95,39650
Louisiana,3806,3545,2.8,68.76,13.2,42.2,12,44930
Maine,1058,3694,0.7,70.39,2.7,54.7,161,30920
Mississippi,2341,3098,2.4,68.09,12.5,41.0,50,47296
Missouri,4767,4254,0.8,70.69,9.3,48.8,108,68995
New Hampshire,812,4281,0.7,71.23,3.3,57.6,174,9027


The last step is to find out which is the state with the highest income.

In [53]:
df[max(df$Income)==df$Income, ]

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432


If you need, you may found more exercises here :)
https://www.r-exercises.com/start-here-to-learn-r/