![](logo.png)

# <font color='red'>R Dataframes</font>

> ### R Dataframe Basics
> ### R Dataframe Indexing and Selections
> ### Merging R Dataframes
> ### R Lists

We've learned about R vectors and two-dimensional R matrices. Now we're going to learn about R Dataframes, which are one of the main tools for data analysis with R. Matrix inputs are limited bercause all of the data inside of the matrix had to be of the same data type (numeric, character, logicals, etc.). R Dataframes can be organized with mixed data types to create a very powerful data structure.

# <font color='red'>R Dataframe Basics</font>

R actually has built-in Dataframes for a quick reference for learning about these data structures.

## Built-in R Dataframes

In [141]:
# Dataframe about states
state.x77

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073


In [142]:
# US personal expense
USPersonalExpenditure

Unnamed: 0,1940,1945,1950,1955,1960
Food and Tobacco,22.2,44.5,59.6,73.2,86.8
Household Operation,10.5,15.5,29.0,36.5,46.2
Medical and Health,3.53,5.76,9.71,14.0,21.1
Personal Care,1.04,1.98,2.45,3.4,5.4
Private Education,0.341,0.974,1.8,2.6,3.64


In [143]:
# Women
women

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


Use the built-in R Function **data()** to get a list of all available built-in R Dataframes

In [144]:
# List of built-in R Dataframes
data()

## Working with Dataframes

If you noticed, the states dataframe is large. Use the built-in R functions **head()** and **tail()** to view the first and last six rows, respectively.

In [145]:
# Quick variable assignment for shortcut
states <- state.x77

In [146]:
head(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766


In [147]:
tail(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Vermont,472,3907,0.6,71.64,5.5,57.1,168,9267
Virginia,4981,4701,1.4,70.08,9.5,47.8,85,39780
Washington,3559,4864,0.6,71.72,4.3,63.5,32,66570
West Virginia,1799,3617,1.4,69.48,6.7,41.6,100,24070
Wisconsin,4589,4468,0.7,72.48,3.0,54.5,149,54464
Wyoming,376,4566,0.6,70.29,6.9,62.9,173,97203


## Dataframes - Overview of Information

Use the built-in R function **str()** to get the structure of a Dataframe, which gives information on the structure of the dataframe and the data contained - including variable names and data types. Use the built-in R function **summary()** to get a quick statistical summary of all of the columns of a Dataframe, depending on the data, this could be very useful

In [148]:
# Statistical summary of data
summary(states)

   Population        Income       Illiteracy       Life Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  

In [149]:
# Structure of Data
str(states)

 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...


## Creating Dataframes

Use the built-in R function **data.frame()** to pass vectors and arguements to create R Dataframes. The vectors a converted into columns in the R Dataframe.

In [150]:
# Creating a Dataframe from weather data
days <- c('Mon','Tue','Wed','Thu','Fri')
temp <- c(76.1, 82.3, 85.4, 78.9, 76.2)
rain <- c(TRUE, FALSE, FALSE, TRUE, TRUE)

In [151]:
df <- data.frame(days, temp, rain)

In [152]:
df

days,temp,rain
<fct>,<dbl>,<lgl>
Mon,76.1,True
Tue,82.3,False
Wed,85.4,False
Thu,78.9,True
Fri,76.2,True


In [153]:
# Check structure
str(df)

'data.frame':	5 obs. of  3 variables:
 $ days: Factor w/ 5 levels "Fri","Mon","Thu",..: 2 4 5 3 1
 $ temp: num  76.1 82.3 85.4 78.9 76.2
 $ rain: logi  TRUE FALSE FALSE TRUE TRUE


In [154]:
summary(df)

  days        temp          rain        
 Fri:1   Min.   :76.10   Mode :logical  
 Mon:1   1st Qu.:76.20   FALSE:2        
 Thu:1   Median :78.90   TRUE :3        
 Tue:1   Mean   :79.78                  
 Wed:1   3rd Qu.:82.30                  
         Max.   :85.40                  

# <font color='red'>R Dataframe Indexing and Selection</font>

Similar to vectors and matrices, R Dataframes can be indexed and selected using bracket notation. 

In [155]:
df

days,temp,rain
<fct>,<dbl>,<lgl>
Mon,76.1,True
Tue,82.3,False
Wed,85.4,False
Thu,78.9,True
Fri,76.2,True


In [156]:
# Get row and columns counts
nrow(df)
ncol(df)

In [157]:
# Get column names
colnames(df)

In [158]:
# Get row names (may just return index)
rownames(df)

In [159]:
# Use the bracket notation df[rows,columns]

# Everything from the first row
df[1,]

days,temp,rain
<fct>,<dbl>,<lgl>
Mon,76.1,True


In [160]:
# Everything from the first column

df[,1]

NOTE: The R Dataframe 'df' shows you the data types of each of the columns. The 'days' column has been converted to a factor.

In [161]:
# Grab Friday data
df[5,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>
5,Fri,76.2,True


## Select using column names

This is where R Dataframes become very powerful. We can use column names to select data for the columns instead of having to remember numbers

In [162]:
# All of the rain values
df[,'rain']

In [163]:
# First 5 rows for days and temps
df[1:5, c('days','temp')]

days,temp
<fct>,<dbl>
Mon,76.1
Tue,82.3
Wed,85.4
Thu,78.9
Fri,76.2


Use the '$' notation if you want all of the values of a particular column. This references the column name without the brackets

In [164]:
# Use $ to grab the rain data df.name$column.name
df$rain

In [165]:
df$days

The bracket notation can be used to grab the same information as the '$' notation, particularly if rows are not selected

In [166]:
# Grab the rain information
df['rain']

rain
<lgl>
True
False
False
True
True


In [167]:
# Grab the days information
df['days']

days
<fct>
Mon
Tue
Wed
Thu
Fri


## Reference Cells

Individual cells can be referenced and converted into a new data frame using bracket notation

In [168]:
# Creating a vector from a referenced cell
vec <- df[[5,2]]
vec

In [169]:
# Making a new Dataframe from df
newdf <- df[1:3, 1:2]
newdf

days,temp
<fct>,<dbl>
Mon,76.1
Tue,82.3
Wed,85.4


In [170]:
#Changing the values of cells
df[[2, 'rain']] <- TRUE
df

days,temp,rain
<fct>,<dbl>,<lgl>
Mon,76.1,True
Tue,82.3,True
Wed,85.4,False
Thu,78.9,True
Fri,76.2,True


## Creating new columns

Using brackets or the '$', new columns can easily be added from vectors or from combinations of other rows within the Dataframe

In [171]:
df

days,temp,rain
<fct>,<dbl>,<lgl>
Mon,76.1,True
Tue,82.3,True
Wed,85.4,False
Thu,78.9,True
Fri,76.2,True


In [172]:
df$celcius <- (df$temp - 32) * 5/9
df

days,temp,rain,celcius
<fct>,<dbl>,<lgl>,<dbl>
Mon,76.1,True,24.5
Tue,82.3,True,27.94444
Wed,85.4,False,29.66667
Thu,78.9,True,26.05556
Fri,76.2,True,24.55556


In [173]:
df['over.80'] <- df['temp'] > 80
df

days,temp,rain,celcius,over.80
<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
Mon,76.1,True,24.5,False
Tue,82.3,True,27.94444,True
Wed,85.4,False,29.66667,True
Thu,78.9,True,26.05556,False
Fri,76.2,True,24.55556,False


## Filtering using subsetting conditions

Use the built-in R function **subset()** to grab a subset of values from an R Dataframe based of a set of conditions. For example, imagine grabbing the days where it rained (rain=TRUE).

In [174]:
# Grabbing a subset of data where rain==TRUE
subset(df, subset= rain==TRUE)

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
1,Mon,76.1,True,24.5,False
2,Tue,82.3,True,27.94444,True
4,Thu,78.9,True,26.05556,False
5,Fri,76.2,True,24.55556,False


NOTE: The condition uses some sort of comparison operator. In the above example, the operator '==' pulls information where the condition is equal.

In [175]:
# Grab a subset where temperature was greater than 80 degrees
subset(df, temp > 80)

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
2,Tue,82.3,True,27.94444,True
3,Wed,85.4,False,29.66667,True


NOTE: When using **subset()**, the R Dataframe is referenced as an arguement. That means that the column names DO NOT need to be input as a string.

## Conditional Subsetting

Multiple conditions can be used to subset a Dataframe using the operators & (and) and | (or) 

In [176]:
df

days,temp,rain,celcius,over.80
<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
Mon,76.1,True,24.5,False
Tue,82.3,True,27.94444,True
Wed,85.4,False,29.66667,True
Thu,78.9,True,26.05556,False
Fri,76.2,True,24.55556,False


In [177]:
subset(df, days != 'Wed' & temp < 80)

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
1,Mon,76.1,True,24.5,False
4,Thu,78.9,True,26.05556,False
5,Fri,76.2,True,24.55556,False


In [178]:
df[(df$days != 'Wed' & df$temp < 80),]

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
1,Mon,76.1,True,24.5,False
4,Thu,78.9,True,26.05556,False
5,Fri,76.2,True,24.55556,False


In [179]:
subset(df, days == 'Wed' | temp > 80)

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
2,Tue,82.3,True,27.94444,True
3,Wed,85.4,False,29.66667,True


In [180]:
df[(df$days == 'Wed' | df$temp > 80),]

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
2,Tue,82.3,True,27.94444,True
3,Wed,85.4,False,29.66667,True


## Ordering R Dataframes

The order of a R Dataframe can be sorted based on a specific column. Use the built-in R function **order()** to create a vector to sort the R Dataframe. Then pass the vector through to sort.

In [181]:
sorted.temp <- order(df$temp)

In [182]:
df[sorted.temp,]

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
1,Mon,76.1,True,24.5,False
5,Fri,76.2,True,24.55556,False
4,Thu,78.9,True,26.05556,False
2,Tue,82.3,True,27.94444,True
3,Wed,85.4,False,29.66667,True


Let's follow what **order()** is doing to figure out how the R Dataframe is sorted

In [183]:
sorted.temp

Ok, so we are asking for indexed values of the ordered elements of sorted.temp (default=ascending) to pass through df to sort based on the temps. Use a negative value to sort in descending order

In [184]:
# Use the other column reference technique to sort in descending order
desc.temp <- order(-df['temp'])

In [185]:
df[desc.temp,]

Unnamed: 0_level_0,days,temp,rain,celcius,over.80
Unnamed: 0_level_1,<fct>,<dbl>,<lgl>,<dbl>,"<lgl[,1]>"
3,Wed,85.4,False,29.66667,True
2,Tue,82.3,True,27.94444,True
4,Thu,78.9,True,26.05556,False
5,Fri,76.2,True,24.55556,False
1,Mon,76.1,True,24.5,False


# <font color='red'>Merging and Binding R Dataframes</font>

Like we learned with matrices, we can add rows and columns using the built-in R functions **rbind()** and **cbind()**, respectively.

In [186]:
# Vector of Integers
c1 <- 1:10

In [187]:
# Vector of Characters - Use letters to pull from the alphabet
c2 <- letters[1:10]

In [188]:
df <- data.frame(col.name.1 = c1, col.name.2 = c2)

In [189]:
df

col.name.1,col.name.2
<int>,<fct>
1,a
2,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


In [190]:
# Adding rows using rbind()
df2 <- data.frame(col.name.1=2000, col.name.2='new')
rbind(df, df2)

col.name.1,col.name.2
<dbl>,<fct>
1,a
2,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


In [191]:
# Adding columns
times.2 <- df$col.name.1 * 2
cbind(df, times.2)

col.name.1,col.name.2,times.2
<int>,<fct>,<dbl>
1,a,2
2,b,4
3,c,6
4,d,8
5,e,10
6,f,12
7,g,14
8,h,16
9,i,18
10,j,20


## Merging Dataframes

Using the built-in R function **merge()**, we'll pull two Dataframes together using a common key between the two Dataframes

In [192]:
peanut.weight <- data.frame(
    name = I(c('Bailey','Bailey II','Wynne','Emery','Sullivan', 'Gregory')),
    weight = c(89.4, 92.3, 105.2, 103.1, 99.6, 89.1))
peanut.skins <- data.frame(
    name = I(c('Bailey','Bailey II','Wynne','Emery','Sullivan','Brantley')),
    skin = c('Tan', 'Tan', 'Pink', 'Pink', 'Tan', 'Red'))

In [193]:
peanut.weight

name,weight
<I<chr>>,<dbl>
Bailey,89.4
Bailey II,92.3
Wynne,105.2
Emery,103.1
Sullivan,99.6
Gregory,89.1


In [194]:
peanut.skins

name,skin
<I<chr>>,<fct>
Bailey,Tan
Bailey II,Tan
Wynne,Pink
Emery,Pink
Sullivan,Tan
Brantley,Red


In [195]:
merge(peanut.weight, peanut.skins, by.x = 'name', by.y = 'name')

name,weight,skin
<I<chr>>,<dbl>,<fct>
Bailey,89.4,Tan
Bailey II,92.3,Tan
Emery,103.1,Pink
Sullivan,99.6,Tan
Wynne,105.2,Pink


In [196]:
peanuts <- merge(peanut.weight, peanut.skins, by.x = 'name', by.y = 'name', all = TRUE)
peanuts

name,weight,skin
<I<chr>>,<dbl>,<fct>
Bailey,89.4,Tan
Bailey II,92.3,Tan
Brantley,,Red
Emery,103.1,Pink
Gregory,89.1,
Sullivan,99.6,Tan
Wynne,105.2,Pink


## Dealing with Missing Data

Missing data is a very important skill to know when working with Dataframes. Use the built-in R functions **any()** and **is.na()** in tandem to determine missing data points in a Dataframe

In [197]:
# Determine if there are missing data anywhere in peanuts
any(is.na(peanuts))

In [198]:
# Determine missing data points within columns
any(is.na(peanuts$weight))

In [199]:
any(is.na(peanuts$name))

In [200]:
# Delete selected missing data rows using '!' meaning 'NOT'
peanuts[!is.na(peanuts$weight),]

Unnamed: 0_level_0,name,weight,skin
Unnamed: 0_level_1,<I<chr>>,<dbl>,<fct>
1,Bailey,89.4,Tan
2,Bailey II,92.3,Tan
4,Emery,103.1,Pink
5,Gregory,89.1,
6,Sullivan,99.6,Tan
7,Wynne,105.2,Pink


In [201]:
# Replace the NA values with information
peanuts[is.na(peanuts)] <- 90.0

"invalid factor level, NA generated"

NOTE: This should replace all NA values with 0 where the column data type is integer. Notice that the NA for Gregory under the column 'skin' is not replaced since this column is a factor

In [202]:
peanuts

name,weight,skin
<I<chr>>,<dbl>,<fct>
Bailey,89.4,Tan
Bailey II,92.3,Tan
Brantley,90.0,Red
Emery,103.1,Pink
Gregory,89.1,
Sullivan,99.6,Tan
Wynne,105.2,Pink


In [203]:
# For selected columns
peanuts$skin[is.na(peanuts$skin)] <- 'Tan'

In [204]:
peanuts

name,weight,skin
<I<chr>>,<dbl>,<fct>
Bailey,89.4,Tan
Bailey II,92.3,Tan
Brantley,90.0,Red
Emery,103.1,Pink
Gregory,89.1,Tan
Sullivan,99.6,Tan
Wynne,105.2,Pink


# <font color='red'>R List Basics</font>

We've covered vectors, matrices and Dataframes. This is the perfect opportunity to discuss list basics using the built-in R data structure **list()**

In [205]:
# Create a vector
v <- c(1,2,3,4,5)

# Create a matrix
m <- matrix(1:10, nrow = 2)

# Create a Dataframe
df <- women

In [206]:
v

In [207]:
m

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10


In [208]:
df

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


## Creating Lists

Use the built-in R function **list()** to create a list data structure containing v, m and df

In [209]:
li <- list(v, m, df)

In [210]:
li

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


NOTE: There are index values to show you which element to reference to pull out v (1), m (2) or df (3). The index values can be changed to names, similar to Python keys in a dictionary

In [211]:
li <- list(sample_vec = v, sample_mat = m, sample_df = df)

In [212]:
li

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


In [213]:
# Use brackets to select an object from the list using an index
li[1]

In [214]:
# Use brackets to select an object by name
li['sample_vec']

In [215]:
# Use double brackets to grab the elements from the index
li[['sample_vec']]

In [216]:
# Can use the '$' notation
li$sample_vec

In [217]:
# Use indexing to select elements within objects
li[['sample_vec']][1]

In [218]:
li[['sample_mat']]

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10


In [219]:
li[['sample_mat']][1,]

In [220]:
li[['sample_mat']][1:2,1:2]

0,1
1,3
2,4


In [221]:
li[['sample_df']]['height']

height
<dbl>
58
59
60
61
62
63
64
65
66
67


## Combining Lists

Lists can hold other lists (nesting). Use the built-in R function combine **c()** to nest lists

In [222]:
# Doubling list
double_list <- c(li, li)

In [223]:
double_list

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


In [224]:
str(double_list)

List of 6
 $ sample_vec: num [1:5] 1 2 3 4 5
 $ sample_mat: int [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
 $ sample_df :'data.frame':	15 obs. of  2 variables:
  ..$ height: num [1:15] 58 59 60 61 62 63 64 65 66 67 ...
  ..$ weight: num [1:15] 115 117 120 123 126 129 132 135 139 142 ...
 $ sample_vec: num [1:5] 1 2 3 4 5
 $ sample_mat: int [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
 $ sample_df :'data.frame':	15 obs. of  2 variables:
  ..$ height: num [1:15] 58 59 60 61 62 63 64 65 66 67 ...
  ..$ weight: num [1:15] 115 117 120 123 126 129 132 135 139 142 ...


# End of Dataframes