# Introduction to R-DataFrames

In this notebook we learn about the following things:

- The R-DataFrame Basics
- R-DataFrames Indexing and Slicing
- Overview of DataFrame Operations

So sit tight and lets enjoy the ride 🚘

# Table of Contents <a name="table"/>

1. [DataFrame Basics](#dataframe_basics)
    - [Head and Tail](#head_tail)
    - [Structure and Summary](#structure_summary)
    - [Creating a DataFrame](#dataframe_creation)
2. [Indexing and Selection](#indexing_selection)
    - [Indexing](#indexing)
    - [Selection](#selection)
3. [Overview of DataFrame Operations](#overview)
    - [Creating a DataFrame](#create_dataframe)
    - [Reading and Writing a DataFrame](#read_write)
    - [Getting the Information about the DataFrame](#info)
    - [Referencing Cells](#cell)
    - [Adding Rows](#add_rows)
    - [Adding Columns](#add_col)
    - [Dealing with Missing value](#miss_val)

# DataFrame Basics <a name="dataframe_basics"/>

[Jump to the Table of Contents](#table)

In [3]:
state.x77

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073


In [4]:
women

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


In [5]:
USPersonalExpenditure

Unnamed: 0,1940,1945,1950,1955,1960
Food and Tobacco,22.2,44.5,59.6,73.2,86.8
Household Operation,10.5,15.5,29.0,36.5,46.2
Medical and Health,3.53,5.76,9.71,14.0,21.1
Personal Care,1.04,1.98,2.45,3.4,5.4
Private Education,0.341,0.974,1.8,2.6,3.64


In [6]:
data()

Package,Item,Title
<chr>,<chr>,<chr>
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
datasets,BJsales,Sales Data with Leading Indicator
datasets,BJsales.lead (BJsales),Sales Data with Leading Indicator
datasets,BOD,Biochemical Oxygen Demand
datasets,CO2,Carbon Dioxide Uptake in Grass Plants
datasets,ChickWeight,Weight versus age of chicks on different diets
datasets,DNase,Elisa assay of DNase
datasets,EuStockMarkets,"Daily Closing Prices of Major European Stock Indices, 1991-1998"
datasets,Formaldehyde,Determination of Formaldehyde
datasets,HairEyeColor,Hair and Eye Color of Statistics Students


## Head and Tail <a name="head_tail"/>
[Jump to the Main Heading](#dataframe_basics)

In [7]:
head(state.x77, 10)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073


In [10]:
tail(state.x77, 10)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
South Dakota,681,4167,0.5,72.08,1.7,53.3,172,75955
Tennessee,4173,3821,1.7,70.11,11.0,41.8,70,41328
Texas,12237,4188,2.2,70.9,12.2,47.4,35,262134
Utah,1203,4022,0.6,72.9,4.5,67.3,137,82096
Vermont,472,3907,0.6,71.64,5.5,57.1,168,9267
Virginia,4981,4701,1.4,70.08,9.5,47.8,85,39780
Washington,3559,4864,0.6,71.72,4.3,63.5,32,66570
West Virginia,1799,3617,1.4,69.48,6.7,41.6,100,24070
Wisconsin,4589,4468,0.7,72.48,3.0,54.5,149,54464
Wyoming,376,4566,0.6,70.29,6.9,62.9,173,97203


## Structure and Summary <a name="structure_summary"/>
[Jump to the Main Heading](#dataframe_basics)

In [11]:
str(state.x77)

 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...


In [9]:
print(summary(state.x77))

   Population        Income       Illiteracy       Life Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81162  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  


## Creating a DataFrame <a name="dataframe_creation"/>

[Jump to the Main Heading](#dataframe_basics)

In [14]:
days = c("Mon", "Tue", "Wed", "Thr", "Fri", "Sat", "Sun")
temps = c(runif(7, min=22, max=50))
rain = sample(c(TRUE, FALSE), 7, replace=TRUE)

# Create a sample dataframe
sample.df = data.frame(days, temps, rain)
sample.df

days,temps,rain
<chr>,<dbl>,<lgl>
Mon,42.65094,False
Tue,43.09041,False
Wed,30.6826,False
Thr,43.0591,True
Fri,46.36578,True
Sat,28.84725,False
Sun,47.0194,True


In [15]:
str(sample.df)

'data.frame':	7 obs. of  3 variables:
 $ days : chr  "Mon" "Tue" "Wed" "Thr" ...
 $ temps: num  42.7 43.1 30.7 43.1 46.4 ...
 $ rain : logi  FALSE FALSE FALSE TRUE TRUE FALSE ...


In [16]:
summary(sample.df)

     days               temps          rain        
 Length:7           Min.   :28.85   Mode :logical  
 Class :character   1st Qu.:36.67   FALSE:4        
 Mode  :character   Median :43.06   TRUE :3        
                    Mean   :40.25                  
                    3rd Qu.:44.73                  
                    Max.   :47.02                  

# Indexing and Selection <a name="indexing_selection"/>

[Jump to the Table of Contents](#table)

## Indexing <a name="indexing"/>

[Jump to the Main Header](#indexing_selection)

In [17]:
sample.df[1, ]

Unnamed: 0_level_0,days,temps,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
1,Mon,42.65094,False


In [19]:
print(sample.df[, 1])

[1] "Mon" "Tue" "Wed" "Thr" "Fri" "Sat" "Sun"


In [20]:
print(sample.df[, 'rain'])

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE


## Selection <a name="selection"/>
[Jump to the Main Header](#indexing_selection)

In [23]:
sample.df[1:3, c("days", "temps")]

Unnamed: 0_level_0,days,temps
Unnamed: 0_level_1,<chr>,<dbl>
1,Mon,42.65094
2,Tue,43.09041
3,Wed,30.6826


In [26]:
print(sample.df$days)

[1] "Mon" "Tue" "Wed" "Thr" "Fri" "Sat" "Sun"


In [27]:
print(sample.df$rain)

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE


In [29]:
subset(sample.df, subset=rain == TRUE)

Unnamed: 0_level_0,days,temps,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
4,Thr,43.0591,True
5,Fri,46.36578,True
7,Sun,47.0194,True


In [31]:
subset(sample.df, subset=temps>45)

Unnamed: 0_level_0,days,temps,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
5,Fri,46.36578,True
7,Sun,47.0194,True


In [35]:
sorted.df = order(sample.df['temps'])
sample.df[sorted.df,]

“cannot xtfrm data frames”


Unnamed: 0_level_0,days,temps,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
6,Sat,28.84725,False
3,Wed,30.6826,False
1,Mon,42.65094,False
4,Thr,43.0591,True
2,Tue,43.09041,False
5,Fri,46.36578,True
7,Sun,47.0194,True


In [37]:
desc.df = order(-sample.df['temps'])
sample.df[desc.df,]

“cannot xtfrm data frames”


Unnamed: 0_level_0,days,temps,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
7,Sun,47.0194,True
5,Fri,46.36578,True
2,Tue,43.09041,False
4,Thr,43.0591,True
1,Mon,42.65094,False
3,Wed,30.6826,False
6,Sat,28.84725,False


# Overview of DataFrame Operations <a name="overview"/>

[Jump to the Table of Contents](#table)

## Creating a Dataframe <a name="create_dataframe"/>

[Jump to the main heading](#overview)

In [40]:
# Create an empty dataframe
empty = data.frame()
print(empty)

# Create Dataframe from vectors
c1 = c(floor(runif(10, min=20, max=100)))
c2 = c(floor(runif(10, min=-34, max=0)))
c3 = c(floor(runif(10, min=100, max=1000)))
c4 = sample(letters, 10, replace=TRUE)

c.df = data.frame(c1, c2, c3, c4)
print(c.df)




data frame with 0 columns and 0 rows
   c1  c2  c3 c4
1  42 -27 485  g
2  20 -31 494  u
3  81 -23 644  c
4  52  -7 708  x
5  40  -2 773  t
6  97  -3 963  h
7  94 -14 285  y
8  57 -16 307  a
9  27 -23 913  t
10 89 -13 417  v


## Reading and Writing a Dataframe <a name="read_write"/>

[Jump to the Main Heading](#overview)

In [43]:
# Read a csv
df = read.csv("../data/sample_files.csv")

# Write a csv
write.csv(sample.df, file="../data/dummy.csv")

## Getting the Information about the DataFrame <a name="info"/>

[Jump to the Main heading](#overview)

In [44]:
sample.df

days,temps,rain
<chr>,<dbl>,<lgl>
Mon,42.65094,False
Tue,43.09041,False
Wed,30.6826,False
Thr,43.0591,True
Fri,46.36578,True
Sat,28.84725,False
Sun,47.0194,True


In [45]:
print(nrow(sample.df)) # Outputs the number of rows in the dataframe

[1] 7


In [46]:
print(ncol(sample.df)) # Outputs the number of columns in the dataframe

[1] 3


In [47]:
print(colnames(sample.df)) # Outputs the column names of the dataframe

[1] "days"  "temps" "rain" 


In [48]:
print(rownames(sample.df)) # Outputs the row names of the dataframe

[1] "1" "2" "3" "4" "5" "6" "7"


## Referencing Cells <a name="cell"/>

[Jump to the Main Heading](#overview)

In [51]:
print(sample.df[[2, 1]])

[1] "Tue"


In [52]:
print(sample.df[[2, 'days']])

[1] "Tue"


In [53]:
sample.df[[2, 'temps']] = 50
print(sample.df[[2, 'temps']])

[1] 50


## Adding Rows <a name="add_rows"/>

[Jump to the main heading](#overview)

In [56]:
c.df_new = rbind(c.df, c(100, -24, 19, "k"))
print(c.df_new)

    c1  c2  c3 c4
1   42 -27 485  g
2   20 -31 494  u
3   81 -23 644  c
4   52  -7 708  x
5   40  -2 773  t
6   97  -3 963  h
7   94 -14 285  y
8   57 -16 307  a
9   27 -23 913  t
10  89 -13 417  v
11 100 -24  19  k


## Adding Columns <a name="add_col"/>

[Jump to the Main Heading](#overview)

In [57]:
c.df_new = cbind(c.df_new, c5=sample(c("H", "T"), 11, replace=TRUE))
print(c.df_new)

    c1  c2  c3 c4 c5
1   42 -27 485  g  H
2   20 -31 494  u  T
3   81 -23 644  c  T
4   52  -7 708  x  H
5   40  -2 773  t  H
6   97  -3 963  h  T
7   94 -14 285  y  T
8   57 -16 307  a  T
9   27 -23 913  t  H
10  89 -13 417  v  H
11 100 -24  19  k  H


## Dealing with Missing Value <a name="miss_val"/>

[Jump to the Main Heading](#overview)

In [58]:
any(is.na(sample.df))

In [59]:
any(is.na(sample.df$temps))