# Dataframes are a commonly used way to store and work with data, it's pretty easy to convert between a .csv and a dataframe

In [1]:
# columns generally contain one type of data

# Here I'm making and populating a dataframe using the data.frame() function
exampleDF<-data.frame(columnOne=1:5,columnTwo=TRUE,columnThree=c("one","two","three", "four","five"))

# Look at the structure of the dataframe
print("structure")
str(exampleDF)

# Look at a summary of the dataframe
print("summary")
summary(exampleDF)

# Look at the dataframe itself
print("dataframe")
exampleDF

[1] "structure"
'data.frame':	5 obs. of  3 variables:
 $ columnOne  : int  1 2 3 4 5
 $ columnTwo  : logi  TRUE TRUE TRUE TRUE TRUE
 $ columnThree: chr  "one" "two" "three" "four" ...
[1] "summary"


   columnOne columnTwo      columnThree       
 Min.   :1   Mode:logical   Length:5          
 1st Qu.:2   TRUE:5         Class :character  
 Median :3                  Mode  :character  
 Mean   :3                                    
 3rd Qu.:4                                    
 Max.   :5                                    

[1] "dataframe"


columnOne,columnTwo,columnThree
<int>,<lgl>,<chr>
1,True,one
2,True,two
3,True,three
4,True,four
5,True,five


# Ways to pull specific lines from a dataframe

In [2]:
# changing the dataframe to contain a different variety of values
exampleDF<-data.frame(one=c(6,NA,8,NA,9),two=c(TRUE,FALSE,TRUE,FALSE,TRUE),three=1:5)
exampleDF

one,two,three
<dbl>,<lgl>,<int>
6.0,True,1
,False,2
8.0,True,3
,False,4
9.0,True,5


In [3]:
# access one row by index
print("one row:")
exampleDF[3,]

# access one value in a specific row and column
print("one value:")
exampleDF[3,2]

[1] "one row:"


Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
3,8,True,3


[1] "one value:"


In [4]:
# access one column by index
# vector of values in a column
exampleDF[,3]
exampleDF[[3]]

# single column of data frame, but this is a bit misleading when considering the  df[y,x] format
exampleDF[3]

# the distinction to note here is vector vs dataframe datatype output

three
<int>
1
2
3
4
5


In [5]:
# # find the max value of a data frame and the column it's in
# val = 0
# for( col in names(exampleDF) ){
#     temp = max( exampleDF[col] )
#     if (temp > val){
#         val = temp
#         final = col
#     }
# }
# sprintf("Max value is %i in column %s.",temp, final)

In [6]:
# show me rows of 'exampleDF' where the values in exampleDF column 'one' are greater than 3
exampleDF[ exampleDF$one > 3 ,]

Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
1,6.0,True,1.0
,,,
3,8.0,True,3.0
NA.1,,,
5,9.0,True,5.0


In [7]:
# show rows where values in exampleDF column 'two' are TRUE
exampleDF[ exampleDF$two == TRUE ,]

Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
1,6,True,1
3,8,True,3
5,9,True,5


In [8]:
# function is.na() recognizes NA and NaN values
is.na(NA)
is.na(NaN)

In [9]:
# lines with NA
exampleDF[ is.na(exampleDF$three) ,]

one,two,three
<dbl>,<lgl>,<int>


In [10]:
# lines without NA
exampleDF[ !is.na(exampleDF$three) ,]

Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
1,6.0,True,1
2,,False,2
3,8.0,True,3
4,,False,4
5,9.0,True,5


# These can be used to subset data while retaining the original DF

In [11]:
exDFnoNA<-exampleDF[ !is.na(exampleDF$three) ,]
exDFnoNA # has only lines where column three does not have NA

exampleDF # stays the same

Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
1,6.0,True,1
2,,False,2
3,8.0,True,3
4,,False,4
5,9.0,True,5


one,two,three
<dbl>,<lgl>,<int>
6.0,True,1
,False,2
8.0,True,3
,False,4
9.0,True,5


## Another way is to use which()

In [12]:
# returns the row position(s) where the logical test evaluated to TRUE
which(exampleDF$one > 3)

# this can be used to access those specific rows in the data frame
exampleDF[ which(exampleDF$one > 3) ,]

Unnamed: 0_level_0,one,two,three
Unnamed: 0_level_1,<dbl>,<lgl>,<int>
1,6,True,1
3,8,True,3
5,9,True,5


# This can also be used to overwrite values, though does so for the whole row

In [13]:
test<-exampleDF
test[ which(test$one > 3) ,] <- 'test'

test

one,two,three
<chr>,<chr>,<chr>
test,test,test
,FALSE,2
test,test,test
,FALSE,4
test,test,test


## Counting things

In [14]:
# number of rows
nrow(exampleDF)

# number of columns
ncol(exampleDF)
length(exampleDF)

# logical test on an entire column of data frame
exampleDF$two == TRUE

# is.na() returns TRUE or FALSE for each row
is.na( exampleDF$three )

# sum() can be used to add each time the logical test evaluates to TRUE
sum( is.na(exampleDF$three) )

## For loops to modify a dataframe

In [15]:
exampleDF

one,two,three
<dbl>,<lgl>,<int>
6.0,True,1
,False,2
8.0,True,3
,False,4
9.0,True,5


### Building a for loop to modify one column in a dataframe step by step

1. isolate a column

In [16]:
exampleDF['three']

three
<int>
1
2
3
4
5


2. Make a for loop to iterate through each item in the column

In [17]:
for(i in exampleDF['three']){
    print(i)
}

[1] 1 2 3 4 5


Example: use the items to calculate something new

In [20]:
for(i in exampleDF['three']){
    print( i*2 )
}

[1]  2  4  6  8 10


Example: create a new column with modified values (note: you could also replace the items in the original dataframe by reindicating the same column)

In [21]:
for(i in exampleDF['three']){
    exampleDF['doubled']<-i*2 
}

exampleDF

one,two,three,doubled
<dbl>,<lgl>,<int>,<dbl>
6.0,True,1,2
,False,2,4
8.0,True,3,6
,False,4,8
9.0,True,5,10


### The benefit to acccessing the column using dataframe['name of column'] vs dataframe$name of column is that [ ] can take a variable as input

In [26]:
colName<-'two'
# can be used to access the specific column
exampleDF[colName]

# or a vector of the data in the column
exampleDF[[colName]]

two
<lgl>
True
False
True
False
True
