# Dataframe Basics
We've learned about vectors and their two-dimensional counterpart, matrices. Now we will learn about Dataframes, one of the main tools for data analysis with R! Matrix inputs were limited because all the data inside of the matrix had to be of the same data type (numerics, logicals, etc). With Dataframes we will be able to organize and mix data types to create a very powerful data structure tool!

R actually has built in DataFrames for quick reference to play around with! Check out the following dataframes that are built-in!

In [31]:
# Dataframe about states
state.x77

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766
Connecticut,3100,5348,1.1,72.48,3.1,56.0,139,4862
Delaware,579,4809,0.9,70.06,6.2,54.6,103,1982
Florida,8277,4815,1.3,70.66,10.7,52.6,11,54090
Georgia,4931,4091,2.0,68.54,13.9,40.6,60,58073


In [32]:
# US personal expense
USPersonalExpenditure

Unnamed: 0,1940,1945,1950,1955,1960
Food and Tobacco,22.2,44.5,59.6,73.2,86.8
Household Operation,10.5,15.5,29.0,36.5,46.2
Medical and Health,3.53,5.76,9.71,14.0,21.1
Personal Care,1.04,1.98,2.45,3.4,5.4
Private Education,0.341,0.974,1.8,2.6,3.64


In [33]:
# Women 
women

height,weight
<dbl>,<dbl>
58,115
59,117
60,120
61,123
62,126
63,129
64,132
65,135
66,139
67,142


In [34]:
# To get a list of all available built-in dataframes, use:
data()

## Working with DataFrames
You'll notice the states dataframe was really big, we can use the head() and tail() functions to view the first and last 6 rows respectively. Let's take a look:

In [35]:
# Quick variable assignment to save typing
states <- state.x77

In [36]:
head(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Alabama,3615,3624,2.1,69.05,15.1,41.3,20,50708
Alaska,365,6315,1.5,69.31,11.3,66.7,152,566432
Arizona,2212,4530,1.8,70.55,7.8,58.1,15,113417
Arkansas,2110,3378,1.9,70.66,10.1,39.9,65,51945
California,21198,5114,1.1,71.71,10.3,62.6,20,156361
Colorado,2541,4884,0.7,72.06,6.8,63.9,166,103766


In [37]:
tail(states)

Unnamed: 0,Population,Income,Illiteracy,Life Exp,Murder,HS Grad,Frost,Area
Vermont,472,3907,0.6,71.64,5.5,57.1,168,9267
Virginia,4981,4701,1.4,70.08,9.5,47.8,85,39780
Washington,3559,4864,0.6,71.72,4.3,63.5,32,66570
West Virginia,1799,3617,1.4,69.48,6.7,41.6,100,24070
Wisconsin,4589,4468,0.7,72.48,3.0,54.5,149,54464
Wyoming,376,4566,0.6,70.29,6.9,62.9,173,97203


### DataFrames - Overview of information¶
We can use the str() to get the structure of a dataframe, which gives information on the structure of the dataframe and the data it contains, such as variable names and data types. We can use summary() to get a quick statistical summary of all the columns of a DataFrame, depending on the data, this may or may not be useful!

In [38]:
# Statistical summary of data
summary(states)

   Population        Income       Illiteracy       Life Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  

In [39]:
# Structure of Data
str(states)

 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...


### Creating Data frames¶
A quick note some people write Dataframe as one word, but in R its more commonly written as two words: data frame. Not a very huge deal either way, but if someone writes DataFrame they may be referring to a Python/pandas DataFrame, so keep that in mind!

We can create data frames using the data.frame() function and pass vectors as arguments, which will then convert the vectors into columns of the data frame. Let's see a simple example:

In [40]:
# Some made up weather data
days <- c('mon','tue','wed','thu','fri')
temp <- c(22.2,21,23,24.3,25)
rain <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

In [41]:
# Pass in the vectors:
df <- data.frame(days,temp,rain)

In [42]:
df

days,temp,rain
<chr>,<dbl>,<lgl>
mon,22.2,True
tue,21.0,True
wed,23.0,False
thu,24.3,False
fri,25.0,True


In [43]:
# Check structure
str(df)

'data.frame':	5 obs. of  3 variables:
 $ days: chr  "mon" "tue" "wed" "thu" ...
 $ temp: num  22.2 21 23 24.3 25
 $ rain: logi  TRUE TRUE FALSE FALSE TRUE


In [44]:
summary(df)

     days                temp         rain        
 Length:5           Min.   :21.0   Mode :logical  
 Class :character   1st Qu.:22.2   FALSE:2        
 Mode  :character   Median :23.0   TRUE :3        
                    Mean   :23.1                  
                    3rd Qu.:24.3                  
                    Max.   :25.0                  

## Data Frame Selection and Indexing
We've seen how to call built-in data frames and how to create them using data.frame() along with vectors. Let's revisit our weather data frame and learn how to select elements from within the dataframe using bracket notation:

In [45]:
# Some made up weather data
days <- c('mon','tue','wed','thu','fri')
temp <- c(22.2,21,23,24.3,25)
rain <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

# Pass in the vectors:
df <- data.frame(days,temp,rain)

In [46]:
df

days,temp,rain
<chr>,<dbl>,<lgl>
mon,22.2,True
tue,21.0,True
wed,23.0,False
thu,24.3,False
fri,25.0,True


We can use the same bracket notation we used for matrices:

df[rows,columns]

In [47]:
# Everything from first row
df[1,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
1,mon,22.2,True


In [48]:
#Everything from first column
df[,1]

In [49]:
# Grab Friday data
df[5,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
5,fri,25,True


### Selecting using column names
Here is where data frames become very powerful, we can use column names to select data for the columns instead of having to remember numbers. So for example:

In [50]:
# All rain values
df[,'rain']

In [51]:
# First 5 rows for days and temps
df[1:5,c('days','temp')]

Unnamed: 0_level_0,days,temp
Unnamed: 0_level_1,<chr>,<dbl>
1,mon,22.2
2,tue,21.0
3,wed,23.0
4,thu,24.3
5,fri,25.0


If you want all the values of a particular column you can use the dollar sign directly after the dataframe as follows:

df.name$column.name

In [52]:
df$rain

In [136]:
df$days

NULL

In [137]:
df$1

ERROR: Error in parse(text = x, srcfile = src): <text>:1:4: unexpected numeric constant
1: df$1
       ^


You can also use bracket notation to return a data frame format of the same information:

In [54]:
df['rain']

rain
<lgl>
True
True
False
False
True


In [55]:
df['days']

days
<chr>
mon
tue
wed
thu
fri


### Filtering with a subset condition¶
We can use the subset() function to grab a subset of values from our data frame based off some condition. So for example, imagin we wanted to grab the days where it rained (rain=True), we can use the subset() function as follows:

In [56]:
subset(df,subset=rain==TRUE)

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
1,mon,22.2,True
2,tue,21.0,True
5,fri,25.0,True


Notice how the condition uses some sort of comparison operator, in the above case ==. Let's grab days where the temperature was greater than 23:

In [57]:
subset(df,subset= temp>23)

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
4,thu,24.3,False
5,fri,25.0,True


Another thing to note is that we didn't pass in the column name as a character string, subset knows that you are referring to a column in that data frame.

## Odering a Data Frame
We can sort the order of our data frame by using the order function. You pass in the column you want to sort by into the order() function, then you use that vector to select from the dataframe. Let's see an example of sorting by the temperature:

In [66]:
sorted.temp <- order(df[["temp"]])

In [67]:
sorted.temp

In [68]:
df[sorted.temp,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
2,tue,21.0,True
1,mon,22.2,True
3,wed,23.0,False
4,thu,24.3,False
5,fri,25.0,True


Ok, so we are just asking for those index elements in that order (by default ascending, we can pass a negative sign to do descending order):

In [70]:
desc.temp <- order(-df[['temp']])

In [71]:
df[desc.temp,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
5,fri,25.0,True
4,thu,24.3,False
3,wed,23.0,False
1,mon,22.2,True
2,tue,21.0,True


We could have also used the other column selection methods we learned:

In [72]:
sort.temp <- order(df$temp)
df[sort.temp,]

Unnamed: 0_level_0,days,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>
2,tue,21.0,True
1,mon,22.2,True
3,wed,23.0,False
4,thu,24.3,False
5,fri,25.0,True


## Merging Data Frames
Let's learn how to merge Data Frames together (you'll use this in your Final Data Frame Project!)

In [73]:
## use character columns of names to get sensible sort order
authors <- data.frame(
    surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    deceased = c("yes", rep("no", 4)))

In [74]:
books <- data.frame(
    name = I(c("Tukey", "Venables", "Tierney",
             "Ripley", "Ripley", "McNeil", "R Core")),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    other.author = c(NA, "Ripley", NA, NA, NA, NA,
                     "Venables & Smith"))

In [138]:
authors

surname,nationality,deceased
<I<chr>>,<chr>,<chr>
Tukey,US,yes
Venables,Australia,no
Tierney,US,no
Ripley,UK,no
McNeil,Australia,no


In [139]:
books

name,title,other.author
<I<chr>>,<chr>,<chr>
Tukey,Exploratory Data Analysis,
Venables,Modern Applied Statistics ...,Ripley
Tierney,LISP-STAT,
Ripley,Spatial Statistics,
Ripley,Stochastic Simulation,
McNeil,Interactive Data Analysis,
R Core,An Introduction to R,Venables & Smith


In [76]:
(m1 <- merge(authors, books, by.x = "surname", by.y = "name"))

surname,nationality,deceased,title,other.author
<I<chr>>,<chr>,<chr>,<chr>,<chr>
McNeil,Australia,no,Interactive Data Analysis,
Ripley,UK,no,Spatial Statistics,
Ripley,UK,no,Stochastic Simulation,
Tierney,US,no,LISP-STAT,
Tukey,US,yes,Exploratory Data Analysis,
Venables,Australia,no,Modern Applied Statistics ...,Ripley


In [98]:
(m2 <- merge(books, authors, by.x = "name", by.y = "surname"))
stopifnot(as.character(m1[, 1]) == as.character(m2[, 1]),
          all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ]),
          dim(merge(m1, m2, by = integer(0))) == c(36, 10))

## "R core" is missing from authors and appears only here :
merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)

## example of using 'incomparables'
x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
merge(x, y, by = c("k1","k2")) # NA's match
merge(x, y, by = "k1") # NA's match, so 6 rows
merge(x, y, by = "k2", incomparables = NA) # 2 rows

name,title,other.author,nationality,deceased
<I<chr>>,<chr>,<chr>,<chr>,<chr>
McNeil,Interactive Data Analysis,,Australia,no
Ripley,Spatial Statistics,,UK,no
Ripley,Stochastic Simulation,,UK,no
Tierney,LISP-STAT,,US,no
Tukey,Exploratory Data Analysis,,US,yes
Venables,Modern Applied Statistics ...,Ripley,Australia,no


surname,nationality,deceased,title,other.author
<I<chr>>,<chr>,<chr>,<chr>,<chr>
McNeil,Australia,no,Interactive Data Analysis,
R Core,,,An Introduction to R,Venables & Smith
Ripley,UK,no,Spatial Statistics,
Ripley,UK,no,Stochastic Simulation,
Tierney,US,no,LISP-STAT,
Tukey,US,yes,Exploratory Data Analysis,
Venables,Australia,no,Modern Applied Statistics ...,Ripley


k1,k2,data.x,data.y
<dbl>,<dbl>,<int>,<int>
4.0,4.0,4,4
5.0,5.0,5,5
,,2,1


k1,k2.x,data.x,k2.y,data.y
<dbl>,<dbl>,<int>,<dbl>,<int>
4.0,4.0,4,4.0,4
5.0,5.0,5,5.0,5
,1.0,1,,1
,1.0,1,3.0,3
,,2,,1
,,2,3.0,3


k2,k1.x,data.x,k1.y,data.y
<dbl>,<dbl>,<int>,<dbl>,<int>
4,4,4,4,4
5,5,5,5,5


In [142]:
empty <- data.frame() # empty data frame

c1 <- 1:10 # vector of integers

c2 <- letters[1:10] # vector of strings

df <- data.frame(col.name.1=c1,col.name.2=c2)

## Importing and Exporting Data

In [143]:
d2 <- read.csv('some.file.name.csv')

# For Excel Files
# Load the readxl package
library(readxl)
# Call info from the sheets using read.excel
df <- read_excel('Sample-Sales-Data.xlsx',sheet='Sheet1')

# Output to csv
write.csv(df, file='some.file.csv')

"cannot open file 'some.file.name.csv': No such file or directory"


ERROR: Error in file(file, "rt"): cannot open the connection


### Getting Information about Data Frame

In [144]:
# Row and columns counts
nrow(df)
ncol(df)

In [145]:
# Column Names
colnames(df)

In [146]:
# Row names (may just return index)
rownames(df)

### Referencing Cells
You can think of the basics as using two sets of brackets for a single cell, and a single set of brackets for multiple cells. For example:

In [147]:
vec <- df[[5, 2]] # get cell by [[row,col]] num

newdf <- df[1:5, 1:2] # get multiplt cells in new df

df[[2, 'col.name.1']] <- 99999 # reassign a single cell

In [148]:
df

col.name.1,col.name.2
<dbl>,<chr>
1,a
99999,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


### Referencing Rows
Usually you'll use the [row,] format

In [149]:
# returns a data frame (and not a vector!)
rowdf <- df[1, ]

In [150]:
rowdf

Unnamed: 0_level_0,col.name.1,col.name.2
Unnamed: 0_level_1,<dbl>,<chr>
1,1,a


In [151]:
# to get a row as a vector, use following
vrow <- as.numeric(as.vector(df[1,]))

"NAs introduced by coercion"


In [152]:
vrow

### Referencing Columns
Most column references return a vector:

In [153]:
cars <- mtcars
head(cars)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [154]:
colv1 <- cars$mpg # returns a vector
colv1

colv2 <- cars[, 'mpg'] # returns vector
colv2

colv3<- cars[, 1] # a is int or string
colv3

colv4 <- cars[['mpg']] # returns a vector
colv4

In [157]:
# Ways of Returning Data Frames
mpgdf <- cars['mpg'] # returns 1 col df
head(mpgdf)

mpgdf2 <- cars[1] # returns 1 col df
head(mpgdf2)

Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


Unnamed: 0_level_0,mpg
Unnamed: 0_level_1,<dbl>
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8
Hornet 4 Drive,21.4
Hornet Sportabout,18.7
Valiant,18.1


### Adding Rows

In [116]:
# Both arguments are DFs)
df2 <- data.frame(col.name.1=2000,col.name.2='new' )
df2

# use rbind to bind a new row!
dfnew <- rbind(df,df2)

col.name.1,col.name.2
<dbl>,<chr>
2000,new


In [117]:
dfnew

col.name.1,col.name.2
<dbl>,<chr>
1,a
99999,b
3,c
4,d
5,e
6,f
7,g
8,h
9,i
10,j


### Adding Columns

In [118]:
df$newcol <- rep(NA, nrow(df)) # NA column
df

col.name.1,col.name.2,newcol
<dbl>,<chr>,<lgl>
1,a,
99999,b,
3,c,
4,d,
5,e,
6,f,
7,g,
8,h,
9,i,
10,j,


In [119]:
df[, 'copy.of.col2'] <- df$col.name.2 # copy a col
df

col.name.1,col.name.2,newcol,copy.of.col2
<dbl>,<chr>,<lgl>,<chr>
1,a,,a
99999,b,,b
3,c,,c
4,d,,d
5,e,,e
6,f,,f
7,g,,g
8,h,,h
9,i,,i
10,j,,j


In [120]:
# Can also use equations!
df[['col1.times.2']] <- df$col.name.1 * 2
df

col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
<dbl>,<chr>,<lgl>,<chr>,<dbl>
1,a,,a,2
99999,b,,b,199998
3,c,,c,6
4,d,,d,8
5,e,,e,10
6,f,,f,12
7,g,,g,14
8,h,,h,16
9,i,,i,18
10,j,,j,20


In [121]:
df3 <- cbind(df, df$col.name.1)
df3

col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2,df$col.name.1
<dbl>,<chr>,<lgl>,<chr>,<dbl>,<dbl>
1,a,,a,2,1
99999,b,,b,199998,99999
3,c,,c,6,3
4,d,,d,8,4
5,e,,e,10,5
6,f,,f,12,6
7,g,,g,14,7
8,h,,h,16,8
9,i,,i,18,9
10,j,,j,20,10


### Setting Column Names

In [122]:
# Rename second column
colnames(df)[2] <- 'SECOND COLUMN NEW NAME'
df

# Rename all at once with a vector
colnames(df) <- c('col.name.1', 'col.name.2', 'newcol', 'copy.of.col2' ,'col1.times.2')
df

col.name.1,SECOND COLUMN NEW NAME,newcol,copy.of.col2,col1.times.2
<dbl>,<chr>,<lgl>,<chr>,<dbl>
1,a,,a,2
99999,b,,b,199998
3,c,,c,6
4,d,,d,8
5,e,,e,10
6,f,,f,12
7,g,,g,14
8,h,,h,16
9,i,,i,18
10,j,,j,20


col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
<dbl>,<chr>,<lgl>,<chr>,<dbl>
1,a,,a,2
99999,b,,b,199998
3,c,,c,6
4,d,,d,8
5,e,,e,10
6,f,,f,12
7,g,,g,14
8,h,,h,16
9,i,,i,18
10,j,,j,20


### Selecting Multiple Rows

In [123]:
first.ten.rows <- df[1:10, ] # Same as head(df, 10)
first.ten.rows

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
Unnamed: 0_level_1,<dbl>,<chr>,<lgl>,<chr>,<dbl>
1,1,a,,a,2
2,99999,b,,b,199998
3,3,c,,c,6
4,4,d,,d,8
5,5,e,,e,10
6,6,f,,f,12
7,7,g,,g,14
8,8,h,,h,16
9,9,i,,i,18
10,10,j,,j,20


In [124]:
everything.but.row.two <- df[-2, ]
everything.but.row.two

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
Unnamed: 0_level_1,<dbl>,<chr>,<lgl>,<chr>,<dbl>
1,1,a,,a,2
3,3,c,,c,6
4,4,d,,d,8
5,5,e,,e,10
6,6,f,,f,12
7,7,g,,g,14
8,8,h,,h,16
9,9,i,,i,18
10,10,j,,j,20


In [125]:
# Conditional Selection
sub1 <- df[ (df$col.name.1 > 8 & df$col1.times.2 > 10), ]
sub1

sub2 <- subset(df, col.name.1 > 8 & col1.times.2 > 10)
sub2

Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
Unnamed: 0_level_1,<dbl>,<chr>,<lgl>,<chr>,<dbl>
2,99999,b,,b,199998
9,9,i,,i,18
10,10,j,,j,20


Unnamed: 0_level_0,col.name.1,col.name.2,newcol,copy.of.col2,col1.times.2
Unnamed: 0_level_1,<dbl>,<chr>,<lgl>,<chr>,<dbl>
2,99999,b,,b,199998
9,9,i,,i,18
10,10,j,,j,20


### Selecting Multiple Columns

In [127]:
df[, c(1, 2, 3)] #Grab cols 1 2 3

col.name.1,col.name.2,newcol
<dbl>,<chr>,<lgl>
1,a,
99999,b,
3,c,
4,d,
5,e,
6,f,
7,g,
8,h,
9,i,
10,j,


In [128]:
df[, c('col.name.1', 'col1.times.2')] # by name

col.name.1,col1.times.2
<dbl>,<dbl>
1,2
99999,199998
3,6
4,8
5,10
6,12
7,14
8,16
9,18
10,20


In [129]:
df[, -1] # keep all but first column|

col.name.2,newcol,copy.of.col2,col1.times.2
<chr>,<lgl>,<chr>,<dbl>
a,,a,2
b,,b,199998
c,,c,6
d,,d,8
e,,e,10
f,,f,12
g,,g,14
h,,h,16
i,,i,18
j,,j,20


In [130]:
df[, -c(1, 3)] # drop cols 1 and 3

col.name.2,copy.of.col2,col1.times.2
<chr>,<chr>,<dbl>
a,a,2
b,b,199998
c,c,6
d,d,8
e,e,10
f,f,12
g,g,14
h,h,16
i,i,18
j,j,20


## Dealing with Missing Data
Dealing with missing data is a very important skill to know when working with data frames!

In [131]:
any(is.na(df)) # detect anywhere in df

In [132]:
any(is.na(df$col.name.1)) # anywhere in col

In [133]:
# delete selected missing data rows
df <- df[!is.na(df$col), ]

In [134]:
# replace NAs with something else
df[is.na(df)] <- 0 # works on whole df

In [135]:
df$col[is.na(df$col)] <- 999 # For a selected column