# Importing Data and basic operations with Dataframes

Content:

- An introduction to R Data Frames
- Functions and packages used to import data
- Introduction to **dplyr**
- Merging, joining DataFrames

## An introduction to R Data Frames

- R's data frames are used to store data in rows and columns.
- Share many of the properties of [matrices](https://www.tutorialspoint.com/r/r_matrices.htm) and of [lists](https://data-flair.training/blogs/r-list-tutorial/).
- used as the fundamental data structure by most of R's modeling software.
- They can store objects of different type: numeric, character, etc.
    - Different to matrices in which all elements must be numeric.


In practice, data frames are loaded or created from reading other sources like csv/excel files or databases, but we can create a small example to start becoming familiar. We use the `data.frame` function:

In [27]:
x <- data.frame(Videogame = c("FIFA", "Fall Guys", "LOL", "Mario Bros"), Age = c(28, 22, 18, 30))
x

Videogame,Age
<chr>,<dbl>
FIFA,28
Fall Guys,22
LOL,18
Mario Bros,30


In [28]:
class(x)

We can get the dimensions of the dataframe with `dim`, `nrow` and `ncol`:

In [29]:
dim(x)  # returns the number of rows (observations) and columns (variables)

In [30]:
nrow(x)

In [31]:
ncol(x)

In [32]:
length(x)  # returns the number of columns (variables)

### Selecting data

We can select data with `$` or `[]`. Notice the data is returned in different types:

In [36]:
x$Videogame

In [38]:
class(x$Videogame)

In [42]:
x$Videogame[1:2]

In [40]:
x[1]

Videogame
<chr>
FIFA
Fall Guys
LOL
Mario Bros


We can add columns to our data frame:

In [44]:
x$id = 1:4
x

Videogame,Age,id
<chr>,<dbl>,<int>
FIFA,28,1
Fall Guys,22,2
LOL,18,3
Mario Bros,30,4


We can use the `cbind` function to append columns to our data:

In [47]:
samples <- 1:4
x <- cbind(x, samples)
x

Videogame,Age,id,samples
<chr>,<dbl>,<int>,<int>
FIFA,28,1,1
Fall Guys,22,2,2
LOL,18,3,3
Mario Bros,30,4,4


#### A note about tidy data

In [tidy data](https://r4ds.had.co.nz/tidy-data.html):

- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.

## Functions and packages used to import data

There are different ways to import data into our workspace, for more information you can go [here](https://www.uv.es/pjperez/curso_R/tt_3_cargar_datos_v4.html).

As a first case, R already comes with some pre-loaded datasets. To get a list of these we used the `data()` function:

In [54]:
data()

Package,Item,Title
<chr>,<chr>,<chr>
datasets,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
datasets,BJsales,Sales Data with Leading Indicator
datasets,BJsales.lead (BJsales),Sales Data with Leading Indicator
datasets,BOD,Biochemical Oxygen Demand
datasets,CO2,Carbon Dioxide Uptake in Grass Plants
datasets,ChickWeight,Weight versus age of chicks on different diets
datasets,DNase,Elisa assay of DNase
datasets,EuStockMarkets,"Daily Closing Prices of Major European Stock Indices, 1991-1998"
datasets,Formaldehyde,Determination of Formaldehyde
datasets,HairEyeColor,Hair and Eye Color of Statistics Students


For example to load the Air Passengers data set we would do the following:

In [55]:
data(AirPassengers)

In [56]:
AirPassengers

Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201
1954,204,188,235,227,234,264,302,293,259,229,203,229
1955,242,233,267,269,270,315,364,347,312,274,237,278
1956,284,277,317,313,318,374,413,405,355,306,271,306
1957,315,301,356,348,355,422,465,467,404,347,305,336
1958,340,318,362,348,363,435,491,505,404,359,310,337


### Reading tabular data

We can import data in CSV format:

In [60]:
library(readr)

nsfg_df <- read_csv("../data/nsfg/nsfg_2002_2019.csv")
nsfg_df

We can see the first or last rows of the dataset:

In [4]:
head(nsfg_df)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,80717,31,2,,1,1,2018,15,,,5,,,,,,
2,80721,17,6,2.0,3,2,2018,1,,6.0,5,,,,,,
3,80722,16,6,,1,2,2019,2,,2.0,5,,,,,,
4,80724,49,4,2.0,4,3,2019,50,2.0,,3,,,,,,
5,80732,39,6,,1,1,2019,0,,,5,,,,,,
6,80734,37,1,2.0,2,2,2018,3,1.0,,1,,,,,,


In [6]:
head(nsfg_df, 10)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,80717,31,2,,1,1,2018,15,,,5,,,,,,
2,80721,17,6,2.0,3,2,2018,1,,6.0,5,,,,,,
3,80722,16,6,,1,2,2019,2,,2.0,5,,,,,,
4,80724,49,4,2.0,4,3,2019,50,2.0,,3,,,,,,
5,80732,39,6,,1,1,2019,0,,,5,,,,,,
6,80734,37,1,2.0,2,2,2018,3,1.0,,1,,,,,,
7,80735,17,6,1.0,3,4,2019,0,,2.0,5,,,,,,
8,80736,40,2,,1,1,2019,22,,,5,,,,,,
9,80738,46,4,2.0,3,2,2018,30,1.0,,3,,,,,,
10,80739,40,1,,1,1,2019,5,1.0,,1,,,,,,


In [5]:
tail(nsfg_df)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
72459,2417,21,6,1.0,3.0,3.0,2002,1.0,,3.0,5,2.0,1.0,4.0,2.0,,3.0
72460,1074,15,6,2.0,3.0,2.0,2002,0.0,,4.0,5,3.0,,3.0,3.0,,2.0
72461,8877,37,4,1.0,3.0,1.0,2002,5.0,1.0,,3,4.0,5.0,3.0,1.0,,3.0
72462,11658,28,1,,1.0,4.0,2002,7.0,1.0,,1,2.0,7.0,2.0,3.0,,3.0
72463,2780,24,4,,1.0,2.0,2002,5.0,1.0,5.0,3,4.0,5.0,2.0,2.0,,1.0
72464,5758,28,1,,,,2002,,1.0,,1,,1.0,,,,


#### More on Selecting Data

We can identify missing values with the `is.na()` function:

In [25]:
nsfg_df[is.na(nsfg_df)[,"reldlife"],]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,80717,31,2,,1,1,2018,15,,,5,,,,,,
3,80722,16,6,,1,2,2019,2,,2,5,,,,,,
5,80732,39,6,,1,1,2019,0,,,5,,,,,,
8,80736,40,2,,1,1,2019,22,,,5,,,,,,
10,80739,40,1,,1,1,2019,5,1,,1,,,,,,
11,80740,31,6,,1,1,2017,0,,,5,,,,,,
14,80746,17,6,,1,1,2019,6,,6,5,,,,,,
19,80757,28,6,,1,4,2019,6,,,5,,,,,,
20,80758,18,2,,1,3,2018,10,,3,5,,,,,,
21,80759,40,2,,1,3,2018,26,,,5,,,,,,


We can also use the `complete.cases` function to check if each row has any missing value in either of the columns:

In [41]:
x$rating <- c(4, 5, NA, NA)
x

Videogame,Age,rating
<chr>,<dbl>,<dbl>
FIFA,28,4.0
Fall Guys,22,5.0
LOL,18,
Mario Bros,30,


In [43]:
completed <- complete.cases(x)
completed

In [44]:
x_complete <- x[completed,]
x_complete

Unnamed: 0_level_0,Videogame,Age,rating
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,FIFA,28,4
2,Fall Guys,22,5


Selecting specific variables (columns):

In [45]:
nsfg_df[, 1:4]

caseid,age_a,marstat,reldlife
<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,
80721,17,6,2
80722,16,6,
80724,49,4,2
80732,39,6,
80734,37,1,2
80735,17,6,1
80736,40,2,
80738,46,4,2
80739,40,1,


In [46]:
nsfg_df[, c(1, 3, 6)]

caseid,marstat,samesex
<dbl>,<dbl>,<dbl>
80717,2,1
80721,6,2
80722,6,2
80724,4,3
80732,6,1
80734,1,2
80735,6,4
80736,2,1
80738,4,2
80739,1,1


Selecting specific observations (rows):

In [47]:
nsfg_df[1:5,]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,80717,31,2,,1,1,2018,15,,,5,,,,,,
2,80721,17,6,2.0,3,2,2018,1,,6.0,5,,,,,,
3,80722,16,6,,1,2,2019,2,,2.0,5,,,,,,
4,80724,49,4,2.0,4,3,2019,50,2.0,,3,,,,,,
5,80732,39,6,,1,1,2019,0,,,5,,,,,,


Using the `seq()` function to generate a sequence to select rows:

In [54]:
nsfg_df[seq(1,nrow(nsfg_df),100),]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,80717,31,2,,1,1,2018,15,,,5,,,,,,
101,80936,40,1,1,3,4,2019,1,1,,1,,,,,,
201,81138,36,1,3,3,1,2018,1,1,,1,,,,,,
301,81356,32,2,,1,2,2018,44,1,,4,,,,,,
401,81576,20,6,,1,2,2019,0,,6,5,,,,,,
501,81846,42,2,2,2,1,2019,12,,,5,,,,,,
601,82083,22,1,1,2,4,2018,1,1,5,1,,,,,,
701,82302,36,2,2,3,2,2017,10,,,5,,,,,,
801,82502,46,1,,1,2,2019,15,1,,1,,,,,,
901,82712,22,6,,1,2,2018,2,,7,5,,,,,,


Boolean or conditional selection:

In [63]:
df1 <- nsfg_df[nsfg_df$age_a >= 18, c(1, 2, 3)]
head(df1, 10)

Unnamed: 0_level_0,caseid,age_a,marstat
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,80717,31,2
4,80724,49,4
5,80732,39,6
6,80734,37,1
8,80736,40,2
9,80738,46,4
10,80739,40,1
11,80740,31,6
12,80741,34,6
13,80745,32,1


In [64]:
df2 <- nsfg_df[nsfg_df$age_a >= 18, c("caseid","age_a", "religion")]
head(df2, 10)

Unnamed: 0_level_0,caseid,age_a,religion
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,80717,31,1
4,80724,49,4
5,80732,39,1
6,80734,37,2
8,80736,40,1
9,80738,46,3
10,80739,40,1
11,80740,31,1
12,80741,34,4
13,80745,32,3


In [77]:
df3 <- nsfg_df[(nsfg_df$religion==1) & (nsfg_df$timesmar>=2),] 
df3

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
,,,,,,,,,,,,,,,,,
NA.1,,,,,,,,,,,,,,,,,
NA.2,,,,,,,,,,,,,,,,,
NA.3,,,,,,,,,,,,,,,,,
NA.4,,,,,,,,,,,,,,,,,
NA.5,,,,,,,,,,,,,,,,,
NA.6,,,,,,,,,,,,,,,,,
NA.7,,,,,,,,,,,,,,,,,
NA.8,,,,,,,,,,,,,,,,,
22,80762,45,1,,1,5,2019,50,2,,1,,,,,,


The `subset()` function also comes handy when subsetting data:

In [78]:
subset(nsfg_df, religion==1 & timesmar>=2)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
22,80762,45,1,,1,5,2019,50,2,,1,,,,,,
63,80850,33,1,,1,2,2018,20,2,,1,,,,,,
151,81034,43,5,,1,2,2018,7,2,,4,,,,,,
309,81377,32,1,,1,2,2017,48,3,,1,,,,,,
354,81477,44,4,,1,5,2018,50,2,,3,,,,,,
410,81589,46,1,,1,2,2019,5,2,,1,,,,,,
420,81622,43,1,,1,2,2017,5,2,,1,,,,,,
440,81672,35,4,,1,1,2018,30,2,,3,,,,,,
452,81712,46,5,,1,1,2017,9,2,,4,,,,,,
499,81829,44,1,,1,3,2017,8,2,,1,,,,,,


In [79]:
subset(nsfg_df, religion==1 & timesmar>=2, select=c("caseid", "religion", "age_a", "timesmar"))

Unnamed: 0_level_0,caseid,religion,age_a,timesmar
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
22,80762,1,45,2
63,80850,1,33,2
151,81034,1,43,2
309,81377,1,32,3
354,81477,1,44,2
410,81589,1,46,2
420,81622,1,43,2
440,81672,1,35,2
452,81712,1,46,2
499,81829,1,44,2


### RData

RData (.Rda) is an specific format available in R.

- More efficient.
- More than one object can be stored in a single file.

We can first export our NSFG data set with the `save()` function:

In [61]:
save(nsfg_df, file = "../data/nsfg/nsfg.RData")

We can save more object in a single RData file:

In [62]:
save(nsfg_df, AirPassengers, file = "../data/nsfg_and_airpass.RData")

Then we can load the files with `load()` (try restarting the notebook's kernel before running the ccode below to check if it actually works):

In [1]:
load("../data/nsfg/nsfg.RData")
nsfg_df

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,1,1,2018,15,,,5,,,,,,
80721,17,6,2,3,2,2018,1,,6,5,,,,,,
80722,16,6,,1,2,2019,2,,2,5,,,,,,
80724,49,4,2,4,3,2019,50,2,,3,,,,,,
80732,39,6,,1,1,2019,0,,,5,,,,,,
80734,37,1,2,2,2,2018,3,1,,1,,,,,,
80735,17,6,1,3,4,2019,0,,2,5,,,,,,
80736,40,2,,1,1,2019,22,,,5,,,,,,
80738,46,4,2,3,2,2018,30,1,,3,,,,,,
80739,40,1,,1,1,2019,5,1,,1,,,,,,


## Introduction to dplyr

[dplyr](https://dplyr.tidyverse.org/) is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

- `filter()` picks cases based on their values.
- `arrange()` changes the ordering of the rows.
- `select()` picks variables based on their names.
- `mutate()` adds new variables that are functions of existing variables
- `summarise()` reduces multiple values down to a single summary.

These all combine naturally with `group_by()` which allows you to perform any operation “by group”. You can learn more [here](https://dplyr.tidyverse.org/articles/dplyr.html).
Each of these functions do only one thing, so to perform more complex operations you have to concatenate one function after another. This is done with the [pipe operator](https://stat545.com/dplyr-intro.html#meet-the-new-pipe-operator) `%>%`.

In [94]:
library(dplyr) 

nsfg_df %>% filter(age_a >= 18, religion == 1) 

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,1,1,2018,15,,,5,,,,,,
80732,39,6,,1,1,2019,0,,,5,,,,,,
80736,40,2,,1,1,2019,22,,,5,,,,,,
80739,40,1,,1,1,2019,5,1,,1,,,,,,
80740,31,6,,1,1,2017,0,,,5,,,,,,
80757,28,6,,1,4,2019,6,,,5,,,,,,
80758,18,2,,1,3,2018,10,,3,5,,,,,,
80759,40,2,,1,3,2018,26,,,5,,,,,,
80762,45,1,,1,5,2019,50,2,,1,,,,,,
80767,31,6,,1,1,2017,12,,,5,,,,,,


In [95]:
nsfg_df %>% filter(age_a >= 18, religion == 1) %>% head()

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,1,1,2018,15,,,5,,,,,,
80732,39,6,,1,1,2019,0,,,5,,,,,,
80736,40,2,,1,1,2019,22,,,5,,,,,,
80739,40,1,,1,1,2019,5,1.0,,1,,,,,,
80740,31,6,,1,1,2017,0,,,5,,,,,,
80757,28,6,,1,4,2019,6,,,5,,,,,,


In [97]:
nsfg_df %>% arrange(caseid)

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,44,1,2,2,3,2002,5,1,,1,3,5,3,3,,2
2,20,6,1,3,4,2002,9,,1,5,2,9,2,4,,1
4,36,1,,1,2,2002,17,1,,1,2,17,3,2,,4
6,40,1,1,3,4,2002,1,1,,1,4,1,4,4,,3
7,39,1,1,3,3,2002,10,4,,1,3,998,3,2,,2
8,23,2,2,3,3,2002,3,,2,5,3,3,2,2,,3
9,15,6,,1,2,2002,1,,4,5,2,1,2,1,,3
12,40,4,1,2,4,2002,3,1,,3,4,3,3,2,,4
13,20,6,1,3,1,2002,0,,1,5,3,,4,1,,4
14,36,4,1,3,3,2002,6,1,,3,5,6,3,2,,2


In [112]:
nsfg_df %>% arrange(desc(caseid))

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
92062,41,1,,1,2,2018,9,1,,1,,,,,,
92061,24,1,1,3,2,2018,5,1,1,1,,,,,,
92060,37,2,3,2,1,2019,1,,,5,,,,,,
92059,39,6,3,3,2,2019,39,,,5,,,,,,
92058,32,6,,1,2,2019,6,,,5,,,,,,
92057,37,6,2,2,2,2018,2,,,5,,,,,,
92056,49,2,,1,3,2018,10,2,,3,,,,,,
92055,20,6,2,3,1,2019,1,,1,5,,,,,,
92054,16,6,,1,2,2019,0,,5,5,,,,,,
92053,41,2,,1,1,2018,6,,,5,,,,,,


In [98]:
nsfg_df %>% arrange(caseid) %>% filter(age_a >= 18)

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,44,1,2,2,3,2002,5,1,,1,3,5,3,3,,2
2,20,6,1,3,4,2002,9,,1,5,2,9,2,4,,1
4,36,1,,1,2,2002,17,1,,1,2,17,3,2,,4
6,40,1,1,3,4,2002,1,1,,1,4,1,4,4,,3
7,39,1,1,3,3,2002,10,4,,1,3,998,3,2,,2
8,23,2,2,3,3,2002,3,,2,5,3,3,2,2,,3
12,40,4,1,2,4,2002,3,1,,3,4,3,3,2,,4
13,20,6,1,3,1,2002,0,,1,5,3,,4,1,,4
14,36,4,1,3,3,2002,6,1,,3,5,6,3,2,,2
15,32,1,,1,4,2002,3,1,,1,3,3,4,3,,3


Select columns with `select()`:

In [99]:
nsfg_df %>% select(caseid, timesmar)

caseid,timesmar
<dbl>,<dbl>
80717,
80721,
80722,
80724,2
80732,
80734,1
80735,
80736,
80738,1
80739,1


We can rename the selected columns:

In [102]:
nsfg_df %>% select(id = caseid, times_married = timesmar)

id,times_married
<dbl>,<dbl>
80717,
80721,
80722,
80724,2
80732,
80734,1
80735,
80736,
80738,1
80739,1


In [105]:
nsfg_df %>% select(starts_with("age") | ends_with("year"))

age_a,intvwyear
<dbl>,<dbl>
31,2018
17,2018
16,2019
49,2019
39,2019
37,2018
17,2019
40,2019
46,2018
40,2019


In [106]:
nsfg_df %>% select(contains("age"))

age_a
<dbl>
31
17
16
49
39
37
17
40
46
40


Add new columns with `mutate()`:

In [111]:
nsfg_df %>% mutate(age_days = age_a * 365) %>%  head()

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog,prvntdiv,achieve,age_days
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,1,1,2018,15,,,5,,,,,,,11315
80721,17,6,2.0,3,2,2018,1,,6.0,5,,,,,,,6205
80722,16,6,,1,2,2019,2,,2.0,5,,,,,,,5840
80724,49,4,2.0,4,3,2019,50,2.0,,3,,,,,,,17885
80732,39,6,,1,1,2019,0,,,5,,,,,,,14235
80734,37,1,2.0,2,2,2018,3,1.0,,1,,,,,,,13505


Summarise values with `summarise()`:

In [110]:
nsfg_df %>% summarise(age_mean = mean(age_a, na.rm = TRUE))

age_mean
<dbl>
29.25337


## Merging/Joining DataFrames 

We can [join dataframes](https://dplyr.tidyverse.org/reference/mutate-joins.html) with the `dplyr` library too. We can do this with:

- `inner_join`
- `left_join`
- `right_join`
- `full_join`

To put this in practice, we create a left and right subsets from our original NSFG data. We also introduce the `slice` function to get rows  by position:

In [122]:
df_left <- nsfg_df %>% select(1:4) %>% slice(1:5)
df_left

caseid,age_a,marstat,reldlife
<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,
80721,17,6,2.0
80722,16,6,
80724,49,4,2.0
80732,39,6,


In [129]:
df_right <- nsfg_df %>% select("caseid", "age_a", 5:7) %>% slice(1:3, 8:10)
df_right

caseid,age_a,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,1,1,2018
80721,17,3,2,2018
80722,16,1,2,2019
80736,40,1,1,2019
80738,46,3,2,2018
80739,40,1,1,2019


In [135]:
df_inner <- inner_join(df_left, df_right, by = "caseid")
df_inner

caseid,age_a.x,marstat,reldlife,age_a.y,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,31,1,1,2018
80721,17,6,2.0,17,3,2,2018
80722,16,6,,16,1,2,2019


In [130]:
inner_join(df_left, df_right, by = c("caseid", "age_a"))

caseid,age_a,marstat,reldlife,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,1,1,2018
80721,17,6,2.0,3,2,2018
80722,16,6,,1,2,2019


In [131]:
left_join(df_left, df_right, by = "caseid")

caseid,age_a.x,marstat,reldlife,age_a.y,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,31.0,1.0,1.0,2018.0
80721,17,6,2.0,17.0,3.0,2.0,2018.0
80722,16,6,,16.0,1.0,2.0,2019.0
80724,49,4,2.0,,,,
80732,39,6,,,,,


In [132]:
right_join(df_left, df_right, by = "caseid")

caseid,age_a.x,marstat,reldlife,age_a.y,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31.0,2.0,,31,1,1,2018
80721,17.0,6.0,2.0,17,3,2,2018
80722,16.0,6.0,,16,1,2,2019
80736,,,,40,1,1,2019
80738,,,,46,3,2,2018
80739,,,,40,1,1,2019


In [133]:
full_join(df_left, df_right, by = "caseid")

caseid,age_a.x,marstat,reldlife,age_a.y,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31.0,2.0,,31.0,1.0,1.0,2018.0
80721,17.0,6.0,2.0,17.0,3.0,2.0,2018.0
80722,16.0,6.0,,16.0,1.0,2.0,2019.0
80724,49.0,4.0,2.0,,,,
80732,39.0,6.0,,,,,
80736,,,,40.0,1.0,1.0,2019.0
80738,,,,46.0,3.0,2.0,2018.0
80739,,,,40.0,1.0,1.0,2019.0


There are two additional ways of joining data with `semi_join` and `anti_join`. These functions are considered as filtering joins. The difference in here is that `semi_join` does not duplicate rows. This type of joining is helpful to identify mismatches between dataframes:

In [134]:
df_semi_join <-  semi_join(df_left, df_right, by = "caseid")
df_semi_join

caseid,age_a,marstat,reldlife
<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,
80721,17,6,2.0
80722,16,6,


Look at the difference between the `df_inner` variable, obtained from and `inner_join`:

In [137]:
df_inner

caseid,age_a.x,marstat,reldlife,age_a.y,religion,samesex,intvwyear
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
80717,31,2,,31,1,1,2018
80721,17,6,2.0,17,3,2,2018
80722,16,6,,16,1,2,2019


With `semi_join` we only get the rows that matched the join conditions whereas `inner_join` returns the rows matched plus the columns from both datasets. 

The `anti_join` has a similar behaviour, but it returns the unmathced rows:

In [138]:
anti_join(df_left, df_right, by="caseid")

caseid,age_a,marstat,reldlife
<dbl>,<dbl>,<dbl>,<dbl>
80724,49,4,2.0
80732,39,6,
