# Importing Data and basic operations with Dataframes

**Content**:
- Functions and packages used to import data
- Getting Basic information about the data 
- Selecting and Filtering Data
- Joining DataFrames
- Split-Apply-Combine

**About DataFrames.jl**

[DataFrames.jl](https://dataframes.juliadata.org/stable/) provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R).

DataFrames.jl plays a central role in the Julia Data ecosystem, and has integrations with a range of different libraries like:

- [StatsKit.jl](https://github.com/JuliaStats/StatsKit.jl): A convenience meta-package which loads a set of essential packages for statistics, including those mentioned below in this section and DataFrames.jl itself.
- [Statistics](https://docs.julialang.org/en/v1/stdlib/Statistics/): The Julia standard library comes with a wide range of statistics functionality, but to gain access to these functions you must call using Statistics.
- [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl): Machine Learning in Julia.
- [Plots.jl](https://docs.juliaplots.org/latest/): Powerful, modern plotting library with a syntax akin to that of matplotlib (in Python) or plot (in R).
- [Gadfly.jl](http://gadflyjl.org/stable/): High-level plotting library with a "grammar of graphics" syntax akin to that of ggplot (in R).

And MORE!!

---

A Julia [DataFrame](https://dataframes.juliadata.org/stable/man/basics/#Constructors-and-Basic-Utility-Functions) is similar to an [array](https://docs.julialang.org/en/v1/base/arrays/) with some differences:

- By default, the first line is interpreted as the list of column names.
- The separation character is the 'comma'. To specify other character we use the `delim` argument.
- The columns may have missing values. The arguments `missingstrings` and `allowmissing` can be used to customize the way the empty values are interpreted.

---

### About the data

In this notebook, we will work with the National Survey of Family Growth - [NSFG](https://www.cdc.gov/nchs/nsfg/index.htm) data set. You can find the `nsfg_2002_2019.csv` file in this repository. It is already a 2002 - 2019 curated version obtained online from [Kaggle](https://www.kaggle.com/datasets/nikodemlewandowski/nsfg-choosen-variables-20022019?resource=download). 

**Columns**

- **caseid**: Respondent ID number.
- **age_a**: Respondent age.
- **marstat**: Marital status.
- **reldlife**: How important is religion in the respondent's daily life.
- **religion**: Current religious affiliation.
- **samesex**: Sexual relations between two adults of the same sex are all right. Do you strongly agree, agree, disagree, or strongly disagree? 
- **intvwyear**: Interview year.
- **lifprtnr**: Number of opposite-sex partners in lifetime.
- **timesmar**: Times respondent has been married.
- **attnd14**: When you were 14, about how often did you usually attend religious services?
- **fmarit**: Respondent formal marital status at time of interview.
- **gayadopt**: Gay or lesbian adults should have the right to adopt children?
- **sxok18**: It is all right for unmarried 18 year olds to have sexual intercourse if they have strong affection for each other?
- **staytog**:  Divorce is usually the best solution when a couple can't seem to work out their marriage problems?
- **achieve**: It is much better for everyone if the man earns the main living and the woman takes care of the home and family?

Most of the columns above are mappings to categorical labels. You can see the details in the `img/labels` directory.

For example, the `reldlife` variable is escoded like the following:

<img src="img/labels/reldlife_labels.png" alt="dataframe" width="200"/>

---

To begin importing data we need to install and load the corresponding `CSV` and `DataFrames` packages:


In [1]:
import Pkg

Pkg.add("CSV")
Pkg.add("DataFrames")

[32m[1m    Updating[22m[39m registry at `/opt/julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/opt/julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `/opt/julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/opt/julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `/opt/julia/environments/v1.7/Manifest.toml`


In [2]:
using CSV, DataFrames

In [3]:
dataframe = DataFrame(CSV.File("../data/nsfg/nsfg_2002_2019.csv"))
dataframe

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,2,,1,1,2018,15,
2,80721,17,6,2,3,2,2018,1,
3,80722,16,6,,1,2,2019,2,
4,80724,49,4,2,4,3,2019,50,2
5,80732,39,6,,1,1,2019,0,
6,80734,37,1,2,2,2,2018,3,1
7,80735,17,6,1,3,4,2019,0,
8,80736,40,2,,1,1,2019,22,
9,80738,46,4,2,3,2,2018,30,1
10,80739,40,1,,1,1,2019,5,1


Notice the displayed data is cutoff both in the rows and the columns. Also the datatype of each column is displayed below its name.

We can access columns of the data in different ways:

In [4]:
dataframe.caseid

72464-element Vector{Int64}:
 80717
 80721
 80722
 80724
 80732
 80734
 80735
 80736
 80738
 80739
 80740
 80741
 80745
     ⋮
  7879
 11427
 11640
  5515
  1551
  9358
  2417
  1074
  8877
 11658
  2780
  5758

In [5]:
dataframe."caseid"

72464-element Vector{Int64}:
 80717
 80721
 80722
 80724
 80732
 80734
 80735
 80736
 80738
 80739
 80740
 80741
 80745
     ⋮
  7879
 11427
 11640
  5515
  1551
  9358
  2417
  1074
  8877
 11658
  2780
  5758

In [6]:
dataframe[!, :caseid]

72464-element Vector{Int64}:
 80717
 80721
 80722
 80724
 80732
 80734
 80735
 80736
 80738
 80739
 80740
 80741
 80745
     ⋮
  7879
 11427
 11640
  5515
  1551
  9358
  2417
  1074
  8877
 11658
  2780
  5758

In [7]:
dataframe[!, "caseid"]

72464-element Vector{Int64}:
 80717
 80721
 80722
 80724
 80732
 80734
 80735
 80736
 80738
 80739
 80740
 80741
 80745
     ⋮
  7879
 11427
 11640
  5515
  1551
  9358
  2417
  1074
  8877
 11658
  2780
  5758

---
**NOTE**

The code examples above do not make a copy of the obtained vector, so changing the elements here will **modify** the original data. If you want to get a copy of the column we need to use the following notation:


In [8]:
dataframe[:, "caseid"]

72464-element Vector{Int64}:
 80717
 80721
 80722
 80724
 80732
 80734
 80735
 80736
 80738
 80739
 80740
 80741
 80745
     ⋮
  7879
 11427
 11640
  5515
  1551
  9358
  2417
  1074
  8877
 11658
  2780
  5758

We can further check if the obtained objects are the same or not:

In [9]:
dataframe.caseid === dataframe[!, "caseid"]

true

In [10]:
dataframe.caseid === dataframe[:, "caseid"]

false

We can get the columns as a vector of Strings:

In [11]:
names(dataframe)

17-element Vector{String}:
 "caseid"
 "age_a"
 "marstat"
 "reldlife"
 "religion"
 "samesex"
 "intvwyear"
 "lifprtnr"
 "timesmar"
 "attnd14"
 "fmarit"
 "gayadopt"
 "lifeprt"
 "sxok18"
 "staytog"
 "prvntdiv"
 "achieve"

### Recall the `!` notation

This notation is similar to the `in_place` argument in some of the methods in Python Pandas library. We can see an example to understand this whith the `empty` function:

In [12]:
empty(dataframe)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3


In [13]:
dataframe

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,2,,1,1,2018,15,
2,80721,17,6,2,3,2,2018,1,
3,80722,16,6,,1,2,2019,2,
4,80724,49,4,2,4,3,2019,50,2
5,80732,39,6,,1,1,2019,0,
6,80734,37,1,2,2,2,2018,3,1
7,80735,17,6,1,3,4,2019,0,
8,80736,40,2,,1,1,2019,22,
9,80738,46,4,2,3,2,2018,30,1
10,80739,40,1,,1,1,2019,5,1


At this moment, after running the `empty` function, the `dataframe` variable has not been modified. Lets now run the `empty!` function: 

In [14]:
empty!(dataframe)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3


In [15]:
dataframe

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3


Notice that we have modified the original `dataframe`!!!

## Get basic information about the data

We can use the `size` function to get the dimensions of the data frame. But first lets reload our dataframe since we just erased it:

In [16]:
dataframe = DataFrame(CSV.File("../data/nsfg/nsfg_2002_2019.csv"))

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,2,,1,1,2018,15,
2,80721,17,6,2,3,2,2018,1,
3,80722,16,6,,1,2,2019,2,
4,80724,49,4,2,4,3,2019,50,2
5,80732,39,6,,1,1,2019,0,
6,80734,37,1,2,2,2,2018,3,1
7,80735,17,6,1,3,4,2019,0,
8,80736,40,2,,1,1,2019,22,
9,80738,46,4,2,3,2,2018,30,1
10,80739,40,1,,1,1,2019,5,1


In [17]:
size(dataframe)

(72464, 17)

We can get the number of rows and columns:

In [18]:
nrow(dataframe)

72464

In [19]:
ncol(dataframe)

17

The `describe` method is also available to get summary statistics:

In [20]:
describe(dataframe)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,caseid,51709.2,1,55830.5,92062.0,0,Int64
2,age_a,29.2534,15,29.0,99.0,0,Int64
3,marstat,3.85194,1,5.0,9.0,0,Int64
4,reldlife,,1,,,0,String3
5,religion,,1,,,0,String3
6,samesex,,1,,,0,String3
7,intvwyear,2011.61,2002,2012.0,2019.0,0,Int64
8,lifprtnr,,0,,,0,String3
9,timesmar,,1,,,0,String3
10,attnd14,,1,,,0,String3


We can get the first and last rows of the dataframe with:

In [21]:
first(dataframe, 5)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,2,,1,1,2018,15,
2,80721,17,6,2.0,3,2,2018,1,
3,80722,16,6,,1,2,2019,2,
4,80724,49,4,2.0,4,3,2019,50,2.0
5,80732,39,6,,1,1,2019,0,


In [22]:
last(dataframe, 3)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,11658,28,1,,1.0,4.0,2002,7.0,1
2,2780,24,4,,1.0,2.0,2002,5.0,1
3,5758,28,1,,,,2002,,1


You can display all collumns or all rows with the `show` function and the `allcols` or `allrows` arguments respectively:

In [23]:
show(dataframe, allcols=true)

[1m72464×17 DataFrame[0m
[1m   Row [0m│[1m caseid [0m[1m age_a [0m[1m marstat [0m[1m reldlife [0m[1m religion [0m[1m samesex [0m[1m intvwyear [0m[1m lifprtnr [0m[1m timesmar [0m[1m attnd14 [0m[1m fmarit [0m[1m gayadopt [0m[1m lifeprt [0m[1m sxok18  [0m[1m staytog [0m[1m prvntdiv [0m[1m achieve [0m
[1m       [0m│[90m Int64  [0m[90m Int64 [0m[90m Int64   [0m[90m String3  [0m[90m String3  [0m[90m String3 [0m[90m Int64     [0m[90m String3  [0m[90m String3  [0m[90m String3 [0m[90m Int64  [0m[90m String3  [0m[90m String3 [0m[90m String3 [0m[90m String3 [0m[90m String3  [0m[90m String3 [0m
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │  80717     31        2  NA        1         1             2018  15        NA        NA            5  NA        NA       NA       NA       NA        NA
     2 │  

## Selecting Data

### Indexing Syntax

In [24]:
dataframe[1:3, [:caseid, :age_a]]

Unnamed: 0_level_0,caseid,age_a
Unnamed: 0_level_1,Int64,Int64
1,80717,31
2,80721,17
3,80722,16


In [25]:
dataframe[1:6, :]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,2,,1,1,2018,15,
2,80721,17,6,2.0,3,2,2018,1,
3,80722,16,6,,1,2,2019,2,
4,80724,49,4,2.0,4,3,2019,50,2.0
5,80732,39,6,,1,1,2019,0,
6,80734,37,1,2.0,2,2,2018,3,1.0


In [26]:
dataframe[:, [:caseid, :age_a]]

Unnamed: 0_level_0,caseid,age_a
Unnamed: 0_level_1,Int64,Int64
1,80717,31
2,80721,17
3,80722,16
4,80724,49
5,80732,39
6,80734,37
7,80735,17
8,80736,40
9,80738,46
10,80739,40


In [27]:
dataframe[!, [:caseid, :age_a]]

Unnamed: 0_level_0,caseid,age_a
Unnamed: 0_level_1,Int64,Int64
1,80717,31
2,80721,17
3,80722,16
4,80724,49
5,80732,39
6,80734,37
7,80735,17
8,80736,40
9,80738,46
10,80739,40


As it was explained earler in this tutorial the difference between using `!` and `:` when passing a row index is that `!` does not perform a copy of columns, while `:` does.

 - The `!` selector normally should be avoided as it can lead to hard to catch bugs.
 - However, when working with very large data frames it can be useful to save memory and improve performance of operations.
 
 
To select a single cell or value from the data frame we do:

In [44]:
dataframe[2,10]

"6"

The indexing syntax can also be used to select rows based on conditions on variables:

In [133]:
dataframe[dataframe.age_a .>= 18, :]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,1,,1,1,2018,15,
2,80724,49,1,2,4,3,2019,50,2
3,80732,39,1,,1,1,2019,0,
4,80734,37,1,2,2,2,2018,3,1
5,80736,40,1,,1,1,2019,22,
6,80738,46,1,2,3,2,2018,30,1
7,80739,40,1,,1,1,2019,5,1
8,80740,31,1,,1,1,2017,0,
9,80741,34,1,2,4,1,2018,1,
10,80745,32,1,1,3,2,2019,4,1


In [142]:
dataframe[(dataframe[:, [:age_a]] .>= 18) .& (dataframe.marstat .== 1), :]

LoadError: MethodError: no method matching getindex(::DataFrame, ::DataFrame, ::Colon)
[0mClosest candidates are:
[0m  getindex(::AbstractDataFrame, [91m::Integer[39m, ::Colon) at /opt/julia/packages/DataFrames/zqFGs/src/dataframerow/dataframerow.jl:210
[0m  getindex(::AbstractDataFrame, [91m::Integer[39m, ::Union{Colon, Regex, AbstractVector, All, Between, Cols, InvertedIndex}) at /opt/julia/packages/DataFrames/zqFGs/src/dataframerow/dataframerow.jl:208
[0m  getindex(::DataFrame, [91m::typeof(!)[39m, ::Union{Colon, Regex, AbstractVector, All, Between, Cols, InvertedIndex}) at /opt/julia/packages/DataFrames/zqFGs/src/dataframe/dataframe.jl:598
[0m  ...

#### Sorting

We can sort a given dataframe by just calling `sort` function:

In [122]:
sort(dataframe)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,1,44,1,2,2,3,2002,5,1
2,2,20,1,1,3,4,2002,9,
3,4,36,1,,1,2,2002,17,1
4,6,40,1,1,3,4,2002,1,1
5,7,39,1,1,3,3,2002,10,4
6,8,23,1,2,3,3,2002,3,
7,9,15,1,,1,2,2002,1,
8,12,40,1,1,2,4,2002,3,1
9,13,20,1,1,3,1,2002,0,
10,14,36,1,1,3,3,2002,6,1


The code above sorted the data based on all the columns. But we can specify the columns we want to use to sort the data. The `rev` argument is also available to sort in descending or ascending order:

In [127]:
sort(dataframe, ["age_a", "religion"], rev=true)

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,60995,99,1,1,3,2,2013,5,
2,73545,98,1,1,4,4,2017,1,1
3,85921,98,1,1,2,5,2019,15,1
4,90945,50,1,2,4,2,2018,8,1
5,78000,50,1,2,4,2,2016,1,1
6,79622,50,1,1,4,4,2017,3,3
7,73600,50,1,1,3,2,2016,17,1
8,71793,50,1,1,3,2,2015,6,2
9,87325,50,1,1,3,4,2019,1,1
10,87989,50,1,1,2,4,2019,1,1


More information on sorting [here](https://dataframes.juliadata.org/stable/man/sorting/).

### Views

We can also create a `view` of a data frame. It is often useful as it is more memory efficient than creating a materialized selection. You can create it using a `view` function:

In [28]:
view(dataframe, :, 1:3)

Unnamed: 0_level_0,caseid,age_a,marstat
Unnamed: 0_level_1,Int64,Int64,Int64
1,80717,31,2
2,80721,17,6
3,80722,16,6
4,80724,49,4
5,80732,39,6
6,80734,37,1
7,80735,17,6
8,80736,40,2
9,80738,46,4
10,80739,40,1


or using a `@view` macro:

In [32]:
@view dataframe[1:3, 7:end]

Unnamed: 0_level_0,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt,sxok18,staytog
Unnamed: 0_level_1,Int64,String3,String3,String3,Int64,String3,String3,String3,String3
1,2018,15,,,5,,,,
2,2018,1,,6.0,5,,,,
3,2019,2,,2.0,5,,,,


### Not, Between, Cols, and All Column Selectors


Drop one column:

In [34]:
dataframe[:, Not(:caseid)]

Unnamed: 0_level_0,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar,attnd14
Unnamed: 0_level_1,Int64,Int64,String3,String3,String3,Int64,String3,String3,String3
1,31,2,,1,1,2018,15,,
2,17,6,2,3,2,2018,1,,6
3,16,6,,1,2,2019,2,,2
4,49,4,2,4,3,2019,50,2,
5,39,6,,1,1,2019,0,,
6,37,1,2,2,2,2018,3,1,
7,17,6,1,3,4,2019,0,,2
8,40,2,,1,1,2019,22,,
9,46,4,2,3,2,2018,30,1,
10,40,1,,1,1,2019,5,1,


Select columns starting from a given column and ending at another:

In [36]:
dataframe[:, Between(:caseid, :intvwyear)]

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64
1,80717,31,2,,1,1,2018
2,80721,17,6,2,3,2,2018
3,80722,16,6,,1,2,2019
4,80724,49,4,2,4,3,2019
5,80732,39,6,,1,1,2019
6,80734,37,1,2,2,2,2018
7,80735,17,6,1,3,4,2019
8,80736,40,2,,1,1,2019
9,80738,46,4,2,3,2,2018
10,80739,40,1,,1,1,2019


Select unions of given columns with `Cols` and `Between` selectors:

In [39]:
dataframe[:, Cols("caseid", Between("intvwyear", "lifeprt"))]

Unnamed: 0_level_0,caseid,intvwyear,lifprtnr,timesmar,attnd14,fmarit,gayadopt,lifeprt
Unnamed: 0_level_1,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,2018,15,,,5,,
2,80721,2018,1,,6,5,,
3,80722,2019,2,,2,5,,
4,80724,2019,50,2,,3,,
5,80732,2019,0,,,5,,
6,80734,2018,3,1,,1,,
7,80735,2019,0,,2,5,,
8,80736,2019,22,,,5,,
9,80738,2018,30,1,,3,,
10,80739,2019,5,1,,1,,


## Joining Dataframes

To pratice joining dataframes we can split our original NSFG data into two subdaframes while keeping the caseid column:

In [54]:
df_right = dataframe[[1,2,3,5,10], Cols("caseid", Between("religion", "intvwyear"))]
df_left = dataframe[[1,2,4,6,10], Between("caseid", "marstat")]

Unnamed: 0_level_0,caseid,age_a,marstat
Unnamed: 0_level_1,Int64,Int64,Int64
1,80717,31,2
2,80721,17,6
3,80724,49,4
4,80734,37,1
5,80739,40,1


The following functions are provided to perform seven kinds of joins:

- `innerjoin`: the output contains rows for values of the key that exist in all passed data frames.
- `leftjoin`: the output contains rows for values of the key that exist in the first (left) argument, whether or not that value exists in the second (right) argument.
- `rightjoin`: the output contains rows for values of the key that exist in the second (right) argument, whether or not that value exists in the first (left) argument.
- `outerjoin`: the output contains rows for values of the key that exist in any of the passed data frames.
- `crossjoin`: The output is the cartesian product of rows from all passed data frames.

In [55]:
innerjoin(df_left, df_right, on="caseid")

Unnamed: 0_level_0,caseid,age_a,marstat,religion,samesex,intvwyear
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,Int64
1,80717,31,2,1,1,2018
2,80721,17,6,3,2,2018
3,80739,40,1,1,1,2019


The `source` argument can be useful to add an extra column that can indicate whether a row appeared only in the the left, right or both dataset:

In [143]:
leftjoin(df_left, df_right, on="caseid", source="source_custom")

Unnamed: 0_level_0,caseid,age_a,marstat,religion,samesex,intvwyear,source_custom
Unnamed: 0_level_1,Int64,Int64,Int64,String3?,String3?,Int64?,String
1,80717,31,2,1,1,2018,both
2,80721,17,6,3,2,2018,both
3,80739,40,1,1,1,2019,both
4,80724,49,4,missing,missing,missing,left_only
5,80734,37,1,missing,missing,missing,left_only


In [57]:
rightjoin(df_left, df_right, on="caseid", source="source")

Unnamed: 0_level_0,caseid,age_a,marstat,religion,samesex,intvwyear,source
Unnamed: 0_level_1,Int64,Int64?,Int64?,String3,String3,Int64,String
1,80717,31,2,1,1,2018,both
2,80721,17,6,3,2,2018,both
3,80739,40,1,1,1,2019,both
4,80722,missing,missing,1,2,2019,right_only
5,80732,missing,missing,1,1,2019,right_only


## Split-Apply-Combine

The DataFrames package supports the split-apply-combine strategy through the `groupby` function that creates a `GroupedDataFrame`, followed by `combine`, `select`/`select!` or `transform`/`transform!`.

Lets group our dataset by the marital status:

In [58]:
df_gp = groupby(dataframe, "marstat")

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80734,37,1,2,2,2,2018,3,1
2,80739,40,1,,1,1,2019,5,1
3,80745,32,1,1,3,2,2019,4,1
4,80762,45,1,,1,5,2019,50,2
5,80769,39,1,2,3,3,2018,39,2
6,80770,48,1,1,4,4,2018,3,1
7,80775,26,1,3,4,2,2019,11,1
8,80777,29,1,1,3,4,2018,13,1
9,80794,35,1,,1,1,2018,2,1
10,80810,35,1,2,2,3,2018,11,1

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,85548,48,9,3.0,3,1,2019,2,
2,71119,28,9,,1,1,2015,20,
3,73136,34,9,1.0,3,1,2017,6,
4,79517,42,9,,1,1,2017,28,
5,84405,19,9,1.0,4,3,2017,1,
6,87057,40,9,1.0,2,1,2017,0,
7,91485,28,9,2.0,3,1,2018,2,
8,91892,33,9,1.0,1,1,2019,5,
9,54251,30,9,8.0,3,5,2011,0,
10,60527,40,9,1.0,4,1,2014,3,


Once we have our grouped dataset we can use the `combine` function and apply different statistics like getting the mean or the number of rows for each group. Notice the `=>` operator to map the column and the function we want to apply. Recall to load the `Statitics` package:

In [60]:
using Statistics

combine(df_gp, "age_a" => mean)

Unnamed: 0_level_0,marstat,age_a_mean
Unnamed: 0_level_1,Int64,Float64
1,1,34.7219
2,2,29.8544
3,3,39.4524
4,4,37.3915
5,5,34.7032
6,6,24.2707
7,8,33.7647
8,9,33.2857


In [61]:
combine(df_gp, nrow)

Unnamed: 0_level_0,marstat,nrow
Unnamed: 0_level_1,Int64,Int64
1,1,22277
2,2,8227
3,3,252
4,4,4296
5,5,2092
6,6,35289
7,8,17
8,9,14


We can apply different functions at the same time:

In [65]:
combine(df_gp, nrow, "age_a" => mean)

Unnamed: 0_level_0,marstat,nrow,age_a_mean
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,22277,34.7219
2,2,8227,29.8544
3,3,252,39.4524
4,4,4296,37.3915
5,5,2092,34.7032
6,6,35289,24.2707
7,8,17,33.7647
8,9,14,33.2857


You can customize the output column names using the `=>` operator:

In [87]:
combine(df_gp, nrow, "age_a" => mean => "age_mean_custom_name")

Unnamed: 0_level_0,marstat,nrow,age_mean_custom_name
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,22277,34.7219
2,2,8227,29.8544
3,3,252,39.4524
4,4,4296,37.3915
5,5,2092,34.7032
6,6,35289,24.2707
7,8,17,33.7647
8,9,14,33.2857


We can perform more advanced operations. For example, we can pass multiple columns as arguments and apply customized functions:

In [145]:
combine(df_gp, ["age_a", "intvwyear"] => ((a, i) -> (x=mean(a)/mean(i))))

Unnamed: 0_level_0,religion,age_a_intvwyear_function
Unnamed: 0_level_1,String3,Float64
1,1.0,0.0143374
2,3.0,0.0146733
3,4.0,0.0145775
4,2.0,0.0144825
5,,0.013986


Similar to the code sample above, we can make more than one custom function at the same time, but we need to use the `AsTable` type. You can find more information about the different `DataFrames` **Types** [here](https://docs.juliahub.com/DataFrames/AR9oZ/0.21.5/lib/types/).

In [147]:
combine(df_gp, ["age_a", "intvwyear"] => ((a, i) -> (x=mean(a)/mean(i), y=sum(i))) => AsTable)

Unnamed: 0_level_0,religion,x,y
Unnamed: 0_level_1,String3,Float64,Int64
1,1.0,0.0143374,34069135
2,3.0,0.0146733,65764429
3,4.0,0.0145775,11667106
4,2.0,0.0144825,34266608
5,,0.013986,2002


We can select the columns using indexing like:

In [88]:
combine(df_gp, [2,7] => cor)

Unnamed: 0_level_0,marstat,age_a_intvwyear_cor
Unnamed: 0_level_1,Int64,Float64
1,1,0.165746
2,2,0.160271
3,3,0.23207
4,4,0.215881
5,5,0.183481
6,6,0.0650576
7,8,0.500346
8,9,0.260659


In [94]:
combine(df_gp, "age_a" => (x -> [extrema(x)]) => ["min_age", "max_age"])

Unnamed: 0_level_0,marstat,min_age,max_age
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,17,98
2,2,15,50
3,3,22,49
4,4,19,50
5,5,18,49
6,6,15,99
7,8,21,49
8,9,19,48


Contrary to `combine`, the `select` and `transform` functions always return a data frame with the same number and order of rows as the source. In the examples below the return values in the columns are broadcasted to match the number of elements in each group:

In [97]:
combine(df_gp, "age_a" => mean)

Unnamed: 0_level_0,marstat,age_a_mean
Unnamed: 0_level_1,Int64,Float64
1,1,34.7219
2,2,29.8544
3,3,39.4524
4,4,37.3915
5,5,34.7032
6,6,24.2707
7,8,33.7647
8,9,33.2857


In [96]:
select(df_gp, "age_a" => mean)

Unnamed: 0_level_0,marstat,age_a_mean
Unnamed: 0_level_1,Int64,Float64
1,2,29.8544
2,6,24.2707
3,6,24.2707
4,4,37.3915
5,6,24.2707
6,1,34.7219
7,6,24.2707
8,2,29.8544
9,4,37.3915
10,1,34.7219


In [102]:
show(transform(df_gp, "age_a" => mean), allcols=true)

[1m72464×18 DataFrame[0m
[1m   Row [0m│[1m caseid [0m[1m age_a [0m[1m marstat [0m[1m reldlife [0m[1m religion [0m[1m samesex [0m[1m intvwyear [0m[1m lifprtnr [0m[1m timesmar [0m[1m attnd14 [0m[1m fmarit [0m[1m gayadopt [0m[1m lifeprt [0m[1m sxok18  [0m[1m staytog [0m[1m prvntdiv [0m[1m achieve [0m[1m age_a_mean [0m
[1m       [0m│[90m Int64  [0m[90m Int64 [0m[90m Int64   [0m[90m String3  [0m[90m String3  [0m[90m String3 [0m[90m Int64     [0m[90m String3  [0m[90m String3  [0m[90m String3 [0m[90m Int64  [0m[90m String3  [0m[90m String3 [0m[90m String3 [0m[90m String3 [0m[90m String3  [0m[90m String3 [0m[90m Float64    [0m
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │  80717     31        2  NA        1         1             2018  15        NA        NA            5  NA     

More information on split-apply-combine can be found [here](https://dataframes.juliadata.org/stable/man/split_apply_combine/).


--- 

## Exercises

For the set of exercises below, we will continue using the NSFG data set. Remember the labels are stored in the `img/labels` directory.

1. Filter data to get all respondents older than 40 years old who have been married twice. Sort the obtained data by interview year.

In [121]:
sort(dataframe[(dataframe.age_a .> 40) .& (dataframe.timesmar .== "2"), :], "intvwyear")

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,5012,42,1,,1,2,2002,12,2
2,1511,42,1,1,3,3,2002,10,2
3,11058,42,1,,1,4,2002,2,2
4,12031,42,1,,1,2,2002,5,2
5,11470,41,1,3,3,2,2002,30,2
6,11333,41,1,1,2,4,2002,2,2
7,3570,41,1,2,2,2,2002,8,2
8,5378,43,1,1,2,2,2002,6,2
9,1611,42,1,1,3,2,2002,3,2
10,657,43,1,1,2,5,2002,4,2


2. Get the number of respondents grouped by religion.

In [128]:
df_gp = groupby(dataframe, "religion")

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,80717,31,1,,1,1,2018,15,
2,80722,16,1,,1,2,2019,2,
3,80732,39,1,,1,1,2019,0,
4,80736,40,1,,1,1,2019,22,
5,80739,40,1,,1,1,2019,5,1
6,80740,31,1,,1,1,2017,0,
7,80746,17,1,,1,1,2019,6,
8,80757,28,1,,1,4,2019,6,
9,80758,18,1,,1,3,2018,10,
10,80759,40,1,,1,3,2018,26,

Unnamed: 0_level_0,caseid,age_a,marstat,reldlife,religion,samesex,intvwyear,lifprtnr,timesmar
Unnamed: 0_level_1,Int64,Int64,Int64,String3,String3,String3,Int64,String3,String3
1,5758,28,1,,,,2002,,1


In [129]:
combine(df_gp, nrow)

Unnamed: 0_level_0,religion,nrow
Unnamed: 0_level_1,String3,Int64
1,1.0,16929
2,3.0,32695
3,4.0,5799
4,2.0,17040
5,,1
