# `pandas` Part III - Merging and joining datasets

This document introduces the joining of multiple datasets in `pandas`.

## Basic idea of relational databases

In many occasions data do not reside in one single table.  Instead, they reside in multiple tables connected by identifiers, because of 

- storage efficiency: the same data do not have to be stored multiple times. 
- data consistency: multiple copies of data are prone to inconsistency.
- standard access: all *relational databases* offer a similar way of accessing the data.

In CS 217, you will dive into the design of relational databases and the language to access them, Structured Query Language (SQL).

## A truncated example with healthcare data ([MIMIC-III](https://mimic.mit.edu/docs/iii/))

```{figure} ../img/table-join-mimic.png
---
width: 80%
name: mimic-erd
---
Example of a relational database (MIMIC III)
```

## Coded example with baby names

{cite:t}`tzioumis2018demographic` has studied how often various first names are used by people of certain racial and Hispanic origin groups.  
The data contain the following information (columns):

In [52]:
import pandas as pd

baby = pd.read_csv('../data/ssa-names.csv.zip')
names_demo_meta = pd.read_excel('../data/firstnames.xlsx')
names_demo = pd.read_excel('../data/firstnames.xlsx', sheet_name='Data')

In [53]:
pd.set_option('display.max_colwidth', 80)
names_demo_meta

Unnamed: 0,Field,Description
0,firstname,First name
1,obs,Number of occurrences in the combined mortgage datasets
2,pcthispanic,Percent Hispanic or Latino
3,pctwhite,Percent Non-Hispanic White
4,pctblack,Percent Non-Hispanic Black or African American
5,pctapi,Percent Non-Hispanic Asian or Native Hawaiian or Other Pacific Islander
6,pctaian,Percent Non-Hispanic American Indian or Alaska Native
7,pct2prace,Percent Non-Hispanic Two or More Races


**Are the most popular baby names used by people of a variety of ethnic groups?**

In [56]:
names_demo.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
0,AARON,3646,2.88,91.607,3.264,2.057,0.055,0.137
1,ABBAS,59,0.0,71.186,3.39,25.424,0.0,0.0
2,ABBEY,57,0.0,96.491,3.509,0.0,0.0,0.0
3,ABBIE,74,1.351,95.946,2.703,0.0,0.0,0.0
4,ABBY,262,1.527,94.656,1.527,2.29,0.0,0.0


In [57]:
# data cleanup for baby names
baby.columns = baby.columns.str.lower()
baby['name'] = baby['name'].str.lower()

In [58]:
# data cleanup for demographics info
names_demo['firstname'] = names_demo['firstname'].str.lower()

In [59]:
baby.head()

Unnamed: 0,state,sex,year,name,count
0,VA,F,1910,mary,848
1,VA,F,1910,virginia,270
2,VA,F,1910,elizabeth,254
3,VA,F,1910,ruth,218
4,VA,F,1910,margaret,209


In [60]:
most_occurrence_names = baby.loc[baby.groupby(['year', 'sex'])['count'].idxmax(), ['year', 'name', 'sex', 'count']].reset_index()
most_occurrence_names = most_occurrence_names.drop(columns='index')

In [61]:
most_occurrence_names

Unnamed: 0,year,name,sex,count
0,1910,mary,F,2913
1,1910,john,M,1326
2,1911,mary,F,3188
3,1911,john,M,1672
4,1912,mary,F,4106
...,...,...,...,...
219,2019,noah,M,2677
220,2020,olivia,F,2350
221,2020,noah,M,2625
222,2021,olivia,F,2395


In [None]:
most_occurrence_names.merge(names_demo,           # which two datasets to join
                            how='inner',          # method of join
                            left_on='name',       # which column (key) to connect with in the first dataset
                            right_on='firstname') # which column (key) to connect with in the second dataset

**Comparison to a query using SQL (CS 217)**

```sql
SELECT * 
FROM most_occurrence_names 
INNER JOIN names_demo
ON most_occurrence_names.name = names_demo.firstname
```

**Many types of "joining" two tables**

```{figure} ../img/table-join-sql.png
---
width: 80%
name: sql-join
---
Types of JOIN (merge) statements (source: Taylor Brownlow)
```

## Practice 4

Using the MIMIC-III demo dataset (`PATIENTS.csv` and `ADMISSIONS.csv`), 

1. Create a table that includes all admissions with the corresponding patient information.  (*Think about which way to join/merge.*)
2. Report the number of admissions for each patient.
3. Report the number of admissions grouped by gender and ethnicity.