# `pandas` Part III

This document introduces the joining of multiple datasets in `pandas`.

## Basic idea of relational databases

In many occasions data do not reside in one single table.  Instead, they reside in multiple tables connected by identifiers, because of 

- storage efficiency: the same data do not have to be stored multiple times. 
- data consistency: multiple copies of data are prone to inconsistency.
- standard access: all *relational databases* offer a similar way of accessing the data.

In CS 217, you will dive into the design of relational databases and the language to access them, Structured Query Language (SQL).

## A truncated example with healthcare data ([MIMIC-III](https://mimic.mit.edu/docs/iii/))

```{figure} ../img/table-join-mimic.png
---
width: 80%
name: mimic-erd
---
Example of a relational database (MIMIC III)
```

## Coded example with baby names

{cite:t}`tzioumis2018demographic` has studied how often various first names are used by people of certain racial and Hispanic origin groups.  
The data contain the following information (columns):

In [None]:
import pandas as pd

baby = pd.read_csv('../data/ssa-names.csv.zip')
names_demo_meta = pd.read_excel('../data/firstnames.xlsx')
names_demo = pd.read_excel('../data/firstnames.xlsx', sheet_name='Data')

In [None]:
pd.set_option('display.max_colwidth', 80)
names_demo_meta

**Are the most popular baby names used by people of a variety of ethnic groups?**

In [None]:
# data cleanup for baby names
baby.columns = baby.columns.str.lower()
baby['name'] = baby['name'].str.lower()

In [None]:
# data cleanup for demographics info
names_demo['firstname'] = names_demo['firstname'].str.lower()

In [None]:
most_occurrence_names = baby.loc[baby.groupby(['year', 'sex'])['count'].idxmax(), ['year', 'name', 'sex', 'count']].reset_index()
most_occurrence_names = most_occurrence_names.drop(columns='index')

In [None]:
most_occurrence_names.merge(names_demo,           # which two datasets to join
                            how='inner',          # method of join
                            left_on='name',       # which column (key) to connect with in the first dataset
                            right_on='firstname') # which column (key) to connect with in the second dataset

**Comparison to a query using SQL (CS 217)**

```sql
SELECT * 
FROM most_occurrence_names 
INNER JOIN names_demo
ON most_occurrence_names.name = names_demo.firstname
```

**Many types of "joining" two tables**

```{figure} ../img/table-join-sql.png
---
width: 80%
name: sql-join
---
Types of JOIN (merge) statements (source: Taylor Brownlow)
```

## Practice 4

Using the MIMIC-III demo dataset (`PATIENTS.csv` and `ADMISSIONS.csv`), 

1. Create a table that includes all admissions with the corresponding patient information.  (*Think about which way to join/merge.*)
2. Report the number of admissions for each patient.
3. Report the number of admissions grouped by gender and ethnicity.