# DSCI 511: Data acquisition and pre-processing<br>Chapter 6: Data Integration and Enrichment

In this notebook, we'll primarily discuss interacting with tabular or spreadsheet-like data with the `pandas` module (`pip3 install pandas`). `pandas` is a high-level library designed to simplify complex operations, as such, it is a very powerful tool well worth learning.

## Exercises
Note: numberings refer to the main notes.

#### 6.1.1.4 Exercise: Boolean masks
Create a Boolean mask for the rows that correspond only to the month of August and print these rows below.

In [8]:
# code goes here

#### 6.1.2.1 Exercise:
Download stock history for Tesla Inc. (TSLA) from Yahoo! Finance, read it into a dataframe, and save the "Date" and "Close" columns to a CSV file.

#### Discussion: Column selection and file i/o made easy!
After doing these sorts of thing with the `csv` module, often line by line, it's pretty nice to just use the `.to_csv()` dataframe method. Just don't forget to tell pandas to drop the index!

In [1]:
import pandas as pd

TSLA = pd.read_csv("./data/TSLA.csv", sep = ",", header = 0, parse_dates = [0])

TSLA[["Date", "Close"]].to_csv("./data/TSLA_first_two_columns.csv", index = False) 

#### 6.1.3.4 Exercise: Dropping missing values

Load `patient-stats.csv` into a `pandas` dataframe. Some patient records are missing values for weight. Drop these records and display the final dataframe.

#### Discussion: Convenience for common tasks
What once took a lot of looping (dropping) and counting (sampling) now as well can be accomplished with only a method!  

In [11]:
import pandas as pd

patient_stats = pd.read_csv("./data/patient-stats.csv", 
                            sep = ",", header = 0)
patients_with_weights = patient_stats.dropna(axis = "rows")
patients_with_weights.head()

Unnamed: 0,patient_id,age,height,weight
1,75608,22,62,101.0
2,19926,47,77,216.0
3,48540,18,64,168.0
4,77350,48,83,160.0
6,35269,87,63,122.0


#### 6.2.1.1 Exercise: More Boolean masks
Create a mask for filtering data for the players whose height is exactly 72 inches. Use this filter to put this data into a new dataframe.

#### Discussion: filtering, with concise syntax
These mask-generating conditional expressions, here comparing a column to an integer, really make it easy to drill down into a dataset.

In [12]:
baseball_data = pd.read_csv("./data/baseball_heightweight.csv", header = 0)

height_mask = baseball_data["Height"] == 72

height72 = baseball_data[height_mask]
height72.head()

Unnamed: 0,Name,Team,Position,Height,Weight,Age
2,Ramon_Hernandez,BAL,Catcher,72,210.0,30.78
3,Kevin_Millar,BAL,First_Baseman,72,210.0,35.43
16,Jay_Gibbons,BAL,Designated_Hitter,72,197.0,30.0
24,Scott_Williamson,BAL,Relief_Pitcher,72,180.0,31.03
47,Scott_Podsednik,CWS,Outfielder,72,188.0,30.95


Once you get the hang of the mask syntax it becomes convenient and easy to perform complex filtering statements on a single line.

In [17]:
## now in one line
baseball_data[baseball_data["Height"] != 72].head()

Unnamed: 0,Name,Team,Position,Height,Weight,Age
0,Adam_Donachie,BAL,Catcher,74,180.0,22.99
1,Paul_Bako,BAL,Catcher,74,215.0,34.69
4,Chris_Gomez,BAL,First_Baseman,73,188.0,35.71
5,Brian_Roberts,BAL,Second_Baseman,69,176.0,29.39
6,Miguel_Tejada,BAL,Shortstop,69,209.0,30.77


We can even run compound logic statements, aligning the separate masks through pointwise boolean comparisons. But be careful, you have to encapsulate each mask in parentheses if you want to construct a new one with compound logic, and moreover use non-verbose operators, like `&`, instead of `and`.

In [30]:
## now in one line
baseball_data[(baseball_data["Height"] != 72) & 
              (baseball_data["Position"] != "Catcher")].head()

Unnamed: 0,Name,Team,Position,Height,Weight,Age
4,Chris_Gomez,BAL,First_Baseman,73,188.0,35.71
5,Brian_Roberts,BAL,Second_Baseman,69,176.0,29.39
6,Miguel_Tejada,BAL,Shortstop,69,209.0,30.77
7,Melvin_Mora,BAL,Third_Baseman,71,200.0,35.07
8,Aubrey_Huff,BAL,Third_Baseman,76,231.0,30.19


#### 6.2.3.4 Exercise: Joins
Use a boolean mask to put only the data for the under-18 population into a new dataframe from the `state-population.csv` file. Load the data from `state-abbrevs.csv` and perform the appropriate join operation to create a dataframe that has both the under-18 population data and the state names.

#### Discussion: Filter and merge
The only real difference here from the main examples is that we're filtering before a merge. This can be especially helpful if we wish to calculate new data, as in the next exercise.

However, it is worth noting that by performing this as an inner join we not only didn't induce any NAs, but wound up making ourselves blissfully ignorant of the missing population data for `'PR'`, since the merge left these (and `'USA'`'s) rows out of the join for having no match on the right table.

In [28]:
pop = pd.read_csv("./data/state-population.csv")
abbrev = pd.read_csv("./data/state-abbrevs.csv")

pop = pop[pop['ages'] == "under18"]

joined = pd.merge(pop, abbrev, how = "inner", 
                  left_on = "state/region", right_on = "abbreviation")

joined.head()

Unnamed: 0,state/region,ages,year,population,state,abbreviation
0,AL,under18,2012,1117489.0,Alabama,AL
1,AL,under18,2010,1130966.0,Alabama,AL
2,AL,under18,2011,1125763.0,Alabama,AL
3,AL,under18,2009,1134192.0,Alabama,AL
4,AL,under18,2013,1111481.0,Alabama,AL


In [29]:
joined.isnull().any()

state/region    False
ages            False
year            False
population      False
state           False
abbreviation    False
dtype: bool

#### 6.2.2.5 Challenge Exercise: Joins
Use boolean masks, again, but this time to enrich the population data. Calculate values for over 18 year old populations, and insert them as additional rows tagged with `'over18'` in the `'ages'` column. Be sure that your calculations correctly align the right years and states!

In [None]:
# code goes here