# Join Features
This notebook loads and joins:

- The food inspections features created in the `01_food_inspections_data_prep` notebook and
- The census features created in the `02_census_data_prep` notebook.

See the `01_food_inspections_data_prep` notebook for information about the Chicago Food Inspections Data, the license, and the various data attributes.  See the `02_census_data_prep` notebook for the US Census API terms of use.

### Imports

In [1]:
import pandas as pd

### Read Chicago Food Inspection Features Dataset

In [2]:
inspections_df = pd.read_csv('../data/Food_Inspection_Features.gz', compression='gzip')

In [3]:
inspections_df.shape

(169305, 96)

### Read Census Features Dataset
Make sure zip is a string.

In [4]:
census_df = pd.read_csv('../data/Census_Features.csv')
census_df['zip'] = census_df['zip'].astype(str)

In [5]:
census_df.shape

(59, 2)

### Create Categorical Column of Zips by "Un-One-Hot-Encoding" via the "idxmax" Method

In [6]:
zip_cols = [col for col in inspections_df.columns if 'zip' in col]
inspections_df['zip'] = inspections_df[zip_cols].idxmax(axis=1).apply(lambda z: z.split('_')[1])

### Left Join to Get Median Household Income from Census Data for each Zip

In [7]:
joined_df = pd.merge(inspections_df, census_df, on=['zip'], how='left')

### Collect the List of Zip Codes for which We Could Not Retrieve the Median Household Income

In [8]:
list(set(joined_df[joined_df['median_household_income'].isnull()]['zip'].tolist()))

['60666']

### Drop the Zip Column to Maintain only One-Hot-Encoded Zips

In [9]:
joined_df = joined_df.drop('zip', 1)

### Drop Null Records from Our Analysis

In [10]:
feature_df = joined_df[~joined_df['median_household_income'].isnull()]

In [11]:
feature_df.shape

(166871, 97)

### Write the Final Feature Set to Compressed CSV

In [12]:
feature_df.to_csv('../data/Final_Features.gz', compression='gzip', index=False)