# Statistical Analysis and Modeling of the Chicago Food Inspections Data
This notebook reads the data splits from `04_create_data_splits` to apply a Logistic Regression model in Python and a from-scratch Logistic Regression model in Spark (with a homemade implementation of Gradient Descent).

See the `01_food_inspections_data_prep` notebook for information about the Chicago Food Inspections Data, the license, and the various data attributes.  See the `02_census_data_prep` notebook for the US Census API terms of use.

### Analysis and Models in this Notebook

- Simple Logistic Regression model using scikit-learn
- From-scratch Logistic Regression model using homemade implementation of Gradient Descent

### Set Global Seed

In [1]:
SEED = 666

### Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

### Read Train and Test Splits

In [3]:
X_train = pd.read_csv('../data/X_train.gz', compression='gzip')
X_test = pd.read_csv('../data/X_test.gz', compression='gzip')
y_train = pd.read_csv('../data/y_train.gz', compression='gzip').values.flatten()
y_test = pd.read_csv('../data/y_test.gz', compression='gzip').values.flatten()

### Scale Train and Test Sets

In [4]:
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_scaled = minmax_scaler.fit_transform(X_train.values)
X_test_scaled = minmax_scaler.transform(X_test.values)

### Check Performance of scikit-learn Logistic Regression with No Regularization
Setting the regularization parameter to `1e8` we get effectively no regularization, as in the statsmodel API Logit model.  See this issue for details: https://github.com/scikit-learn/scikit-learn/issues/6738

In [5]:
sklearn_clf = LogisticRegression(C=1e8, # https://github.com/scikit-learn/scikit-learn/issues/6738
                                 penalty='l2',
                                 solver='liblinear',
                                 fit_intercept=True,
                                 max_iter=1000)

In [6]:
%%time
sklearn_clf.fit(X_train_scaled, y_train)

CPU times: user 1.88 s, sys: 57.9 ms, total: 1.94 s
Wall time: 2.33 s


LogisticRegression(C=100000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

### Compute Accuracy at 0.5 Threshold

In [7]:
y_prob = sklearn_clf.predict_proba(X_test_scaled)[:, 1]

In [8]:
y_pred = [1 if x > 0.5 else 0 for x in y_prob]
np.mean(y_test == y_pred)

0.7761797752808989

In [9]:
X_train.head()

Unnamed: 0,risk,latitude,longitude,facility_type_Bakery,facility_type_Catering,facility_type_Children's Services Facility,facility_type_Daycare (2 - 6 Years),facility_type_Daycare (Under 2 Years),facility_type_Daycare Above and Under 2 Years,facility_type_Daycare Combo 1586,...,inspection_type_License,inspection_type_License Re-Inspection,inspection_type_License-Task Force,inspection_type_Out of Business,inspection_type_Recent Inspection,inspection_type_Short Form Complaint,inspection_type_Suspected Food Poisoning,inspection_type_Tag Removal,inspection_type_Task Force Liquor 1475,median_household_income
0,1,41.857714,-87.664542,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,41226.0
1,1,41.721144,-87.675565,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,59488.0
2,1,41.894448,-87.726203,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,22467.0
3,1,41.893216,-87.624812,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,96040.0
4,1,41.844203,-87.720006,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,31445.0


In [None]:
# Consider an end to end pipeline implementation for working with this data in spark
# Run performance comparisons to local processing
# Potentially compare performance of logistic regression on multiple platforms