<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Grid Searching and Multinomial Models with San Francisco Crime Data

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

## SQL

You must be connected to WiFi on a whitelisted IP address (i.e. your GA campus) in order to complete SQL asks.

In [87]:
%load_ext sql
%sql postgresql://dsi_student:yellowpencil@35.196.107.77/postgres

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: dsi_student@postgres'

For reference, you can see the tables in this database with this code:

In [88]:
%%sql

SELECT table_name FROM information_schema.tables
WHERE "table_type" = 'BASE TABLE' AND "table_schema" = 'public'

3 rows affected.


table_name
titanic
default_data
sf_crime


### 1.

Return the five Police Department Districts in San Francisco that have the highest number of incidents as well as a count of the incidents in each of those districts.

In [89]:
%%sql
SELECT * FROM sf_crime LIMIT 5;

5 rows affected.


index,dates,category,descript,dayofweek,pddistrict,resolution,address,x,y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.4258917,37.7745986
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.4258917,37.7745986
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.80041432
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.4269953,37.80087263
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.4387376,37.77154117


In [90]:
%%sql
SELECT pddistrict, COUNT(category) FROM sf_crime GROUP BY pddistrict ORDER BY count DESC LIMIT 5; 

5 rows affected.


pddistrict,count
SOUTHERN,3287
NORTHERN,2250
CENTRAL,2206
MISSION,2118
BAYVIEW,1678


Expected output:


|count|pddistrict|
|---|---|
|3287|SOUTHERN|
|2250|NORTHERN|
|2206|CENTRAL|
|2118|MISSION|
|1678|BAYVIEW|

### 2.

How many cases of `'MISSING PERSON'` had no resolution?

In [91]:
%%sql

SELECT count(*) FROM sf_crime WHERE category = 'MISSING PERSON' AND resolution = 'NONE';

1 rows affected.


count
446


Expected output:
    
|count|
|---|
|446|

## Multinomial Logistic Regression Models

So far, we've been using logistic regression for binary problems where there are only two class labels. Logistic regression can also be extended to target variables with multiple classes.

There are two ways scikit-learn solves multiple class problems with logistic regression: a multinomial loss or a "one-versus-rest" (OvR) process in which a model is fit for each target class versus all of the other classes. You can choose between multinomial and OvR by setting the `multi_class` option in the `LogisticRegression` class (see the docs [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)).

**Multinomial vs. OvR**
- (both) `k` classes.
- (Multinomial) `k-1` models with one reference category.
- (OvR) `k*(k-1)/2` models.

In this lab you'll use grid search in conjunction with **multinomial logistic regression** to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary Lab Imports**

In [92]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1) Read in the data.

In [93]:
df = %sql SELECT * FROM sf_crime
df = df.DataFrame()
df.drop('index', axis=1, inplace=True)

# crime_csv = './datasets/sf_crime_train.csv'

18000 rows affected.


### 2) Create column for hour, month, and year from "Dates" column.

> *Hint: `pd.to_datetime` may or may not be helpful.*


In [94]:
df['month'] = pd.to_datetime(df['dates'].values).month

In [95]:
df['hour'] = pd.to_datetime(df['dates'].values).hour

In [96]:
df['year'] = pd.to_datetime(df['dates'].values).year

### 3) Validate and clean the data.

In [97]:
df.resolution.unique()

array(['ARREST, BOOKED', 'NONE', 'ARREST, CITED', 'PSYCHOPATHIC CASE',
       'JUVENILE BOOKED', 'UNFOUNDED', 'EXCEPTIONAL CLEARANCE', 'LOCATED',
       'CLEARED-CONTACT JUVENILE FOR MORE INFO', 'NOT PROSECUTED'], dtype=object)

In [98]:
df['Solved'] = (df['resolution'] != 'NONE').astype(int)

In [99]:
df['weekend'] = df['dayofweek'].isin(['Friday', 'Saturday', 'Sunday']).astype(int)

### 4) Set up a target and predictor matrix for predicting violent, non-violent, and non-crimes.

**Non-Violent Crimes**
- Bad checks.
- Bribery.
- Drug/narcotic.
- Drunkenness.
- Embezzlement.
- Forgery/counterfeiting.
- Fraud.
- Gambling.
- Liquor.
- Loitering.
- Trespass.

**Non-Crimes**
- Non-criminal.
- Runaway.
- Secondary codes.
- Suspicious OCC.
- Warrants.

**Violent Crimes**
- Everything else.

**What type of model do you need here? What should your "baseline" category be?**

In [108]:
def encoder(p):
    if p in [x.upper() for x in ['Non-criminal', 'Runaway', 'Secondary codes','Suspicious OCC', 'Warrants']]:
        p = 0
        return p
    elif p  in [x.upper() for x in ['Bad checks','Bribery','Drug/narcotic' ,'Drunkenness','Embezzlement','Forgery/counterfeiting','Fraud',
    'Gambling','Liquor','Loitering','Trespass']]:
        p = 1
        return p
    else:
        return 2

df['category'] = df['category'].apply(encoder)

In [112]:
y = df['category'].values
X = df[['x', 'y', 'month', 'hour', 'year', 'Solved', 'weekend']]

### 5) Standardize the predictor matrix.

In [113]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=48, stratify=y)

In [114]:
ss = StandardScaler()
Xs = ss.fit_transform(X_train)

### 6) Find the optimal hyperparameters (optimal regularization) to predict your crime categories.

> **Note:** In this case you will just be finding the optimal regularization parameter, so you could use `GridSearchCV` or `LogisticRegressionCV`. The grid search object is more general and can be applied to any model and ny number of hyperparameters. The `LogisticRegressionCV` only looks for the best regularization parameter. The `LogisticRegressionCV` is recommended, but the downside is that you have to manually check the lasso vs ridge penalty option.

**References for logistic regression regularization hyperparameters:**
- `solver`: Algorithm used for optimization (relevant for multiclass).
    - `Newton-cg`: Handles multinomial loss and L2 only.
    - `Sag`: Handles multinomial loss, large data sets, and L2 only; works best on scaled data.
    - `lbfgs`: Handles multinomial loss and L2 only.
    - `Liblinear`: Small data sets; no warm starts.
- `Cs`: Regularization strengths (smaller values are stronger penalties).
- `cv`: Cross-validations or number of folds.
- `penalty`: `'l1'` = lasso, `'l2'` = ridge.

For example, to search for the best regularization hyperparameter, $C$, using 5-fold cross validation and lasso regularization, you could use

```python
logreg_cv = LogisticRegressionCV(solver='saga', 
                                 multi_class='multinomial',
                                 Cs=15, 
                                 cv=5, penalty='l1')
```

where `Cs=15` searches a grid of 15 distinct parameters.  (Remember: Cs describes the inverse of regularization strength.)


**Split data into training and testing sets with 50 percent in testing.**

In [None]:
# A:

**Grid search hyperparameters for the training data.**

> You may get a warning saying that the optimization did not converge for one case.  You could increase `max_iter` to run more optimization iterations, but that it is not neccessary for this lab.

In [None]:
# A:

**Find the best parameters for each target class.**

> Remember we are searching for the "best" value of the hyperparameter $C$.  In _binomial_ classification, `LogisticRegressionCV` will give us just one best value.  In _multiclass_ classification, we will get a best value of `C` _for each class_.  Think about why this may be useful!

In [None]:
# A:

**Build one or more logistic regression models using the best parameters for each target class.**

> If each class yielded the same "best" $C$ value then just build one model.  If each class had a different "best" $C$ value then build three classes.

In [None]:
# A:

### 7) Build confusion matrices for the model(s) above.
- Use the holdout test data from the train/test split.

In [None]:
# A:

### 8) Print classification reports for your model(s).

In [None]:
# A:

**Describe the metrics in the classification report.**

In [None]:
# A:

## All done?  Want more?

Here are some ideas to take your learning to the next level:

- Try using OvR multiclass classification instead of mulitnomial.
- Try using `GridSearchCV` to also look at the effect of the `penalty` option (l1 vs l2).