<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---



Predict the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### 1. Read in the data

In [3]:
# read in the data using pandas
sf_crime = pd.read_csv(
    '../../../../resource-datasets/sf_crime/sf_crime_sample.csv')
sf_crime.drop('DayOfWeek', axis=1, inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052
2,2004-03-06 03:00:00,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421
3,2011-12-03 12:10:00,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528
4,2003-01-10 00:15:00,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946


In [4]:
sf_crime.shape

(25000, 8)

In [5]:
# check the shape of your dataframe


In [6]:
#check whether there are any missing values
#do we need to fix anything here?
sf_crime.isnull().sum()

Dates         0
Category      0
Descript      0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
dtype: int64

In [7]:
#check what your datatypes are
#do we need to fix anything here?
sf_crime.dtypes

Dates          object
Category       object
Descript       object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

### 2. Create column for year, month, day, hour, time, and date from 'Dates' column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [8]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime['Dates'])
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052


In [9]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.weekday_name
# check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday


In [10]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07


In [11]:
# Drop the 'Dates' column
dates = sf_crime.pop("Dates")

In [12]:
sf_crime.dtypes

Category        object
Descript        object
PdDistrict      object
Resolution      object
Address         object
X              float64
Y              float64
Year             int64
Month            int64
Day_of_Week     object
Hour             int64
Time            object
Date            object
dtype: object

In [13]:
sf_crime.head()

Unnamed: 0,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23
1,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07
2,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,2004,3,Saturday,3,03:00:00,2004-03-06
3,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528,2011,12,Saturday,12,12:10:00,2011-12-03
4,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946,2003,1,Friday,0,00:15:00,2003-01-10


In [14]:
sf_crime.Category.value_counts()

LARCENY/THEFT                  4934
OTHER OFFENSES                 3656
NON-CRIMINAL                   2601
ASSAULT                        2164
DRUG/NARCOTIC                  1533
VEHICLE THEFT                  1506
VANDALISM                      1280
WARRANTS                       1239
BURGLARY                       1023
SUSPICIOUS OCC                  891
MISSING PERSON                  771
ROBBERY                         630
FRAUD                           537
SECONDARY CODES                 283
FORGERY/COUNTERFEITING          281
WEAPON LAWS                     255
PROSTITUTION                    223
TRESPASS                        209
STOLEN PROPERTY                 137
SEX OFFENSES FORCIBLE           120
DISORDERLY CONDUCT              105
DRUNKENNESS                     105
RECOVERED VEHICLE                80
DRIVING UNDER THE INFLUENCE      75
KIDNAPPING                       71
RUNAWAY                          58
ARSON                            52
LIQUOR LAWS                 

### 3. Validate and clean the data.

In [None]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)
#sf_crime.Category.value_counts()

In [17]:
# have a look to see whether you have all the days of the week in your data

sf_crime_temp = pd.get_dummies(sf_crime,columns=["Day_of_Week"])

In [20]:
sf_crime=sf_crime_temp

In [25]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out
sf_crime.Resolution.unique()

array(['NONE', 'ARREST, BOOKED', 'ARREST, CITED', 'LOCATED',
       'JUVENILE DIVERTED', 'DISTRICT ATTORNEY REFUSES TO PROSECUTE',
       'UNFOUNDED', 'PSYCHOPATHIC CASE', 'JUVENILE BOOKED',
       'NOT PROSECUTED', 'COMPLAINANT REFUSES TO PROSECUTE',
       'JUVENILE CITED', 'PROSECUTED BY OUTSIDE AGENCY',
       'EXCEPTIONAL CLEARANCE', 'JUVENILE ADMONISHED',
       'CLEARED-CONTACT JUVENILE FOR MORE INFO'], dtype=object)

In [26]:
# use .describe() to see whether the location coordinates seem appropriate
sf_crime.describe()

Unnamed: 0,X,Y,Year,Month,Hour,Day_of_Week_Friday,Day_of_Week_Monday,Day_of_Week_Saturday,Day_of_Week_Sunday,Day_of_Week_Thursday,Day_of_Week_Tuesday,Day_of_Week_Wednesday
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,-122.422454,37.773486,2008.68808,6.40736,13.3848,0.15532,0.14096,0.13984,0.13252,0.14316,0.14192,0.14628
std,0.032753,0.572667,3.625646,3.418299,6.590859,0.362217,0.347987,0.346828,0.339062,0.350243,0.348975,0.353394
min,-122.513642,37.708003,2003.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-122.432797,37.752874,2005.0,3.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-122.416469,37.775421,2009.0,6.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,-122.406953,37.784401,2012.0,9.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,-120.5,90.0,2015.0,12.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [34]:
all_col = sf_crime.Category.unique()

In [35]:
all_col

array(['ARSON', 'LARCENY/THEFT', 'NON-CRIMINAL', 'BURGLARY',
       'SUSPICIOUS OCC', 'VEHICLE THEFT', 'ASSAULT', 'FRAUD',
       'DRUG/NARCOTIC', 'SECONDARY CODES', 'OTHER OFFENSES',
       'MISSING PERSON', 'VANDALISM', 'ROBBERY', 'FORGERY/COUNTERFEITING',
       'PROSTITUTION', 'WARRANTS', 'TRESPASS', 'DISORDERLY CONDUCT',
       'SEX OFFENSES FORCIBLE', 'STOLEN PROPERTY',
       'DRIVING UNDER THE INFLUENCE', 'KIDNAPPING', 'WEAPON LAWS',
       'LOITERING', 'RECOVERED VEHICLE', 'RUNAWAY', 'DRUNKENNESS',
       'LIQUOR LAWS', 'EXTORTION', 'FAMILY OFFENSES', 'EMBEZZLEMENT',
       'SUICIDE', 'GAMBLING', 'BAD CHECKS', 'BRIBERY',
       'SEX OFFENSES NON FORCIBLE', 'TREA'], dtype=object)

In [46]:
NVC = ['BAD CHECKS','BRIBERY','DRUG/NARCOTIC','DRUNKENNESS',
     'EMBEZZLEMENT','FORGERY/COUNTERFEITING','FRAUD',
     'GAMBLING','LIQUOR','LOITERING','TRESPASS','OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL','RUNAWAY','SECONDARY CODES','SUSPICIOUS OCC','WARRANTS']

#use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above

VC = []
for x in all_col:
    if x not in NVC:
        VC.append(x)
    elif x not in NOT_C:
        VC.append(x)

        
print(VC)

['ARSON', 'LARCENY/THEFT', 'NON-CRIMINAL', 'BURGLARY', 'SUSPICIOUS OCC', 'VEHICLE THEFT', 'ASSAULT', 'FRAUD', 'DRUG/NARCOTIC', 'SECONDARY CODES', 'OTHER OFFENSES', 'MISSING PERSON', 'VANDALISM', 'ROBBERY', 'FORGERY/COUNTERFEITING', 'PROSTITUTION', 'WARRANTS', 'TRESPASS', 'DISORDERLY CONDUCT', 'SEX OFFENSES FORCIBLE', 'STOLEN PROPERTY', 'DRIVING UNDER THE INFLUENCE', 'KIDNAPPING', 'WEAPON LAWS', 'LOITERING', 'RECOVERED VEHICLE', 'RUNAWAY', 'DRUNKENNESS', 'LIQUOR LAWS', 'EXTORTION', 'FAMILY OFFENSES', 'EMBEZZLEMENT', 'SUICIDE', 'GAMBLING', 'BAD CHECKS', 'BRIBERY', 'SEX OFFENSES NON FORCIBLE', 'TREA']


In [48]:
#add a column called 'Type' into your dataframe that stores whether the observation was:
#Non-Violent, Violent, or Non-Crime
#use .map()!
def typecrime(x):
    if x in NOT_C: return 'NOT_CRIMINAL'
    if x in NVC: return 'NON-VIOLENT'
    if x in VC: return 'VIOLENT_CRIME'

sf_crime['Type']=sf_crime["Category"].map(typecrime)

In [50]:
sf_crime.head()

Unnamed: 0,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Hour,Time,Date,Day_of_Week_Friday,Day_of_Week_Monday,Day_of_Week_Saturday,Day_of_Week_Sunday,Day_of_Week_Thursday,Day_of_Week_Tuesday,Day_of_Week_Wednesday,Type
0,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,23,23:27:00,2003-03-23,0,0,0,1,0,0,0,VIOLENT_CRIME
1,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,6,06:45:00,2006-03-07,0,0,0,0,0,1,0,VIOLENT_CRIME
2,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,2004,3,3,03:00:00,2004-03-06,0,0,1,0,0,0,0,NOT_CRIMINAL
3,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528,2011,12,12,12:10:00,2011-12-03,0,0,1,0,0,0,0,VIOLENT_CRIME
4,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946,2003,1,0,00:15:00,2003-01-10,1,0,0,0,0,0,0,VIOLENT_CRIME


In [56]:
#find the baseline accuracy:
sf_crime.Type.value_counts("VIOLENT_CRIME")

VIOLENT_CRIME    0.54060
NON-VIOLENT      0.25652
NOT_CRIMINAL     0.20288
Name: Type, dtype: float64

In [57]:
#create a target array with 'Type'
#y = sf_crime.pop("Type")

In [None]:
#create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
# X = 

In [None]:
#use pd.get_dummies() to dummify your categorical variables
#remember to drop a column!
# X = 

### 5. Create a train/test/split and standardize the predictor matrices

In [None]:
#create a 50/50 train test split; 
#stratify based on your target variable
#use a random state of 2018


In [None]:
#standardise your predictor matrices
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [None]:
#create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
#use 5 cross-validation folds


In [None]:
#create a confusion matrix 
#predictions = 
#confusion = confusion_matrix()
#pd.DataFrame(confusion,
#             columns=sorted(y_train.unique()),
#             index=sorted(y_train.unique()))

### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [None]:
#create a hyperparameter dictionary for a logistic regression


In [None]:
#create a gridsearch object using LogisticRegression() and the dictionary you created above


In [None]:
#fit the gridsearch object on your training data


In [None]:
#print out the best parameters


In [None]:
#print out the best mean cross-validated score


In [None]:
#assign your best estimator to the variable 'best_logreg'


In [None]:
#score your model on your testing data


### 8. Print out a classification report for your best_logreg model

In [None]:
#use your test data to create your classification report
#predictions = 
#print(classification_report())

### 9. Explore LogisticRegressionCV.  

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [None]:
# A: