<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---


Predict the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### 1. Read in the data

In [3]:
# read in the data using pandas
sf_crime = pd.read_csv(
    '../../../../resource-datasets/sf_crime/sf_crime_sample.csv')
sf_crime.drop('DayOfWeek', axis=1, inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052
2,2004-03-06 03:00:00,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421
3,2011-12-03 12:10:00,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528
4,2003-01-10 00:15:00,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946


In [4]:
# check the shape of your dataframe
sf_crime.shape

(25000, 8)

In [5]:
# check whether there are any missing values
# do we need to fix anything here?
sf_crime.isnull().sum()

Dates         0
Category      0
Descript      0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
dtype: int64

In [6]:
# check what your datatypes are
# do we need to fix anything here?
sf_crime.dtypes

Dates          object
Category       object
Descript       object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

Ideally, our 'Dates' column should be a datetime object!

### 2. Create a column for year, month, day, hour, time and date from the `Dates` column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [7]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime['Dates'])
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052


In [8]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.weekday_name
# check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday


In [9]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,2003-03-23 23:27:00,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23
1,2006-03-07 06:45:00,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07


In [10]:
# Drop the 'Dates' column
sf_crime.drop(['Dates'], axis=1, inplace=True)
sf_crime.head(2)

Unnamed: 0,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date
0,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23
1,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07


### 3. Validate and clean the data.

In [11]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4934
OTHER OFFENSES                 3656
NON-CRIMINAL                   2601
ASSAULT                        2164
DRUG/NARCOTIC                  1533
VEHICLE THEFT                  1506
VANDALISM                      1280
WARRANTS                       1239
BURGLARY                       1023
SUSPICIOUS OCC                  891
MISSING PERSON                  771
ROBBERY                         630
FRAUD                           537
SECONDARY CODES                 283
FORGERY/COUNTERFEITING          281
WEAPON LAWS                     255
PROSTITUTION                    223
TRESPASS                        209
STOLEN PROPERTY                 137
SEX OFFENSES FORCIBLE           120
DRUNKENNESS                     105
DISORDERLY CONDUCT              105
RECOVERED VEHICLE                80
DRIVING UNDER THE INFLUENCE      75
KIDNAPPING                       71
RUNAWAY                          58
ARSON                            52
LIQUOR LAWS                 

In [12]:
# have a look to see whether you have all the days of the week in your data
sf_crime['Day_of_Week'].value_counts()

Friday       3883
Wednesday    3657
Thursday     3579
Tuesday      3548
Monday       3524
Saturday     3496
Sunday       3313
Name: Day_of_Week, dtype: int64

In [13]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out
sf_crime['Descript'].value_counts()

GRAND THEFT FROM LOCKED AUTO                                      1656
LOST PROPERTY                                                      883
DRIVERS LICENSE, SUSPENDED OR REVOKED                              768
BATTERY                                                            751
STOLEN AUTOMOBILE                                                  732
WARRANT ARREST                                                     698
AIDED CASE, MENTAL DISTURBED                                       644
SUSPICIOUS OCCURRENCE                                              617
PETTY THEFT FROM LOCKED AUTO                                       603
TRAFFIC VIOLATION                                                  491
MALICIOUS MISCHIEF, VANDALISM                                      481
MALICIOUS MISCHIEF, VANDALISM OF VEHICLES                          477
PETTY THEFT OF PROPERTY                                            473
THREATS AGAINST LIFE                                               408
ENROUT

In [14]:
sf_crime['PdDistrict'].value_counts()

SOUTHERN      4413
MISSION       3416
NORTHERN      3076
BAYVIEW       2555
CENTRAL       2424
TENDERLOIN    2336
INGLESIDE     2256
TARAVAL       1804
PARK          1438
RICHMOND      1282
Name: PdDistrict, dtype: int64

In [15]:
sf_crime['Resolution'].value_counts()

NONE                                      14880
ARREST, BOOKED                             6019
ARREST, CITED                              2181
LOCATED                                     496
PSYCHOPATHIC CASE                           419
UNFOUNDED                                   227
JUVENILE BOOKED                             174
COMPLAINANT REFUSES TO PROSECUTE            125
DISTRICT ATTORNEY REFUSES TO PROSECUTE      123
NOT PROSECUTED                               97
JUVENILE CITED                               94
PROSECUTED BY OUTSIDE AGENCY                 67
JUVENILE ADMONISHED                          45
EXCEPTIONAL CLEARANCE                        35
JUVENILE DIVERTED                            12
CLEARED-CONTACT JUVENILE FOR MORE INFO        6
Name: Resolution, dtype: int64

In [16]:
# use .describe() to see whether the location coordinates seem appropriate
sf_crime.describe()

Unnamed: 0,X,Y,Year,Month,Hour
count,25000.0,25000.0,25000.0,25000.0,25000.0
mean,-122.422454,37.773486,2008.68808,6.40736,13.3848
std,0.032753,0.572667,3.625646,3.418299,6.590859
min,-122.513642,37.708003,2003.0,1.0,0.0
25%,-122.432797,37.752874,2005.0,3.0,9.0
50%,-122.416469,37.775421,2009.0,6.0,14.0
75%,-122.406953,37.784401,2012.0,9.0,19.0
max,-120.5,90.0,2015.0,12.0,23.0


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [17]:
NVC = ['BAD CHECKS', 'BRIBERY', 'DRUG/NARCOTIC', 'DRUNKENNESS',
       'EMBEZZLEMENT', 'FORGERY/COUNTERFEITING', 'FRAUD',
       'GAMBLING', 'LIQUOR LAWS', 'LOITERING', 'TRESPASS', 'OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL', 'RUNAWAY',
         'SECONDARY CODES', 'SUSPICIOUS OCC', 'WARRANTS']

# use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above

VC = [cat for cat in sf_crime['Category'].unique() if cat not in NVC+NOT_C]

In [18]:
# add a column called 'Type' into your dataframe that stores whether the observation was:
# Non-Violent, Violent, or Non-Crime


def typecrime(x):
    if x in NOT_C:
        return 'NOT_CRIMINAL'
    if x in NVC:
        return 'NON-VIOLENT'
    if x in VC:
        return 'VIOLENT_CRIME'


sf_crime['Type'] = sf_crime['Category'].map(typecrime)

In [19]:
sf_crime.head()

Unnamed: 0,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week,Hour,Time,Date,Type
0,ARSON,ARSON OF A VEHICLE,BAYVIEW,NONE,0 Block of HUNTERS PT EXPWY EX,-122.376945,37.733018,2003,3,Sunday,23,23:27:00,2003-03-23,VIOLENT_CRIME
1,LARCENY/THEFT,PETTY THEFT FROM LOCKED AUTO,NORTHERN,NONE,0 Block of MARINA BL,-122.432952,37.805052,2006,3,Tuesday,6,06:45:00,2006-03-07,VIOLENT_CRIME
2,NON-CRIMINAL,LOST PROPERTY,SOUTHERN,NONE,800 Block of BRYANT ST,-122.403405,37.775421,2004,3,Saturday,3,03:00:00,2004-03-06,NOT_CRIMINAL
3,BURGLARY,"BURGLARY OF STORE, UNLAWFUL ENTRY",TARAVAL,"ARREST, BOOKED",3200 Block of 20TH AV,-122.475647,37.728528,2011,12,Saturday,12,12:10:00,2011-12-03,VIOLENT_CRIME
4,LARCENY/THEFT,PETTY THEFT OF PROPERTY,NORTHERN,NONE,POLK ST / BROADWAY ST,-122.421772,37.795946,2003,1,Friday,0,00:15:00,2003-01-10,VIOLENT_CRIME


In [20]:
# find the baseline accuracy:
sf_crime['Type'].value_counts(normalize=True).max()

0.53856

In [21]:
# create a target array with 'Type'

y = sf_crime['Type']

In [22]:
# create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
X = sf_crime[['Day_of_Week', 'Month', 'Year',
              'PdDistrict', 'Hour', 'Resolution']]

In [23]:
X.head()

Unnamed: 0,Day_of_Week,Month,Year,PdDistrict,Hour,Resolution
0,Sunday,3,2003,BAYVIEW,23,NONE
1,Tuesday,3,2006,NORTHERN,6,NONE
2,Saturday,3,2004,SOUTHERN,3,NONE
3,Saturday,12,2011,TARAVAL,12,"ARREST, BOOKED"
4,Friday,1,2003,NORTHERN,0,NONE


In [24]:
# use pd.get_dummies() to dummify your categorical variables
# remember to drop a column!
dummify = [col for col in X.columns if col!='Year']
X = pd.get_dummies(X, columns=dummify, drop_first=True)

In [25]:
X.head()

Unnamed: 0,Year,Day_of_Week_Monday,Day_of_Week_Saturday,Day_of_Week_Sunday,Day_of_Week_Thursday,Day_of_Week_Tuesday,Day_of_Week_Wednesday,Month_2,Month_3,Month_4,...,Resolution_JUVENILE ADMONISHED,Resolution_JUVENILE BOOKED,Resolution_JUVENILE CITED,Resolution_JUVENILE DIVERTED,Resolution_LOCATED,Resolution_NONE,Resolution_NOT PROSECUTED,Resolution_PROSECUTED BY OUTSIDE AGENCY,Resolution_PSYCHOPATHIC CASE,Resolution_UNFOUNDED
0,2003,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,2006,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,2004,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,2011,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2003,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [26]:
X.shape

(25000, 65)

### 5. Create a train/test/split and standardize the predictor matrices

In [27]:
# create a 50/50 train test split;
# stratify based on your target variable
# use a random state of 2018
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, stratify=y, random_state=2018)

In [28]:
# alternative approach: split on years

#mask = X.Year<=2012

#X_train = X[mask]
#X_test = X[~mask]
#y_train = y[mask]
#y_test = y[~mask]

#X_train.shape, X_test.shape

In [29]:
# standardise your predictor matrices
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  


### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [30]:
# create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
# use 5 cross-validation folds
lr = LogisticRegression(solver='lbfgs', multi_class='ovr')
cross_val_score(lr, X_train_std, y_train, cv=5).mean()

0.6292796495384609

In [31]:
# create a confusion matrix
lr.fit(X_train_std, y_train)
predictions = lr.predict(X_test_std)
confusion = confusion_matrix(y_test, predictions)
pd.DataFrame(confusion, columns=sorted(y_train.unique()),
             index=sorted(y_train.unique()))

Unnamed: 0,NON-VIOLENT,NOT_CRIMINAL,VIOLENT_CRIME
NON-VIOLENT,2028,3,1201
NOT_CRIMINAL,635,208,1693
VIOLENT_CRIME,1028,10,5694


In [32]:
y_test.value_counts()

VIOLENT_CRIME    6732
NON-VIOLENT      3232
NOT_CRIMINAL     2536
Name: Type, dtype: int64

### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [33]:
model = LogisticRegressionCV(solver='lbfgs', multi_class='ovr', cv=5)

In [34]:
list(model.get_params().keys())

['Cs',
 'class_weight',
 'cv',
 'dual',
 'fit_intercept',
 'intercept_scaling',
 'max_iter',
 'multi_class',
 'n_jobs',
 'penalty',
 'random_state',
 'refit',
 'scoring',
 'solver',
 'tol',
 'verbose']

In [35]:
# create a hyperparameter dictionary for a logistic regression
crime_gs_params = {'penalty': ['l1', 'l2'],
                   'solver': ['liblinear'],
                   'Cs': [np.logspace(-3, 0, 10)]}

In [36]:
# create a gridsearch object using LogisticRegression() and the dictionary you created above
crime_gs = GridSearchCV(model, crime_gs_params, cv=5, n_jobs=2, verbose=1)

In [37]:
# fit the gridsearch object on your training data
crime_gs.fit(X_train_std, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:   49.0s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=None, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid={'penalty': ['l1', 'l2'], 'solver': ['liblinear'], 'Cs': [array([0.001  , 0.00215, 0.00464, 0.01   , 0.02154, 0.04642, 0.1    ,
       0.21544, 0.46416, 1.     ])]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [38]:
# print out the best parameters
crime_gs.best_params_

{'Cs': array([0.001     , 0.00215443, 0.00464159, 0.01      , 0.02154435,
        0.04641589, 0.1       , 0.21544347, 0.46415888, 1.        ]),
 'penalty': 'l1',
 'solver': 'liblinear'}

In [39]:
# print out the best mean cross-validated score
crime_gs.best_score_

0.6324

In [40]:
# assign your best estimator to the variable 'best_logreg'
best_logreg = crime_gs.best_estimator_

In [41]:
best_logreg.penalty

'l1'

In [42]:
best_logreg.C_

array([0.02154435, 0.00215443, 0.1       ])

In [43]:
best_logreg.score(X_train_std, y_train)

0.63448

In [44]:
# score your model on your testing data
best_logreg.score(X_test_std, y_test)

0.63472

### 8. Print out a classification report for your best_logreg model

In [45]:
# use your test data to create your classification report
predictions = best_logreg.predict(X_test_std)
print(classification_report(y_test, predictions))

               precision    recall  f1-score   support

  NON-VIOLENT       0.54      0.65      0.59      3232
 NOT_CRIMINAL       0.95      0.08      0.15      2536
VIOLENT_CRIME       0.67      0.84      0.74      6732

    micro avg       0.63      0.63      0.63     12500
    macro avg       0.72      0.52      0.49     12500
 weighted avg       0.69      0.63      0.58     12500



In [46]:
confusion = confusion_matrix(y_test, predictions)
pd.DataFrame(confusion, columns=sorted(y_train.unique()),
             index=sorted(y_train.unique()))

Unnamed: 0,NON-VIOLENT,NOT_CRIMINAL,VIOLENT_CRIME
NON-VIOLENT,2091,3,1138
NOT_CRIMINAL,673,204,1659
VIOLENT_CRIME,1086,7,5639


### 9. Explore LogisticRegressionCV 

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [47]:
# A:

In [48]:
# see above