<a href="https://colab.research.google.com/github/imene-swaan/machine-learning/blob/master/Gridsearch_multinomial_sf_crime_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Gridsearch and Multinomial Models with SF Crime Data


---

### Multinomial logistic regression models

So far, we have been using logistic regression for binary problems where there are only two class labels. Logistic regression can be extended to dependent variables with multiple classes.

There are two ways sklearn solves multiple-class problems with logistic regression: a multinomial loss or a "one vs. rest" (OvR) process where a model is fit for each target class vs. all the other classes. 

**Multinomial vs. OvR**
- (M) 'k-1' models with 1 reference category
- (OvR) 'k*(k-1)/2' models

You will use the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

**Necessary lab imports**

In [None]:
import numpy as np
import pandas as pd
import patsy

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV


import seaborn as sns

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1. Read in the data

In [None]:
crime_csv = '/content/sf_crime_train.csv'

In [None]:
#read in the data using pandas
sf_crime = pd.read_csv(crime_csv)
sf_crime.drop('DayOfWeek',axis=1,inplace=True)
sf_crime.head()

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [None]:
# check the shape of your dataframe
print(f'the shape of the dataset is given by {sf_crime.shape[0]} rows and {sf_crime.shape[1]} columns')

the shape of the dataset is given by 18000 rows and 8 columns


In [None]:
#check whether there are any missing values
missing = sf_crime['Resolution'].value_counts()[0]
print(f'the {sf_crime.columns[4]} column has {missing} missing values denoted as NONE')
#do we need to fix anything here?


the Resolution column has 12862 missing values denoted as NONE


In [None]:
#check what your datatypes are
sf_crime.info()
#do we need to fix anything here?


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Dates       18000 non-null  object 
 1   Category    18000 non-null  object 
 2   Descript    18000 non-null  object 
 3   PdDistrict  18000 non-null  object 
 4   Resolution  18000 non-null  object 
 5   Address     18000 non-null  object 
 6   X           18000 non-null  float64
 7   Y           18000 non-null  float64
dtypes: float64(2), object(6)
memory usage: 1.1+ MB


### 2. Create column for year, month, day, hour, time, and date from 'Dates' column.

> *`pd.to_datetime` and `Series.dt` may be helpful here!*


In [None]:
# convert the 'Dates' column to a datetime object
sf_crime['Dates'] = pd.to_datetime(sf_crime['Dates'])

In [None]:
# create a new column for 'Year','Month',and 'Day_of_Week'
sf_crime['Year'] = sf_crime['Dates'].dt.year
sf_crime['Month'] = sf_crime['Dates'].dt.month
sf_crime['Day_of_Week'] = sf_crime['Dates'].dt.dayofweek
map_days= {0:'Monday',
          1:'Tuesday',
          2:'Wednesday',
          3:'Thursday',
          4:'Friday',
          5:'Saturday',
          6:'Sunday'}
sf_crime['Day_of_Week'] = sf_crime['Day_of_Week'].map(map_days)
#check the first couple rows to make sure it's what you want
sf_crime.head(2)

Unnamed: 0,Dates,Category,Descript,PdDistrict,Resolution,Address,X,Y,Year,Month,Day_of_Week
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,Wednesday
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,2015,5,Wednesday


In [None]:
# create a column for the 'Hour','Time', and 'Date'
sf_crime['Hour'] = sf_crime['Dates'].dt.hour
sf_crime['Time'] = sf_crime['Dates'].dt.time
sf_crime['Date'] = sf_crime['Dates'].dt.date

In [None]:
# Drop the 'Dates' column
sf_crime.drop(['Dates'], axis = 1, inplace = True)

### 3. Validate and clean the data.

In [None]:
# check the 'Category' value counts to see what sort of categories there are
# and to see if anything might require cleaning (particularly the ones with fewer values)
sf_crime['Category'].value_counts()

LARCENY/THEFT                  4885
OTHER OFFENSES                 2291
NON-CRIMINAL                   2255
ASSAULT                        1536
VEHICLE THEFT                   967
VANDALISM                       877
BURGLARY                        732
WARRANTS                        728
SUSPICIOUS OCC                  592
MISSING PERSON                  535
DRUG/NARCOTIC                   496
ROBBERY                         465
FRAUD                           363
SECONDARY CODES                 261
WEAPON LAWS                     212
TRESPASS                        130
STOLEN PROPERTY                 111
SEX OFFENSES FORCIBLE           103
FORGERY/COUNTERFEITING           85
DRUNKENNESS                      74
KIDNAPPING                       50
PROSTITUTION                     44
DRIVING UNDER THE INFLUENCE      42
DISORDERLY CONDUCT               37
ARSON                            35
LIQUOR LAWS                      25
RUNAWAY                          16
BRIBERY                     

In [None]:
# What's going on with 'TRESPASS' and 'TRESPASSING'?
# What's going on with 'ASSAULT' and 'ASSUALT'?
# fix these with .loc
rep = {'ASSUALT': 'ASSAULT',
      'TRESPASSING': 'TRESPASS'}
sf_crime['Category'].replace(rep, inplace = True)

In [None]:
# have a look to see whether you have all the days of the week in your data
sf_crime['Day_of_Week'].value_counts()

Wednesday    2930
Friday       2733
Saturday     2556
Thursday     2479
Sunday       2456
Monday       2447
Tuesday      2399
Name: Day_of_Week, dtype: int64

In [None]:
# have a look at the value counts for 'Descript', 'PdDistrict', and 'Resolution' to make sure it all checks out
print(sf_crime['Descript'].value_counts())
print(sf_crime['PdDistrict'].value_counts())
print(sf_crime['Resolution'].value_counts())

GRAND THEFT FROM LOCKED AUTO                        2127
STOLEN AUTOMOBILE                                    625
AIDED CASE, MENTAL DISTURBED                         591
DRIVERS LICENSE, SUSPENDED OR REVOKED                589
BATTERY                                              520
                                                    ... 
MINOR PURCHASING OR RECEIVING TOBACCO PRODUCT          1
FIREARM, POSSESSION OF WHILE WEARING MASK              1
ELECTRICAL  OR GAS LINES, INTERFERING WITH             1
PHONE CALLS, OBSCENE                                   1
ROBBERY, VEHICLE FOR HIRE, ATT., W/ OTHER WEAPON       1
Name: Descript, Length: 510, dtype: int64
SOUTHERN      3287
NORTHERN      2250
CENTRAL       2206
MISSION       2118
BAYVIEW       1678
INGLESIDE     1628
TARAVAL       1426
TENDERLOIN    1327
RICHMOND      1101
PARK           979
Name: PdDistrict, dtype: int64
NONE                                      12862
ARREST, BOOKED                             4455
UNFOUNDED     

In [None]:
# use .describe() to see whether the location coordinates seem appropriate
sf_crime[['X', 'Y']].describe()

Unnamed: 0,X,Y
count,18000.0,18000.0
mean,-122.423639,37.768466
std,0.026532,0.024391
min,-122.513642,37.708154
25%,-122.434199,37.753838
50%,-122.416949,37.775608
75%,-122.406539,37.78539
max,-122.365565,37.819923


### 4. Set up a target and predictor matrix for predicting violent crime vs. non-violent crime vs. non-crimes.

**Non-Violent Crimes:**
- bad checks
- bribery
- drug/narcotic
- drunkenness
- embezzlement
- forgery/counterfeiting
- fraud
- gambling
- liquor
- loitering 
- trespass

**Non-Crimes:**
- non-criminal
- runaway
- secondary codes
- suspicious occ
- warrants

**Violent Crimes:**
- everything else



**What type of model do you need here? What is your baseline accuracy?**

In [None]:
NVC = ['BAD CHECKS','BRIBERY','DRUG/NARCOTIC','DRUNKENNESS',
     'EMBEZZLEMENT','FORGERY/COUNTERFEITING','FRAUD',
     'GAMBLING','LIQUOR','LOITERING','TRESPASS','OTHER OFFENSES']

NOT_C = ['NON-CRIMINAL','RUNAWAY','SECONDARY CODES','SUSPICIOUS OCC','WARRANTS']

#use a list comprehension to get all the categories in sf_crime['Category'].unique() that are NOT in the lists above

VC = [i for i in sf_crime['Category'].value_counts().index if (i not in NVC) and (i not in NOT_C)]

In [None]:
#add a column called 'Type' into your dataframe that stores whether the observation was:
#Non-Violent, Violent, or Non-Crime
#use .map()!
def typecrime(x):
    if x in NOT_C: return 'NOT_CRIMINAL'
    if x in NVC: return 'NON-VIOLENT'
    if x in VC: return 'VIOLENT_CRIME'

sf_crime['Type']= sf_crime['Category'].map(typecrime)

In [None]:
#find the baseline accuracy:
sf_crime['Type'].value_counts().max() / len(sf_crime['Type'])

0.5931111111111111

In [None]:
#create a target array with 'Type'
#create a predictor matrix with 'Day_of_Week','Month','Year','PdDistrict','Hour', and 'Resolution'
y = sf_crime['Type']
X = sf_crime[['Day_of_Week', 'Month', 'Year', 'PdDistrict', 'Hour', 'Resolution']]

In [None]:
#use pd.get_dummies() to dummify your categorical variables
#remember to drop a column!
X = pd.get_dummies(X, drop_first = True)

### 5. Create a train/test/split and standardize the predictor matrices

In [None]:
#create a 50/50 train test split; 
#stratify based on your target variable
#use a random state of 2018
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2018, test_size = 0.5, stratify = y)

stratify

In [None]:
#standardise your predictor matrices
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
ss.fit(X_train)
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

### 6. Create a basic Logistic Regression model and use cross_val_score to assess its performance on your training data

In [None]:
#create a default Logistic Regression model and find its mean cross-validated accuracy with your training data
#use 5 cross-validation folds
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
cross_val_score(model, X_train_ss, y_train, cv=5).mean()

0.6271111111111112

In [None]:
#create a confusion matrix with cross_val_predict
predictions = cross_val_predict(model, X_train_ss, y_train, cv=5)
confusion = confusion_matrix(y_test, predictions)
pd.DataFrame(confusion,
             columns=sorted(y_train.unique()),
             index=sorted(y_train.unique()))

Unnamed: 0,NON-VIOLENT,NOT_CRIMINAL,VIOLENT_CRIME
NON-VIOLENT,340,45,1351
NOT_CRIMINAL,364,40,1522
VIOLENT_CRIME,1028,127,4183


### 7. Find the optimal hyperparameters (optimal regularization) to predict your crime categories using GridSearchCV.

> **Note:** Gridsearching can be done with `GridSearchCV` or `LogisticRegressionCV`. They operate differently - the gridsearch object is more general and can be applied to any model. The `LogisticRegressionCV` is specific to tuning the logistic regression hyperparameters. I recommend the logistic regression one, but the downside is that lasso and ridge must be searched separately. To start with, use `GridSearchCV`.

**Reference for logistic regression regularization hyperparameters:**
- `solver`: algorithm used for optimization (relevant for multiclass)
    - Newton-cg - Handles Multinomial Loss, L2 only
    - Sag - Handles Multinomial Loss, Large Datasets, L2 Only, Works best on scaled data
    - lbfgs - Handles Multinomial Loss, L2 Only
    - liblinear - Small Datasets, no Warm Starts
- `C`: Regularization strengths (smaller values are stronger penalties)
- `penalty`: `'l1'` - Lasso, `'l2'` - Ridge 

In [None]:
#create a hyperparameter dictionary for a logistic regression
crime_gs_params={'penalty':['l1','l2'],
                 'solver':['liblinear'],
                 'C':np.logspace(-3,0,50)}

In [None]:
#create a gridsearch object using LogisticRegression() and the dictionary you created above
crime_gs=GridSearchCV(LogisticRegression(),
                      crime_gs_params,
                      n_jobs=-1,cv=5)

In [None]:
#fit the gridsearch object on your training data
crime_gs.fit(X_train_ss,y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': array([0.001     , 0.0011514 , 0.00...
       0.03393222, 0.0390694 , 0.04498433, 0.05179475, 0.05963623,
       0.06866488, 0.07906043, 0.09102982, 0.10481131, 0.12067926,
       0.13894955, 0.15998587, 0.184207  , 0.21209509, 0.24420531,
       0.28117687, 0.32374575, 0.37275937, 0.42919343, 0.49417134,
       0.5689866 

In [None]:
#print out the best parameters
crime_gs.best_params_

{'C': 0.044984326689694466, 'penalty': 'l2', 'solver': 'liblinear'}

In [None]:
#print out the best mean cross-validated score
crime_gs.best_score_

0.6288888888888888

In [None]:
#assign your best estimator to the variable 'best_logreg'
best_logreg=crime_gs.best_estimator_

In [None]:
#score your model on your testing data
best_logreg.score(X_test_ss,y_test)

0.6347777777777778

### 8. Print out a classification report for your best_logreg model

In [None]:
#use your test data to create your classification report
predictions = best_logreg.predict(X_test_ss)
print(classification_report(y_test, predictions))

               precision    recall  f1-score   support

  NON-VIOLENT       0.48      0.47      0.47      1736
 NOT_CRIMINAL       0.66      0.07      0.13      1926
VIOLENT_CRIME       0.67      0.89      0.77      5338

     accuracy                           0.63      9000
    macro avg       0.60      0.48      0.46      9000
 weighted avg       0.63      0.63      0.57      9000



### 9. Explore LogisticRegressionCV.  

With LogisticRegressionCV, you can access the best regularization strength for predicting each class! Read the documentation and see if you can implement a model with LogisticRegressionCV.

In [None]:
# creating a logistic regressioncv object
lr_cv = LogisticRegressionCV(Cs = np.logspace(-3,0,50), fit_intercept = True, cv = 5, penalty = 'l1',\
                             scoring = 'f1', solver = 'liblinear', n_jobs = -1,\
                             random_state = 1)

In [None]:
# training the model
lr_cv.fit(X_train, y_train)

LogisticRegressionCV(Cs=array([0.001     , 0.0011514 , 0.00132571, 0.00152642, 0.00175751,
       0.00202359, 0.00232995, 0.0026827 , 0.00308884, 0.00355648,
       0.00409492, 0.00471487, 0.00542868, 0.00625055, 0.00719686,
       0.00828643, 0.00954095, 0.01098541, 0.01264855, 0.01456348,
       0.01676833, 0.01930698, 0.02222996, 0.02559548, 0.02947052,
       0.03393222, 0.0390694 , 0.04498433, 0.05179475, 0.059636...
       0.13894955, 0.15998587, 0.184207  , 0.21209509, 0.24420531,
       0.28117687, 0.32374575, 0.37275937, 0.42919343, 0.49417134,
       0.5689866 , 0.65512856, 0.75431201, 0.86851137, 1.        ]),
                     class_weight=None, cv=5, dual=False, fit_intercept=True,
                     intercept_scaling=1.0, l1_ratios=None, max_iter=100,
                     multi_class='auto', n_jobs=-1, penalty='l1',
                     random_state=1, refit=True, scoring='f1',
                     solver='liblinear', tol=0.0001, verbose=0)

In [None]:
# cheking the model parameters and intercept
print(lr_cv.classes_, '\n')
print(lr_cv.coef_, '\n')
print(lr_cv.intercept_, '\n')

['NON-VIOLENT' 'NOT_CRIMINAL' 'VIOLENT_CRIME'] 

[[ 1.29742314e-02  1.05259344e-04 -2.03918510e-02  2.78748434e-02
  -1.09466350e-01 -2.50094664e-01 -1.32944490e-01  0.00000000e+00
   7.56330128e-02 -3.59681251e-01 -1.02208862e-01 -1.70337109e-01
  -3.69280335e-01 -6.39512452e-02 -2.96036330e-01 -4.39225304e-01
   2.75227448e-02  0.00000000e+00  2.70589218e+00 -9.33120767e-01
  -3.21953761e+00 -2.21478150e-01 -2.20310707e+00 -1.89758090e+00
   0.00000000e+00 -1.95496358e+00 -1.19155401e+00]
 [ 6.97819774e-03 -5.29097351e-04 -2.22773260e-02 -3.45242157e-02
  -2.02607437e-01 -8.61451362e-03  0.00000000e+00  0.00000000e+00
   5.43242496e-02 -5.49566509e-02  1.42314074e-02  1.92934614e-01
  -2.89024849e-01  4.32480082e-02  0.00000000e+00  0.00000000e+00
   1.37543406e-01  4.10766092e-01 -1.17612701e+00  0.00000000e+00
   1.08361459e+00 -2.64580923e-01 -3.13077748e-01 -3.30117432e-02
   0.00000000e+00  3.49975144e+00  1.77857906e+00]
 [-1.47145941e-02 -5.61711003e-04  3.03111648e-02  4.1184

In [None]:
# checking the f1 score of each fold
cross_val_score(lr_cv, X_test, y_test, scoring = 'f1_micro')

array([0.63111111, 0.63111111, 0.63722222, 0.63333333, 0.63944444])

In [None]:
y_pred = lr_cv.predict(X_test)

In [None]:
#overall model evaluation
print(accuracy_score(y_test, y_pred), '\n')
print(confusion_matrix(y_test, y_pred),'\n')
print(classification_report(y_test, y_pred))

0.6343333333333333 

[[ 810   34  892]
 [ 347  139 1440]
 [ 539   39 4760]] 

               precision    recall  f1-score   support

  NON-VIOLENT       0.48      0.47      0.47      1736
 NOT_CRIMINAL       0.66      0.07      0.13      1926
VIOLENT_CRIME       0.67      0.89      0.77      5338

     accuracy                           0.63      9000
    macro avg       0.60      0.48      0.46      9000
 weighted avg       0.63      0.63      0.57      9000

