# San Francisco Crime Logistic Regression

This notebook is of 3rd importance since this model is not as strong as our Random Forests. However we used the information here to try and get a good model and to use feature analysis to help give insight as to what features will lead to a stronger Random Forest model. The steps we took here were as follows. We optimized for C using L2 penalty on our initial 58 feature set. Using this C value we then use a Logistic Regression with L1 penalty to analyze which features are strong, and use this information to get insight into what features represent the data as whole. (This actuall informed our decisions on what features to test in random forests). Finally, we train some Logistic Regression models with the L2 penalty with different combinations of features to see if we find any interesting results.

### Data Setup

In [1]:

%matplotlib inline
import numpy as np
import pandas as pd
import zipfile
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.externals import joblib
from collections import Counter

Optional field below for our zipfile workflow. If you have the csvs from the data setup file then this cell can be skipped.

In [None]:
# Unzip data files into the "csv" subdirectory 
# (unless you have already done this since running the Data Set Up notebook)

# **IMPORTANT**  This will overwrite existing files in the "csv" folder in your local repo
# with the most recent data files from the data.zip file

# Unzip 80% training data
unzip_training_data = zipfile.ZipFile("data_subset.zip", "r")
unzip_training_data.extractall()
unzip_training_data.close()

# Unzip development and training data
unzip_test_data = zipfile.ZipFile("testing.zip", "r")
unzip_test_data.extractall()
unzip_test_data.close()

# Unzip full set of training data for creating predictions to submit to Kaggle
unzip_all_data = zipfile.ZipFile("data.zip", "r")
unzip_all_data.extractall()
unzip_all_data.close()

In [None]:
# Load these csv files into numpy arrays for testing on development data
train_data = np.loadtxt('csv/train_data.csv', delimiter=",")
train_labels = np.loadtxt('csv/train_labels.csv', dtype=str, delimiter=",")
dev_data = np.loadtxt('csv/dev_data.csv', delimiter=",")
dev_labels = np.loadtxt('csv/dev_labels.csv', dtype=str, delimiter=",")

In [2]:
# Load these csv files into numpy arrays for creating predictions to submit to Kaggle
train_data_all = np.loadtxt('csv/train_data_all.csv', delimiter=",")
train_labels_all = np.loadtxt('csv/train_labels_all.csv', dtype=str, delimiter=",")
test_data_all = np.loadtxt('csv/test_data_all.csv', delimiter=",")

In [None]:
# print shapes to compare before and after csv conversion
print("train_data shape is", train_data.shape)
print("train_labels shape is", train_labels.shape)
print("dev_data shape is", dev_data.shape)
print("dev_labels shape is", dev_labels.shape)

In [None]:
print("train_data_all shape is", train_data_all.shape)
print("train_labels_all shape is", train_labels_all.shape)
print("test_data_all shape is", test_data_all.shape)

## Feature Analysis

In [3]:
#70 feature set. Our 58 feature set lacked seasons, first_day, month_year, d_police and rotational data.
get_feature_names = ['X', 'Y', 'hour', 'holidays', 'first_day', 'month_year', 'spring',
       'summer', 'fall', 'winter', 'PRCP', 'TMAX', 'TMIN', 'd_police',
       'rot_45_X', 'rot_45_Y', 'rot_30_X', 'rot_30_Y', 'rot_60_X', 'rot_60_Y',
       'radial_r', 'DayOfWeek_Friday', 'DayOfWeek_Monday',
       'DayOfWeek_Saturday', 'DayOfWeek_Sunday', 'DayOfWeek_Thursday',
       'DayOfWeek_Tuesday', 'DayOfWeek_Wednesday', 'PdDistrict_BAYVIEW',
       'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 'PdDistrict_MISSION',
       'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND',
       'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN',
       'month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
       'month_7', 'month_8', 'month_9', 'month_10', 'month_11', 'month_12',
       'year_2003', 'year_2004', 'year_2005', 'year_2006', 'year_2007',
       'year_2008', 'year_2009', 'year_2010', 'year_2011', 'year_2012',
       'year_2013', 'year_2014', 'year_2015', 'dayparts_early_afternoon',
       'dayparts_early_evening', 'dayparts_early_morning',
       'dayparts_late_afternoon', 'dayparts_late_evening',
       'dayparts_late_morning', 'dayparts_late_night']

#Classes
headers = ["ARSON","ASSAULT","BAD CHECKS","BRIBERY","BURGLARY","DISORDERLY CONDUCT","DRIVING UNDER THE INFLUENCE",
           "DRUG/NARCOTIC","DRUNKENNESS","EMBEZZLEMENT","EXTORTION","FAMILY OFFENSES","FORGERY/COUNTERFEITING",
           "FRAUD","GAMBLING","KIDNAPPING","LARCENY/THEFT","LIQUOR LAWS","LOITERING","MISSING PERSON","NON-CRIMINAL",
           "OTHER OFFENSES","PORNOGRAPHY/OBSCENE MAT","PROSTITUTION","RECOVERED VEHICLE","ROBBERY","RUNAWAY",
           "SECONDARY CODES","SEX OFFENSES FORCIBLE","SEX OFFENSES NON FORCIBLE","STOLEN PROPERTY","SUICIDE",
           "SUSPICIOUS OCC","TREA","TRESPASS","VANDALISM","VEHICLE THEFT","WARRANTS","WEAPON LAWS"]

## Training L1 Models to Analyze Feature Importance.

In [None]:
#Feature extraction with L1 Logistic regression model.
C_value = 2.0
#LR = LogisticRegression(penalty = 'l1', C = C_value, n_jobs = -1)
#LR.fit(train_data_all, train_labels_all)
LR = joblib.load('LR.pkl')

In [None]:
#Finds the top 5 features for each class (as a tuple)
topfeatures = [0]*len(headers)
for i in range(len(headers)):
    topfeatures[i] = sorted(enumerate(LR.coef_[i]), key=lambda tup: tup[1], reverse = True)[0:5]

#Extracts the indices of the top features
feature_index = [] 
for lst in topfeatures:
    for j in lst:
        feature_index.append(j[0])
feature_index.sort()

#Sets up the data for the table
feature_names = []
table_text = []
for i in feature_index:
    feature_names.append(get_feature_names[i])
    table_text.append([LR.coef_[j][i] for j in range(len(headers))])
print(Counter(feature_names)) #Now we count how many times features showed up in the top 5.

In [None]:
#Sort features by the mean of the absolute value of coefficients
coefficients = pd.DataFrame(LR.coef_)
d = sorted(dict(coefficients.abs().mean().rank()).items(), key=lambda x:x[1])
print("Rank")
for i in range(len(d)):
    print(71-d[i][1], get_feature_names[d[i][0]]) #They casted as tuples so I had to strongarm them with a full reassignment.

Based on this list we'd like to test whether dropping seaons, months or weekdays will improve our modeling. We confirmed this with our random forest optimization so that we knew we were getting something coherent.

In [4]:
#In this cell we prepare the data for analysis with both dropping the seasons from our complete feature set and additionally
#dropping months.


train_data_all = pd.DataFrame(train_data_all, columns = get_feature_names)
train_labels_all = pd.DataFrame(train_labels_all)
test_data_all = pd.DataFrame(test_data_all, columns = get_feature_names)

train_data_no_seasons = train_data_all.drop(['winter', 'spring', 'summer', 'fall'], axis = 1) 
test_data_no_seasons = test_data_all.drop(['winter', 'spring', 'summer', 'fall'], axis = 1)

train_data_no_months = train_data_no_seasons.drop(['month_1', 'month_2','month_3','month_4','month_5'
                                                 ,'month_6','month_7','month_8','month_9','month_10','month_11','month_12']
                                                , axis = 1)

test_data_no_months = test_data_no_seasons.drop(['month_1', 'month_2','month_3','month_4','month_5'
                                                 ,'month_6','month_7','month_8','month_9','month_10','month_11','month_12']
                                                , axis = 1)

#Weekdays ran out of time before submission time.
train_data_no_weekdays = train_data_no_months.drop(['DayOfWeek_Monday', 'DayOfWeek_Friday','DayOfWeek_Wednesday','DayOfWeek_Tuesday'
                                                 ,'DayOfWeek_Thursday','DayOfWeek_Saturday','DayOfWeek_Sunday']
                                                , axis = 1)

test_data_no_weekdays = test_data_no_months.drop(['DayOfWeek_Monday', 'DayOfWeek_Friday','DayOfWeek_Wednesday','DayOfWeek_Tuesday'
                                                 ,'DayOfWeek_Thursday','DayOfWeek_Saturday','DayOfWeek_Sunday']
                                                , axis = 1)

#### Optimization for C

In [None]:
# Set up functions for training logistic regression model and finding 'optimal' value of C.

def TrainLR(data, labels, test_data, C_value=1.0):
    """This function takes in training data and labels, testing data,
    and can accept different values of C (the learning rate).
    It trains a logistic regression model and returns the model and predicted probabilities.
    """
    LR = LogisticRegression(C=C_value, n_jobs = -1)
    LR.fit(data, labels)
    pp = LR.predict_proba(test_data)
    return LR, pp

def find_C(data, labels, dev_data, dev_labels, C_values):
    """Find optimal value of C in a logistic regression model.  
    
    Note that this cannot be used on test data from Kaggle 
    because we do not have labels for that data.  This function is intended to only be used
    in the development stage with the development data.
    """
    for C in C_values:      
        LR, pp = TrainLR(data, labels, dev_data, C, n_jobs = -1)
        predictions = LR.predict(dev_data)
        f1 = metrics.f1_score(dev_labels, predictions, average = "weighted")
        logloss = metrics.log_loss(dev_labels, pp)
        
        # Print F1 score and log loss for each value of k
        print("For C =", C, "the F1 score is", round(f1, 6), "and the Log Loss score is", round(logloss, 6))
    print("\n")

In [None]:
# Find the optimal value of C using the 80% training data and the development data
C_values = [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 100.00, 1000.0]
find_C(train_data, train_labels, dev_data, dev_labels, C_values)
#We found the optimal C to be = 2.0 on our 58 feature set, so we used that.

In [None]:
# Train model with a single value of C with 80% training data and development data
C_value = 2.0
LR, pp = TrainLR(train_data, train_labels, dev_data, C_value)
logloss = metrics.log_loss(dev_labels, pp)
print(logloss)
#LogLoss of 2.56


#### Preparing final Kaggle models and data

These are run separately and then the final cells at the end export them to a Kaggle friendly format.

In [None]:
C_value = 2.0
#LR = LogisticRegression(penalty = 'l2', C = C_value, n_jobs = -1)
#LR.fit(train_data_no_seasons, train_labels_all)
LR = joblib.load('LRno_seasons.pkl') 
pp = LR.predict_proba(test_data_no_seasons)
#pp is used to export a file to kaggle for testing.

No seasons kaggle score: 2.55255

In [None]:
C_value = 2.0
#LR = LogisticRegression(penalty = 'l2', C = C_value, n_jobs = -1)
#LR.fit(train_data_all, train_labels_all)
LR = joblib.load('LRwith_L2.pkl')
pp = LR.predict_proba(test_data_all)
#pp is used to export a file to kaggle for testing.

Kaggle score of 2.55263

In [None]:
C_value = 2.0
#LR = LogisticRegression(penalty = 'l2', C = C_value, n_jobs = -1)
#LR.fit(train_data_no_months, train_labels_all)
LR = joblib.load('LRno_months.pkl')
pp = LR.predict_proba(test_data_no_months)
#pp is used to export a file to kaggle for testing.

kaggle score 2.55247

In [None]:
# Set up predictions for submission to Kaggle

data = pd.DataFrame(data=pp, 
                    index=[x for x in range(len(test_data_all))], 
                    columns=headers)
data.columns.name ="Id"
print(data.shape)
print(data)

In [None]:
joblib.dump(LR, 'LRno_weekdays.pkl')

Create zipped csv file for Kaggle
#### Update the filename first in all lines of the following code
Add something unique after our names to avoid overwriting other submission files

In [None]:
data.to_csv('Williams_Gascoigne_Vignola_Regression.csv', index_label = "Id")

In [None]:
zip_probs = zipfile.ZipFile("Williams_Gascoigne_Vignola_Regression_no_weekdays", "w")
zip_probs.write("Williams_Gascoigne_Vignola_Regression.csv", compress_type=zipfile.ZIP_DEFLATED)
zip_probs.close()

### Results from previous datasets and/or model parameters

**First Submission**   
Results on development data from dataset as of Saturday 11/18, with weather added, latitude outliers removed, binarized and normalized features:

For C = 0.0001 the F1 score is 0.147075 and the Log Loss score is 3.014795  
For C = 0.001 the F1 score is 0.150284 and the Log Loss score is 2.638404  
For C = 0.01 the F1 score is 0.151366 and the Log Loss score is 2.551881  
For C = 0.1 the F1 score is 0.151589 and the Log Loss score is 2.543797  
For C = 0.5 the F1 score is 0.151615 and the Log Loss score is 2.543427  
For C = 1.0 the F1 score is 0.151579 and the Log Loss score is 2.543396  
**For C = 2.0 the F1 score is 0.151605 and the Log Loss score is 2.543383**  
For C = 10.0 the F1 score is 0.151657 and the Log Loss score is 2.543385  
For C = 100.0 the F1 score is 0.151619 and the Log Loss score is 2.543447  
For C = 1000.0 the F1 score is 0.151616 and the Log Loss score is 2.543544  

Predictions on test data from training on full data set are in zip file ending with Regression1

Kaggle score from that zip file that we thought should have correlated with the above scores on dev data was 18.20988 (!?)  


**Second Submission**  
We changed our workflow along the way, so our first step was to re-run this notebook to confirm that we are unzipping and using the latest version of the data, in particular the full set of training and test data. 

Log loss on dev data after this step, with C=2.0 is 2.54338  
Predictions from this step are in zip file ending with Regression2  
Kaggle score is the same: 18.20989   
So the problem ended up being our test data was not normalized.


**Third Submission**  
On Cyprian's suggestion, we added the multi_class = 'multinomial' argument to the logistic regression model, because we don't have a binary output variable.  

Log loss on dev data after this step, with C=2.0 is 2.54267  
Predictions from this step are in zip file ending with Regression3  
Kaggle score on this set of predictions is:  33.37633   
This is even worse! Removed the multi_class argument from the model code.

**Fourth Submission**
Fixed a critical bug in submissions 2 and 3 that basically made them nonsense. Our test data was not normalized and now it is. Scored a *2.54* on Kaggle, not bad. This is our best score with Logistic regression.

**Fifth-Seventh Submission**
More bug fixing.

**Eight Submission**

L2 Log Regression with no seasons. Kaggle score of 2.55255. Worse than with our 58 feature set (no rotational data, d_police or seasons). We tried dropping season first because we knew they were bad from our L1 regression analysis.

**Ninth Submission**

L2 Log Regression with full 70 features. Kaggle score of 2.55263. This is worse than with seasons.

**Tenth Submission**

L2 Log Regression without seasons and months. Kaggle score of 2.55247. This is the best with our full feature set, but still not better than our reduced 58 feature set.


