Note: The lines of code that apply the models to the test data and generate the submission file are commented out.  This is because the files created are >500MB each, and we didn't want somebody to accidentally create a bunch of large files if they ran all the cells.

# Imports

In [1]:
import crime
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

# Load Data

In [2]:
reload(crime)
train = crime.load_cleaned_train()
test = crime.load_cleaned_test()

# print train.info()
# print test.info()

The data is cleaned as described in `crime.py`.  In short, Year, Month, Day, Hour, and Minute columns are created, DayOfWeek, PdDistrict, and Category are encoded as integers, and invalid X and Y values are set to the median for that crime's PdDistrict.

# Split Train Data for Cross Validation

In [3]:
predictors = ['X','Y','Year','Hour','Minute','DoW','PdD','CornerCrime','ST_0','BogusReport','NBogusReport']
X = train[predictors]
y = train.CategoryNumber
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=np.array(y))

The `stratify` parameter of `train_test_split` requires scikit-learn-0.17, but ensures that the proportion of categories is maintained in the split.  The biggest thing that this does is make it so that we always get at least one crime from each category in the training set.  Our models can only predict based on what they have seen before, so it is crucial that we train them with all possible categories.

In [14]:
def cross_validate(alg, X_train, X_test, y_train, y_test):
    predictor_sets = (
#         ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD', 'Hour', 'DoW'],
#         ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD', 'Hour'],
#         ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD'],
#         ['X', 'Y', 'CornerCrime', 'PdD', 'Hour', 'DoW'],
#         ['X', 'Y', 'CornerCrime', 'Hour', 'DoW'],
#         ['X', 'Y', 'CornerCrime'],
#         ['X', 'Y', 'CornerCrime','BogusReport'],
#         ['X', 'Y', 'CornerCrime', 'PdD', 'Hour'],
#         ['X', 'Y', 'CornerCrime', 'PdD', 'BogusReport'],
#         ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime'],
        ['X', 'Y', 'DoW', 'Year', 'CornerCrime','BogusReport','PdD','ST_0','Minute'],
        ['X', 'Y', 'DoW', 'Year', 'CornerCrime','BogusReport','PdD','ST_0','Minute','NBogusReport'],
        ['X', 'Y', 'DoW', 'Year', 'CornerCrime','BogusReport','PdD','ST_0','Minute','NBogusReport','Hour'],
#         ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD','Afternoon','Night','Morning','Evening'],
#         ['DoW', 'Hour', 'Year', 'CornerCrime', 'ST_0', 'ST_1'],
#         ['X', 'Y', 'Hour', 'CornerCrime', 'ST_0','PdD_0','PdD_1','PdD_2','PdD_3','PdD_4','PdD_5','PdD_6','PdD_7','PdD_8','PdD_9'],
#         ['X', 'Y', 'Hour', 'DoW', 'CornerCrime'],
#         ['X', 'Y', 'ST_0', 'CornerCrime','PdD_0','PdD_1','PdD_2','PdD_3','PdD_4','PdD_5','PdD_6','PdD_7','PdD_8','PdD_9'],
#         ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime','PdD_0','PdD_1','PdD_2','PdD_3','PdD_4','PdD_5','PdD_6','PdD_7','PdD_8','PdD_9','Afternoon','Night','Morning','Evening']
    )

    for predictors in predictor_sets:
        alg.fit(X_train[predictors], y_train)
        p = alg.predict_proba(X_test[predictors])
        print crime.logloss(y_test, p), predictors

# Baseline Model

In order to have something to compare to, we've created a baseline model that guesses based on the crime rates in each district

In [None]:
class baseline(object):
    def __init__(self):
        self.has_fit = False
        
    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        X_train['CategoryNumber'] = y_train
        groups = X_train.groupby(['PdD', 'CategoryNumber'])

        # Tally up the counts of each Category in each PdDistrict
        num_districts = len(X_train.PdD.unique())
        num_categories = len(y_train.unique())
        self.district_rates = np.zeros((num_districts, num_categories))
        for ind,data in groups:
            self.district_rates[ind] = len(data)

        # Normalize values
        self.district_rates /= self.district_rates.sum(axis=1, keepdims=True)

        self.has_fit = True

    def predict_proba(self, X_test):
        if self.has_fit:
            predictions = X_test.PdD.apply(lambda x: self.district_rates[x,:])
            return pd.DataFrame(predictions.tolist()).values  # to get a numpy array of the correct shape
        return None

alg = baseline()
predictors = ['PdD']
alg.fit(X_train[predictors], y_train)
p = alg.predict_proba(X_test[predictors])
print crime.logloss(y_test, p)

# crime.create_submission(alg, X, y, test, predictors, 'baseline_submission.csv')

This scored a 2.61645 on the test data, a very similar score to the cross validation.  This isn't too surprising since the way the train and test data are split up are by every other week, so our cross validation train-test split is pretty representative of the data as a whole.

# k-Nearest Neighbors Model

The first model we've chosen to try is the k-Nearest Neighbors model, partially for the fact that you can quite literally look at which crimes occurred near each other using the X and Y columns.

In [35]:
alg = KNeighborsClassifier(n_neighbors=50)
cross_validate(alg, X_train, X_test, y_train, y_test)

5.11685209372 ['X', 'Y', 'Hour', 'PdD', 'CornerCrime', 'ST_0']
5.2368918436 ['X', 'Y', 'PdD', 'Hour', 'ST_0', 'ST_1']


It looks like this model did the best when it only used the spatial data, the X and Y columns.  It is also performing worse than our baseline model, but let's see how it does with the test data.

In [None]:
predictors = ['X', 'Y']
alg = KNeighborsClassifier(n_neighbors=50)
# crime.create_submission(alg, X, y, test, predictors, 'k-nn_submission.csv')

This scored a 5.32130 on the test data, which is a little worse than the cross validation score.

# Logistic Regression Model

Next, we decided to see how our old friend the Logistic Regression would do.

In [49]:
alg = LogisticRegression()
cross_validate(alg, X_train, X_test, y_train, y_test)

2.56665986522 ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
2.54508352077 ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9', 'Afternoon', 'Night', 'Morning', 'Evening']
2.56204104465 ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
2.54155475454 ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9', 'Afternoon', 'Night', 'Morning', 'Evening']


The Logistic Regression model seems to do slightly better when it's given all of the predictors, but is still not as good as our baseline model.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = LogisticRegression()
# crime.create_submission(alg, X, y, test, predictors, 'lr_submission.csv')

This scored a 2.65839 on the test data, quite close to the cross validation score but still not as good as the baseline model.

# Decision Tree Model

In [15]:
alg = tree.DecisionTreeClassifier(max_depth=3)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.58949857035 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']
2.59203895398 ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime']


Let's try giving the model all the predictors since its scores are so close for just using two and using all of them.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
# crime.create_submission(alg, X, y, test, predictors, 'dt_submission.csv')

This scored a 2.62696 on the test data.  Same story here:  very close to the cross validation score, but worse than our baseline model.

# Gradient Boosting Model

In [42]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.67518828608 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']


In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
# crime.create_submission(alg, X, y, test, predictors, 'gb_submission.csv')

This scored a 2.69673 on the test data.  Again, no surprises here. Additionally, this model takes a very long time computationally to run. 

# Random Forest Model

In [15]:
alg = RandomForestClassifier(n_estimators=25, max_depth=15)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.37321471025 ['X', 'Y', 'DoW', 'Year', 'CornerCrime', 'BogusReport', 'PdD', 'ST_0', 'Minute']
2.3694449714 ['X', 'Y', 'DoW', 'Year', 'CornerCrime', 'BogusReport', 'PdD', 'ST_0', 'Minute', 'NBogusReport']
2.35486923645 ['X', 'Y', 'DoW', 'Year', 'CornerCrime', 'BogusReport', 'PdD', 'ST_0', 'Minute', 'NBogusReport', 'Hour']


Using more trees (`n_estimators`) seems to improve things, but makes a senior laptop with 8GB of RAM a bit sad.  If your computer can handle it, try increasing this parameter!

In [10]:
predictors = ['X', 'Y']
alg = RandomForestClassifier(n_estimators=20, max_depth=10)
# crime.create_submission(alg, X, y, test, predictors, 'rf_xy_submission.csv')

This scored a 2.45770 on the test data, better than the baseline!  It's interesting how well this model performed with so few predictors being used.

In [None]:
predictors = ['X', 'Y', 'DoW', 'Hour', 'Year']
alg = RandomForestClassifier(n_estimators=20, max_depth=10)
# crime.create_submission(alg, X, y, test, predictors, 'rf_more_submission.csv')

This scored a 2.46047 on the test data, slightly worse than the Random Forest using only the X and Y columns.

In [None]:
predictors = ['X', 'Y', 'PdD', 'Hour']
alg = RandomForestClassifier(n_estimators=25, max_depth=10)
# crime.create_submission(alg, X, y, test, predictors, 'rf_axph_submission.csv')

With only the X,Y,PdD, and Hour predictors the score can be improved to 2.44754

In [None]:
predictors = ['X', 'Y', 'DoW', 'Hour', 'Year']
alg = RandomForestClassifier(n_estimators=25, max_depth=15)
crime.create_submission(alg, X, y, test, predictors, 'rf_xydhy_submission.csv')

This scored a 2.43554 on Kaggle, but we're starting to get to the point in which with better computers, we could perform better. This is fine, but now awesome for our learning. 

Adding the corner crime indicator gave an impressive score of 2.41737.

In [5]:
predictors = ['X', 'Y', 'DoW', 'Year', 'CornerCrime', 'BogusReport', 'PdD', 'ST_0', 'Minute']
alg = RandomForestClassifier(n_estimators=20, max_depth=10)
crime.create_submission(alg, X, y, test, predictors, 'v1_rfc_all.csv')

This got a score of 2.38158.

In [4]:
predictors = ['X', 'Y', 'DoW', 'Year', 'CornerCrime', 'BogusReport', 'PdD', 'ST_0', 'Minute', 'NBogusReport', 'Hour']
alg = RandomForestClassifier(n_estimators=25, max_depth=15)
crime.create_submission(alg, X, y, test, predictors, 'v3_rfc_all.csv')

This got a score of 2.33845.

# Bayes

In [40]:
from sklearn.naive_bayes import BernoulliNB
alg = BernoulliNB(fit_prior=True, binarize=0.0, alpha=0.25)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.62482822734 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']
2.62766723051 ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime']
2.60877973281 ['X', 'Y', 'CornerCrime', 'ST_0', 'PdD', 'Afternoon', 'Night', 'Morning', 'Evening']
2.56666551916 ['X', 'Y', 'Hour', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
2.55127520511 ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9', 'Afternoon', 'Night', 'Morning', 'Evening']


# Adaboost

In [47]:
from sklearn.ensemble import AdaBoostClassifier
alg = AdaBoostClassifier()
cross_validate(alg, X_train, X_test, y_train, y_test)

3.58774824065 ['X', 'Y', 'CornerCrime']
3.58199885915 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']
3.57840619939 ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime']
3.58100998049 ['X', 'Y', 'Hour', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
3.58784805018 ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']


# Bagging

In [49]:
from sklearn.ensemble import BaggingClassifier
alg = BaggingClassifier(n_estimators=20)
cross_validate(alg, X_train, X_test, y_train, y_test)

7.12364876848 ['X', 'Y', 'CornerCrime']
12.6827841013 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']
11.9044417276 ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime']
12.6980398955 ['X', 'Y', 'Hour', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
7.15624551676 ['X', 'Y', 'ST_0', 'ST_1', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']


# Extra Trees

In [8]:
from sklearn.ensemble import ExtraTreesClassifier
alg = ExtraTreesClassifier(n_estimators = 20)
cross_validate(alg, X_train, X_test, y_train, y_test)

7.61062486967 ['X', 'Y', 'CornerCrime']
15.4302319074 ['X', 'Y', 'CornerCrime', 'PdD', 'Hour']
14.1251773967 ['X', 'Y', 'DoW', 'Hour', 'Year', 'CornerCrime']
15.4325900837 ['X', 'Y', 'Hour', 'CornerCrime', 'ST_0', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']
7.648138786 ['X', 'Y', 'ST_0', 'CornerCrime', 'PdD_0', 'PdD_1', 'PdD_2', 'PdD_3', 'PdD_4', 'PdD_5', 'PdD_6', 'PdD_7', 'PdD_8', 'PdD_9']


# Telling a Story

In this modelling exploration, we find that the methods which call out the categorical nature of the data - logistic regression, decision trees, and random forest classifiers - perform highly. Particularly relevant is that the predictors of location and time of day may be particularly telling, followed by year and day of week. In our exploration phase, these relationships were also evident. What we're seeing, at a high level, is that crime reporting or crime report filing over the past 10+ years in San Francisco has followed a similar pattern - the same types of crime are being committed with relatively similar frequency, in the same general locations, and are being reported by the expected reporting entity. 

What may be interesting to explore will be whether the type of crime that it is, different slices of time of year, or connecting reporting time and type of crime may yield more exacting results when predicting crimes.

# Questions

Why is such a simplistic, hard-coded model performing better than most of these scikit-learn models?  What is it about this dataset that makes these models perform as they do?  Would it be worth tweaking the parameters of one of these models to try and improve the score, or are they just the wrong models to be using for this problem?  How can we answer these questions?