Note: The lines of code that apply the models to the test data and generate the submission file are commented out.  This is because the files created are >500MB each, and we didn't want somebody to accidentally create a bunch of large files if they ran all the cells.

# Imports

In [3]:
import crime
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

# Load Data

In [4]:
train = crime.load_cleaned_train()
test = crime.load_cleaned_test()

print train.info()
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 28 columns):
X                 878049 non-null float64
Y                 878049 non-null float64
Year              878049 non-null int64
Month             878049 non-null int64
Day               878049 non-null int64
Hour              878049 non-null int64
Minute            878049 non-null int64
BogusReport       878049 non-null bool
NBogusReport      878049 non-null bool
DoW               878049 non-null int64
Morning           878049 non-null int64
Afternoon         878049 non-null int64
Evening           878049 non-null int64
Night             878049 non-null int64
PdD               878049 non-null int64
PdD_0             878049 non-null float64
PdD_1             878049 non-null float64
PdD_2             878049 non-null float64
PdD_3             878049 non-null float64
PdD_4             878049 non-null float64
PdD_5             878049 non-null float64
PdD_6             878049 non-null f

The data is cleaned as described in `crime.py`.  In short, Year, Month, Day, Hour, and Minute columns are created, DayOfWeek, PdDistrict, and Category are encoded as integers, and invalid X and Y values are set to the median for that crime's PdDistrict.  In addition, PdDistrict information is one-hot encoded, a flag for crimes on a corner is added, and unnecessary columns are dropped.

# Split Train Data for Cross Validation

In [5]:
X = train
y = X.pop('CategoryNumber')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=np.array(y))

The `stratify` parameter of `train_test_split` requires scikit-learn-0.17, but ensures that the proportion of categories is maintained in the split.  The biggest thing that this does is make it so that we always get at least one crime from each category in the training set.  Our models can only predict based on what they have seen before, so it is crucial that we train them with all possible categories.

In [None]:
def cross_validate(alg, X_train, X_test, y_train, y_test):
    predictor_sets = (
        ['X', 'Y', 'CornerCrime', 'BogusReport', 'NBogusReport', 'ST_0'],
        ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD', 'CornerCrime'],
        ['Minute', 'Y', 'X', 'CornerCrime', 'Hour', 'PdD', 'Year', 'NBogusReport', 'Month'],
        ['Minute', 'Y', 'X', 'CornerCrime', 'Hour', 'PdD', 'Year', 'NBogusReport', 'Month', 'BogusReport']
    )

    for predictors in predictor_sets:
        alg.fit(X_train[predictors], y_train)
        p = alg.predict_proba(X_test[predictors])
        print crime.logloss(y_test, p), predictors

# Baseline Model

In order to have something to compare to, we've created a baseline model that guesses based on the crime rates in each district

In [None]:
class baseline(object):
    def __init__(self):
        self.has_fit = False
        
    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        X_train['CategoryNumber'] = y_train
        groups = X_train.groupby(['PdD', 'CategoryNumber'])

        # Tally up the counts of each Category in each PdDistrict
        num_districts = len(X_train.PdD.unique())
        num_categories = len(y_train.unique())
        self.district_rates = np.zeros((num_districts, num_categories))
        for ind,data in groups:
            self.district_rates[ind] = len(data)

        # Normalize values
        self.district_rates /= self.district_rates.sum(axis=1, keepdims=True)

        self.has_fit = True

    def predict_proba(self, X_test):
        if self.has_fit:
            predictions = X_test.PdD.apply(lambda x: self.district_rates[x,:])
            return pd.DataFrame(predictions.tolist()).values  # to get a numpy array of the correct shape
        return None

alg = baseline()
predictors = ['PdD']
alg.fit(X_train[predictors], y_train)
p = alg.predict_proba(X_test[predictors])
print crime.logloss(y_test, p)

# crime.create_submission(alg, X, y, test, predictors, 'baseline_submission.csv')

This scored a 2.61645 on the test data, a very similar score to the cross validation.  This isn't too surprising since the way the train and test data are split up are by every other week, so our cross validation train-test split is pretty representative of the data as a whole.

# k-Nearest Neighbors Model

The first model we've chosen to try is the k-Nearest Neighbors model, partially for the fact that you can quite literally look at which crimes occurred near each other using the X and Y columns.

In [17]:
alg = KNeighborsClassifier(n_neighbors=250)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.99210834292 ['X', 'Y']
3.20051772488 ['DoW', 'Hour', 'Year']
2.94271635483 ['X', 'Y', 'CornerCrime']
3.00298633715 ['X', 'Y', 'Hour', 'PdD', 'CornerCrime']
3.06264945443 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD', 'CornerCrime']


It looks like this model did the best when it only used spatial data, the X, Y, and CornerCrime columns.  It also gets better as `n_neighbors` is increased, but 250 is the maximum that I was able to do on my laptop.  I'll probably have to use less with the full data.

In [6]:
predictors = ['X', 'Y', 'CornerCrime']
alg = KNeighborsClassifier(n_neighbors=150)
# crime.create_submission(alg, X, y, test, predictors, 'k-nn_submission.csv')

This scored a 3.40014 on the test data, but could probably have done better if run with a higher `n_neighbors` value.

# Logistic Regression Model

Next, we decided to see how our old friend the Logistic Regression would do.  First, we should be able to replicate our baseline model by using a one-hot encoding of PdDistrict.  `C`, the inverse of regularization strength, is a factor that penalizes large coefficients.  Since our baseline model didn't do anything like this, we can set `C` to a very large number to make it have little effect.

In [None]:
predictors = [col for col in X_train.columns if 'PdD_' in col]
alg = LogisticRegression(C=1e30)
alg.fit(X_train[predictors], y_train)
p = alg.predict_proba(X_test[predictors])
print crime.logloss(y_test, p), predictors

This is pretty close to what our baseline model got, as we expected

In [None]:
alg = LogisticRegression()
cross_validate(alg, X_train, X_test, y_train, y_test)

The Logistic Regression model seems to do slightly better when it's given all of the predictors, but is still not as good as our baseline model.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = LogisticRegression()
# crime.create_submission(alg, X, y, test, predictors, 'lr_submission.csv')

This scored a 2.65839 on the test data, quite close to the cross validation score but still not as good as the baseline model.

# Decision Tree Model

In [None]:
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
cross_validate(alg, X_train, X_test, y_train, y_test)

Let's try giving the model all the predictors since its scores are so close for just using two and using all of them.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
# crime.create_submission(alg, X, y, test, predictors, 'dt_submission.csv')

This scored a 2.62696 on the test data.  Same story here:  very close to the cross validation score, but worse than our baseline model.

# Gradient Boosting Model

In [None]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
cross_validate(alg, X_train, X_test, y_train, y_test)

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
# crime.create_submission(alg, X, y, test, predictors, 'gb_submission.csv')

This scored a 2.69673 on the test data.  Again, no surprises here. Additionally, this model takes a very long time computationally to run. 

# Random Forest Model

In [7]:
alg = RandomForestClassifier(n_estimators=20, max_depth=10)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.46507713387 ['X', 'Y']
2.65195846086 ['DoW', 'Hour', 'Year']
2.43301324231 ['X', 'Y', 'CornerCrime']
2.42163389311 ['X', 'Y', 'Hour', 'PdD', 'CornerCrime']
2.44154444196 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD', 'CornerCrime']


Using more trees (`n_estimators`) seems to improve things, but makes a senior laptop with 8GB of RAM a bit sad.  If your computer can handle it, try increasing this parameter!

In [8]:
predictors = ['X', 'Y', 'Hour', 'PdD', 'CornerCrime']
alg = RandomForestClassifier(n_estimators=20, max_depth=10)
crime.create_submission(alg, X, y, test, predictors, 'rf_submission.csv')

This scored a 2.41986 on the test data, better than the baseline!

In another notebook (`parameter_sweeps.ipynb`), we ran a variety of sweeps to try and tune our parameters.  The outcome of that was to use the features below as predictors, use as many estimators as our computers can handle (30 works for cross validation, but only 25 for the full dataset), and to set the max depth to 14.

In [7]:
alg = RandomForestClassifier(n_estimators=30, max_depth=14, n_jobs=8)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.41816255874 ['X', 'Y', 'CornerCrime', 'BogusReport', 'NBogusReport', 'ST_0']
2.39390399362 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD', 'CornerCrime']
2.33125201384 ['Minute', 'Y', 'X', 'CornerCrime', 'Hour', 'PdD', 'Year', 'NBogusReport', 'Month']
2.33390254332 ['Minute', 'Y', 'X', 'CornerCrime', 'Hour', 'PdD', 'Year', 'NBogusReport', 'Month', 'BogusReport']


In [None]:
predictors = ['Minute', 'Y', 'X', 'CornerCrime', 'Hour', 'PdD', 'Year', 'NBogusReport', 'Month']
alg = RandomForestClassifier(n_estimators=25, max_depth=14, n_jobs=4)
# crime.create_submission(alg, X, y, test, predictors, 'rf_tuned_submission.csv')

This scored a 2.32886, our best yet.

# Telling a Story

In this modelling exploration, we find that the methods which call out the categorical nature of the data - logistic regression, decision trees, and random forest classifiers - perform highly. Particularly relevant is that the predictors of location and time of day may be particularly telling, followed by year and day of week. In our exploration phase, these relationships were also evident. What we're seeing, at a high level, is that crime reporting or crime report filing over the past 10+ years in San Francisco has followed a similar pattern - the same types of crime are being committed with relatively similar frequency, in the same general locations, and are being reported by the expected reporting entity. 

What may be interesting to explore will be whether the type of crime that it is, different slices of time of year, or connecting reporting time and type of crime may yield more exacting results when predicting crimes.

# Questions

Why is such a simplistic, hard-coded model performing better than most of these scikit-learn models?  What is it about this dataset that makes these models perform as they do?  Would it be worth tweaking the parameters of one of these models to try and improve the score, or are they just the wrong models to be using for this problem?  How can we answer these questions?