Note: The lines of code that apply the models to the test data and generate the submission file are commented out.  This is because the files created are >500MB each, and we didn't want somebody to accidentally create a bunch of large files if they ran all the cells.

# Imports

In [1]:
import crime
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier

# Load Data

In [2]:
train = crime.load_cleaned_train()
test = crime.load_cleaned_test()

print train.info()
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 17 columns):
Dates             878049 non-null object
Category          878049 non-null object
Descript          878049 non-null object
DayOfWeek         878049 non-null object
PdDistrict        878049 non-null object
Resolution        878049 non-null object
Address           878049 non-null object
X                 878049 non-null float64
Y                 878049 non-null float64
Year              878049 non-null int64
Month             878049 non-null int64
Day               878049 non-null int64
Hour              878049 non-null int64
Minute            878049 non-null int64
DoW               878049 non-null int64
PdD               878049 non-null int64
CategoryNumber    878049 non-null int64
dtypes: float64(2), int64(8), object(7)
memory usage: 120.6+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 14 columns):
Id            8842

The data is cleaned as described in `crime.py`.  In short, Year, Month, Day, Hour, and Minute columns are created, DayOfWeek, PdDistrict, and Category are encoded as integers, and invalid X and Y values are set to the median for that crime's PdDistrict.

# Split Train Data for Cross Validation

In [3]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
X = train[predictors]
y = train.CategoryNumber
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=np.array(y))

The `stratify` parameter of `train_test_split` requires scikit-learn-0.17, but ensures that the proportion of categories is maintained in the split.  The biggest thing that this does is make it so that we always get at least one crime from each category in the training set.  Our models can only predict based on what they have seen before, so it is crucial that we train them with all possible categories.

In [8]:
def cross_validate(alg, X_train, X_test, y_train, y_test):
    predictor_sets = (
        ['X', 'Y'],
        ['DoW', 'Hour', 'Year'],
        ['X', 'Y', 'DoW', 'Hour', 'Year'],
        ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
    )

    for predictors in predictor_sets:
        alg.fit(X_train[predictors], y_train)
        p = alg.predict_proba(X_test[predictors])
        print crime.logloss(y_test, p), predictors

# Baseline Model

In order to have something to compare to, we've created a baseline model that guesses based on the crime rates in each district

In [4]:
class baseline(object):
    def __init__(self):
        self.has_fit = False
        
    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        X_train['CategoryNumber'] = y_train
        groups = X_train.groupby(['PdD', 'CategoryNumber'])

        # Tally up the counts of each Category in each PdDistrict
        num_districts = len(X_train.PdD.unique())
        num_categories = len(y_train.unique())
        self.district_rates = np.zeros((num_districts, num_categories))
        for ind,data in groups:
            self.district_rates[ind] = len(data)

        # Normalize values
        self.district_rates /= self.district_rates.sum(axis=1, keepdims=True)

        self.has_fit = True

    def predict_proba(self, X_test):
        if self.has_fit:
            predictions = X_test.PdD.apply(lambda x: self.district_rates[x,:])
            return pd.DataFrame(predictions.tolist()).values  # to get a numpy array of the correct shape
        return None

alg = baseline()
predictors = ['PdD']
alg.fit(X_train[predictors], y_train)
p = alg.predict_proba(X_test[predictors])
print crime.logloss(y_test, p)

# crime.create_submission(alg, X, y, test, predictors, 'baseline_submission.csv')

2.61558736009


This scored a 2.61645 on the test data, a very similar score to the cross validation.  This isn't too surprising since the way the train and test data are split up are by every other week, so our cross validation train-test split is pretty representative of the data as a whole.

# k-Nearest Neighbors Model

The first model we've chosen to try is the k-Nearest Neighbors model, partially for the fact that you can quite literally look at which crimes occurred near each other using the X and Y columns.

In [5]:
alg = KNeighborsClassifier(n_neighbors=50)
cross_validate(alg, X_train, X_test, y_train, y_test)

5.15551118816 ['X', 'Y']
5.69044667321 ['DoW', 'Hour', 'Year']
5.56268327075 ['X', 'Y', 'DoW', 'Hour', 'Year']
5.35789194314 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']


It looks like this model did the best when it only used the spatial data, the X and Y columns.  It is also performing worse than our baseline model, but let's see how it does with the test data.

In [6]:
predictors = ['X', 'Y']
alg = KNeighborsClassifier(n_neighbors=50)
# crime.create_submission(alg, X, y, test, predictors, 'k-nn_submission.csv')

This scored a 5.32130 on the test data, which is a little worse than the cross validation score.  It seems that 

# Logistic Regression Model

Next, we decided to see how our old friend the Logistic Regression would do.

In [7]:
alg = LogisticRegression()
cross_validate(alg, X_train, X_test, y_train, y_test)

2.6721756429 ['X', 'Y']
2.67226861354 ['DoW', 'Hour', 'Year']
2.67200816957 ['X', 'Y', 'DoW', 'Hour', 'Year']
2.66178121371 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']


The Logistic Regression model seems to do slightly better when it's given all of the predictors, but is still not as good as our baseline model.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = LogisticRegression()
# crime.create_submission(alg, X, y, test, predictors, 'lr_submission.csv')

This scored a [] on the test data

# Decision Tree Model

In [9]:
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
cross_validate(alg, X_train, X_test, y_train, y_test)

2.62620494929 ['X', 'Y']
2.64580637088 ['DoW', 'Hour', 'Year']
2.62780730822 ['X', 'Y', 'DoW', 'Hour', 'Year']
2.62780730822 ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']


Let's try giving the model all the predictors since its scores are so close for just using two and using all of them.

In [None]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
# crime.create_submission(alg, X, y, test, predictors, 'dt_submission.csv')

# Gradient Boosting Model

In [None]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
cross_validate(alg, X_train, X_test, y_train, y_test)