# Imports

In [17]:
import crime
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier

# Load Data

In [2]:
train = crime.load_cleaned_train()
test = crime.load_cleaned_test()

train.columns
print train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 17 columns):
Dates             878049 non-null object
Category          878049 non-null object
Descript          878049 non-null object
DayOfWeek         878049 non-null object
PdDistrict        878049 non-null object
Resolution        878049 non-null object
Address           878049 non-null object
X                 878049 non-null float64
Y                 878049 non-null float64
Year              878049 non-null int64
Month             878049 non-null int64
Day               878049 non-null int64
Hour              878049 non-null int64
Minute            878049 non-null int64
DoW               878049 non-null int64
PdD               878049 non-null int64
CategoryNumber    878049 non-null int64
dtypes: float64(2), int64(8), object(7)
memory usage: 120.6+ MB
None


The data is cleaned as described in `crime.py`.  In short, Year, Month, Day, Hour, and Minute columns are created, DayOfWeek, PdDistrict, and Category are encoded as integers, and invalid X and Y values are set to the median for that crime's PdDistrict.

# Split Train Data for Cross Validation

In [10]:
predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
X = train[predictors]
y = train.CategoryNumber
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=np.array(y))

The `stratify` parameter of `train_test_split` requires scikit-learn-0.17, but ensures that the proportion of categories is maintained in the split.  The biggest thing that this does is make it so that we always get at least one crime from each category in the training set.  k-NN can only predict based on what it has seen before, so it is crucial that we train the model with all possible categories.

# Train and Test Model

The model we've chosen is the k-Nearest Neighbors model.  For some reason, it is having issues with the fact that there are duplicate data points ---- actually its because there are null Y values....V

In [11]:
alg = KNeighborsClassifier(n_neighbors=50, n_jobs=-1)
alg.fit(X_train, y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test, p)

5.3167679703291171

In [12]:
alg = LogisticRegression()
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.6596192092540711

In [16]:
alg = tree.DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.6276539353166206

In [21]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.6956344784492186