# Imports

In [2]:
import crime
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier

# Load Data

In [12]:
train = crime.load_cleaned_train()
test = crime.load_cleaned_test()

train.columns
# print train.info()
# print test.info()

print train.Address.unique()

at_corner = []
for address in train.Address:
    at_corner.append('/' in address)
train['CornerCrime'] = at_corner

at_corner = []
for address in test.Address:
    at_corner.append('/' in address)
test['CornerCrime'] = at_corner

['OAK ST / LAGUNA ST' 'VANNESS AV / GREENWICH ST'
 '1500 Block of LOMBARD ST' ..., '300 Block of JOHN F KENNEDY DR'
 'FOLSOM ST / ZENO PL' '1000 Block of 22ND AV']


The data is cleaned as described in `crime.py`.  In short, Year, Month, Day, Hour, and Minute columns are created, DayOfWeek, PdDistrict, and Category are encoded as integers, and invalid X and Y values are set to the median for that crime's PdDistrict.

# Split Train Data for Cross Validation

In [10]:
# predictors = ['X', 'Y', 'Year', 'Month', 'Hour', 'DoW', 'PdD']
predictors = ['X', 'Y', 'Hour','CornerCrime', 'PdD']
# predictors = ['X', 'Y']
X = train[predictors]
y = train.CategoryNumber
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=np.array(y))

The `stratify` parameter of `train_test_split` requires scikit-learn-0.17, but ensures that the proportion of categories is maintained in the split.  The biggest thing that this does is make it so that we always get at least one crime from each category in the training set.  k-NN can only predict based on what it has seen before, so it is crucial that we train the model with all possible categories.

# Train and Test Model

This is just a quick run through of various models. Those models which we submit to kaggle are noted below. To see a more comprehensive outline of models and how they worked, please see the models ipython notebook.

In [13]:
alg = KNeighborsClassifier(n_neighbors=50, n_jobs=-1)
alg.fit(X_train, y_train)

p = alg.predict_proba(X_test)
print crime.logloss(y_test, p)

#let's make a submission
crime.create_submission(alg, X, y, test, predictors, 'v1_Knn.csv')

5.21681892914


In [45]:
alg = LogisticRegression()
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.6668366232023684

In [75]:
alg = tree.DecisionTreeClassifier(max_depth=6)
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.5539968194182818

In [21]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
alg.fit(X_train,y_train)

p = alg.predict_proba(X_test)
crime.logloss(y_test,p)

2.6956344784492186

In [15]:
from sklearn.naive_bayes import BernoulliNB
gnb = BernoulliNB()
y_pred = gnb.fit(X_test, y_test).predict_proba(X_train)
# print("Number of mislabeled points out of a total %d points : %d"
#        % (X_test.shape[0],(y_train != y_pred).sum()))

crime.logloss(y_test,y_pred)

2.68898640645883

In [11]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 25, max_depth = 10)
clf.fit(X_train, y_train)

p = clf.predict_proba(X_test)
print crime.logloss(y_test,p)

#let's make a submission
crime.create_submission(clf, X, y, test, predictors, 'v1_RFC_CC.csv')

2.41934871946


In [43]:
from sklearn.linear_model import SGDClassifier
alg = SGDClassifier(loss='log', penalty='l2', n_iter=10)
alg.fit(X_train, y_train)

p = alg.predict_proba(X_test)
print crime.logloss(y_test,p)

34.1832521143


# Submission Results

The random forest classifier performed very well on our split testing data, so we put this into kaggle. With simple parameters, we were able to achieve a score on our test data of 2.57091. With tuning, we achieved a score of 2.54702, and then simply using location information a score of 2.45693 with a number of iterations of 25. With predictors of location, police district, and hour this got to 2.44754 (the split test only had a score of 2.44835, so clearly you can't completely judge the "goodness" of the model until you run it on the test data).