# Model Iteration 3
I want to implement a new method of XY coordinate cleaning here. Let's see if this works!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing 
%matplotlib inline




I'm going to keep using Dummies from iteration 2. Most of the changes to XY coords will be done in the cleanup function. I'm basing my changes off this script: https://www.kaggle.com/c/sf-crime/forums/t/18853/feature-engineering-of-lat-long-x-y-helps
Basically, it uses feature engineering to make the coordinate use more intuitive. sklearn.preprocessing.StandardScaler scales the XY coords so that its variance and magnitude can't dwarf the other variables' features. With scaling, the XY coords are less likely to dominate the objective function. 

I also created new cartesian coordinate systems that are rotated around SF and a polar coordinate system. All of these center on the SF area and convey more relevant spatial information.

In [2]:
def parse_date(Dates):
    """ Convert a date in YYYY-MM-DD HH:MM:SS to a tuple
        containing year, month, day, and hours each expressed
        as an integer. Used from Paul Ruvolo's example in bikeshare kaggle dataset
    """
    return int(Dates[0:4]), int(Dates[5:7]), int(Dates[8:10]), int(Dates[11:13])
def cleanupDummiesOld(data):
    data.X.replace(-120.5, data["X"].median(), inplace = True)
    data.Y.replace(90, data["Y"].median(), inplace = True)
    data["Year"] = data.Dates.apply(lambda x: parse_date(x)[0])
    data["Month"] = data.Dates.apply(lambda x: parse_date(x)[1])
    data["Hour"] = data.Dates.apply(lambda x: parse_date(x)[3])
    data =pd.concat((data, pd.get_dummies(data.DayOfWeek, prefix="dow")), axis=1)
    data = pd.concat((data, pd.get_dummies(data.PdDistrict, prefix="pds")), axis=1)
    return data
oldData = pd.read_csv('train.csv')
oldData = cleanupDummiesOld(oldData)

In [3]:
def parse_date(Dates):
    """ Convert a date in YYYY-MM-DD HH:MM:SS to a tuple
        containing year, month, day, and hours each expressed
        as an integer. Used from Paul Ruvolo's example in bikeshare kaggle dataset
    """
    return int(Dates[0:4]), int(Dates[5:7]), int(Dates[8:10]), int(Dates[11:13])

def cleanupDummies(data):
    data.X.replace(-120.5, data["X"].median(), inplace = True)
    data.Y.replace(90, data["Y"].median(), inplace = True)
    xy_scaler=preprocessing.StandardScaler()
    xy_scaler.fit(data[["X","Y"]])
    data[["X","Y"]]=xy_scaler.transform(data[["X","Y"]])
    data["rot45_X"]=0.707*data["Y"]+0.707*data["X"]
    data["rot45_Y"]=0.707*data["Y"]-0.707*data["X"]
    data["rot30_X"]=(1.732/2)*data["X"]+(1./2)*data["Y"]
    data["rot30_Y"]=(1.732/2)*data["Y"]-(1./2)*data["X"]
    data["rot60_X"]=(1./2)*data["X"]+(1.732/2)*data["Y"]
    data["rot60_Y"]=(1./2)*data["Y"]-(1.732/2)*data["X"]
    data["radial_r"]=np.sqrt(np.power(data["Y"],2)+np.power(data["X"],2))
    
    data["Year"] = data.Dates.apply(lambda x: parse_date(x)[0])
    data["Month"] = data.Dates.apply(lambda x: parse_date(x)[1])
    data["Hour"] = data.Dates.apply(lambda x: parse_date(x)[3])
    data =pd.concat((data, pd.get_dummies(data.DayOfWeek, prefix="dow")), axis=1)
    data = pd.concat((data, pd.get_dummies(data.PdDistrict, prefix="pds")), axis=1)
    return data

data_dummies=pd.read_csv('train.csv')
data_dummies= cleanupDummies(data_dummies)

data_dummies.columns

Index([u'Dates', u'Category', u'Descript', u'DayOfWeek', u'PdDistrict',
       u'Resolution', u'Address', u'X', u'Y', u'rot45_X', u'rot45_Y',
       u'rot30_X', u'rot30_Y', u'rot60_X', u'rot60_Y', u'radial_r', u'Year',
       u'Month', u'Hour', u'dow_Friday', u'dow_Monday', u'dow_Saturday',
       u'dow_Sunday', u'dow_Thursday', u'dow_Tuesday', u'dow_Wednesday',
       u'pds_BAYVIEW', u'pds_CENTRAL', u'pds_INGLESIDE', u'pds_MISSION',
       u'pds_NORTHERN', u'pds_PARK', u'pds_RICHMOND', u'pds_SOUTHERN',
       u'pds_TARAVAL', u'pds_TENDERLOIN'],
      dtype='object')

Let's try these cross validation trials with dummies and without dummies to look at how much it may have helped our score. We'll also be uploading a test csv to kaggle. We're keeping with BernoulliNB for now because it predicts more accurately than the sample submission. Our next iteration will focus on algorithms. 

In [11]:
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation


model = BernoulliNB()
cats = data_dummies.Category.values
origDrops = oldData.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)

dummyDataDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)

xyCordsDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "X","Y"], axis=1)

newCordsDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","radial_r"], axis=1)

polarCordsOnly = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","X", "Y"], axis=1)

model.fit(origDrops.dropna(),cats)
orig_score = cross_validation.cross_val_score(model, origDrops, data_dummies["Category"], cv=3)

model.fit(dummyDataDrops.dropna(), cats)
with_dummies_scores= cross_validation.cross_val_score(model, dummyDataDrops, data_dummies["Category"], cv=3)

model.fit(xyCordsDrops.dropna(), cats)
without_xy_scores= cross_validation.cross_val_score(model, xyCordsDrops, data_dummies["Category"], cv=3)

model.fit(newCordsDrops.dropna(), cats)
without_newcords_scores= cross_validation.cross_val_score(model, newCordsDrops, data_dummies["Category"], cv=3)

model.fit(polarCordsOnly.dropna(), cats)
polarcords_scores= cross_validation.cross_val_score(model, polarCordsOnly, data_dummies["Category"], cv=3)

print "original score:", orig_score.mean()
print "score with all additions:",with_dummies_scores.mean()
print "with only new coordinate systems:",without_xy_scores.mean()
print "scaled xy only score:",without_newcords_scores.mean()
print "with only the polar coordinates:", polarcords_scores.mean()


original score: 0.220478593273
score with all additions: 0.195751107956
with only new coordinate systems: 0.197247620754
scaled xy only score: 0.210547522574
with only the xy coordinates: 0.220478593273


Based on this training data, it looks like the best fit model is with all new coordinate system additions. Interestingly, scaling the XY coordinates on their own doesn't do anything for model accuracy. I'm submitting a test version with all new coordinate parameters.


In [13]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data_dummies.Category.values

droppedTestDummies = testDummies.drop(["Id","Address","Dates", "DayOfWeek", "PdDistrict"], axis=1)

model = BernoulliNB()
model.fit(dummyDataDrops.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)
labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('newcoords.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

Unfortunately, the test file generated with all new coordinates scores significantly lower on kaggle- a 2.9 to the current score of 2.6. I'm going to try the next best scoring combination of coordinate pairs. by dropping the XY coordinates.

In [15]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data_dummies.Category.values

droppedTestDummies = testDummies.drop(["Id","Address","Dates", "DayOfWeek", "PdDistrict", "X","Y"], axis=1)

model = BernoulliNB()
model.fit(xyCordsDrops.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)
labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('newcoords.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

Eliminating use of the raw xy coords gets a 2.7 on kaggle, which is better than before, but still note better than Model 2. I'm going to try using just the scaled XY coordinates, which received the next best training score

In [17]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data_dummies.Category.values

droppedTestDummies = testDummies.drop(["Id","Address","Dates", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","radial_r"], axis=1)

model = BernoulliNB()
model.fit(newCordsDrops.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)
labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('newcoords.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

This got a 2.638, whereas the model 2 score was 2.611. I'm trying just polar coordinates next

In [18]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data_dummies.Category.values

droppedTestDummies = testDummies.drop(["Id", "Address", "Dates", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","X", "Y"], axis=1)

model = BernoulliNB()
model.fit(polarCordsOnly.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)

labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('newcoords.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

It's literally the same score as model 2. I'm wondering if there's a different algorithm I can use that will actually optimize the new coordinate structure.

The forum entry I found discussing cleaning coordinates used random forests, so let's implement that next instead of Bernoulli

In [8]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import cross_validation 
from sklearn.cross_validation import KFold

dummyDataDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)

alg = RandomForestClassifier(random_state=1, n_estimators=10, max_depth =18)

scores = cross_validation.cross_val_score(alg, dummyDataDrops, data_dummies["Category"], cv=6)
print scores

[ 0.21487575  0.20834985  0.20009293  0.21128195  0.16919176  0.09571713]


This looks just as promising as the other Bernoulli-based models! I'll generate a test file with this. Hopefully the random-ness is not as overfit to the data as the previous attempts with Bernoulli and customized location. 