# Model Iteration 3
I want to implement a new method of XY coordinate cleaning here. Let's see if this works!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing 
%matplotlib inline




I'm going to keep using Dummies from iteration 2. Most of the changes to XY coords will be done in the cleanup function. I'm basing my changes off this script: https://www.kaggle.com/c/sf-crime/forums/t/18853/feature-engineering-of-lat-long-x-y-helps
Basically, it uses feature engineering to make the coordinate use more intuitive. sklearn.preprocessing.StandardScaler scales the XY coords so that its variance and magnitude can't dwarf the other variables' features. With scaling, the XY coords are less likely to dominate the objective function. 

I also created new cartesian coordinate systems that are rotated around SF and a polar coordinate system. All of these center on the SF area and convey more relevant spatial information.

In [4]:
def parse_date(Dates):
    """ Convert a date in YYYY-MM-DD HH:MM:SS to a tuple
        containing year, month, day, and hours each expressed
        as an integer. Used from Paul Ruvolo's example in bikeshare kaggle dataset
    """
    return int(Dates[0:4]), int(Dates[5:7]), int(Dates[8:10]), int(Dates[11:13])

def cleanupDummies(data):
    data.X.replace(-120.5, data["X"].median(), inplace = True)
    data.Y.replace(90, data["Y"].median(), inplace = True)
    xy_scaler=preprocessing.StandardScaler()
    xy_scaler.fit(data[["X","Y"]])
    data[["X","Y"]]=xy_scaler.transform(data[["X","Y"]])
    data["rot45_X"]=0.707*data["Y"]+0.707*data["X"]
    data["rot45_Y"]=0.707*data["Y"]-0.707*data["X"]
    data["rot30_X"]=(1.732/2)*data["X"]+(1./2)*data["Y"]
    data["rot30_Y"]=(1.732/2)*data["Y"]-(1./2)*data["X"]
    data["rot60_X"]=(1./2)*data["X"]+(1.732/2)*data["Y"]
    data["rot60_Y"]=(1./2)*data["Y"]-(1.732/2)*data["X"]
    data["radial_r"]=np.sqrt(np.power(data["Y"],2)+np.power(data["X"],2))
    
    data["Year"] = data.Dates.apply(lambda x: parse_date(x)[0])
    data["Month"] = data.Dates.apply(lambda x: parse_date(x)[1])
    data["Hour"] = data.Dates.apply(lambda x: parse_date(x)[3])
    data =pd.concat((data, pd.get_dummies(data.DayOfWeek, prefix="dow")), axis=1)
    data = pd.concat((data, pd.get_dummies(data.PdDistrict, prefix="pds")), axis=1)
    return data

data_dummies=pd.read_csv('train.csv')
data_dummies= cleanupDummies(data_dummies)

data_dummies.columns

Index([u'Dates', u'Category', u'Descript', u'DayOfWeek', u'PdDistrict',
       u'Resolution', u'Address', u'X', u'Y', u'rot45_X', u'rot45_Y',
       u'rot30_X', u'rot30_Y', u'rot60_X', u'rot60_Y', u'radial_r', u'Year',
       u'Month', u'Hour', u'dow_Friday', u'dow_Monday', u'dow_Saturday',
       u'dow_Sunday', u'dow_Thursday', u'dow_Tuesday', u'dow_Wednesday',
       u'pds_BAYVIEW', u'pds_CENTRAL', u'pds_INGLESIDE', u'pds_MISSION',
       u'pds_NORTHERN', u'pds_PARK', u'pds_RICHMOND', u'pds_SOUTHERN',
       u'pds_TARAVAL', u'pds_TENDERLOIN'],
      dtype='object')

Let's try these cross validation trials with dummies and without dummies to look at how much it may have helped our score. We'll also be uploading a test csv to kaggle. We're keeping with BernoulliNB for now because it predicts more accurately than the sample submission. Our next iteration will focus on algorithms. 

In [9]:
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation


model = BernoulliNB()
cats = data_dummies.Category.values
dummyDataDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)

xyCordsDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "X","Y"], axis=1)

newCordsDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","radial_r"], axis=1)

polarCordsOnly = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict", "rot45_X","rot45_Y","rot30_X","rot30_Y","rot60_X","rot60_Y","X", "Y"], axis=1)

model.fit(dummyDataDrops.dropna(), cats)
with_dummies_scores= cross_validation.cross_val_score(model, dummyDataDrops, data_dummies["Category"], cv=3)

model.fit(xyCordsDrops.dropna(), cats)
without_xy_scores= cross_validation.cross_val_score(model, xyCordsDrops, data_dummies["Category"], cv=3)

model.fit(newCordsDrops.dropna(), cats)
without_newcords_scores= cross_validation.cross_val_score(model, newCordsDrops, data_dummies["Category"], cv=3)

model.fit(polarCordsOnly.dropna(), cats)
polarcords_scores= cross_validation.cross_val_score(model, polarCordsOnly, data_dummies["Category"], cv=3)

print "score with all additions:",with_dummies_scores.mean()
print "with only new coordinate systems:",without_xy_scores.mean()
print "scaled xy only score:",without_newcords_scores.mean()
print "with only the xy coordinates:", polarcords_scores.mean()


score with all additions: 0.195751107956
with only new coordinate systems: 0.197247620754
scaled xy only score: 0.210547522574
with only the xy coordinates: 0.220478593273


This score looks promising! Again, could be a little bit exaggerated of a score because of overfitting, but I'm interested to see what the kaggle score yields
with all the bells and whistles:0.19575110795636985


In [8]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data_dummies.Category.values

droppedTestDummies = testDummies.drop(["Id","Address","Dates", "DayOfWeek", "PdDistrict"], axis=1)

model = BernoulliNB()
model.fit(dummyDataDrops.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)
labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('bernoulinb.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

This is an improvement!we are now scoring 2.61102 and ranked at 673. 