# Model Iteration 2 
Next, we're implementing a cleanup function and playing with dummy variables instead of our catgory replacement

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline




First things first, we just put together all our cleaning code into one function. This is mostly for better readability and repeatability There's no difference in methods yet. We're also calling the cleanup on our data here.

In [8]:
def parse_date(Dates):
    """ Convert a date in YYYY-MM-DD HH:MM:SS to a tuple
        containing year, month, day, and hours each expressed
        as an integer. Used from Paul Ruvolo's example in bikeshare kaggle dataset
    """
    return int(Dates[0:4]), int(Dates[5:7]), int(Dates[8:10]), int(Dates[11:13])

def cleanup(data):
    dow = {
        "Monday" : 0,
        "Tuesday" : 1,
        "Wednesday" : 2,
        "Thursday" : 3,
        "Friday" : 4,
        "Saturday" : 5,
        "Sunday" : 6
    }
    data["DOW"] = data.DayOfWeek.map(dow)
    pds = {
        "SOUTHERN" : 0,
        "MISSION" : 1,
        "NORTHERN" : 2,
        "BAYVIEW" : 3,
        "CENTRAL" : 4,
        "TERNDERLOIN" : 5,
        "INGLESIDE" : 6,
        "TARAVAL" : 7,
        "PARK" : 8,
        "RICHMOND" : 9
    }
    data["pds"] = data.PdDistrict.map(pds)
    # for crimes without PD, use "Other" : 10
    data["pds"] = data["pds"].fillna(10)
    data.X.replace(-120.5, data["X"].median(), inplace = True)
    data.Y.replace(90, data["Y"].median(), inplace = True)
    data["Year"] = data.Dates.apply(lambda x: parse_date(x)[0])
    data["Month"] = data.Dates.apply(lambda x: parse_date(x)[1])
    data["Hour"] = data.Dates.apply(lambda x: parse_date(x)[3])
    return data
    

data = pd.read_csv('train.csv')
data = cleanup(data)
                                


Dummies we are going to attempt to use dummy variables to define new columns for more of our data. Using dummies instead of reassignment could increase precision (maybe?).

In [6]:
def cleanupDummies(data):
    data.X.replace(-120.5, data["X"].median(), inplace = True)
    data.Y.replace(90, data["Y"].median(), inplace = True)
    data["Year"] = data.Dates.apply(lambda x: parse_date(x)[0])
    data["Month"] = data.Dates.apply(lambda x: parse_date(x)[1])
    data["Hour"] = data.Dates.apply(lambda x: parse_date(x)[3])
    data =pd.concat((data, pd.get_dummies(data.DayOfWeek, prefix="dow")), axis=1)
    data = pd.concat((data, pd.get_dummies(data.PdDistrict, prefix="pds")), axis=1)
    return data

data_dummies=pd.read_csv('train.csv')
data_dummies= cleanupDummies(data_dummies)

data_dummies.columns

Index([u'Dates', u'Category', u'Descript', u'DayOfWeek', u'PdDistrict',
       u'Resolution', u'Address', u'X', u'Y', u'Year', u'Month', u'Hour',
       u'dow_Friday', u'dow_Monday', u'dow_Saturday', u'dow_Sunday',
       u'dow_Thursday', u'dow_Tuesday', u'dow_Wednesday', u'pds_BAYVIEW',
       u'pds_CENTRAL', u'pds_INGLESIDE', u'pds_MISSION', u'pds_NORTHERN',
       u'pds_PARK', u'pds_RICHMOND', u'pds_SOUTHERN', u'pds_TARAVAL',
       u'pds_TENDERLOIN'],
      dtype='object')

Let's try these cross validation trials with dummies and without dummies to look at how much it may have helped our score. We'll also be uploading a test csv to kaggle

In [10]:
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn import cross_validation


model = GaussianNB()
cats = data.Category.values
dataDrops = data.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)
dummyDataDrops = data_dummies.drop(["Address","Category","Dates","Descript","Resolution", "DayOfWeek", "PdDistrict"], axis=1)

model.fit(dataDrops.dropna(), cats)
without_dummies_scores = cross_validation.cross_val_score(model, dataDrops, data["Category"], cv = 3)

model.fit(dummyDataDrops.dropna(), cats)
with_dummies_scores= cross_validation.cross_val_score(model, dummyDataDrops, data["Category"], cv=3)

with_dummies_scores.mean() - without_dummies_scores.mean()


-0.1369992809902677

So it looks like using dummies drastically lowers our score, which means this could be a really good choice! We're going to go ahead and generate a test submission file for kaggle. Hopefully this isn't grossly overfit to the training data.

In [13]:
import gzip, csv
testDummies = pd.read_csv('test.csv')
testDummies = cleanupDummies(testDummies)

idx = testDummies.Id.values
cats = data.Category.values

droppedTestDummies = testDummies.drop(["Id","Address","Dates", "DayOfWeek", "PdDistrict"], axis=1)

model = GaussianNB()
model.fit(dummyDataDrops.dropna(), cats)
predicted = model.predict_proba(droppedTestDummies)
labels =['Id']
for i in model.classes_:
    labels.append(i)
with gzip.open('bernoulinb.csv.gz', 'wb') as outf:
    fo =csv.writer(outf, lineterminator = '\n' )
    fo.writerow(labels)
    
    for i, pred in enumerate(predicted):
        fo.writerow([i] + list(pred))

This is not an improvement after submitting on kaggle! I'm going to keep experimenting with strat