## Model iteration 1 (Sophia and Mac-I)
This is an early model iteration, using just the data that we have and that can easily be recoded. This notebook is mostly about trying to figure out a working pipeline for the model rather than trying to get a good score. As a result, you can see that we don't do a lot of test/train splitting here. 

In [1]:
% matplotlib inline
import shapefile
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib import cm
from datetime import datetime
from ipywidgets import widgets  
from IPython.display import display

from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn import metrics

import seaborn as sns



In [2]:
df_train = pd.read_csv('./../train.csv')
df_train = df_train.dropna()
df_test = pd.read_csv('./../test.csv')

df_train.head(5)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


Here, we're converting the date-time object, which is a string into a useable date.

In [3]:
def preprocessData(df_raw):
    '''Preprocess the dataframe extracting the year, month, hour, and day from the dataframe'''
    df_raw['DateTime'] = df_raw['Dates'].apply(
    lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

    df_raw['Year'] = df_raw['DateTime'].apply(lambda x: x.year)
    df_raw['Month'] = df_raw['DateTime'].apply(lambda x: x.month)
    df_raw['Day'] = df_raw['DateTime'].apply(lambda x: x.day)
    df_raw['Hour'] = df_raw['DateTime'].apply(lambda x: x.hour)
    
    return df_raw    

Preprocess the data

In [4]:
df_trainPros = preprocessData(df_train);
df_testPros = preprocessData(df_test);

Train the model on all of the test data

In [5]:
X_tr = df_trainPros.drop("Category", axis=1)
y_tr = df_trainPros["Category"]

factors = ['Year','Month','Day','Hour', 'X', 'Y']

X_tr = X_tr[factors]

alg = RandomForestClassifier(n_estimators=10)
alg.fit(X_tr, y_tr)
print "done"

done


Get the cross validation score

In [6]:
predicted = cross_validation.cross_val_predict(alg, X_tr, y_tr, cv=3)
print "done"

done


Get the accuracy of the model -- this doesn't really mean anything because we didn't do a test/train split, so this will be better than in the case where we don't actually train on the data we're predicting.  

In [7]:
print metrics.accuracy_score(y_tr, predicted)

0.0588611797291


Generate a submission file 

In [8]:
y_test = pd.DataFrame(alg.predict_proba(df_testPros[factors]), index=df_testPros.Id, columns=alg.classes_)
y_test.to_csv("results.csv")
print "saved"

saved


Here, we've found a metric called log_loss which appears to be the scoring metric that kaggle is using. From now on, we'll use that instead of the accuracy score. 

In [9]:
predicted = alg.predict_proba(X_tr)
print metrics.log_loss(y_tr, predicted, eps=1e-15)

0.518627789908


In this noteobok the actual scores that we predict don't really tell us that much about a model, but we have started to get a feel for the data. 