# Table of Contents
 <p><div class="lev1"><a href="#Data-Preparation"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Preparation</a></div><div class="lev1"><a href="#Random-Forest"><span class="toc-item-num">2&nbsp;&nbsp;</span>Random Forest</a></div><div class="lev1"><a href="#Encode-Labels"><span class="toc-item-num">3&nbsp;&nbsp;</span>Encode Labels</a></div><div class="lev1"><a href="#Features-selection"><span class="toc-item-num">4&nbsp;&nbsp;</span>Features selection</a></div><div class="lev1"><a href="#Initialization"><span class="toc-item-num">5&nbsp;&nbsp;</span>Initialization</a></div><div class="lev1"><a href="#Fitting-the-model"><span class="toc-item-num">6&nbsp;&nbsp;</span>Fitting the model</a></div><div class="lev1"><a href="#Predict-on-test-data"><span class="toc-item-num">7&nbsp;&nbsp;</span>Predict on test data</a></div><div class="lev1"><a href="#Prepare-submission"><span class="toc-item-num">8&nbsp;&nbsp;</span>Prepare submission</a></div><div class="lev1"><a href="#Submission"><span class="toc-item-num">9&nbsp;&nbsp;</span>Submission</a></div>

In [39]:
import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split

## Data Preparation

In [146]:
train_csv = r'data/train.csv'
%time train = pd.read_csv(train_csv, parse_dates = ['Dates'])
train.drop('Resolution', axis=1, inplace=True)

test_csv = r'data/test.csv'
%time test = pd.read_csv(test_csv, parse_dates = ['Dates'])

CPU times: user 3.02 s, sys: 408 ms, total: 3.43 s
Wall time: 4.75 s
CPU times: user 2.63 s, sys: 231 ms, total: 2.86 s
Wall time: 2.95 s


## Random Forest

A random forest is an ensemble of decision trees which will output a prediction value. Each decision tree is constructed by using a random subset of the training data. After you have trained your forest, you can then pass each test row through it, in order to output a prediction.

This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.

## Encode Labels

In [147]:
def encodeLabels(df):
    df['DayOfWeek'] = df.Dates.dt.dayofweek
    encodedDistrict = pd.get_dummies(df.PdDistrict)
    encodedDataFrame = pd.concat([df.DayOfWeek,encodedDistrict], axis=1)

    encodedDataFrame['Y']=df['Y']
    encodedDataFrame['X']=df['X']
    
    if 'Category' in df.columns.values:
        labelEncoder = preprocessing.LabelEncoder()
        encodedCategory = labelEncoder.fit_transform(df.Category)
        encodedDataFrame['Category']=df.Category #encodedCategory
        
    return encodedDataFrame

In [148]:
%time trainDf = encodeLabels(train)
%time testDf = encodeLabels(test)

CPU times: user 1.13 s, sys: 157 ms, total: 1.29 s
Wall time: 1.32 s
CPU times: user 724 ms, sys: 135 ms, total: 859 ms
Wall time: 910 ms


In [85]:
trainDf.head(3)

Unnamed: 0,DayOfWeek,BAYVIEW,CENTRAL,INGLESIDE,MISSION,NORTHERN,PARK,RICHMOND,SOUTHERN,TARAVAL,TENDERLOIN,Y,X,Category
0,2,0,0,0,0,1,0,0,0,0,0,37.774599,-122.425892,37
1,2,0,0,0,0,1,0,0,0,0,0,37.774599,-122.425892,21
2,2,0,0,0,0,1,0,0,0,0,0,37.800414,-122.424363,21


## Features selection

In [86]:
features = ['Y', 'X','DayOfWeek']
features = np.append(features, train.PdDistrict.unique())

In [87]:
features

array(['Y', 'X', 'DayOfWeek', 'NORTHERN', 'PARK', 'INGLESIDE', 'BAYVIEW',
       'RICHMOND', 'CENTRAL', 'TARAVAL', 'TENDERLOIN', 'MISSION',
       'SOUTHERN'], dtype=object)

## Initialization

In [15]:
??ensemble.RandomForestClassifier

In [22]:
model = ensemble.RandomForestClassifier(n_jobs=-1,n_estimators=50)

## Fitting the model

In [116]:
%time model.fit(trainDf[features], trainDf.Category)

CPU times: user 3min 23s, sys: 7.81 s, total: 3min 31s
Wall time: 1min 2s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Predict on test data

In [117]:
%time predictedCategory = model.predict(testDf)

CPU times: user 28.6 s, sys: 27.7 s, total: 56.4 s
Wall time: 42 s


In [118]:
predictedCategory

array(['OTHER OFFENSES', 'OTHER OFFENSES', 'OTHER OFFENSES', ...,
       'OTHER OFFENSES', 'OTHER OFFENSES', 'OTHER OFFENSES'], dtype=object)

## Prepare submission

In [161]:
%time sampleSubmission = pd.read_csv(r'data/sampleSubmission.csv')

CPU times: user 4.37 s, sys: 682 ms, total: 5.05 s
Wall time: 5.71 s


In [162]:
def prepareSubmission(df, predictedCategory, sampleDf, category):
    tosubmit = pd.DataFrame({'Id' : df.Id.values, '' : predictedCategory})
    tosubmit = pd.get_dummies(tosubmit,prefix_sep='')
    for cat in categories:
        if cat not in tosubmit:
            tosubmit[cat]=0

    tosubmit = tosubmit[sampleDf.columns]
    if len(tosubmit.columns) != (len(categories)+1):
        print('submit data is inconsistent with categories passed')

    return tosubmit

In [157]:
np.unique(predictedCategory)

array(['ARSON', 'NON-CRIMINAL', 'OTHER OFFENSES'], dtype=object)

In [151]:
categories = sorted(train.Category.unique())
%time tosubmit = prepareSubmission(test,predictedCategory,sampleSubmission,categories)

## Submission

In [160]:
%time tosubmit.to_csv(r'data/tosubmit.csv',index=False)

CPU times: user 28.2 s, sys: 1.29 s, total: 29.5 s
Wall time: 32.3 s


In [127]:
tosubmit.head()

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
