## Model Iteration
Date Created: 14 February 2016

This model iteration is used to make crime category predictions for test data for San Francisco Crime Classification kaggle competition
https://www.kaggle.com/c/sf-crime

as of 14.02.16
Rank: /
Score: %

### Importing Modules and Data

In [1]:
import pandas as pd
import numpy as np
import zipfile

import sklearn as sk
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy as sp

In [2]:
#importing train dataset
z_train = zipfile.ZipFile('train.csv.zip')
train = pd.read_csv(z_train.open('train.csv'), parse_dates=['Dates'], index_col=False)

In [3]:
#importing test dataset
z_test = zipfile.ZipFile('test.csv.zip')
test = pd.read_csv(z_test.open('test.csv'), parse_dates=['Dates'], index_col=False)

### Modifying and Trimming Data

Here, we analyze data and modify it accordingly. As we see the data columns for the training and testing data, we see that the resolution column is not really needed. Moreover, some data types such as PdDistrict and Address seem to have some overlap, so we may pick to use one of them, or some altered version of each. Also, we dropped the Descript column from the train data will be dropped as there are great number of unique values, and are not present in the test data.

In [4]:
print train.info()
print "----------------------------------"
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 67.0+ MB
None
----------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null datetime64[ns]
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(3

### Tools

The information in some of the columns in the data are extracted & seperated into different columns for better evaluation. 

The time_trim function converts total Date column into sub parts for convenience.

The make_binary_fields function, allows us to create dummy variables with data from pre-existing columns. Dummy variables work extremely well with Random Forrest Regression, although the number of columns in the data set are increased. This will probably be used for randomforest or gradient boosting method.

The make_seasons function converts month data into seasons which we can include in our approximation. However, seasons overlap with date greatly, so we may not fully utilize it.

### Data Modifying Functions

From a source we used from the kaggle scripts, we create a streamlined function that adds new time categories to the data set based off of the 'Dates' category.

In [5]:
def time_trim(df):
    df['Day'] = df['Dates'].dt.day
    df['Month'] = df['Dates'].dt.month
    df['Year'] = df['Dates'].dt.year
    df['Hour'] = df['Dates'].dt.hour
    df['Minute'] = df['Dates'].dt.minute
    df['WeekOfYear'] = df['Dates'].dt.weekofyear
    return

def make_season(df):
    """
    Make new field name Season
    and binary fields for each season
    Has to happen after making 'Month' field
    spring: month 2, 3, 4
    summer: month 5, 6, 7
    autumn: month 8, 9, 10
    winter: month 11, 12, 1
    """
    df['Season'] = df['Month']
    df.loc[(df['Season'] > 10) | (df['Season'] == 1), 'Season'] = 'Winter'
    df.loc[(df['Season'] > 1) & (df['Season'] <= 4), 'Season'] = 'Spring'
    df.loc[(df['Season'] > 4) & (df['Season'] <= 7), 'Season'] = 'Summer'
    df.loc[(df['Season'] > 7) & (df['Season'] <= 10), 'Season'] = 'Autumn'
    return

### Formatting data

Here we alter the columns we have in our data set to suit our regression procedure. Because many of the sklearn regression functions cannot handle string data, we must convert them to either dummie variables or re-encode them with numbers. With the make_binary_fields and time_trim function, we do such for DayOfWeek, PdDistrict, and with LabelEncoder() we do such for the Categories.

In [6]:
time_trim(train)
seasons = make_season(train)

time_trim(test)
seasons = make_season(test)

In [7]:
#Making into binary
train = pd.concat((train, pd.get_dummies(train['DayOfWeek'], prefix = 'dow')), axis=1)
train = pd.concat((train, pd.get_dummies(train['Season'], prefix = 'season')), axis=1)
train = pd.concat((train, pd.get_dummies(train['PdDistrict'], prefix = 'pdd')), axis=1)

test = pd.concat((test, pd.get_dummies(test['DayOfWeek'], prefix = 'dow')), axis=1)
test = pd.concat((test, pd.get_dummies(test['Season'], prefix = 'season')), axis=1)
test = pd.concat((test, pd.get_dummies(test['PdDistrict'], prefix = 'pdd')), axis=1)

In [8]:
#Encoding into numbers
def enc(df):
    enc_pdd = LabelEncoder()
    df['PdDistrict_enc'] = enc_pdd.fit_transform(df['PdDistrict'])

    enc_seas = LabelEncoder()
    df['Season_enc'] = enc_seas.fit_transform(df['Season'])

    enc_dow = LabelEncoder()
    df['DayOfWeek_enc'] = enc_dow.fit_transform(df['DayOfWeek'])
    return

enc(train)
enc(test)

In [34]:
enc = LabelEncoder()
enc.fit(train['Category'])
train['CategoryEncoded'] = enc.transform(train['Category'])

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 41 columns):
Dates              878049 non-null datetime64[ns]
Category           878049 non-null object
Descript           878049 non-null object
DayOfWeek          878049 non-null object
PdDistrict         878049 non-null object
Resolution         878049 non-null object
Address            878049 non-null object
X                  878049 non-null float64
Y                  878049 non-null float64
Day                878049 non-null int64
Month              878049 non-null int64
Year               878049 non-null int64
Hour               878049 non-null int64
Minute             878049 non-null int64
WeekOfYear         878049 non-null int64
Season             878049 non-null object
dow_Friday         878049 non-null float64
dow_Monday         878049 non-null float64
dow_Saturday       878049 non-null float64
dow_Sunday         878049 non-null float64
dow_Thursday       878049 non-null float6

### Setting predictors

The predictors are the data columns we will want to use for our regression process. We added the time components, and appended the PdDistric, Seasons, and day of week.

In [11]:
predictors = ['Day','Month','Year','Hour','Minute','WeekOfYear']
predictors_b = ['pdd_TENDERLOIN', 'pdd_TARAVAL', 'pdd_SOUTHERN', 'pdd_RICHMOND', 'pdd_PARK', 
                'pdd_NORTHERN', 'pdd_MISSION', 'pdd_INGLESIDE', 'pdd_CENTRAL', 'pdd_BAYVIEW',
                'season_Winter', 'season_Summer', 'season_Spring', 'season_Autumn', 
                'dow_Friday', 'dow_Monday', 'dow_Tuesday', 'dow_Wednesday', 'dow_Thursday', 
                'dow_Saturday', 'dow_Sunday']
predictors_enc = ['PdDistrict_enc', 'Season_enc', 'DayOfWeek_enc']

### Kaggle SF Crime Classification Scoring System

We noticed on the leader board of the SF Crime Kaggle the scoring system was logloss. Therefore, based upon someone else's algorithm on calculating log loss we developed our own. Here we initially get rid of extreme values (by designating an epsilon) and then create a log loss return that is our score.

The log loss sytem works differently from the regular percentage scoring system. When computing the output for the test regression, each category (the point of interest) will be assigned a probability based upon the regression. With that, the solution is compared to the output of the test regression, and the total loss (or inaccuracy of each probability) is computed and summed. Overall, we want a lower log loss value for a more successful regression.

In [12]:
def logloss(y,p):
    """
    information derived from following sources
    https://www.kaggle.com/wiki/LogarithmicLoss
    https://www.kaggle.com/c/sf-crime/details/evaluation
    """
    eps = 1e-15
    p = p/p.sum(axis=1)[:,np.newaxis]
    p = np.maximum(eps,p)
    p = np.minimum(1-eps,p)
    

    # Calculate logloss
    ll = 0
    for i in range(len(p)):
        ll += np.log(p[i, y.iloc[i]])
    ll /= float(-len(p))

    return ll

### Seperating train data

Here, we create folds in our data to split our training data into a train set and a test set. With this, we can test within our train data to see how accurate our model is before submission.

In [13]:
# Create x and y from train data, x will be train, y will be target
x_b = train[predictors + predictors_b]
x_e = train[predictors + predictors_enc]
y = train['CategoryEncoded']

In [13]:
xtr_b, xtest_b, ytr_b, ytest_b = cross_validation.train_test_split(x_b, y, train_size = 0.5)

In [14]:
xtr_e, xtest_e, ytr_e, ytest_e = cross_validation.train_test_split(x_e, y, train_size = 0.5)

### Logistic Regression

Our first regression attempt is the logistic regression. We are familiar with this process from our warm up project and it is a good starting point. However, I believe that since the point of interest (category) has 38 parts, the logistic regression may not be apt, as during the Warmup Project the output was either alive or dead.

In [38]:
alg = LogisticRegression()
alg.fit(xtr_b, ytr_b)
prediction = alg.predict_proba(xtest_b)
logloss(ytest_b,prediction)

2.579205451592236

In [28]:
alg = LogisticRegression()
alg.fit(xtr_e, ytr_e)
prediction = alg.predict_proba(xtest_e)
logloss(ytest_e,prediction)

2.6269253551422671

### Gradient Boosting

In [None]:
alg = GradientBoostingClassifier()
alg.fit(xtr_b, ytr_b)
prediction = alg.predict_proba(xtest_b)
logloss(ytest_b,prediction)

In [None]:
alg = GradientBoostingClassifier()
alg.fit(xtr_e, ytr_e)
prediction = alg.predict_proba(xtest_e)
logloss(ytest_e,prediction)

### Random Forest

In [22]:
alg = RandomForestClassifier(max_depth=13)
alg.fit(xtr_b, ytr_b)
prediction = alg.predict_proba(xtest_b)
logloss(ytest_b,prediction)

2.5062227345094228

In [26]:
alg = RandomForestClassifier(max_depth=12, min_samples_split=4, min_samples_leaf=5)
alg.fit(xtr_b, ytr_b)
prediction = alg.predict_proba(xtest_b)
logloss(ytest_b,prediction)

2.4998721955608434

In [42]:
alg = RandomForestClassifier(max_depth=4)
alg.fit(xtr_e, ytr_e)
prediction = alg.predict_proba(xtest_e)
logloss(ytest_e,prediction)

2.5603905655630932

In [41]:
alg = RandomForestClassifier(max_depth=4, min_samples_split=2, min_samples_leaf=2)
alg.fit(xtr_e, ytr_e)
prediction = alg.predict_proba(xtest_e)
logloss(ytest_e,prediction)

2.5603264644083668

### Decision Tree

The decision tree uses a lot of conditional statements. Because we moved a lot of our categories to dummie variables, (with either true or false conditions), we believe that the decision tree model will be very suitable.

In [28]:
alg = sk.tree.DecisionTreeClassifier(max_depth = 6)
alg.fit(xtr_b, ytr_b)
prediction = alg.predict_proba(xtest_b)
logloss(ytest_b,prediction)

2.5430288925085018

Playing around with min_samples_split and min_samples_leaf did not change the result much

In [47]:
export_graphviz(alg, feature_names=xtr_b.columns, out_file='tree_b.dot')

%load_ext gvmagic
f = open('tree_b.dot')
tree_model_visualization = f.read()
f.close()
%dotstr tree_model_visualization

In [45]:
alg = sk.tree.DecisionTreeClassifier(max_depth=4)
alg.fit(xtr_e, ytr_e)
prediction = alg.predict_proba(xtest_e)
logloss(ytest_e,prediction)

2.5567671980169173

In [None]:
export_graphviz(alg, feature_names=xtr_e.columns, out_file='tree_e.dot')

%reload_ext gvmagic
f = open('tree_e.dot')
tree_model_visualization = f.read()
f.close()
%dotstr tree_model_visualization

#### Baseline model

In [23]:
all_crimes = set(train.Category)
crime_probabilities = {}
for c in all_crimes:
    crime_probabilities[c] = (train.Category == c).sum() / float(len(train.Category))

### Creating Submission File

#### Baseline Model

In [46]:
submission = pd.DataFrame(crime_probabilities, index = range(len(test)))
submission['Id'] = test['Id']
submission.to_csv('sfcc_baseline.csv', index = False)

Kaggle Submission Result: 2.68016

Predicting for test data using RandomForestClassifier

In [31]:
alg = RandomForestClassifier(max_depth=12, min_samples_split=4, min_samples_leaf=5)
alg.fit(x_b, y)
prediction = alg.predict_proba(test[predictors + predictors_b])

In [43]:
submission = pd.DataFrame({'Id': test.Id})
for i in range(prediction.shape[1]):
    category = enc.inverse_transform([i])
    submission[category[0]] = prediction[:,i]

submission.to_csv('sfcc_rfc.csv', index=False)

Kaggle Submission Result: 2.49867 (improvement of 0.18149 from baseline model)