## Model Iteration
Date Created: 14 February 2016

This model iteration is used to make crime category predictions for test data for San Francisco Crime Classification kaggle competition
https://www.kaggle.com/c/sf-crime

as of 14.02.16
Rank: /
Score: %

### Importing Modules and Data

In [11]:
import pandas as pd
import numpy as np
import zipfile

import sklearn as sk
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import LabelEncoder
import scipy as sp

In [3]:
#importing train dataset
z_train = zipfile.ZipFile('train.csv.zip')
train = pd.read_csv(z_train.open('train.csv'), parse_dates=['Dates'], index_col=False)

In [4]:
#importing test dataset
z_test = zipfile.ZipFile('test.csv.zip')
test = pd.read_csv(z_test.open('test.csv'), parse_dates=['Dates'], index_col=False)

### Modifying and Trimming Data

Here, we analyze data and modify it accordingly. As we see the data columns for the training and testing data, we see that the resolution column is not really needed. Moreover, some data types such as PdDistrict and Address seem to have some overlap, so we may pick to use one of them, or some altered version of each. Also, we dropped the Descript column from the train data will be dropped as there are great number of unique values, and are not present in the test data.

In [5]:
print train.info()
print "----------------------------------"
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 67.0+ MB
None
----------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null datetime64[ns]
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(3

### Tools

The information in some of the columns in the data are extracted & seperated into different columns for better evaluation. 

The extract_date function alters the Dates column in the data set to be used more conveniently. Previously, one column held all of year, month, day, and specific time, but now we divide it up. By doing this we can see trends in crime within different days, different years, and a lot of flexibility becomes available.

The extract_time function divides the Time column into Hour, Minute and Second. We will mainly be using the time column

The make_binary_features function, allows us to create dummy variables with data from pre-existing columns. Dummy variables work extremely well with Random Forrest Regression, although the number of columns in the data set are increased. This will probably be used for randomforest or gradient boosting method.

### Additional Functions

From a source we used from the kaggle scripts, we create a streamlined function that adds new time categories to the data set based off of the 'Dates' category.

In [12]:
def time_trim(df):
    df['Day'] = df['Dates'].dt.day
    df['Month'] = df['Dates'].dt.month
    df['Year'] = df['Dates'].dt.year
    df['Hour'] = df['Dates'].dt.hour
    df['Minute'] = df['Dates'].dt.minute
    df['WeekOfYear'] = df['Dates'].dt.weekofyear
    return

def make_binary_fields(df, field):
    """
    creates new field with field name as the name of data 
    if the original data match the new field name, the data will be 1
    if the original data does not match the new field name, the data will be 0

    
    ex 
    make_binary_field(df, 'DayOfWeek')
    will create new fields
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday
    where
    df['Monday'] will have value 1 for all Mondays and 0 for the rest
    """
    for new_field in df[field].unique():
        df[new_field] = df[field]
        df.loc[df[new_field] != new_field, new_field] = 0
        df.loc[df[new_field] == new_field, new_field] = 1
    return

def make_season(df):
    """
    Make new field name Season
    and binary fields for each season
    Has to happen after making 'Month' field
    spring: month 2, 3, 4
    summer: month 5, 6, 7
    autumn: month 8, 9, 10
    winter: month 11, 12, 1
    """
    df['Season'] = df['Month']
    df.loc[(df['Season'] > 10) | (df['Season'] == 1), 'Season'] = 'Winter'
    df.loc[(df['Season'] > 1) & (df['Season'] <= 4), 'Season'] = 'Spring'
    df.loc[(df['Season'] > 4) & (df['Season'] <= 7), 'Season'] = 'Summer'
    df.loc[(df['Season'] > 7) & (df['Season'] <= 10), 'Season'] = 'Autumn'
    seasons = make_binary_fields(df, 'Season')
    return seasons

### Formatting data

In [None]:
time_trim(train)
seasons = make_season(train)
dow = make_binary_fields(train, 'DayOfWeek')
pdd = make_binary_fields(train, 'PdDistrict')
categories = make_binary_fields(train, 'Category')

In [None]:
enc = LabelEncoder()
enc.fit(train['Category'])
train['CategoryEncoded'] = enc.transform(train['Category'])

### Setting predictors

In [None]:
predictors = ['Day','Month','Year','Hour','Minute','WeekOfYear']
predictors.extend(pdd)
predictors.extend(seasons)
predictors.extend(dow)

### Kaggle SF Crime Classification Scoring System

In [None]:
def logloss(y,p):
    """
    information derived from following sources
    https://www.kaggle.com/wiki/LogarithmicLoss
    https://www.kaggle.com/c/sf-crime/details/evaluation
    """
    eps = 1e-15
    p = p/p.sum(axis=1)[:,np.newaxis]
    p = np.maximum(eps,p)
    p = np.minimum(1-eps,p)
    

    # Calculate logloss
    ll = 0
    for i in range(len(p)):
        ll += np.log(p[i, y.iloc[i]])
    ll /= float(-len(p))

    return ll

### Seperating train data

In [None]:
x = train[predictors]
y = train['CategoryEncoded']
xtr, xtest, ytr, ytest = cross_validation.train_test_split(x, y, test_size = 0.5, stratify = np.array(y) )

### Logistic Regression

In [None]:
alg = LogisticRegression()
alg.fit(xtr, ytr)
prediction = alg.predict_proba(xtest)
logloss(ytest,prediction)

### Decision Tree

In [None]:
alg = sk.tree.DecisionTreeClassifier(max_depth=4, min_samples_leaf=4)
alg.fit(xtr, ytr)
prediction = alg.predict_proba(xtest)
logloss(ytest,prediction)

### Gradient Boosting

In [None]:
alg = GradientBoostingClassifier(random_state=1, n_estimators=10, max_depth=3)
alg.fit(xtr, ytr)
prediction = alg.predict_proba(xtest)
logloss(ytest,prediction)