## Model Iteration
Date Created: 14 February 2016

This model iteration is used to make crime category predictions for test data for San Francisco Crime Classification kaggle competition
https://www.kaggle.com/c/sf-crime

as of 14.02.16
Rank: /
Score: %

In [2]:
### Importing Modules and Data

In [22]:
import pandas as pd
import numpy as np
import zipfile

from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier

In [23]:
#importing train dataset
z_train = zipfile.ZipFile('train.csv.zip')
train = pd.read_csv(z_train.open('train.csv'), parse_dates=['Dates'], index_col=False)

In [24]:
#importing test dataset
z_test = zipfile.ZipFile('test.csv.zip')
test = pd.read_csv(z_test.open('test.csv'), parse_dates=['Dates'], index_col=False)

### Modifying and Trimming Data

Here, we analyze data and modify it accordingly. As we see the data columns for the training and testing data, we see that the resolution column is not really needed. Moreover, some data types such as PdDistrict and Address seem to have some overlap, so we may pick to use one of them, or some altered version of each. Also, we dropped the Descript column from the train data will be dropped as there are great number of unique values, and are not present in the test data.

In [25]:
print train.info()
print "----------------------------------"
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 67.0+ MB
None
----------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null datetime64[ns]
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(3

In [26]:
train = train.drop(['Descript', 'Resolution', 'Address'], axis = 1)
test = test.drop(['Address'], axis = 1)

### Tools

The information in some of the columns in the data are extracted & seperated into different columns for better evaluation. 

The extract_date function alters the Dates column in the data set to be used more conveniently. Previously, one column held all of year, month, day, and specific time, but now we divide it up. By doing this we can see trends in crime within different days, different years, and a lot of flexibility becomes available.

The extract_time function divides the Time column into Hour, Minute and Second. We will mainly be using the time column

The make_binary_features function, allows us to create dummy variables with data from pre-existing columns. Dummy variables work extremely well with Random Forrest Regression, although the number of columns in the data set are increased. This will probably be used for randomforest or gradient boosting method.

In [27]:
 def extract_date(df):
    """
    function specifically for Dates field only
    creates new field 
    Year  YYYY
    Month MM
    Date  DD
    Time  HH:MM:SS
    """
    df['Year'] = df['Dates'].apply(lambda x: x[:4])
    df['Month'] = df['Dates'].apply(lambda x: x[5:7])
    df['Date'] = df['Dates'].apply(lambda x: x[8:10])
    df['Time'] = df['Dates'].apply(lambda x: x[-9:])
    return

def extract_time(df):
    """
    function specifically for Time field only
    creates new field 
    Hour   HH
    Minute MM
    Second SS
    """
    df['Hour'] = df['Time'].apply(lambda x: x[:3])
    df['Minute'] = df['Time'].apply(lambda x: x[4:6])
    df['Second'] = df['Time'].apply(lambda x: x[7:9])
    return

def make_binary_fields(df, field):
    """
    creates new field with field name as the name of data 
    if the original data match the new field name, the data will be 1
    if the original data does not match the new field name, the data will be 0

    
    ex 
    make_binary_field(df, 'DayOfWeek')
    will create new fields
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday
    where
    df['Monday'] will have value 1 for all Mondays and 0 for the rest
    """
    for new_field in df[field].unique():
        df[new_field] = df[field]
        df.loc[df[new_field] != new_field, new_field] = 0
        df.loc[df[new_field] == new_field, new_field] = 1
    return

### Additional Functions

From a source we used from the kaggle scripts, we create a streamlined function that adds new time categories to the data set based off of the 'Dates' category.

In [28]:
def time_trim(data):
    data['Day'] = data['Dates'].dt.day
    data['Month'] = data['Dates'].dt.month
    data['Year'] = data['Dates'].dt.year
    data['Hour'] = data['Dates'].dt.hour
    data['Minute'] = data['Dates'].dt.minute
    data['DayOfWeek'] = data['Dates'].dt.dayofweek
    data['WeekOfYear'] = data['Dates'].dt.weekofyear
    return data

In [29]:
train = time_trim(train)
test = time_trim(test)

train.drop(['Dates','Minute'], axis = 1)
test.drop(['Dates','Minute'], axis = 1)

Unnamed: 0,Id,DayOfWeek,PdDistrict,X,Y,Day,Month,Year,Hour,WeekOfYear
0,0,6,BAYVIEW,-122.399588,37.735051,10,5,2015,23,19
1,1,6,BAYVIEW,-122.391523,37.732432,10,5,2015,23,19
2,2,6,NORTHERN,-122.426002,37.792212,10,5,2015,23,19
3,3,6,INGLESIDE,-122.437394,37.721412,10,5,2015,23,19
4,4,6,INGLESIDE,-122.437394,37.721412,10,5,2015,23,19
5,5,6,TARAVAL,-122.459024,37.713172,10,5,2015,23,19
6,6,6,INGLESIDE,-122.425616,37.739351,10,5,2015,23,19
7,7,6,INGLESIDE,-122.412652,37.739750,10,5,2015,23,19
8,8,6,MISSION,-122.418700,37.765165,10,5,2015,23,19
9,9,6,CENTRAL,-122.413935,37.798886,10,5,2015,23,19


In [34]:
train['PdDistrict'].unique

<bound method Series.unique of 0           NORTHERN
1           NORTHERN
2           NORTHERN
3           NORTHERN
4               PARK
5          INGLESIDE
6          INGLESIDE
7            BAYVIEW
8           RICHMOND
9            CENTRAL
10           CENTRAL
11           TARAVAL
12        TENDERLOIN
13          NORTHERN
14           BAYVIEW
15           BAYVIEW
16        TENDERLOIN
17         INGLESIDE
18           BAYVIEW
19        TENDERLOIN
20         INGLESIDE
21         INGLESIDE
22        TENDERLOIN
23        TENDERLOIN
24          NORTHERN
25        TENDERLOIN
26          NORTHERN
27         INGLESIDE
28           TARAVAL
29           TARAVAL
             ...    
878019      SOUTHERN
878020      NORTHERN
878021      NORTHERN
878022       MISSION
878023    TENDERLOIN
878024          PARK
878025       BAYVIEW
878026       BAYVIEW
878027      SOUTHERN
878028      SOUTHERN
878029    TENDERLOIN
878030    TENDERLOIN
878031       BAYVIEW
878032      NORTHERN
878033      RICHMOND
878

In [36]:
from sklearn.preprocessing import LabelEncoder
enc_pdd = LabelEncoder()

train['PdDistrict'] = enc_pdd.fit_transform(train['PdDistrict'])
test['PdDistrict'] = enc_pdd.fit_transform(test['PdDistrict'])

enc_cat = LabelEncoder()

enc_cat.fit(train['Category'])
train['CategoryEncoded'] = enc_cat.transform(train['Category'])

enc_day = LabelEncoder()
train['DayOfWeek'] = enc_day.fit_transform(train['DayOfWeek'])

print train['DayOfWeek']

0         2
1         2
2         2
3         2
4         2
5         2
6         2
7         2
8         2
9         2
10        2
11        2
12        2
13        2
14        2
15        2
16        2
17        2
18        2
19        2
20        2
21        2
22        2
23        2
24        2
25        2
26        2
27        2
28        2
29        2
         ..
878019    0
878020    0
878021    0
878022    0
878023    0
878024    0
878025    0
878026    0
878027    0
878028    0
878029    0
878030    0
878031    0
878032    0
878033    0
878034    0
878035    0
878036    0
878037    0
878038    0
878039    0
878040    0
878041    0
878042    0
878043    0
878044    0
878045    0
878046    0
878047    0
878048    0
Name: DayOfWeek, dtype: int64


In [20]:
print enc_cat.classes_

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS' 'EMBEZZLEMENT'
 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING' 'FRAUD' 'GAMBLING'
 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING' 'MISSING PERSON'
 'NON-CRIMINAL' 'OTHER OFFENSES' 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION'
 'RECOVERED VEHICLE' 'ROBBERY' 'RUNAWAY' 'SECONDARY CODES'
 'SEX OFFENSES FORCIBLE' 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY'
 'SUICIDE' 'SUSPICIOUS OCC' 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT'
 'WARRANTS' 'WEAPON LAWS']


In [13]:
#from sklearn.ensemble import RandomForestClassifier
#clf = RandomForestClassifier(n_estimators=10)
#clf.fit(train, train['Category Encoded'])
#test = 