## Model Iteration
Date Created: 14 February 2016

This model iteration is used to make crime category predictions for test data for San Francisco Crime Classification kaggle competition
https://www.kaggle.com/c/sf-crime

as of 14.02.16
Rank: /
Score: %

In [None]:
### Importing Modules and Data

In [2]:
import pandas as pd
import numpy as np
import zipfile

In [14]:
#importing train dataset
z_train = zipfile.ZipFile('train.csv.zip')
train = pd.read_csv(z_train.open('train.csv'), parse_dates=['Dates'], index_col=False)

In [15]:
#importing test dataset
z_test = zipfile.ZipFile('test.csv.zip')
test = pd.read_csv(z_test.open('test.csv'), parse_dates=['Dates'], index_col=False)

### Modifying and Trimming Data

Here, we analyze data and modify it accordingly. As we see the data columns for the training and testing data, we see that the resolution column is not really needed. Moreover, some data types such as PdDistrict and Address seem to have some overlap, so we may pick to use one of them, or some altered version of each. Also, we dropped the Descript column from the train data will be dropped as there are great number of unique values, and are not present in the test data.

In [16]:
print train.info()
print "----------------------------------"
print test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 67.0+ MB
None
----------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null datetime64[ns]
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(3

In [None]:
train = train.drop(['Descript', 'Resolution'], axis = 1)

### Tools

The information in some of the columns in the data are extracted & seperated into different columns for better evaluation. 

The extract_date function alters the Dates column in the data set to be used more conveniently. Previously, one column held all of year, month, day, and specific time, but now we divide it up. By doing this we can see trends in crime within different days, different years, and a lot of flexibility becomes available.

The extract_time function divides the Time column into Hour, Minute and Second. We will mainly be using the time column

The make_binary_features function, allows us to create dummy variables with data from pre-existing columns. Dummy variables work extremely well with Random Forrest Regression, although the number of columns in the data set are increased. This will probably be used for randomforest or gradient boosting method.

In [5]:
 def extract_date(df):
    """
    function specifically for Dates field only
    creates new field 
    Year  YYYY
    Month MM
    Date  DD
    Time  HH:MM:SS
    """
    df['Year'] = df['Dates'].apply(lambda x: x[:4])
    df['Month'] = df['Dates'].apply(lambda x: x[5:7])
    df['Date'] = df['Dates'].apply(lambda x: x[8:10])
    df['Time'] = df['Dates'].apply(lambda x: x[-9:])
    return

def extract_time(df):
    """
    function specifically for Time field only
    creates new field 
    Hour   HH
    Minute MM
    Second SS
    """
    df['Hour'] = df['Time'].apply(lambda x: x[:3])
    df['Minute'] = df['Time'].apply(lambda x: x[4:6])
    df['Second'] = df['Time'].apply(lambda x: x[7:9])
    return

def make_binary_fields(df, field):
    """
    creates new field with field name as the name of data 
    if the original data match the new field name, the data will be 1
    if the original data does not match the new field name, the data will be 0

    
    ex 
    make_binary_field(df, 'DayOfWeek')
    will create new fields
    Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday
    where
    df['Monday'] will have value 1 for all Mondays and 0 for the rest
    """
    for new_field in df[field].unique():
        df[new_field] = df[field]
        df.loc[df[new_field] != new_field, new_field] = 0
        df.loc[df[new_field] == new_field, new_field] = 1
    return

### Additional Functions

From a source we used from the kaggle scripts, we create a streamlined function that adds new time categories to the data set based off of the 'Dates' category.

In [7]:
def time_trim(data):
    data['Day'] = data['Dates'].dt.day
    data['Month'] = data['Dates'].dt.month
    data['Year'] = data['Dates'].dt.year
    data['Hour'] = data['Dates'].dt.hour
    data['Minute'] = data['Dates'].dt.minute
    data['DayOfWeek'] = data['Dates'].dt.dayofweek
    data['WeekOfYear'] = data['Dates'].dt.weekofyear
    return data

In [11]:
train = feature_engineering(train)
test = feature_engineering(test)