# Data Science: Choose Your Own Adventure
**By Mac-I and Sophia**

For this project, we decided to use the [San Franscisco Crime Dataset](https://www.kaggle.com/c/sf-crime), with the goal of predicting the category of crime based on the date/time of report, and the location of the report. This notebook will serve as a writeup of the work that we have done on this project, in both data exploration and building a model. 

## Introduction
Before we get started, let's talk about what we're trying to do and what information we actually have!

In the dataset, the information we have is:
* **Dates**
* Category
* **Description**
* **Day of Week**
* **Police District**
* Resolution
* **Addresss**
* **X (Longitude)**
* **Y (Latitude)**

The bolded items are the ones that occur in both the test and training datasets. In other words, the bolded items are the ones that we will be using to predict the cateogry of the crime. 


## Importing Everything
To keep our code neat, let's import all the helper libraries we need up here!

In [None]:
% matplotlib inline
import shapefile
import pandas as pd
import numpy as np
import itertools
import re
from time import time

#data exploration imports
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib import cm
from datetime import datetime
from ipywidgets import widgets  
from IPython.display import display

import seaborn as sns#Building and testing model iputs
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import log_loss
import xgboost as xgb

isPowerful = False


## Data Exploration
Before we get into building a model, we're going to start by just exploring the dataset. The goal of this is to just explore what kinds of relationships exist in the dataset. 

First, let's start by reading in the dataset


In [None]:
crimeData = pd.read_csv('train.csv')
crimeData

Now that we've read in the data, we can see that we have a timestamp column. This, however is a string, so let's actually decompose this into year, month, day, and hour values. 

In [None]:
crimeData['DateTime'] = crimeData['Dates'].apply(
    lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

crimeData['Year'] = crimeData['DateTime'].apply(lambda x: x.year)
crimeData['Month'] = crimeData['DateTime'].apply(lambda x: x.month)
crimeData['Day'] = crimeData['DateTime'].apply(lambda x: x.day)
crimeData['Hour'] = crimeData['DateTime'].apply(lambda x: x.hour)

In the dataset, we have two different kinds of data: location data and time data that we can use to predict the type of crime.

First, we'll start by creating plots of the crimes for each location. 

### The Relationship between Location and Crime Category
In the graphs below, we'll show some of the work we did to explore how location and the category of crime are related. 

To do this, we're going to use ipython notebook widgets, to allow a user to choose a category and then display all the crimes of that category that occurred in our training data. 

First, we will create the dataframe that we need to plot the information. 

In [None]:
#Only get the non lat-long-outlier crime reports
displayCrimeData = crimeData[(crimeData.X<-121) & (crimeData.Y<40)]

Next, we will create a function that given a crime, plots the crimes of that category. (Thanks to our [classmates](https://github.com/BrennaManning/DataScience16CYOA/blob/master/data_exploration.ipynb) for providing the idea to do a hexbin plot!)

In [None]:
def crime_map_display(crime):
    #Load in the map data and set the appropriate lat long variables
    mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
    asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
    clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]
    lon_lat_box=[-122.52469, -122.33663, 37.69862, 37.82986]
    
    #get only the crimes that of the category we are showing. 
    crimeDataS = displayCrimeData[displayCrimeData.Category == crime]
    plt.figure()
    plt.grid(False)
    #ax = sns.kdeplot(crimeDataS.Xok, crimeDataS.Yok, clip=clipsize, aspect=1/asp)

    plt.imshow(mapdata, cmap=plt.get_cmap('gray'), 
                  extent=lon_lat_box, 
                  aspect=asp)
    
    plt.hexbin(crimeDataS.X, crimeDataS.Y, gridsize=100,
           extent=lon_lat_box, alpha=0.5, cmap=plt.get_cmap('Blues'), bins='log')
    
#     g = sns.jointplot(crimeDataS['Latitude'], crimeDataS['Longitude'], kind="hex")
#     g = sns.regplot(x="Latitude", y="Longitude", data=crimeDataS, fit_reg=False, scatter_kws={'alpha':0.3})
    plt.title(crime)
    cb = plt.colorbar()
    cb.set_label('log10(Number of Crimes)')
    

Now, we will want to actually call this function when we update a widget. 

In [None]:
categories = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=categories, description='Select one of the categories:')
widgets.interact(crime_map_display, crime = crime)

Here, as opposed to using just the number of crimes that occurred, we are using the log<sub>10</sub> of the number of crimes that occurred. If we do not do this, we see most of the cells as clear and one or two as dark blue. Plotting the log<sub>10</sub> of the number of crimes allows us to more clearly see how the number of crimes changes. 

In these graphs (which load a little slowly), There is notably one patch of land where crimes rarely occurr-- according to our research, this is a relatively nice park. Additionally, crimes, for the most part, tend to be concentrated in the downtown area. 

Notably, it appears that the "other" category, which are mostly traffic violations are, unsurprisingly, concentrated on major roads. 

One contrast to crimes being more and more concentrated downtown is the heatmap of prostitution crimes. These tend to just have two centers. 

To give us a little more insight into what is happening, and to break up the data a little more, we decided to split this plot up by hour, too, and explore the data per category, per hour. Here we follow the same structure as above (creating a function that plots the data based on our filtered variables, and then calling that function on change of ipython notebook widgets). 

In [None]:
def image_display(crime, time):
    
    #Load in the Map Data
    mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
    asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
    clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]
    lon_lat_box=[-122.52469, -122.33663, 37.69862, 37.82986]
    
    crimeDataS = crimeData[crimeData.Category == crime][crimeData.Hour==time]
    plt.figure()
    plt.grid(False)

    plt.imshow(mapdata, cmap=plt.get_cmap('gray'), 
                  extent=lon_lat_box, 
                  aspect=asp)
    
    plt.hexbin(crimeDataS.X, crimeDataS.Y, gridsize=100,
           extent=lon_lat_box, alpha=0.5, cmap=plt.get_cmap('Blues'), bins='log')
    
    plt.title(crime + " at time :" + str(time))
    cb = plt.colorbar()
    cb.set_label('log10(Number of Crimes)')

In [None]:
vals = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=vals, description='Select one of the values:')
time = widgets.IntSlider(min=0, max=23, value=2003)
widgets.interact(image_display, crime = crime, time=time)

Some noteable takeaways here are that:
* For most categories of crime, there is a lull around 3 am
* Most crime tends to be concentrated downtown. We predict that this is because many more people work/live in this area than anywhere else, not because this is actually a more dangerous place to live

Although it is helpful to see the map for different categories, for different hours, this view of the data does not provide the most intuitive way to visualize time-based crime patterns. 

### Crime Distribution of Day of Week and Hour
To dig into this, let's look first at the distribution of crime patterns by day of week and hour of the day. Using the same pattern described above, we'll implement a function that creates a plot and declare the widgets to control it.

First, let's group the crimes by day of week. 

In [None]:
groupedByDayOfWeek = crimeData.groupby(['DayOfWeek', 'Category']).count().reset_index()

Now we can declare our function and create the widget

In [None]:
def image_display_day_of_week(i):
    # Get the string for the day of week
    dayOfWeek = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
    day = dayOfWeek[i]
    
    #Get the number of crimes that occurred on this day of week
    #(For normalization purposes)
    totalCrimes = sum(groupedByDayOfWeek[groupedByDayOfWeek.DayOfWeek == day]['Dates'].tolist())
    
    #Get a list of the different crime types
    crimeTypes = sorted(crimeData.Category.unique().tolist())
    #Calculate the percentage of crimes that occurred that were of a given category
    crimeCountsPercent = []
    for crime in crimeTypes:
        countList = groupedByDayOfWeek[(groupedByDayOfWeek.DayOfWeek == day) & (groupedByDayOfWeek.Category == crime)]['Dates'].tolist()
        #Here, we're doing some error handling 
        #If no crimes of a certain type occurred on a given day, 
        #then append zero
        if (len(countList) > 0):
            count = countList[0]
        else:
            count = 0

        crimeCountsPercent.append(1.0*count/totalCrimes)

    #Create the figure
    plt.figure(figsize=(10,8))
    plt.bar([x + 0.1 for x in range(len(crimeCountsPercent))], crimeCountsPercent, width = 0.8)
    plt.xticks([x + 0.5 for x in range(len(crimeCountsPercent))], crimeTypes, rotation='vertical')
    plt.axis([-0.5, 39.5, 0 ,0.3])
    plt.title('Crime Breakdown where day of week is ' + str(day))
    plt.xlabel('Type of Crime')
    plt.ylabel('Percentage of Crimes')

#Outside of the function create the widgets
step_slider = widgets.IntSlider(min=0, max=6, value=0)
widgets.interact(image_display_day_of_week, i=step_slider)

This is a slightly easier way of seeing that for all days of the week, larceny/theft and other offenses are the most common crimes. 

In general, it appears that there is an increase of crimes like larceny/theft, assault, and drug/narcotic-related offenses over the weekend, and less during the week.

Additionally, to explore whether certain crimes happen more often at certain times of day, let's also make a similar plot for hour of the day. First, we group by hour:

In [None]:
groupedByHour = crimeData.groupby(['Hour', 'Category']).count().reset_index()

In [None]:
def image_display_hour(i):
    #get the hour of the day
    hour = i
    #Count the number of crimes that occured in that hour
    #(for normalization)
    totalCrimes = sum(groupedByHour[groupedByHour.Hour == hour]['Dates'].tolist())
    
    #Get the number of each type of crime that occured in that hour
    crimeTypes = sorted(crimeData.Category.unique().tolist())
    crimeCountsPercent = []
    for crime in crimeTypes:
        countList = groupedByHour[(groupedByHour.Hour == hour) & (groupedByHour.Category == crime)]['Dates'].tolist()
        #Handle the zero-occurrence case
        if (len(countList) > 0):
            count = countList[0]
        else:
            count = 0

        crimeCountsPercent.append(1.0*count/totalCrimes)

    #create the plot
    plt.figure(figsize=(10,8))
    plt.bar([x + 0.1 for x in range(len(crimeCountsPercent))], crimeCountsPercent, width = 0.8)
    plt.xticks([x + 0.5 for x in range(len(crimeCountsPercent))], crimeTypes, rotation='vertical')
    plt.axis([-0.5, 39.5, 0 ,0.3])
    plt.title('Crime Breakdown where hour = ' + str(hour))
    plt.xlabel('Type of Crime')
    plt.ylabel('Percentage of Crimes')


#Create the widget
step_slider = widgets.IntSlider(min=0, max=23, value=0)
widgets.interact(image_display_hour, i=step_slider)

Here, we can see that the distribution of crimes does change over the course of the day.

There is a huge shift in the distribution to be mostly larceny/theft crimes around 6pm. We hypothesize that this is because most larcenty/theft is report when people get home or to their cars after work. 

Additionally, at around 2 am, it appears that the distribution shifts so that assault is the most common crime.

### Visualizing Patterns of Occurrence for Crimes
Although these visualizations help us get a general  sense about what the distribution of crimes look like at various times, it would also be helpful to visualize the pattern of one crime over time. 

To do this, let's again use ipython widgets. Because this is a little computationally intensive, though, let's create the dataframe outside of the function where we update the plot. 


In [None]:
groupedByTime = crimeData.groupby(['DayOfWeek', 'Hour', 'Category']).count().reset_index()

Now, we can write a function to plot the time heatmap for a given category.

In [None]:
def show_time_heatmap(crime):
    daysOfWeek = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
    daysOfWeekDisp = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
    hours = range(24)
    
    numCrimes = np.zeros((len(hours), len(daysOfWeek)))
    for i,hour in enumerate(hours):
        for j,dayOfWeek in enumerate(daysOfWeek):
            try:
                crimeCount = groupedByTime[(groupedByTime.DayOfWeek == dayOfWeek) & 
                                           (groupedByTime.Hour == hour) & 
                                           (groupedByTime.Category == crime)]['Dates'].tolist()[0]
            except:
                crimeCount = 0


            numCrimes[23-hour][j] = int(crimeCount)

    g = sns.heatmap(numCrimes, annot=True, fmt='.0f')
    g.set_title(crime)
    g.set(xticklabels = daysOfWeekDisp)
    g.set(yticklabels = hours)
    
    
crimeCategories = groupedByTime.Category.unique().tolist()
vals = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=vals, description='Select one of the values:')
widgets.interact(show_time_heatmap, crime = crime)

In exploring these plots, it appears that there tends to be a similar pattern of higher levels of crime in the evening and lower levels of crime in the morning. Many crimes also occur more often on the weekends. 

Notable exceptions to this pattern appear to be drug/narcotic-related offenses (which are most common Wednesdays around lunch time) and missing person cases which appear to be reported most often when in the mornings when people don't show up for work. 

## Making a Model
Now that we've explored the data, we're going to start building a model. To make our lives easier, we're going to follow a pretty simple workflow. We will 
1. Read in the Data
2. Clean the training data
3. Create a Model using the cleaned data
4. Score the model using the crime categories in the training data
5. If the model performs well (better than previous attempts) we will: 
  1. Repeat steps 1-3 with the test data. 
  2. Generate a submission file to upload to Kaggle. 
  
Rather than talking through each successive iteration of our model, the following code will instead talk through everything we developed in each of these steps. 

### Loading the data and constants
Here we load the data into a panda dataframe and set some initial constants. Specifically, we are hard-coding a variable with all of the different categories. This will come in handy later when we generate a submission file. 

In [None]:
crimeData = pd.read_csv('train.csv')
#The categories found in the dataset
Categories = ['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY',
              'DISORDERLY CONDUCT', 'DRIVING UNDER THE INFLUENCE',
              'DRUG/NARCOTIC', 'DRUNKENNESS', 'EMBEZZLEMENT', 'EXTORTION',
              'FAMILY OFFENSES', 'FORGERY/COUNTERFEITING', 'FRAUD', 'GAMBLING',
              'KIDNAPPING', 'LARCENY/THEFT', 'LIQUOR LAWS', 'LOITERING',
              'MISSING PERSON', 'NON-CRIMINAL', 'OTHER OFFENSES',
              'PORNOGRAPHY/OBSCENE MAT', 'PROSTITUTION', 'RECOVERED VEHICLE',
              'ROBBERY', 'RUNAWAY', 'SECONDARY CODES', 'SEX OFFENSES FORCIBLE',
              'SEX OFFENSES NON FORCIBLE', 'STOLEN PROPERTY', 'SUICIDE',
              'SUSPICIOUS OCC', 'TREA', 'TRESPASS', 'VANDALISM', 'VEHICLE THEFT',
              'WARRANTS', 'WEAPON LAWS']

### Preprocessing the data
This section will cover how we cleaned the data as well as how we added features. In this section, we first define a wrapper function that will take in a dataframe that we read in from a csv and return our pre-processed dataframe. This function will call the functions that we write to clean/recode/generate new features for the data. 

In [None]:
def recodeData(df, isTrain = False):
    '''This function takes in the dataframe that we get from loading in the 
    SF crime data and returns a re-coded dataframe that has all the 
    additional features we want to add and the categorical features recoded 
    and cleaned.
    '''

    #All of these functions return both the new dataframe ad the list of columns that we added. 
    df, newLatLon = removeOutlierLatLon(df)
    df, newDate = recodeDates(df)
    df, newDistrict = recodePoliceDistricts(df)
    df, newAddress, streetColumns = recodeAddresses(df)
    df, newWeather = addWeather(df)

    #Add the new columns to our list of added columns
    addedColumns = [] 
    addedColumns += newDate
    addedColumns += newDistrict 
    addedColumns += newLatLon
    addedColumns += newAddress
    addedColumns += newWeather
   

    #If this is the traning data, we want to remove the columns that we will not have access to in the test set. 
    #We also want to recode the crime category information in the dataframe if this is the test dataset.
    if (isTrain):
        newCategory = recodeCategories(df)
        addedColumns += newCategory
        try: #prevents error if the columns have already been removed
            columnsToDrop = ['Descript', 'Resolution']
            df.drop(columnsToDrop, axis=1, inplace=True)
        except:
            print "already recoded"
         

    return df, addedColumns, streetColumns

#### Recode the Categories
Here we turn the category names into integers to ease classification. Not all of the models that we used can handle text-based category data, so we need to convert the categories to a number. We do this by mapping each category to its respective index in the category list. 

In [None]:
def recodeCategories(df):
    '''This function will recode the Categories from strings into integers'''
    df['CategoryRecode'] = df.Category.apply(lambda x: Categories.index(x))
        
    return df, ['CategoryRecode']

#### Fixing the Latitudes and Longitudes that do fall in San Francisco
During our data exploration we noticed that some of the latitudes and longitudes listed were not anywhere close to San Francisco. In order to fix this we calculated the median latidute and logitude for each police district. We then assigned the appropriate median latitude and logitude to those data points with invalid latitude and logitudes.

In [None]:
def removeOutlierLatLon(df):
    '''This function will remove outlier Latitudes and Longitudes'''
    df.loc[df.X > -121, 'X'] = df.loc[(df.X > -121)].apply(lambda row: df.X[df["PdDistrict"] == row['PdDistrict']].median(), axis=1)
    df.loc[df.Y > 38, 'Y'] = df.loc[(df.Y > 38)].apply(lambda row: df.Y[df["PdDistrict"] == row['PdDistrict']].median(), axis=1)

    return df, ['X', 'Y']

#### Recoding Dates
In order to make the "Dates" column useful we needed to recode them into columns such as "Year", "Month", "Day", "Hour", "Minute". We also needed to recode the "DayOfWeek" into a nurerical format since some of the models can't handle string categorical data. It also makes sense from the perspective that a model may want to group data by weekday vs weekend which is much easier with numerical data.

In [None]:
def recodeDates(df):
    '''This function takes in a dataframe and recodes the date field into 
    useable values. Here, we also recode the day of week.'''
    #Recode the dates column to year, month, day and hour columns
    df['DateTime'] = pd.to_datetime(df['Dates'], format ='%Y-%m-%d %H:%M:%S')

    df['Year'] = df['DateTime'].apply(lambda x: x.year)
    df['Month'] = df['DateTime'].apply(lambda x: x.month)
    df['Day'] = df['DateTime'].apply(lambda x: x.day)
    df['Hour'] = df['DateTime'].apply(lambda x: x.hour)
    df['Minute'] = df['DateTime'].apply(lambda x: x.minute)
    df['DayOfWeekRecode'] = df['DateTime'].apply(lambda x: x.weekday())

    return df, ['Year', 'Month', 'Day', 'Hour', 'Minute', 'DayOfWeekRecode']

#### Recoding the Police districts
Similarly to the "DayOfWeek" column, the "PdDistrict" column needed to be recoded in order to be useful. We did this with one-hot encoding since the there is no inherent order to the districts, unlike day of week, where there is an order.

In [None]:
def recodePoliceDistricts(df):
    '''This function recodes the police district to a one-hot encoding scheme.'''
    districts = df['PdDistrict'].unique().tolist()
    
    dummies = pd.get_dummies(df['PdDistrict'], prefix="PdDistrict")
    
    newColumns = dummies.columns.tolist()
    print newColumns
    
    df = pd.concat([df, dummies], axis=1)

    return df, newColumns

#### Recoding the Address field into useful features
The original address field is simply a string that looks like "2000 Block of THOMAS AV" or  "JEFFERSON ST / HYDE ST". In order to make this field useful there are a couple different methods we used. The first thing we did was create a flag indicatin whether the address was an intersection of 2 streets or simply a block. We also pulled out the block number (if applicable) as well as the name(s) of the street(s).

In [None]:
def recodeAddresses(df):
    '''This function will attempt to create some features related to the address field in 
    the database. To do this, first, we need to split up the address field into two different
    steet fields, a block nnumber, and a boolean specifying whether it's a street corner '''
    
        
    #Also add the "did the crime occur on a street corner field?"
    df['StreetCornerFlag'] = df['Address'].apply(lambda x: len(x.split(" / ")) > 1)
    
    #If there are two streets, split fields. Also extract the block number
    df['street1'] = df['Address'].apply(lambda x: re.sub(r'^\d+ Block of ','',x.split(" / ")[0]))
    df['street2'] = df['Address'].apply(lambda x: (x.split(" / ")[1]) if (len(x.split(" / ")) > 1) else '')

    df['BlockNumber'] = df['Address'].apply(lambda x: int(re.findall(r'^\d+',x)[0]) if (len(re.findall(r'^\d+',x)) > 0) else None )
    df['BlockNumber'] = df['BlockNumber'].fillna(-1)

    
    streetColumns = []
    
    #one-hot encoding the streets requires more RAM than the standard computer has
    if isPowerful: 
        print "starting street dummy creation"

        #create a one-hot encoding of the Streets
        street1Dummy = pd.get_dummies(df['street1'])
        print "completed street 1 dummy creation"
        
        street2Dummy = pd.get_dummies(df['street2'])
        print "completed street 2 dummy creation"

        #turn the 0s into NaNs so that 'combine_first' can merge them
        street1Dummy = street1Dummy.replace(0, np.nan)
        street2Dummy = street2Dummy.replace(0, np.nan)
        
        #merge the 2 one-hot address frames into 1
        mergedStreetDummy = street1Dummy.combine_first(street2Dummy)
        print "completed address dummy DataFrames merge"
        
        #turn the NaNs back into 0s
        mergedStreetDummy = mergedStreetDummy.fillna(0)
        print "completed fillna on mergedAddressDummy"
        
        #extract all the new street columns
        streetColumns = list(mergedStreetDummy.columns.values)

        #merge the street data and the original dataframe
        df = pd.concat([df, mergedStreetDummy], axis=1)
        print "completed merge of original df and new dummy variable df"
    
    return df, ['StreetCornerFlag', 'BlockNumber'], streetColumns

#### Add Daily Weather Data
We were interested to know if the weather played a role in the types of crimes that occured so we added daily information about the max/min temperature as well as percipitation. The data came from NOAA (National Oceanic and Atmospheric Administration) and was pulled from https://www.ncdc.noaa.gov/cdo-web/search. Specifically it came from the [downtown station](http://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00023272/detail).

In [None]:
def addWeather(df):
    '''add 'PRCP' (precipitation),'TMAX' (Max Temperature),'TMIN' (Min Temperature) to each data point'''
    
    # create column to merge the weather data on (eg. 01-29-2016 becomes "20160129")
    df['DATE'] = df['DateTime'].apply(lambda x: int( str(x.year)+x.strftime('%m')+x.strftime('%d') ))
    
    weatherData = pd.read_csv('weather1.csv')
    
    #replace how NaNs are encoded
    weatherData = weatherData.replace('-9999', np.nan)
    
    #get subset of full dataframe
    weatherData = weatherData[['DATE','PRCP','TMAX','TMIN']]
    
    #merge the data frames based on the integer coumln "DATE"
    df = pd.merge(df, weatherData, on='DATE')
    
    return df, ['PRCP','TMAX','TMIN']

#### Recode the Training Data
Now that we have created functions that do all of our pre-processing for us, we recode the test data. 

In [None]:
crimeData, addedColumns, streetColumns = recodeData(
    crimeData, isTrain = True)
crimeData.describe()

### Create and Test our Models

#### Prepare and select the data for training
Here we specify which predictors we intend to use to train our model ('columnsToUse'). In the code block below we list some of the most interesting sets of predictors. In order to specify which one to use simply leave that line uncommented. 

In [None]:
#use all of the columns added from our preprocessing of the data
#columnsToUse = addedColumns

#Use just the basic columns (date, time, and location)
columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
        'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
        'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
        'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN']

# Use just the basic columns and whether the crimes were reported to be on a street corner
# columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
#        'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
#         'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
#         'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag']

#use the basic columns, whether the crime was on a corner, and the one-hot encoding of the most common streets
#columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
#        'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
#         'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
#         'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag'] + commonStreets

#use the basic columns, whether the crime was on a corner, and the daily weather
#columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
#        'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
#         'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
#         'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag', 'PRCP','TMAX','TMIN']

X = crimeData[columnsToUse]
y = crimeData['CategoryRecode']

print "prepping for training complete"

#### Create function to compute the logloss cross-validation score


In [None]:
def runOnTrainData(clf, X, y, numSplits=3):
    '''This function takes in a classifier and the number of folds to compute the cross-validation 
    score. It then splits the data into multiple training and test sets. For each split the model 
    is trained on the training data and then the logloss score is calculated based on the predictions 
    generated by the model.'''
    
    #split the data into training and test sets while ensuring that every category appears in both sets
    k_folds = StratifiedShuffleSplit(y, numSplits, test_size=0.5, random_state=0)

    #create list to store logloss scores in
    scores = []
    
    print "starting kfold testing"
    #enumerate through all the folds
    for k, (train, test) in enumerate(k_folds):
        print ""
        print "starting fit: " + str(k) + " of " + str(numSplits)
        start = time()
        clf.fit(X.iloc[train], y.iloc[train])
        print "fit complete, time: " + str((time() - start))
        startPredictTime = time()
        probs = clf.predict_proba(X.iloc[test])
        print "predict complete, time: " + str((time() - startPredictTime))
        score = log_loss(y.iloc[test].values, probs)
        print "Logloss score: " + str(score)
        print "total time: " + str((time() - start))
        scores.append(score)

    print ""
    print(scores)
    print("Average score: " + str(np.average(scores)))

#### Our First Model: Random Forest
We chose a random forest classifier as our first model because we wanted to learn more about them. Additionally, Random forests seemed to be the most common method on Kaggle. We chose the initial model parameters based the maximum values our computers could handle and included all of the features we had generated thus far.

In [None]:
columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
       'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
        'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
        'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN']
X = crimeData[columnsToUse]
y = crimeData['CategoryRecode']

clf = RandomForestClassifier(n_estimators=30, max_depth = 7, random_state=1, n_jobs = -1)
runOnTrainData(clf, X, y, numSplits=3)

##### Our reaction
We were happy with this score (2.491) as our first attempt because it was a significant improvement over guessing equally. 

#### Our Second Interation
Our second iteration involved adding a new feature, a flag indicating whether the crime's reported address was an intersection.

In [None]:
columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
       'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
        'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
        'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag']
X = crimeData[columnsToUse]
y = crimeData['CategoryRecode']

clf = RandomForestClassifier(n_estimators=30, max_depth = 7, random_state=1, n_jobs = -1)
runOnTrainData(clf, X, y, numSplits=3)

##### Our Reaction
Adding the street intersection indicator improved the model's score from 2.491 to 2.453. We were pleased that adding this new feature improved our model and decided to make our first kaggle submission (see code at bottom of script). On the Kaggle testing data we scored 2.44156. This was great! It was reassuring to know that our dataset was large enough that our cross-validation scores were likley to be very simlialr to the scores on Kaggle. 

#### Our third Interation
For our third iteration we decied to investigate our on campus resource, "deepthought". Deepthought is a mini "super computer" with 1TB of RAM and 48 cores. We adjusted our parameters to include more trees and depth.

In [None]:
if isPowerful:
    columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
       'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
        'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
        'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag']
    X = crimeData[columnsToUse]
    y = crimeData['CategoryRecode']

    clf = RandomForestClassifier(n_estimators=300, max_depth = 20, random_state=1, n_jobs = 1)
    runOnTrainData(clf, X, y, numSplits=3)

##### Our Reaction
Increasing the number of trees and the maximum depth dramatically improved our score on Kaggle to 2.31827. This was quite exciting and good enough to put us in the top 100 spots on the leaderboard. While we we obviously excited, we were a bit saddened at the idea that simply throwing more computation at the problem could produce way better results than the features we had engineered thus far.

#### Our 4th iteration
For our 4th iteration we wanted to go back and see if we could engineer new features that would improve the model. We decided to focus on the address coulm provided. For each crime we identified the street name(s) where the crime occured and ont-hot encoded them. In order to save ourselves some RAM and avoid overfitting we only looked at the most common streets for each crime category.

In [None]:
#The most common streets for each crime in the dataset
commonStreets = ['FOLSOM ST','16TH ST','JONES ST','TAYLOR ST',
                 'ARMSTRONG AV','EDDY ST','LARKIN ST','CASTRO ST',
                 '10TH AV','5TH ST','HAIGHT ST','OFARRELL ST',
                 '11TH AV','PAGE ST','FITCH ST','CAPP ST','13TH ST',
                 '24TH AV','17TH ST','18TH ST','19TH ST','GENEVA AV',
                 'GEARY BL','BRYANT ST','HYDE ST','4TH ST','FULTON ST',
                 'LEAVENWORTH ST','COLE ST','ALEMANY BL','PHELPS ST',
                 'MISSION ST','6TH ST','12TH AV','SHOTWELL ST',
                 'TREAT AV','7TH ST','JEFFERSON ST','QUESADA AV',
                 'TURK ST','2ND ST','MARKET ST','GGBRIDGE HY',
                 '24TH ST','CAPITOL AV','KEARNY ST','HARRISON ST',
                 'LYON ST','BUSH ST','POLK ST','3RD ST','ELLIS ST',
                 'SOUTH VAN NESS AV','POTRERO AV','20TH ST','POWELL ST']
if isPowerful:
    columnsToUse = ['X','Y', 'Year', 'Month', 'Day','Hour', 'Minute',
       'DayOfWeekRecode', 'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE', 
        'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_RICHMOND', 
        'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'StreetCornerFlag'] + commonStreets
    X = crimeData[columnsToUse]
    y = crimeData['CategoryRecode']

    clf = RandomForestClassifier(n_estimators=300, max_depth = 20, random_state=1, n_jobs = 1)
    runOnTrainData(clf, X, y, numSplits=3)

##### Our Reaction

#### Our 5th Iteration

In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=30, objective='multi:softprob', max_delta_step = 1, learning_rate = 1, nthread = -1)
runOnTrainData(clf, X, y, numSplits=2)