# Data Science: Choose Your Own Adventure
**By Mac-I and Sophia**

For this project, we decided to use the [San Franscisco Crime Dataset](https://www.kaggle.com/c/sf-crime), with the goal of predicting the category of crime based on the date/time of report, and the location of the report. This notebook will serve as a writeup of the work that we have done on this project, in both data exploration and building a model. 

## Introduction
Before we get started, let's talk about what we're trying to do and what information we actually have!

In the dataset, the information we have is:
* **Dates**
* Category
* **Description**
* **Day of Week**
* **Police District**
* Resolution
* **Addresss**
* **X (Longitude)**
* **Y (Latitude)**

The bolded items are the ones that occur in both the test and training datasets. In other words, the bolded items are the ones that we will be using to predict the cateogry of the crime. 


## Importing Everything
To keep our code neat, let's import all the helper libraries we need up here!

We will also read in the data here. 

In [None]:
% matplotlib inline
import shapefile
import pandas as pd
import numpy as np
import itertools
import re

#data exploration imports
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib import cm
from datetime import datetime
from ipywidgets import widgets  
from IPython.display import display

#Building and testing model iputs
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import log_loss
import xgboost as xgb

isPowerful = False

crimeData = pd.read_csv('train.csv')

## Data Exploration
Before we get into building a model, we're going to start by just exploring the dataset. The goal of this is to just explore what kinds of relationships exist in the dataset. 

In the dataset, we have two different kinds of data: location data and time data that we can use to predict the type of crime. 

First, we'll start by exploring how location plays a role in the type of crime. 

### The Relationship between Location and Crime Category
In the graphs below, we'll show some of the work we did to explore how location and the category of crime are related. 

## Making a Model
Now that we've explored the data, we're going to start building a model. To make our lives easier, we're going to follow a pretty simple workflow. We will 
1. Read in the Data
2. Clean the training data
3. Create a Model using the cleaned data
4. Score the model using the crime categories in the training data
5. If the model performs well (better than previous attempts) we will: 
  1. Repeat steps 1-3 with the test data. 
  2. Generate a submission file to upload to Kaggle. 

### Loading the data and constants
Here we load the data into a panda dataframe and set some initial constants

In [None]:
crimeData = pd.read_csv('train.csv')
Categories = ['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY',
              'DISORDERLY CONDUCT', 'DRIVING UNDER THE INFLUENCE',
              'DRUG/NARCOTIC', 'DRUNKENNESS', 'EMBEZZLEMENT', 'EXTORTION',
              'FAMILY OFFENSES', 'FORGERY/COUNTERFEITING', 'FRAUD', 'GAMBLING',
              'KIDNAPPING', 'LARCENY/THEFT', 'LIQUOR LAWS', 'LOITERING',
              'MISSING PERSON', 'NON-CRIMINAL', 'OTHER OFFENSES',
              'PORNOGRAPHY/OBSCENE MAT', 'PROSTITUTION', 'RECOVERED VEHICLE',
              'ROBBERY', 'RUNAWAY', 'SECONDARY CODES', 'SEX OFFENSES FORCIBLE',
              'SEX OFFENSES NON FORCIBLE', 'STOLEN PROPERTY', 'SUICIDE',
              'SUSPICIOUS OCC', 'TREA', 'TRESPASS', 'VANDALISM', 'VEHICLE THEFT',
              'WARRANTS', 'WEAPON LAWS']

### Preprocessing the data
This section will cover how we cleaned the data as well as how we added features. 

In [None]:
def recodeData(df, isTrain = False):
    '''This function takes in the dataframe that we get from loading in the 
    SF crime data and returns a re-coded dataframe that has all the 
    additional features we want to add and the categorical features recoded 
    and cleaned.
    '''

    #since the modifications are done in-place we don't return the dataframe. 
    #we do, however, return the list of all the columns we added.
    df, newLatLon = removeOutlierLatLon(df)
    df, newDate = recodeDates(df)
    df, newDistrict = recodePoliceDistricts(df)
    print "recoding addresses"
    df, newAddress, streetColumns = recodeAddresses(df)
    df, newWeather = addWeather(df)

    
    addedColumns = [] 
    addedColumns += newDate
    addedColumns += newDistrict 
    addedColumns += newLatLon
    addedColumns += newAddress
    addedColumns += newWeather
   

    if (isTrain):
        newCategory = recodeCategories(df)
        addedColumns += newCategory
        try: #prevents error if the columns have already been removed or we are processing test data
            columnsToDrop = ['Descript', 'Resolution']
            df.drop(columnsToDrop, axis=1, inplace=True)
        except:
            print "already recoded or using test data"
         

    return df, addedColumns, streetColumns

#### Recode the Categories
Here we turn the category names into integers to ease classification. 

In [None]:
def recodeCategories(df):
    '''This function will recode the Categories from strings into integers'''
    #if 'Category' in df.columns:
    df['CategoryRecode'] = df.Category.apply(lambda x: Categories.index(x))
        
    return df, ['CategoryRecode']

#### Fixing the Latitudes and Longitudes that do fall in San Francisco
During our data exploration we noticed that some of the latitudes and longitudes listed were not anywhere close to San Francisco. In order to fix this we calculated the median latidute and logitude for each police district. We then assigned the appropriate median latitude and logitude to those data points with invalid latitude and logitudes.

In [None]:
def removeOutlierLatLon(df):
    '''This function will remove outlier Latitudes and Longitudes'''
    df.loc[df.X > -121, 'X'] = df.loc[(df.X > -121)].apply(lambda row: df.X[df["PdDistrict"] == row['PdDistrict']].median(), axis=1)
    df.loc[df.Y > 38, 'Y'] = df.loc[(df.Y > 38)].apply(lambda row: df.Y[df["PdDistrict"] == row['PdDistrict']].median(), axis=1)

    return df, ['X', 'Y']

#### Recoding Dates
In order to make the "Dates" column useful we needed to recode them into columns such as "Year", "Month", "Day", "Hour", "Minute". We also needed to recode the "DayOfWeek" into a nurerical format since some of the models can't handle string categorical data. It also makes sense from the perspective that a model may want to group data by weekday vs weekend which is much easier with numerical data.

In [None]:
def recodeDates(df):
    '''This function takes in a dataframe and recodes the date field into 
    useable values. Here, we also recode the day of week.'''
    #Recode the dates column to year, month, day and hour columns
    df['DateTime'] = df['Dates'].apply(
        lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

    df['Year'] = df['DateTime'].apply(lambda x: x.year)
    df['Month'] = df['DateTime'].apply(lambda x: x.month)
    df['Day'] = df['DateTime'].apply(lambda x: x.day)
    df['Hour'] = df['DateTime'].apply(lambda x: x.hour)
    df['Minute'] = df['DateTime'].apply(lambda x: x.minute)
    df['DayOfWeekRecode'] = df['DateTime'].apply(lambda x: x.weekday())

    return df, ['Year', 'Month', 'Day', 'Hour', 'Minute', 'DayOfWeekRecode']

#### Recoding the Police districts
Similarly to the "DayOfWeek" column, the "PdDistrict" column needed to be recoded in order to be useful. We did this with one-hot encoding since the there is no inherent order to the districts.

In [None]:
def recodePoliceDistricts(df):
    '''This function recodes the police district to a one-hot encoding scheme.'''
    districts = df['PdDistrict'].unique().tolist()
    newColumns = []
    for district in districts:
        newColumns.append('District' + district)
        df['District' + district] = df['PdDistrict'].apply(
            lambda x: int(x == district))

    return df, newColumns

#### Recoding the Address field into useful features
The original address field is simply a string that looks like "2000 Block of THOMAS AV" or  "JEFFERSON ST / HYDE ST". In order to make this field useful there are a couple different methods we used. The first thing we did was create a flag indicatin whether the address was an intersection of 2 streets or simply a block. We also pulled out the block number (if applicable) as well as the name(s) of the street(s).

In [None]:
def recodeAddresses(df):
    '''This function will attempt to create some features related to the address field in 
    the database. To do this, first, we need to split up the address field into two different
    steet fields, a block nnumber, and a boolean specifying whether it's a street corner '''
    
        
    #Also add the "did the crime occur on a street corner field?"
    df['StreetCornerFlag'] = df['Address'].apply(lambda x: len(x.split(" / ")) > 1)
    
    #If there are two streets, split fields. Also extract the block number
    df['street1'] = df['Address'].apply(lambda x: re.sub(r'^\d+ Block of ','',x.split(" / ")[0]))
    df['street2'] = df['Address'].apply(lambda x: (x.split(" / ")[1]) if (len(x.split(" / ")) > 1) else '')

    df['BlockNumber'] = df['Address'].apply(lambda x: int(re.findall(r'^\d+',x)[0]) if (len(re.findall(r'^\d+',x)) > 0) else None )
    df['BlockNumber'] = df['BlockNumber'].fillna(-1)

    
    streetColumns = []
    
    #one-hot encoding the streets requires more RAM than the standard computer has
    if isPowerful: 
        print "starting street dummy creation"

        #create a one-hot encoding of the Streets
        street1Dummy = pd.get_dummies(df['street1'])
        print "completed street 1 dummy creation"
        
        street2Dummy = pd.get_dummies(df['street2'])
        print "completed street 2 dummy creation"

        #turn the 0s into NaNs so that 'combine_first' can merge them
        street1Dummy = street1Dummy.replace(0, np.nan)
        street2Dummy = street2Dummy.replace(0, np.nan)
        
        #merge the 2 one-hot address frames into 1
        mergedStreetDummy = street1Dummy.combine_first(street2Dummy)
        print "completed address dummy DataFrames merge"
        
        #turn the NaNs back into 0s
        mergedStreetDummy = mergedStreetDummy.fillna(0)
        print "completed fillna on mergedAddressDummy"
        
        #extract all the new street columns
        streetColumns = list(mergedStreetDummy.columns.values)

        #merge the street data and the original dataframe
        df = pd.concat([df, mergedStreetDummy], axis=1)
        print "completed merge of original df and new dummy variable df"
    
    return df, ['StreetCornerFlag', 'BlockNumber'], streetColumns

#### Add Daily Weather Data
We were interested to know if the weather played a role in the types of crimes that occured so we added daily information about the max/min temperature as well as percipitation. The data came from NOAA (National Oceanic and Atmospheric Administration) and was pulled from https://www.ncdc.noaa.gov/cdo-web/search. Specifically it came from the [downtown station](http://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00023272/detail).

In [None]:
def addWeather(df):
    '''add 'PRCP' (precipitation),'TMAX' (Max Temperature),'TMIN' (Min Temperature) to each data point'''
    
    # create column to merge the weather data on (eg. 01-29-2016 becomes "20160129")
    df['DATE'] = df['DateTime'].apply(lambda x: int( str(x.year)+x.strftime('%m')+x.strftime('%d') ))
    
    weatherData = pd.read_csv('weather1.csv')
    
    #replace how NaNs are encoded
    weatherData = weatherData.replace('-9999', np.nan)
    
    #get subset of full dataframe
    weatherData = weatherData[['DATE','PRCP','TMAX','TMIN']]
    
    #merge the data frames based on the integer coumln "DATE"
    df = pd.merge(df, weatherData, on='DATE')
    
    return df, ['PRCP','TMAX','TMIN']

#### recode the training data

In [None]:
crimeData, addedColumns, streetColumns = recodeData(
    crimeData, isTrain = True)
crimeData.describe()

### Create and Test our Models