# Data Science: Choose Your Own Adventure
**By Mac-I and Sophia**

For this project, we decided to use the [San Franscisco Crime Dataset](https://www.kaggle.com/c/sf-crime), with the goal of predicting the category of crime based on the date/time of report, and the location of the report. This notebook will serve as a writeup of the work that we have done on this project, in both data exploration and building a model. 

## Introduction
Before we get started, let's talk about what we're trying to do and what information we actually have!

In the dataset, the information we have is:
* **Dates**
* Category
* **Description**
* **Day of Week**
* **Police District**
* Resolution
* **Addresss**
* **X (Longitude)**
* **Y (Latitude)**

The bolded items are the ones that occur in both the test and training datasets. In other words, the bolded items are the ones that we will be using to predict the cateogry of the crime. 


## Importing Everything
To keep our code neat, let's import all the helper libraries we need up here!

We will also read in the data here. 

In [None]:
% matplotlib inline
import shapefile
import pandas as pd
import numpy as np
import itertools

import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib import cm
from datetime import datetime
from ipywidgets import widgets  
from IPython.display import display

import seaborn as sns

## Data Exploration
Before we get into building a model, we're going to start by just exploring the dataset. The goal of this is to just explore what kinds of relationships exist in the dataset. 

First, let's start by reading in the dataset


In [None]:
crimeData = pd.read_csv('train.csv')
crimeData

Now that we've read in the data, we can see that we have a timestamp column. This, however is a string, so let's actually decompose this into year, month, day, and hour values. 

In [None]:
crimeData['DateTime'] = crimeData['Dates'].apply(
    lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

crimeData['Year'] = crimeData['DateTime'].apply(lambda x: x.year)
crimeData['Month'] = crimeData['DateTime'].apply(lambda x: x.month)
crimeData['Day'] = crimeData['DateTime'].apply(lambda x: x.day)
crimeData['Hour'] = crimeData['DateTime'].apply(lambda x: x.hour)

In the dataset, we have two different kinds of data: location data and time data that we can use to predict the type of crime.

First, we'll start by creating plots of the crimes for each location. 

### The Relationship between Location and Crime Category
In the graphs below, we'll show some of the work we did to explore how location and the category of crime are related. 

To do this, we're going to use ipython notebook widgets, to allow a user to choose a category and then display all the crimes of that category that occurred in our training data. 

First, we will create a function that given a crime, plots the crimes of that category. 

In [None]:
def crime_map_display(crime):
    #Load in the map data and set the appropriate lat long variables
    mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
    asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
    clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]
    lon_lat_box=[-122.52469, -122.33663, 37.69862, 37.82986]
    
    #get only the crimes that of the category we are showing. 
    crimeDataS = crimeData[crimeData.Category == crime].copy()
    
    #Filter out any lat long outliers
    crimeDataS['Latitude'] = crimeDataS[crimeDataS.X<-121].X
    crimeDataS['Longitude'] = crimeDataS[crimeDataS.Y<40].Y
    
    plt.figure()
    plt.grid(False)
    #ax = sns.kdeplot(crimeDataS.Xok, crimeDataS.Yok, clip=clipsize, aspect=1/asp)

    cmap = plt.get_cmap('gray')

    g = sns.regplot(x="Latitude", y="Longitude", data=crimeDataS, fit_reg=False, scatter_kws={'alpha':0.3})
    g.set_title(crime)
    g.imshow(mapdata, cmap=plt.get_cmap('gray'), 
                  extent=lon_lat_box, 
                  aspect=asp)

Now, we will want to actually call this function when we update a widget. 

In [None]:
categories = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=categories, description='Select one of the categories:')
widgets.interact(crime_map_display, crime = crime)

In these graphs (which load a little slowly), crime categories with a lot of data, larceny/theft for example, the plot appears to be almost entirely covered in blue. There is one exception, the rectangle that is not covered is a fairly nice park. 

For the less common crimes, however, we can see crimes, for the most part, tend to be concentrated in the downtown area. 

Notably, it appears that the "other" category, which are mostly traffic violations are, unsurprisingly, concentrated on major roads. 

Additionally, prostitution appears to have two major centers. 

To give us a little more insight into what is happening, and to break up the data a little more, we decided to split this plot up by hour, too, and explore the data per category, per hour. Here we follow the same structure as above (creating a function that plots the data based on our filtered variables, and then calling that function on change of ipython notebook widgets). 

In [None]:
def image_display(crime, time):
    
    #Load in the Map Data
    mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
    asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
    clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]
    lon_lat_box=[-122.52469, -122.33663, 37.69862, 37.82986]
    
    #Filter out lat/long outliers
    crimeData['Longitude'] = crimeData[crimeData.X<-121].X
    crimeData['Latitude'] = crimeData[crimeData.Y<40].Y

    #print crime,time
    crimeDataS = crimeData[crimeData.Category == crime][crimeData.Hour==time]
    plt.figure()
    plt.grid(False)
    #ax = sns.kdeplot(crimeDataS.Xok, crimeDataS.Yok, clip=clipsize, aspect=1/asp)

    cmap = plt.get_cmap('gray')

    g = sns.regplot(x="Longitude", y="Latitude", data=crimeDataS, fit_reg=False, scatter_kws={'alpha':0.3})
    g.set_title(crime)
    g.imshow(mapdata, cmap=plt.get_cmap('gray'), 
                  extent=lon_lat_box, 
                  aspect=asp)

In [None]:
vals = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=vals, description='Select one of the values:')
time = widgets.IntSlider(min=0, max=23, value=2003)
widgets.interact(image_display, crime = crime, time=time)

This makes these plots a little more managable. 

Some noteable takeaways here are that:
* For most categories of crime, there is a lull around 3 am
* Most crime tends to be concentrated downtown. We predict that this is because many more people work/live in this area than anywhere else, not because this is actually a more dangerous place to live

Although it is helpful to see the map for different categories, for different hours, this view of the data does not provide the most intuitive way to visualize time-based crime patterns. 

### Crime Distribution of Day of Week and Hour
To dig into this, let's look first at the distribution of crime patterns by day of week and hour of the day. Using the same pattern described above, we'll implement a function that creates a plot and declare the widgets to control it.

First, let's group the crimes by day of week. 

In [None]:
groupedByDayOfWeek = crimeData.groupby(['DayOfWeek', 'Category']).count().reset_index()

Now we can declare our function and create the widget

In [None]:
def image_display_day_of_week(i):
    # Get the string for the day of week
    dayOfWeek = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
    day = dayOfWeek[i]
    
    #Get the number of crimes that occurred on this day of week
    #(For normalization purposes)
    totalCrimes = sum(groupedByDayOfWeek[groupedByDayOfWeek.DayOfWeek == day]['Dates'].tolist())
    
    #Get a list of the different crime types
    crimeTypes = sorted(crimeData.Category.unique().tolist())
    #Calculate the percentage of crimes that occurred that were of a given category
    crimeCountsPercent = []
    for crime in crimeTypes:
        countList = groupedByDayOfWeek[(groupedByDayOfWeek.DayOfWeek == day) & (groupedByDayOfWeek.Category == crime)]['Dates'].tolist()
        #Here, we're doing some error handling 
        #If no crimes of a certain type occurred on a given day, 
        #then append zero
        if (len(countList) > 0):
            count = countList[0]
        else:
            count = 0

        crimeCountsPercent.append(1.0*count/totalCrimes)

    #Create the figure
    plt.figure(figsize=(10,8))
    plt.bar([x + 0.1 for x in range(len(crimeCountsPercent))], crimeCountsPercent, width = 0.8)
    plt.xticks([x + 0.5 for x in range(len(crimeCountsPercent))], crimeTypes, rotation='vertical')
    plt.axis([-0.5, 39.5, 0 ,0.3])
    plt.title('Crime Breakdown where day of week is ' + str(day))
    plt.xlabel('Type of Crime')
    plt.ylabel('Percentage of Crimes')

#Outside of the function create the widgets
step_slider = widgets.IntSlider(min=0, max=6, value=0)
widgets.interact(image_display_day_of_week, i=step_slider)

This is a slightly easier way of seeing that for all days of the week, larceny/theft and other offenses are the most common crimes. 

In general, it appears that there is an increase of crimes like larceny/theft, assault, and drug/narcotic-related offenses over the weekend, and less during the week.

Additionally, to explore whether certain crimes happen more often at certain times of day, let's also make a similar plot for hour of the day. First, we group by hour:

In [None]:
groupedByHour = crimeData.groupby(['Hour', 'Category']).count().reset_index()

In [None]:
def image_display_hour(i):
    #get the hour of the day
    hour = i
    #Count the number of crimes that occured in that hour
    #(for normalization)
    totalCrimes = sum(groupedByHour[groupedByHour.Hour == hour]['Dates'].tolist())
    
    #Get the number of each type of crime that occured in that hour
    crimeTypes = sorted(crimeData.Category.unique().tolist())
    crimeCountsPercent = []
    for crime in crimeTypes:
        countList = groupedByHour[(groupedByHour.Hour == hour) & (groupedByHour.Category == crime)]['Dates'].tolist()
        #Handle the zero-occurrence case
        if (len(countList) > 0):
            count = countList[0]
        else:
            count = 0

        crimeCountsPercent.append(1.0*count/totalCrimes)

    #create the plot
    plt.figure(figsize=(10,8))
    plt.bar([x + 0.1 for x in range(len(crimeCountsPercent))], crimeCountsPercent, width = 0.8)
    plt.xticks([x + 0.5 for x in range(len(crimeCountsPercent))], crimeTypes, rotation='vertical')
    plt.axis([-0.5, 39.5, 0 ,0.3])
    plt.title('Crime Breakdown where hour = ' + str(hour))
    plt.xlabel('Type of Crime')
    plt.ylabel('Percentage of Crimes')


#Create the widget
step_slider = widgets.IntSlider(min=0, max=23, value=0)
widgets.interact(image_display_hour, i=step_slider)

Here, we can see that the distribution of crimes does change over the course of the day.

There is a huge shift in the distribution to be mostly larceny/theft crimes around 6pm. We hypothesize that this is because most larcenty/theft is report when people get home or to their cars after work. 

Additionally, at around 2 am, it appears that the distribution shifts so that assault is the most common crime.

### Visualizing Patterns of Occurrence for Crimes
Although these visualizations help us get a general  sense about what the distribution of crimes look like at various times, it would also be helpful to visualize the pattern of one crime over time. 

To do this, let's again use ipython widgets. Because this is a little computationally intensive, though, let's create the dataframe outside of the function where we update the plot. 


In [None]:
groupedByTime = crimeData.groupby(['DayOfWeek', 'Hour', 'Category']).count().reset_index()

Now, we can write a function to plot the time heatmap for a given category.

In [None]:
def show_time_heatmap(crime):
    daysOfWeek = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
    daysOfWeekDisp = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
    hours = range(24)
    
    numCrimes = np.zeros((len(hours), len(daysOfWeek)))
    for i,hour in enumerate(hours):
        for j,dayOfWeek in enumerate(daysOfWeek):
            try:
                crimeCount = groupedByTime[(groupedByTime.DayOfWeek == dayOfWeek) & 
                                           (groupedByTime.Hour == hour) & 
                                           (groupedByTime.Category == crime)]['Dates'].tolist()[0]
            except:
                crimeCount = 0


            numCrimes[23-hour][j] = int(crimeCount)

    g = sns.heatmap(numCrimes, annot=True, fmt='.0f')
    g.set_title(crime)
    g.set(xticklabels = daysOfWeekDisp)
    g.set(yticklabels = hours)
    
    
crimeCategories = groupedByTime.Category.unique().tolist()
vals = list(zip(crimeData.Category.unique(), crimeData.Category.unique()))
crime = widgets.Select(options=vals, description='Select one of the values:')
widgets.interact(show_time_heatmap, crime = crime)

In exploring these plots, it appears that there tends to be a similar pattern of higher levels of crime in the evening and lower levels of crime in the morning. Many crimes also occur more often on the weekends. 

Notable exceptions to this pattern appear to be drug/narcotic-related offenses (which are most common Wednesdays around lunch time) and missing person cases which appear to be reported most often when in the mornings when people don't show up for work. 

## Making a Model
Now that we've explored the data, we're going to start building a model. To make our lives easier, we're going to follow a pretty simple workflow. We will 
1. Read in the Data
2. Clean the training data
3. Create a Model using the cleaned data
4. Score the model using the crime categories in the training data
5. If the model performs well (better than previous attempts) we will: 
  1. Repeat steps 1-3 with the test data. 
  2. Generate a submission file to upload to Kaggle. 