# Introduction

The United States has a quite a few cities notorious for high rates of violent crime. According to [MSN](https://www.msn.com/en-us/news/crime/25-most-dangerous-cities-in-america/ss-AAsxtw1#image=26), Detroit ranks as the most dangerous city in the country as of 2017 with 303 murders in 2016 alone and a staggering 2,047 violent crimes per 100,000 people. In this tutorial, we will step through the entire data science pipeline, while analyzing crime in the city of Detroit, Michigan.

#### Quick Reference:
1. [Getting Started](#Getting-Started)
1. [Data Curation, Parsing, and Management](#Data-Curation,-Parsing,-and-Management)
1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
1. [Machine Learning](#Machine-Learning)
1. [Hypothesis Testing](#Hypothesis-Testing)
1. [Conclusion](#Conclusion)
1. [Resources](#Resources)

# Getting Started

To reproduce the steps in this notebook, you should be running Python 3.6. 

You will also need the following packages:
1. [Pandas](https://pandas.pydata.org/) (```pip install pandas```)
1. [NumPy](http://www.numpy.org/) (```pip install numpy```)
1. [Matplotlib](https://matplotlib.org/) (```pip install matplotlib```)
1. [Folium](http://folium.readthedocs.io/en/latest/index.html) (```pip install folium```)
1. [SciPy](https://www.scipy.org/) (```pip install scipy```)
1. [Seaborn](https://seaborn.pydata.org/) (```pip install seaborn```)
1. [ScikitLearn](http://scikit-learn.org/stable/index.html) (```pip install sklearn```)




The links above will navigate you to the pages associated with the packages, in case you run into issues.

##### Imports:

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import folium
from datetime import datetime
from sklearn import cross_validation
import sklearn.metrics




# Data Curation, Parsing, and Management

First, we need to download the [dataset](https://data.detroitmi.gov/api/views/invm-th67/rows.csv?accessType=DOWNLOAD) from the City of Detroit website. This dataset includes crimes recorded January 1, 2009 to December 6, 2016. Click [here](https://data.detroitmi.gov/Public-Safety/DPD-All-Crime-Incidents-January-1-2009-December-6-/invm-th67) to be navigated to the documentation for this dataset which includes column names, column data types, descriptions, etc. In this tutorial, I have saved my data set as 'DetroitCrimeData.csv', and we can import it into Pandas by doing the following: 

In [14]:
# Create a Pandas dataframe to hold the crime data. We will use this throughout this tutorial.
crime_df = pd.read_csv('https://data.detroitmi.gov/api/views/invm-th67/rows.csv?accessType=DOWNLOAD', dtype=str, parse_dates=False)

# Remove the ROWNUM column as it is unnecessary since we have row indexing built into our dataframe.
del crime_df['ROWNUM']

crime_df.head()

KeyboardInterrupt: 

### What is a violent crime?
In order to analyze violent crime, we need to first define what it is. Below you can see a unique list of all the different categories of crimes that this data set covers:

In [None]:
# Print all categories of crimes
crimes = sorted(crime_df.CATEGORY.unique())
for i in range(0, len(crimes), 3):
    three = crimes[i:(min(i + 3, len(crimes)))]
    string = "{:40} {:40} {:40}".format(three[0] if len(three) >= 1 else "", \
                                        three[1] if len(three) >= 2 else "", \
                                        three[2] if len(three) == 3 else "")
    print(string)

We will define a "violent crime" to be any crime of arson, aggravated assult, assult, homicide, kidnapping, negligent homicide (manslaughter), robbery, or any sex offense. The code below will give us a new data frame that only deals with these offenses. Note that kidnapping is listed twice, as it is incorrectly spelled 'kidnaping' in some places.

In [None]:
# We are interested in violent crimes, so we first define a list of them:
VIOLENT_CRIMES = ['ARSON', 'AGGRAVATED ASSAULT', 'ASSAULT', 'HOMICIDE', 'KIDNAPING', 'KIDNAPPING', 'NEGLIGENT HOMICIDE', \
                  'ROBBERY', 'SEX OFFENSES']

# Now, we create a new dataframe with only the above crimes
crime_df = crime_df.loc[crime_df['CATEGORY'].isin(VIOLENT_CRIMES)]

crime_df.head()

#### Some tidying

For the purposes of this tutorial, we only care about the month and year that a crime occurred. We will not worry about the time or day of the month, so we will reformat the INCIDENTDATE column. Also, we need to extract the latitude/longitude from the LOCATION column and transform them into something usable for later. See below!

In [None]:
# Transform INCIDENTDATE column into just MM/YYYY
crime_df['INCIDENTDATE'] = crime_df['INCIDENTDATE'].map(lambda x: datetime.strptime(str(x).split(' ')[0], '%m/%d/%Y'))

# Creates a tuple of (lat,long) from the string current in LOCATION
def lat_long_tuple(x):
    if '\n' not in x: return "Unknown"
    string = str(x).split('\n')[1]
    split = string.split(',')
    return (float(split[0][1:]), float(split[1][1:-1]))

# Use the function above to transformn the LOCATION column into (lat,long), and remove old LOCATION column
crime_df['LAT/LONG'] = crime_df['LOCATION'].map(lat_long_tuple)
del crime_df['LOCATION']

crime_df

# Exploratory Data Analysis

Now we will use visualization and other statistical techniques to analyze our dataset, find interesting patterns and trends, and form some testable hypothesis. First, let's start with a simple scatterplot showing the number of violent crimes for each day from 01/01/2009 to 12/06/2016.

In [None]:
# Group the data by date 
groupby_date = crime_df.groupby('INCIDENTDATE').size().reset_index(name="COUNT")
groupby_date.head(8)

In [None]:
# Create the scatterplot
fig, ax = plt.subplots(figsize=(16,8))
plt.plot_date(groupby_date["INCIDENTDATE"], groupby_date["COUNT"], color="#4542f4")
plt.xlabel('Date')
plt.ylabel('Number of Violent Crimes')
plt.title('Number of Violent Crimes by Date')
plt.show()

At first glance, one may notice that there appears to be a downwards trend in violent crime over the years 2009 - 2016 in Detroit. Let's try to visually confirm these results by fitting a linear regression with Seaborn!

In [None]:
# Plot with linear regression
fig, ax = plt.subplots(figsize=(16,8))
g = sb.regplot(groupby_date["INCIDENTDATE"].apply(lambda x: x.timestamp()), groupby_date["COUNT"],  \
                 line_kws={'color':'red'}, color="#4256f4", ax=ax)
plt.xlabel("Date (Represented as seconds since epoch)")
plt.ylabel("Number of Violent Crimes")
plt.title('Number of Violent Crimes by Date w/ Regression')
sb.plt.show()

Clearly the linear regression shows a downwards trend in violent crimes from year 2009 to 2016. Upon further inspection, it seems as though each year has roughly the same distrubution. We will later attempt to use machine learning to predict how many violent crimes will occurr on any given day throughout the year!

#### Crime by location in Detroit

For fun, let's take a look at a map of the different locations where this violent crime is happening. Hopefully we might get some insight as to which locations in Detroit are the safest. Safest relative to the rest of Detroit that is. After all, we are
talking about the most dangerous city in America!

In [None]:
# Create map, centered on Detroit
crime_map = folium.Map(location=[42.337431, -83.048331], zoom_start = 11)

# Create a dictionary for referencing different colors for different crimes
color_lookup = {'ARSON': 'yellow', 'AGGRAVATED ASSAULT': 'blue', 'ASSAULT': 'orange', 'HOMICIDE': 'red', \
                'KIDNAPING': 'green', 'KIDNAPPING': 'green', 'NEGLIGENT HOMICIDE': 'purple', \
                  'ROBBERY': 'black', 'SEX OFFENSES': '#00FFFF'}

# Add a random sample of 1000 data points to the map
for index, row in crime_df.sample(1000).iterrows():
    lat = row['LAT/LONG'][0]
    long = row['LAT/LONG'][1]
    category = row['CATEGORY']
    folium.CircleMarker([lat, long], color=color_lookup[category],fill=True, fill_color=color_lookup[category], \
                        radius = 2 if row['CATEGORY'] != 'HOMICIDE' else 5, popup=category).add_to(crime_map)

crime_map

Based on the random sample of violent crimes, it appears as though the Hamtramck/Highland Park regions of Detroit have the lowest rates of such crime.

# Machine Learning

The hypothesis we are going to test in this section is that we may be able to predict the number of crimes that will happen on any given day in Detroit. To do this, we will use a tool called [Polyfit](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polyfit.html) which resides in the NumPy package. We give Polyfit a polynomial degree, say 8, and a random sample (a training set) from our data, and then we should be able to query for a date and get a prediction of what the crime will be on that day.

###### Just a little more tidying

Because we have the years 2009 - 2016 in our data, we need to account for leap years when getting the day number. Years 2012 and 2016 both have the day Feb. 29th, which will push every date in those years back by a day. For instance 12/16/2009 is day 350 of the year, while 12/16/2012 is day 351 of the year. To account for this, we will set the year (since we are not worried about it in this part of the tutorial) to 2012 for every date, that way each year has the same amount of days (366).

In [None]:
# Replace all years with 2012 so each year has an even 366 days
group_copy = groupby_date.copy()
group_copy['INCIDENTDATE'] = groupby_date['INCIDENTDATE'].apply(lambda x: x.replace(year=2016))

# Transform each date into the day of the year
group_copy['INCIDENTDATE'] = group_copy['INCIDENTDATE'].apply(lambda x: x.timetuple().tm_yday)

group_copy.head()

###### Polyfit example
Now that we have the data we want, let's do an example that will give a general overview of what is going on. Below, we will take a random sample of 200 days from our dataset and use that to fit a Polyfit polynomial. 

In [None]:
# Create random sample of 200 days from our 2009 - 2016 period, and fit polynomial (degree 8)
crime_sample = group_copy.sample(200)
fit = np.polyfit(crime_sample['INCIDENTDATE'], crime_sample['COUNT'], 8)
poly = np.poly1d(fit)
x = np.linspace(1, 366, 100)
y = poly(x)

In [None]:
# Plot the data along with our polyfit polynomial
fig, ax = plt.subplots(figsize=(16,8))
plt.plot(x, y)
plt.plot(crime_sample['INCIDENTDATE'], crime_sample['COUNT'], '.', color = "orange")
plt.xlabel('Day Number of Year')
plt.ylabel('Number of Violent Crimes')
plt.title('Example: Number of Violent Crimes vs Day of the Year')
plt.show()

###### Create training and testing data sets
In order to properly use Polyfit, we need to break up our data into a training and testing set. We will give Polyfit our training data set and it will learn how to predict number of crimes based on that data. Then, we will test the newly trained Polyfit on our testing set, and record the error compared to the actual number of crimes for that day. We want to ensure that our data size is large, but we also want to ensure that we have enough test data to accurately measure the error of the fit. We will allow our training set to use 3/4 of the entire dataset, while leaving our test set with the remaining 1/4.

In [None]:
# We take a random sample of the entire set - this ensures random order within the set
random_sample = group_copy.sample(frac=1)

# Get our training and testing sets
split = int(len(random_sample) * (3/4))
train = random_sample[0:split]
test = random_sample[split::]

print("Training set now has: {} Samples".format(len(train)))
print("Testing  set now has: {}  Samples".format(len(test)))

##### Choosing the best polynomial degree with Cross Validation
Generally, we would now start our [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html). This would ensure we have the best possible polynomial degree so that our training is as accurate as possible. We do this by iterating through a few possible choices of the polynomial degree and computing the error given by that degree. Once we are done, we would choose to use the degree which minimized the error in our tests. We will leave that out of this tutorial and just assume a polynomial degree of 8. We are ready to fully train Polyfit with our data!

In [None]:
# Train data with out random trainings set
fit = np.polyfit(train["INCIDENTDATE"], train["COUNT"], 8)
poly = np.poly1d(fit)
x = np.linspace(1, 366, 100)
y = poly(x)

In [None]:
# Plot the test data along with the Polyfit prediction!
fig, ax = plt.subplots(figsize=(16,8))
plt.plot(x, y)
plt.plot(test['INCIDENTDATE'], test['COUNT'], '.', color = "purple")
plt.xlabel('Day Number of Year')
plt.ylabel('Number of Violent Crimes')
plt.title('Number of Violent Crimes vs Day of the Year w/ Prediction')
plt.show()

# Hypothesis Testing
Now comes time to test how accurate our hypothesis was. We wanted to show that there was a correlation between the day of the year and the number of violent crimes that occurred. For our testing, let's use the R-squared score. To understand how R-squared values work, feel free to check out [this](http://statisticsbyjim.com/regression/interpret-r-squared-regression/) blog post. We are looking for an R-squared value of close to 1 to represent a positive correlation between day of the year and number of violent crimes.

In [None]:
# Compute R-squared using Scilearn
r_squared = sklearn.metrics.r2_score(test['COUNT'], poly(test['INCIDENTDATE']))
print(r_squared)

Although this R-squared value is low, it might be enough to support our hypothesis. The value would directly interpret that we do not explain ~70% of the error in our prediction. However, this does not mean that we do not have a positive correlation between day of the year and number of violent crimes. 

# Conclusion
Detroit is a dangerous city, and local governments are trying everything they can to keep the city safe. Being able to analyze the crime and make decisions based on it can be critical to increasing safety. In this tutorial, we showed how crime actually seems to be decreasing year-over-year. We have visualized the locations of these violent crimes on a map, which could help governments to ensure that areas with dense violent crime rates have more police presence. And we have also seen that violent crime my be correlated with the day of the year. This could potentailly help local police schedule more officers on some days, and less officers on others.

# Resources

1. MSN: https://www.msn.com/en-us/news/crime/25-most-dangerous-cities-in-america/ss-AAsxtw1#image=26
1. R-Squared Blog: http://statisticsbyjim.com/regression/interpret-r-squared-regression/
1. Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html
1. Pandas: https://pandas.pydata.org/
1. NumPy: http://www.numpy.org/
1. Matplotlib: https://matplotlib.org/
1. Folium: http://folium.readthedocs.io/en/latest/index.html
1. SciPy: https://www.scipy.org/
1. Seaborn: https://seaborn.pydata.org/
1. ScikitLearn: http://scikit-learn.org/stable/index.html