# CSC 578D / Data Mining / Fall 2018 / University of Victoria

## Python Notebook for Final Project

### The dataset for this project is the following:
1. Crime statistics for the city of Chicago from 2001-2017. [This](https://catalog.data.gov/dataset/crimes-2001-to-present-398a4) is the link to the original dataset.
1. The boundaries of the city of Chicago. [This](https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Beats-current-/aerh-rz74) is the link to the original dataset.

### Goals of this analysis:
1. Visualise crime in an concise and efficient manner.
1. Segment crime by year, month.
1. Look for patterns in the data.
1. Superpose the data onto a map.
1. Build a **k-Nearest Neighbor** classifier to compute how likely someone can be a victim of a crime per time and location.

**Author:** Andreas P. Koenzen <akoenzen@uvic.ca>

**Version:** 0.1

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import json
import os.path
import IPython

# from mapboxgl.utils import *
# from mapboxgl.viz import *

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import Javascript, IFrame
from IPython.core.display import HTML

In [None]:
# Miscellaneous Configuration:
HTML(
    '<style>'
    'ol           {counter-reset:item}'
    'ol li        {display:block}'
    'ol li:before {content:counter(item) ". ";counter-increment:item;font-weight:bold}'
    '</style>'
)

plt.style.use(['default', 'ggplot'])
plt.rcParams.update({'font.size': 8})

## Load the entire dataset into memory:
1. Index the dataset by the **Date** column.

In [None]:
data_set_home = %env DATA_SETS_HOME

raw_data = pd.read_csv(
    "{0}/CSC_578D/Project/Chicago_Crimes_2017.csv".format(data_set_home),
    index_col=['Date'],
    parse_dates=True,
    date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")
)

raw_data.sort_index(inplace=True)

# print basic information about the data
# raw_data.head()
raw_data.shape

## Pre-processing:
1. Since the dataset is about 1.5GB in size, I need to remove some unused or duplicate columns. The following are redundant columns, which will be removed.
    1. ID
    1. Case Number
    1. Block: We can remove the block because we will use other information to pin point locations, like Beat, etc.
    1. Primary Type: I will use the IUCR to categorize danger.
    1. Description: I will use the IUCR to categorize danger.
    1. Location Description: I will use the "Domestic" column to filter out crimes commited indoors.
    1. Arrest
    1. Community Area
    1. Ward
    1. Year
    1. FBI Code
    1. X Coordinate
    1. Y Coordinate
    1. Updated On
    1. Latitude, Longitude and Location can be omitted as well, since I will use the Chicago Boundaries dataset.
    1. X Coordinate and Y Coordinate.
1. Index the entire dataset by the Date column.

In [None]:
# filter out redundant columns
data = raw_data.drop([
    'ID', 
    'Case Number',
    'Block',
    'Primary Type',
    'Description',
    'Location Description',
    'Arrest',
    'Community Area',
    'Ward',
    'FBI Code',
    'X Coordinate',
    'Y Coordinate',
    'Updated On',
    'Latitude',
    'Longitude',
    'Location'
], axis=1)

data.head()

## Describe the dataset:
Gather metrics for:
1. Top 10 Most Violent Beats: In police terminology, a beat is the territory and time that a police officer patrols.
1. Describe the dataset using Time Series analysis.

### Top 10 Most Violent Beats:

In [None]:
# group data by:
# - beat
# plot top 10 most violent beats
tmp = data.groupby('Beat').size().sort_values(ascending=False)[0:10]
top_10_mv_beats = pd.DataFrame(index=range(1, 11), data={
    'Beat #': tmp.index,
    'Crime Count ' + str(data['Year'].unique()[0]): tmp.values
})
top_10_mv_beats

### Time Series analysis:
The dataset needs to be indexed by date.

#### Describe and plot data by month:

In [None]:
month = data.groupby(data.index.month).size()
month = pd.DataFrame(index=np.arange(1, 13, 1), data={
    'Crimes': month
})

fig, ax = plt.subplots(figsize=(6, 4))
ax = month.plot(
    ax=ax,
    title='Crimes by Month',
    xticks=np.arange(1, 13, 1),
    style=['--']
)
_ = ax.set_ylabel('Count')
_ = ax.set_xlabel('Month')

#### Result:
Given the chart above we can see that a pattern emerges from the data, where crimes are commited more often during summer months. So I will formulate an hypothesis and look further to see if it holds.

#### Hypothesis 1:
Crimes are higher during summer months, due to more people being outdoors than indoors.

In [None]:
month_outdoor = data.loc[(data['Domestic'] == False)]
month_outdoor = month_outdoor.groupby(month_outdoor.index.month).size()
month_indoor = data.loc[(data['Domestic'] == True)]
month_indoor = month_indoor.groupby(month_indoor.index.month).size()

fig, (ax_left, ax_right) = plt.subplots(ncols=2, figsize=(12, 4))
_ = ax_left.set_ylabel('Count')
_ = ax_left.set_xlabel('Month')
_ = ax_right.set_ylabel('Count')
_ = ax_right.set_xlabel('Month')

_ = month_outdoor.plot(
    ax=ax_left,
    title='Non-Domestic Crimes by Month',
    xticks=np.arange(1, 13, 1),
    style=['--']
)
_ = ax_left.legend(["{} Crimes Total".format(month_outdoor.sum())])

_ = month_indoor.plot(
    ax=ax_right,
    title='Domestic Crimes by Month',
    xticks=np.arange(1, 13, 1),
    style=['--']
)
_ = ax_right.legend(["{} Crimes Total".format(month_indoor.sum())])

#### Result:
Well, we can see after separating the two sets (outdoor & indoor) crimes, that hypothesis 1 is rejected since both charts follow the same pattern. We can see in the right chart that indoor crimes also spike during summer months.

## Visualise:
1. Plot crimes by Beat, District and Wards.
1. Build a Choropleth using the Chicago Boundaries dataset.
1. Plot crime by month, day, hour.

**See dataset: https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Beats-current-/aerh-rz74**

In [None]:
ax = top_10_mv_beats.plot.bar(
    x=0,
    y=1,
    figsize=(6, 4),
    title='Top 10 Most Violent Police Beats',
    color='#B0C4DE'
);
ax.set_ylabel('Count');
ax.set_xlabel('Beat #');
ax.set_axisbelow(True);
ax.yaxis.grid(color='gray', linestyle='dashed');
ax.xaxis.grid(color='gray', linestyle='dashed');

## Superpose data onto a map:
1. First separate the map into beats covering the entire city of Chicago.
1. Plot the data using a choropleth chart.

In [None]:
mapbox_token = %env MAPBOX_TOKEN
data_set_home = %env DATA_SETS_HOME

replacements = {
    'mapbox_token': mapbox_token
}

html = """
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset='utf-8'/>
        <title>Project</title>
        <meta name='viewport' content='initial-scale=1,maximum-scale=1,user-scalable=no'/>
        <script src='https://api.tiles.mapbox.com/mapbox-gl-js/v0.51.0/mapbox-gl.js'></script>
        <link href='https://api.tiles.mapbox.com/mapbox-gl-js/v0.51.0/mapbox-gl.css' rel='stylesheet'/>
        <style>
            body { margin:0; padding:0; }
            #map { position:absolute; top:0; bottom:0; width:100%; }
        </style>
    </head>
    <body>
        <div id='map'></div>
        <script>
        mapboxgl.accessToken = '(mapbox_token)';
        var map = new mapboxgl.Map({
            container: 'map',
            style: 'mapbox://styles/mapbox/streets-v9',
            center: [-74.50, 40],
            zoom: 9 // starting zoom
        });
        </script>
    </body>
    </html>
    """

for key, value in replacements.items():
    html = html.replace("({})".format(key), value)

display(HTML('<iframe srcdoc="{srcdoc}" style="width:110%;height:500px;"></iframe>'.format(srcdoc=html)))