## Objective:

The objective of this task was to use crime data from Chicago's Data Portal (Crimes - 2001 to present) to train a supervised learning model that could predict the occurrence of a crime in Chicago.

In [2]:
from sklearn import linear_model
import pandas as pd
import random

# The data to load
f = "crime_data/Crimes_-_2001_to_present.csv"

# Code to load random sample of .csv
num_lines = sum(1 for l in open(f))
print(num_lines)
# Sample size - in this case ~20% - Anymore than this I run into memory issues
size = int(num_lines / 5)
skip_idx = random.sample(range(1, num_lines), num_lines - size)

# Read the data
## According to Chicago Data Portal, 'Community Areas' is current. I am going to remove 'Community Area'.
drops = ['Case Number', 'Block',  'Description',  'Location Description', 'Updated On', 'Community Area']
data = pd.read_csv(f, skiprows=skip_idx).drop(drops, axis=1)
data.columns = ['ID','Date','IUCR',
                     'Primary_Type','Arrest','Domestic', 'Beat', 'District',
                     'Ward', 'FBI_Code', 'X_Coordinate', 'Y_Coordinate',
                     'Year', 'Latitude', 'Longitude', 'Location', 'Historical_Wards',
                     'Zip Codes', 'Community_Areas', 'Census_Tracts', 'Wards', 'Boundaries_ZIP',
                     'Police_Dist', 'Police_Beats']
print(data['Primary_Type'].value_counts())



6856243
THEFT                                288892
BATTERY                              249885
CRIMINAL DAMAGE                      156209
NARCOTICS                            143466
OTHER OFFENSE                         85551
ASSAULT                               85414
BURGLARY                              78371
MOTOR VEHICLE THEFT                   63526
DECEPTIVE PRACTICE                    54219
ROBBERY                               51918
CRIMINAL TRESPASS                     39177
WEAPONS VIOLATION                     14575
PROSTITUTION                          13734
PUBLIC PEACE VIOLATION                 9795
OFFENSE INVOLVING CHILDREN             9351
CRIM SEXUAL ASSAULT                    5734
SEX OFFENSE                            5262
INTERFERENCE WITH PUBLIC OFFICER       3218
GAMBLING                               2930
LIQUOR LAW VIOLATION                   2745
ARSON                                  2189
HOMICIDE                               1935
KIDNAPPING              

The code above is part of the **eda.py** program. This program limits the crime data (*Crimes_-_2001_to_present.csv*) to just a random 20% sample of the origional rows. Working with any larger of a sample resulted in memory issues. 
The next step was to get a sense of the types of crimes contained in these data. By looking at the "Primary Type" column I could see a standardized description of the type of crime each observation in the data represents. The data contain a wide array of offenses. To build a model that could predict every type of crime would require a large number of different features. Thus, I felt it was appropriate to limit this exercise to just looking at one type of crime. I chose Burglary since it is a common offense in Chicago, and I felt I could develop an appropriate set of model features for predicting this type of crime.
These sample data were then limited to just burglaries and output to a csv file.

## Features to Predict Burglary:

The program **burglaries_predictor_variables.py** takes the dataset described in the last step and adds some columns that will be used as features for our model:

### 1.  Time Realted Features

* For each observation, how much time has elapsed since the last burglary in the same community area (in days)
* Dummy variable to indicate working hours of the day (8am - 7pm)
* Dummy variable to indicate colder (winter) months (October to March)
* Dummy varibale to indicate work days of the week (M - F)

### 2. Geographical Features (Community Area Features)

* Average amount of time that elapses between bruglaries within a community area (in days)
* Dummy variables to indicate each community area
* Total number of burglaries in that community area
* Number of affordable housing units within that community area
    * The data on affordable housing units by community area were obtained from here: https://data.cityofchicago.org/Community-Economic-Development/Affordable-Housing-Units-by-Community-Area/yvj4-y3fb

### 3. Law Enforcement Features

* Number of police beats (crimes dataset)
* Heversine distance to nearest police station (in kilometers)
    * The latitude and longitude for police station locations in Chicago were obtained from: https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv
    * Heversine distances were calcualted from each burglary to each police station. Then for each burglary, the minimum of the distnaces caluclated were used as the distance to the nearest police station.

## Random Forest Model: