# Client Project: The Lab @ DC

## Project Title: Predicting Shots

### Authors: Kihoon Sohn, Brian Collins, Harsha Goonawardana, Priya Kakkar
- Cohorts of the Data Science Immersive, General Assembly @ Washington DC campus

In this notebook, we pull the raw data from Open Data DC and the Metropolitan Police Department (MPD) and transform and clean them for our analysis. **This is notebook 1 of 3.**

### Import Libraries

In [1]:
# import basic libraries

import pandas as pd
import os, os.path

In [2]:
# set the path for the datasets.

csr_path = "assets/csr/"
mpd_path = "assets/mpd/"
os.makedirs(csr_path, exist_ok=True)
os.makedirs(mpd_path, exist_ok=True)

### Read CSVs
##### 1) Source: Open Data DC - City Service Requests datasets
- All datasets from Open Data DC are over 100mb, therefore it will not fit in your remote github due to the size limitation. 
- The following code enables you to download the file from the URLs and save it to your local machine, so that you don't need to fetch it from the web everytime you run the notebook.
- **Make sure have `.gitignore` file in your local repo** so that the downloaded `big-size-CSVs` won't push it back to your remote repo. **Please check with the instruction from `README.md`**

In [3]:
# Once you have a dataset in your local machine, it will be loaded it up directly.
# if not, download the datasets directly from City Service Requests, OpenData DC(http://opendata.dc.gov/) 

# City Service Requests 2014 datasets
if os.path.isfile('./assets/csr/CSR_2014.csv') == True:
    csr_2014 = pd.read_csv('./assets/csr/CSR_2014.csv', low_memory=False)
else:
    csr_2014 = pd.read_csv('https://opendata.arcgis.com/datasets/17cafb3ffab347409def7e85e14c56bd_5.csv', low_memory=False)
    csr_2014.to_csv('./assets/csr/CSR_2014.csv', index=False)

# City Service Requests 2015 datasets
if os.path.isfile('./assets/csr/CSR_2015.csv') == True:
    csr_2015 = pd.read_csv('./assets/csr/CSR_2015.csv', low_memory=False)
else:
    csr_2015 = pd.read_csv('https://opendata.arcgis.com/datasets/b93ec7fc97734265a2da7da341f1bba2_6.csv', low_memory=False)
    csr_2015.to_csv('./assets/csr/CSR_2015.csv', index=False)

# City Service Requests 2016 datasets
if os.path.isfile('./assets/csr/CSR_2016.csv') == True:
    csr_2016 = pd.read_csv('./assets/csr/CSR_2016.csv', low_memory=False)
else:
    csr_2016 = pd.read_csv('https://opendata.arcgis.com/datasets/0e4b7d3a83b94a178b3d1f015db901ee_7.csv', low_memory=False)
    csr_2016.to_csv('./assets/csr/CSR_2016.csv', index=False)

# City Service Requests 2017 datasets
if os.path.isfile('./assets/csr/CSR_2017.csv') == True:
    csr_2017 = pd.read_csv('./assets/csr/CSR_2017.csv', low_memory=False)
else:
    csr_2017 = pd.read_csv('https://opendata.arcgis.com/datasets/19905e2b0e1140ec9ce8437776feb595_8.csv', low_memory=False)
    csr_2017.to_csv('./assets/csr/CSR_2017.csv', index=False)

# City Service Requests 2018 Q1 datasets
if os.path.isfile('./assets/csr/CSR_2018_q1.csv') == True:
    csr_2018_q1 = pd.read_csv('./assets/csr/CSR_2018_q1.csv', low_memory=False)
else:
    csr_2018_q1 = pd.read_csv('https://opendata.arcgis.com/datasets/2a46f1f1aad04940b83e75e744eb3b09_9.csv', low_memory=False)
    csr_2018_q1.to_csv('./assets/csr/CSR_2018_q1.csv', index=False)

##### 2) Source: Metropolitan Police Department - ShotSpotters datasets
- https://mpdc.dc.gov/publication/shotspotter-data-disclaimer-and-dictionary
- You will find the datasets at `./assets/` folder in the git repo and no need to download it from the web.

In [4]:
# unmute and run this code below if you have problem `.read_excel` 

# !pip install xlrd

In [5]:
# ShotSpotters datasets from Metropolitan Police Department 

# Train set: ShotSpotters datasets for 2014 - 2017 
shots_2014 = pd.read_excel('./assets/mpd/ShotSpotter Data 14-17 180213_0.xlsx', sheet_name=0)
shots_2015 = pd.read_excel('./assets/mpd/ShotSpotter Data 14-17 180213_0.xlsx', sheet_name=1)
shots_2016 = pd.read_excel('./assets/mpd/ShotSpotter Data 14-17 180213_0.xlsx', sheet_name=2)
shots_2017 = pd.read_excel('./assets/mpd/ShotSpotter Data 14-17 180213_0.xlsx', sheet_name=3)

# Test set: ShotSpotters datasets for 2018 Q1 
shots_2018_q1 = pd.read_excel('./assets/mpd/ShotSpotter Public Data Q1 2018.xlsx')

### Basic settings with the datasets

Name Dataframes

In [6]:
csr_2014.name      = 'City Service Requests 2014 data'
csr_2015.name      = 'City Service Requests 2015 data'
csr_2016.name      = 'City Service Requests 2016 data'
csr_2017.name      = 'City Service Requests 2017 data'
csr_2018_q1.name   = 'City Service Requests 2018 Q1 data'
shots_2014.name    = 'Shot Spotters 2014 data'
shots_2015.name    = 'Shot Spotters 2015 data'
shots_2016.name    = 'Shot Spotters 2016 data'
shots_2017.name    = 'Shot Spotters 2017 data'
shots_2018_q1.name = 'Shot Spotters 2018 Q1 data'

Make sure that no datasets have `Unnamed: 0` columns in the dataframe.

In [7]:
for df_ in [csr_2014, csr_2015, csr_2016, csr_2017, csr_2018_q1, shots_2014, shots_2015, shots_2016, shots_2017, shots_2018_q1]:
    if df_.columns[0] == "Unnamed: 0":
        df_.drop(['Unnamed: 0'], axis=1, inplace=True)
        print("Dropped unnamed column in", df_.name)
    else:
        print("No columns dropped in", df_.name)

Dropped unnamed column in City Service Requests 2014 data
Dropped unnamed column in City Service Requests 2015 data
Dropped unnamed column in City Service Requests 2016 data
Dropped unnamed column in City Service Requests 2017 data
Dropped unnamed column in City Service Requests 2018 Q1 data
No columns dropped in Shot Spotters 2014 data
No columns dropped in Shot Spotters 2015 data
No columns dropped in Shot Spotters 2016 data
No columns dropped in Shot Spotters 2017 data
No columns dropped in Shot Spotters 2018 Q1 data


Check the shapes of the datasets

In [8]:
print("Shape of City Service Requests 2014   : ", csr_2014.shape)
print("Shape of City Service Requests 2015   : ", csr_2015.shape)
print("Shape of City Service Requests 2016   : ", csr_2016.shape)
print("Shape of City Service Requests 2017   : ", csr_2017.shape)
print("Shape of City Service Requests 2018 Q1: ", csr_2018_q1.shape)
print("--------------------")
print("Shape of Shot Spotters 2014   : ", shots_2014.shape)
print("Shape of Shot Spotters 2015   : ", shots_2015.shape)
print("Shape of Shot Spotters 2016   : ", shots_2016.shape)
print("Shape of Shot Spotters 2017   : ", shots_2017.shape)
print("Shape of Shot Spotters 2018 Q1: ", shots_2018_q1.shape)

Shape of City Service Requests 2014   :  (322469, 30)
Shape of City Service Requests 2015   :  (295633, 30)
Shape of City Service Requests 2016   :  (302985, 30)
Shape of City Service Requests 2017   :  (310146, 30)
Shape of City Service Requests 2018 Q1:  (152796, 30)
--------------------
Shape of Shot Spotters 2014   :  (9637, 7)
Shape of Shot Spotters 2015   :  (7952, 7)
Shape of Shot Spotters 2016   :  (5872, 7)
Shape of Shot Spotters 2017   :  (4882, 7)
Shape of Shot Spotters 2018 Q1:  (1072, 7)


High likely the City Service Requests datasets are aligned in its columns name, but just make sure to check it.

In [9]:
csr_2015 = csr_2015[csr_2014.columns]
csr_2016 = csr_2016[csr_2015.columns]
csr_2017 = csr_2017[csr_2016.columns]
csr_2018_q1 = csr_2018_q1[csr_2017.columns]

In [10]:
# concat the datasets
csr = [csr_2014, csr_2015, csr_2016, csr_2017]

csr_train = pd.concat(csr)
print("CSR train set shape   : ", csr_train.shape)

csr_test = csr_2018_q1
print("CSR test set shape    : ", csr_test.shape)

shots = [shots_2014, shots_2015, shots_2016, shots_2017]
shots_train = pd.concat(shots)
print("Shots train set shape : ", shots_train.shape)

shots_test = shots_2018_q1
print("Shots test set shape  : ", shots_test.shape)

CSR train set shape   :  (1231233, 30)
CSR test set shape    :  (152796, 30)
Shots train set shape :  (28343, 7)
Shots test set shape  :  (1072, 7)


In [11]:
# check ShotSpotters datasets columns
print(shots_train.columns)
print(shots_test.columns)

Index(['ID', 'Type', 'Date', 'Time', 'Source', 'Lat (100)', 'Lon (100)'], dtype='object')
Index(['ID', 'Type', 'Date', 'Time', 'Source', 'Lat (100m)', 'Lon (100m)'], dtype='object')


In [12]:
# make sure to rename it
shots_train = shots_train.rename(columns={'Lat (100)': 'Latitude', 'Lon (100)': 'Longitude'})
shots_test = shots_test.rename(columns={'Lat (100m)': 'Latitude', 'Lon (100m)': 'Longitude'})

### Save it to CSVs (please proceed with the 2nd notebook for the EDA)

In [13]:
csr_train.to_csv('./assets/csr/csr_train.csv', index=False)
csr_test.to_csv('./assets/csr/csr_test.csv', index=False)
shots_train.to_csv('./assets/mpd/shots_train.csv', index=False)
shots_test.to_csv('./assets/mpd/shots_test.csv', index=False)