# Client Project: The Lab @ DC

## Project Title: {here}

### Authors: {names}
- Cohorts of the Data Science Immersive, General Assembly @ Washington DC campus

In this notebook, we pull the raw data from Open Data DC and the Metropolitan Police Department (MPD) and transform and clean them for our analysis. **This is notebook 1 of 3.**

### Import Libraries

In [36]:
# import basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### Read CSVs
##### 1) Source: Open Data DC - City Service Requests datasets
- All datasets from Open Data DC are over 100mb, therefore it will not fit in your remote github due to the size limitation. 
- The following code enables you to download the file from the URLs and save it to your local machine, so that you don't need to fetch it from the web everytime you run the notebook.
- **Make sure have `.gitignore` file in your local repo** so that the downloaded `big-size-CSVs` won't push it back to your remote repo. **Please check with the instruction from `README.md`**

In [37]:
import os.path

# Once you have a dataset in your local machine, it will be loaded it up directly.
# if not, download the datasets directly from City Service Requests, OpenData DC(http://opendata.dc.gov/) 

# City Service Requests 2014 datasets
if os.path.isfile('./assets/CSR_2014.csv') == True:
    csr_2014 = pd.read_csv('./assets/CSR_2014.csv', low_memory=False)
else:
    csr_2014 = pd.read_csv('https://opendata.arcgis.com/datasets/17cafb3ffab347409def7e85e14c56bd_5.csv', low_memory=False)
    csr_2014.to_csv('./assets/CSR_2014.csv')

# City Service Requests 2015 datasets
if os.path.isfile('./assets/CSR_2015.csv') == True:
    csr_2015 = pd.read_csv('./assets/CSR_2015.csv', low_memory=False)
else:
    csr_2015 = pd.read_csv('https://opendata.arcgis.com/datasets/b93ec7fc97734265a2da7da341f1bba2_6.csv', low_memory=False)
    csr_2015.to_csv('./assets/CSR_2015.csv')

# City Service Requests 2016 datasets
if os.path.isfile('./assets/CSR_2016.csv') == True:
    csr_2016 = pd.read_csv('./assets/CSR_2016.csv', low_memory=False)
else:
    csr_2016 = pd.read_csv('https://opendata.arcgis.com/datasets/0e4b7d3a83b94a178b3d1f015db901ee_7.csv', low_memory=False)
    csr_2016.to_csv('./assets/CSR_2016.csv')

# City Service Requests 2017 datasets
if os.path.isfile('./assets/CSR_2017.csv') == True:
    csr_2017 = pd.read_csv('./assets/CSR_2017.csv', low_memory=False)
else:
    csr_2017 = pd.read_csv('https://opendata.arcgis.com/datasets/19905e2b0e1140ec9ce8437776feb595_8.csv', low_memory=False)
    csr_2017.to_csv('./assets/CSR_2017.csv')

# City Service Requests 2018 Q1 datasets
if os.path.isfile('./assets/CSR_2018_q1.csv') == True:
    csr_2018_q1 = pd.read_csv('./assets/CSR_2018_q1.csv', low_memory=False)
else:
    csr_2018_q1 = pd.read_csv('https://opendata.arcgis.com/datasets/2a46f1f1aad04940b83e75e744eb3b09_9.csv', low_memory=False)
    csr_2018_q1.to_csv('./assets/CSR_2018_q1.csv')

##### 2) Source: Metropolitan Police Department - ShotSpotters datasets
- https://mpdc.dc.gov/publication/shotspotter-data-disclaimer-and-dictionary
- You will find the datasets at `./assets/` folder in the git repo and no need to download it from the web.

In [38]:
# unmute and run this code below if you have problem `.read_excel` 

# !pip install xlrd

In [39]:
# ShotSpotters datasets from Metropolitan Police Department 

# Train set: ShotSpotters datasets for 2014 - 2017 
shots_train = pd.read_excel('./assets/ShotSpotter Data 14-17 180213_0.xlsx')

# Test set: ShotSpotters datasets for 2018 Q1 
shots_test = pd.read_excel('./assets/ShotSpotter Public Data Q1 2018.xlsx')

### Basic settings with the datasets

Name Dataframes

In [40]:
csr_2014.name    = 'City Service Requests 2014 data'
csr_2015.name    = 'City Service Requests 2015 data'
csr_2016.name    = 'City Service Requests 2016 data'
csr_2017.name    = 'City Service Requests 2017 data'
csr_2018_q1.name = 'City Service Requests 2018 Q1 data'
shots_train.name = 'Shot Spotters 2014-2017 data'
shots_test.name  = 'Shot Spotters 2018 Q1 data'

Make sure that no datasets have `Unnamed: 0` columns in the dataframe.

In [41]:
for df_ in [csr_2014, csr_2015, csr_2016, csr_2017, csr_2018_q1, shots_train, shots_test]:
    if df_.columns[0] == "Unnamed: 0":
        df_.drop(['Unnamed: 0'], axis=1, inplace=True)
        print("Dropped unnamed column in", df_.name)
    else:
        print("No columns dropped in", df_.name)

Dropped unnamed column in City Service Requests 2014 data
Dropped unnamed column in City Service Requests 2015 data
Dropped unnamed column in City Service Requests 2016 data
Dropped unnamed column in City Service Requests 2017 data
Dropped unnamed column in City Service Requests 2018 Q1 data
No columns dropped in Shot Spotters 2014-2017 data
No columns dropped in Shot Spotters 2018 Q1 data


Check the shapes of the datasets

In [42]:
print("Shape of City Service Requests 2014:    ", csr_2014.shape)
print("Shape of City Service Requests 2015:    ", csr_2015.shape)
print("Shape of City Service Requests 2016:    ", csr_2016.shape)
print("Shape of City Service Requests 2017:    ", csr_2017.shape)
print("Shape of City Service Requests 2018 Q1: ", csr_2018_q1.shape)
print("--------------------")
print("Shape of Shot Spotters 2014-2017: ", shots_train.shape)
print("Shape of Shot Spotters 2018 Q1  : ", shots_test.shape)

Shape of City Service Requests 2014:     (322469, 30)
Shape of City Service Requests 2015:     (295633, 30)
Shape of City Service Requests 2016:     (302985, 30)
Shape of City Service Requests 2017:     (310146, 30)
Shape of City Service Requests 2018 Q1:  (151284, 30)
--------------------
Shape of Shot Spotters 2014-2017:  (9637, 7)
Shape of Shot Spotters 2018 Q1  :  (1072, 7)


High likely the City Service Requests datasets are aligned in its columns name, but just make sure to check it.

In [43]:
csr_2015 = csr_2015[csr_2014.columns]
csr_2016 = csr_2016[csr_2015.columns]
csr_2017 = csr_2017[csr_2016.columns]
csr_2018_q1 = csr_2018_q1[csr_2017.columns]

In [44]:
# concat the datasets
csr = [csr_2014, csr_2015, csr_2016, csr_2017]

csr_train = pd.concat(csr)
print(csr_train.shape)

(1231233, 30)


In [45]:
# check ShotSpotters datasets columns
print(shots_train.columns)
print(shots_test.columns)

Index(['ID', 'Type', 'Date', 'Time', 'Source', 'Lat (100)', 'Lon (100)'], dtype='object')
Index(['ID', 'Type', 'Date', 'Time', 'Source', 'Lat (100m)', 'Lon (100m)'], dtype='object')


In [46]:
# make sure to rename it
shots_train = shots_train.rename(columns={'Lat (100)': 'Latitude', 'Lon (100)': 'Longitude'})
shots_test = shots_test.rename(columns={'Lat (100m)': 'Latitude', 'Lon (100m)': 'Longitude'})

##### Teams, two datasets have different digits in the Lat's and Long's. Will it be matters? 

In [48]:
csr_train[['LATITUDE', 'LONGITUDE']].head()

Unnamed: 0,LATITUDE,LONGITUDE
0,38.89795,-76.972732
1,38.922857,-76.991905
2,38.874742,-76.970889
3,38.942811,-77.022676
4,38.898952,-77.048838


In [49]:
shots_train[['Latitude', 'Longitude']].head()

Unnamed: 0,Latitude,Longitude
0,38.917,-77.012
1,38.917,-77.002
2,38.917,-76.987
3,38.823,-77.0
4,38.893,-76.993


##### Basic EDAs 
- Let's playing on some EDAs

In [50]:
csr_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231233 entries, 0 to 310145
Data columns (total 30 columns):
X                             1231233 non-null float64
Y                             1231233 non-null float64
OBJECTID                      1231233 non-null int64
SERVICECODE                   1231233 non-null object
SERVICECODEDESCRIPTION        1231233 non-null object
SERVICETYPECODEDESCRIPTION    1230379 non-null object
ORGANIZATIONACRONYM           1231232 non-null object
SERVICECALLCOUNT              1231233 non-null int64
ADDDATE                       1231233 non-null object
RESOLUTIONDATE                1145187 non-null object
SERVICEDUEDATE                1218530 non-null object
SERVICEORDERDATE              1231233 non-null object
INSPECTIONFLAG                1231233 non-null object
INSPECTIONDATE                434130 non-null object
INSPECTORNAME                 40361 non-null object
SERVICEORDERSTATUS            1230380 non-null object
STATUS_CODE                

In [51]:
csr_train.isnull().sum().sort_values(ascending=False)

INSPECTORNAME                 1190872
INSPECTIONDATE                 797103
DETAILS                        444580
MARADDRESSREPOSITORYID         189162
STATUS_CODE                    151801
RESOLUTIONDATE                  86046
CITY                            50324
STATE                           50324
STREETADDRESS                   49730
SERVICEDUEDATE                  12703
WARD                             6221
PRIORITY                         2677
SERVICETYPECODEDESCRIPTION        854
SERVICEORDERSTATUS                853
ZIPCODE                            16
ORGANIZATIONACRONYM                 1
SERVICEREQUESTID                    0
XCOORD                              0
INSPECTIONFLAG                      0
SERVICEORDERDATE                    0
YCOORD                              0
LATITUDE                            0
ADDDATE                             0
SERVICECALLCOUNT                    0
LONGITUDE                           0
SERVICECODEDESCRIPTION              0
SERVICECODE 