# Client Project: The Lab @ DC

## Project Title: {here}

### Authors: {names}
- Cohorts of the Data Science Immersive, General Assembly @ Washington DC campus

In this notebook, we have Exploratory Data Analysis on the datasets. **This is notebook 2 of 3.**

### Import Libraries

In [77]:
# import basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### Read CSVs

In [78]:
csr_train   = pd.read_csv('./assets/csr/csr_train.csv', low_memory=False)
shots_train = pd.read_csv('./assets/mpd/shots_train.csv', low_memory=False)
shots_test  = pd.read_csv('./assets/mpd/shots_test.csv', low_memory=False)

### Basic EDAs and Data Cleaning

In [79]:
csr_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1231233 entries, 0 to 1231232
Data columns (total 31 columns):
Unnamed: 0                    1231233 non-null int64
X                             1231233 non-null float64
Y                             1231233 non-null float64
OBJECTID                      1231233 non-null int64
SERVICECODE                   1231233 non-null object
SERVICECODEDESCRIPTION        1231233 non-null object
SERVICETYPECODEDESCRIPTION    1230379 non-null object
ORGANIZATIONACRONYM           1231232 non-null object
SERVICECALLCOUNT              1231233 non-null int64
ADDDATE                       1231233 non-null object
RESOLUTIONDATE                1145201 non-null object
SERVICEDUEDATE                1218530 non-null object
SERVICEORDERDATE              1231233 non-null object
INSPECTIONFLAG                1231233 non-null object
INSPECTIONDATE                434130 non-null object
INSPECTORNAME                 40361 non-null object
SERVICEORDERSTATUS         

In [80]:
csr_train.isnull().sum().sort_values(ascending=False)

INSPECTORNAME                 1190872
INSPECTIONDATE                 797103
DETAILS                        444566
MARADDRESSREPOSITORYID         189162
STATUS_CODE                    151801
RESOLUTIONDATE                  86032
CITY                            50324
STATE                           50324
STREETADDRESS                   49730
SERVICEDUEDATE                  12703
WARD                             6221
PRIORITY                         2677
SERVICETYPECODEDESCRIPTION        854
SERVICEORDERSTATUS                853
ZIPCODE                            16
ORGANIZATIONACRONYM                 1
SERVICECALLCOUNT                    0
SERVICECODE                         0
OBJECTID                            0
Y                                   0
X                                   0
SERVICECODEDESCRIPTION              0
INSPECTIONFLAG                      0
ADDDATE                             0
SERVICEORDERDATE                    0
SERVICEREQUESTID                    0
XCOORD      

In [81]:
shots_train.head()

Unnamed: 0.1,Unnamed: 0,ID,Type,Date,Time,Source,Latitude,Longitude
0,0,5D39700,Multiple_Gunshots,2014-01-01,00:00:02,WashingtonDC5D,38.917,-77.012
1,1,5D39701,Multiple_Gunshots,2014-01-01,00:00:06,WashingtonDC5D,38.917,-77.002
2,2,5D39702,Multiple_Gunshots,2014-01-01,00:00:07,WashingtonDC5D,38.917,-76.987
3,3,7D119445,Multiple_Gunshots,2014-01-01,00:00:10,WashingtonDC7D,38.823,-77.0
4,4,1D55993,Multiple_Gunshots,2014-01-01,00:00:10,WashingtonDC1D,38.893,-76.993


In [82]:
shots_train.shape

(28343, 8)

In [83]:
shots_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28343 entries, 0 to 28342
Data columns (total 8 columns):
Unnamed: 0    28343 non-null int64
ID            28339 non-null object
Type          28343 non-null object
Date          28343 non-null object
Time          28343 non-null object
Source        28343 non-null object
Latitude      28343 non-null float64
Longitude     28343 non-null float64
dtypes: float64(2), int64(1), object(5)
memory usage: 1.7+ MB


In [84]:
shots_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28343 entries, 0 to 28342
Data columns (total 8 columns):
Unnamed: 0    28343 non-null int64
ID            28339 non-null object
Type          28343 non-null object
Date          28343 non-null object
Time          28343 non-null object
Source        28343 non-null object
Latitude      28343 non-null float64
Longitude     28343 non-null float64
dtypes: float64(2), int64(1), object(5)
memory usage: 1.7+ MB


In [85]:
def shot_spot_preprocess(df):
    shots_train.set_index(['ID'], inplace=True)
    shots_train.drop(['Unnamed: 0'], axis=1, inplace=True)
    shots_train.Source = shots_train.Source.apply(lambda DC: DC.replace('WashingtonDC', ''))
    shots_train.Date = pd.to_datetime(shots_train.Date)
    shots_train.Time = pd.to_datetime(shots_train.Time)
    return df

shots_spot = shot_spot_preprocess(shots_train)

In [None]:
def crimespot_preprocess(df):
    # Removing unused or redundent information
    csr_train.drop(['Unnamed: 0', 'INSPECTORNAME', 'CITY', 'STATE', 'X', 'Y'], axis=1, inplace=True)
    csr_train.columns = map(str.lower, csr_train.columns) # Easier to work with lowercase columns
    csr_train.serviceduedate.fillna('0', inplace=True)
    csr_train.resolutiondate.fillna('0', inplace=True)
    csr_train.inspectiondate.fillna('0', inplace=True)
    timestamp = ['adddate', 'resolutiondate', 'serviceduedate', 'serviceorderdate', 'inspectiondate']
    for x in timestamp:
        csr_train[x] = csr_train[x].map(lambda x: x.strip('Z').replace('T', ' '))
    csr_train.zipcode.dropna(inplace=True) # only 16 null values.
    csr_train.zipcode = csr_train.zipcode.astype(int)
    csr_train.ward.dropna(inplace=True) # missing 6221. Maybe we can put in the Ward based on Zipcode
    csr_train.ward = csr_train.ward.map(lambda x: x.strip('Ward'))
    return df

crimespot_preprocess(csr_train)

In [101]:
# Unnamed clearly the Original index column.  Delete that.
# X and Y seems to be closely related to 
# Is service code description the same as service type, just a little more information?
# Object ID can be set as the index.
csr_train.iloc[:5, :6]

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,organizationacronym,servicecallcount
0,463232,S0011,Alley Cleaning,Street Cleaning,DPW,1
1,463233,S0321,Recycling Collection - Missed,Recycling,DPW,1
2,463234,S0031,Bulk Collection,Bulk Collection,DPW,1
3,463235,S0311,Rat Abatement,DOH,DOH,1
4,463236,S0276,Parking Meter Repair,TOA,DDOT,1


In [102]:
# Service Type, Important to know the different values there.
# Organization to know who is on the task.
# Service call count to see how many times a call is needed.
# Everything just needs to be set to datetime.
csr_train.iloc[:5, 6:12]

Unnamed: 0,adddate,resolutiondate,serviceduedate,serviceorderdate,inspectionflag,inspectiondate
0,2014-01-02 13:27:40.000,2014-01-15 07:43:42.000,2014-02-18 13:27:40.000,2014-01-02 13:27:40.000,N,
1,2014-01-02 13:46:57.000,2014-01-06 12:39:39.000,2014-01-06 13:46:57.000,2014-01-02 13:46:57.000,N,2014-01-06T12:39:00.000Z
2,2014-01-02 13:57:46.000,2014-01-14 14:29:16.000,2014-01-23 13:57:46.000,2014-01-02 13:57:46.000,N,
3,2014-01-02 13:43:20.000,0,2014-02-24 13:43:20.000,2014-01-02 13:43:20.000,N,
4,2014-01-02 16:00:59.000,2014-01-07 16:33:48.000,2014-01-09 16:00:59.000,2014-01-02 16:00:59.000,N,


In [103]:
# Service order data good.  Needs to be broken down into Datetime.
# Inspection Flag, what does that mean?
# Remove Inspector Name
# What does Status code contain? A lot of NaN could be bad
# Service Order Status, Important maybe?
csr_train.iloc[:5, 12:18]

Unnamed: 0,serviceorderstatus,status_code,servicerequestid,priority,streetaddress,xcoord
0,CLOSED,,14-00000654,STANDARD,2301 BENNING ROAD NE,402365.36
1,CLOSED,,14-00000686,STANDARD,1004 RHODE ISLAND AVENUE NE,400701.99
2,CLOSED,,14-00000707,STANDARD,2333 FAIRLAWN AVENUE SE,402526.12
3,OPEN,,14-00000677,STANDARD,720 VARNUM STREET NW,398034.18
4,CLOSED,,14-00000877,STANDARD,700 - 799 BLOCK OF 22ND STREET NW,395763.56


In [174]:
csr_train.ward.isnull().sum()

6221

In [175]:
csr_train.ward.dropna(inplace=True)

In [176]:
csr_train.ward.astype('int32')

ValueError: invalid literal for int() with base 10: '4.0'

In [172]:
csr_train.ward = csr_train['ward'].str.strip() 

In [173]:
csr_train.ward.value_counts()

2.0    172013
6.0    154744
4.0    119513
5.0    116919
1.0     98617
7.0     93001
3.0     83809
2       82924
8.0     63931
6       60159
4       39432
5       36207
3       30929
1       29378
7       26437
8       16999
Name: ward, dtype: int64

In [135]:
# Service Request ID seems unimportant
# What is the XCOORD and YCOORD?
# Street Address can help us find our quandrants. Do we also want the address or is Latitude and Longitude
# Priority - how many unique values are in there?
csr_train.iloc[:5, 18:24]

Unnamed: 0,ycoord,latitude,longitude,zipcode,maraddressrepositoryid,ward
0,136678.02,38.89795,-76.972732,20002.0,48983.0,7
1,139442.62,38.922857,-76.991905,20018.0,76304.0,5
2,134101.82,38.874742,-76.970889,20020.0,286919.0,7
3,141657.91,38.942811,-77.022676,20011.0,249794.0,4
4,136790.11,38.898952,-77.048838,20052.0,,2


In [105]:
# Remove City and State
# clean up the Ward to just numbers
# Fix Zipcode to be int.
# Longitude is the same as the X column
# Details can be vectorized.
# What is MARADDRESSREPOSITORYID?
csr_train.iloc[:5, 24:31]

Unnamed: 0,details
0,There is some dumping in the rear of this addr...
1,Has not been collected the past 4 weeks.
2,"1 television, 2 vacuums, 1 boom box,"
3,requesting ratb abatement
4,Broken Parking Meter


##### Teams, two datasets have different digits in the Lat's and Long's. Will it be matters? 

In [5]:
csr_train[['LATITUDE', 'LONGITUDE']].head()

Unnamed: 0,LATITUDE,LONGITUDE
0,38.89795,-76.972732
1,38.922857,-76.991905
2,38.874742,-76.970889
3,38.942811,-77.022676
4,38.898952,-77.048838


In [6]:
shots_train[['Latitude', 'Longitude']].head()

Unnamed: 0,Latitude,Longitude
0,38.917,-77.012
1,38.917,-77.002
2,38.917,-76.987
3,38.823,-77.0
4,38.893,-76.993


In [47]:
shots_train.Type.value_counts()

Multiple_Gunshots         15858
Single_Gunshot            10034
Gunshot_or_Firecracker     2451
Name: Type, dtype: int64