## 1.1. Load and Display the Data

In [11]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time

# Import supplementary visualization code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

In [12]:
# Load the 911 incident data
incident_data_raw = pd.read_csv("incident_data.csv")

incident_data_raw.shape

(1349619, 24)

## 1.2. Clean the Data

In [13]:
# Truncate the data to cover the incidents in the last three years from 2016-09-01 up to 2019-09-01
incident_data=incident_data_raw[(incident_data_raw['Response_Date'] >= '2016-09-01 00:00:00.000') & (incident_data_raw['Response_Date'] <= '2019-09-01 00:00:00.000')]
incident_data.reset_index(drop = True, inplace=True)
incident_data.shape

(1098114, 24)

In [14]:
# Keep only the medical incidents in the data set where the 'Call_Category' is Emergency Medical Response, Urgent Medical Response or Non-Emergency Medical Response
incident_data_medical=incident_data[incident_data['Call_Category'].str.contains("Emergency Medical Response|Urgent Medical Response|Non-Emergency Medical Response")]
incident_data_medical.reset_index(drop=True,inplace=True)
incident_data_medical.shape

(963040, 24)

In [15]:
# Remove the data if the Master_Incident_Number does not start with FS. These incidents occurred in other cities, not in San Diego.
incident_data_sd=incident_data_medical[incident_data_medical['Master_Incident_Number'].str.startswith('FS')]
incident_data_sd.reset_index(drop=True, inplace=True)
incident_data_sd.shape

(961784, 24)

In [16]:
# Remove the 74 rows which contains no discernible coordinates for the location of the corresponding incident

incidents_with_locations=incident_data_sd[(incident_data_sd['Latitude_Decimal']>20) & (incident_data_sd['Latitude_Decimal']<40) 
                                          & (incident_data_sd['Longitude_Decimal']>-130) & (incident_data_sd['Longitude_Decimal']<-110)] 
                                          
incidents_with_locations.reset_index(drop=True, inplace=True)
incidents_with_locations.shape                                         

(961710, 24)

## 1.3. Feature Set Description

In [17]:
# Display the feature names in the data
incidents_with_locations.columns

Index(['ID', 'Master_Incident_Number', 'Response_Date', 'Jurisdiction',
       'Incident_Type', 'Problem', 'Priority_Description', 'Call_Category',
       'Transport_Mode', 'Location_Name', 'Address', 'Apartment',
       'Postal_Code', 'Longitude', 'Latitude', 'Longitude_Decimal',
       'Latitude_Decimal', 'Cross_Street', 'MethodOfCallRcvd',
       'Time_First_Unit_Assigned', 'Time_First_Unit_Enroute',
       'Time_First_Unit_Arrived', 'Call_Disposition', 'TimeFirstStaged'],
      dtype='object')

In [18]:
# Find the values and their counts for features ["Incident_Type","Problem","Priority_Description","Call_Category","Transport_Mode"]
features=["Incident_Type","Problem","Priority_Description","Call_Category","Transport_Mode"]
for feature in features:
    print(incidents_with_locations[feature].value_counts())

Medical Aid 1                    342229
1a Medical Aid 1a                318629
Traffic Accidents                 77255
Medical Aid 4                     55552
Medical Aid 3                     50052
4b Medical Aid 4b                 46432
3a Medical Aid 3a                 33501
Cardiac Arrest                    10538
Medical Alert Alarm                5374
1c Medical Aid 1c                  5197
2c Medical Aid 2c                  4105
Gaslamp                            4033
Medical Aid O                      3884
4c Medical Aid 4c                  1898
2a Medical Aid 2a                  1449
Traffic Accident Freeway (NC)       762
4a Medical Aid 4a                   444
2b Medical Aid 2b                   180
Vehicle Rescue                       72
3c Medical Aid 3c                    61
1b Medical Aid 1b                    26
3b Medical Aid 3b                    11
Vehicle vs. Structure                 8
Heavy Rescue                          7
Water Rescue 1 & 0                    5


As shown above, the incident data has 24 columns. The following features are available in the data set:

**1. ID:** Ordinal categorical. Integer. The ID increases in the chronological order of incidents. 7-digit unique ID for each incident. If more than one ambulance is sent to the scene, the same incident number appears in multiple rows.

**2. Master_Incident_Number:** Ordinal categorical. Object. "FS"+"2-digit year"+"6-digit incident number starting from 000001 at the start of each year and increases in the chronological order of incidents". If more than one ambulance is sent to the scene, the same incident number might appear in multiple rows.

**3. Response_Date:** Ordinal categorical. Object. Date and time of when the 911 call is received. The format is YYYY-MM-DD HH:MM:SS.000.

**4. Jurisdiction:** Nominal categorical. Object. This feature shows the jurisdiction of the incident. This feature only contains the string "San Diego" for this data set.  

**5. Incident_Type:**  Nominal categorical. Object. This feature shows 28 different incident types as shown above. Medical Aid 1 shows most acute incidents, Medical Aid 4 shows least acute incidents.

**6. Problem:**  Nominal categorical. Object. This feature shows 218 different problems associated with the incidents as shown above. The most frequent problem is described as "Sick Person (Specific Dx)(L1)". L1 indicates most acute version of chest pain while L4 represents the least acute version.

**7. Priority_Description:** Nominal categorical. Object. This feature shows 27 priority descriptions for the incidents as shown above. The lower numbers in the descriptions represent more acute incidents. 

**8. Call_Category:**     Nominal categorical. Object. This feature shows 3 categories for the medical 911 calls. From most frequent to least frequent: Emergency Medical Response, Non-Emergency Medical Response, Urgent Medical Response.

**9. Transport_Mode:** Nominal categorical. Object. This feature shows categories for the transport mode of incidents such as: 50-Non Emergency, 40-BLS Status Transport, 30-IV/No Medication, 20-IV/Medication, 10-Acute/Medical Trauma, MUTUAL AID TRANSPORT. The lower numbers represent more acute incidents. 

**10. Location_Name:** Nominal categorical. Object. This feature shows the description of the location of the incident such as hotel, school, residence, etc.  

**11. Address:** Nominal categorical. Object. This feature shows the address of the incident.

**12. Apartment:** Nominal categorical. Object. This feature shows the apartment number of the incident.

**13. Postal_Code:** Nominal categorical. Float. This feature shows the zipcode of the incident.

**14. Longitude:** Numeric. Float. This feature shows the longitude of the incident as an 9-digit number.

**15. Latitude:** Numeric. Float. This feature shows the latitude of the incident as an 8-digit number.

**16. Longitude_Decimal:** Numeric. Float. This feature shows the latitude of the incident as a number with 6 decimals.

**17. Latitude_Decimal:**  Numeric. Float. This feature shows the latitude of the incident as a number with 6 decimals.

**18. Cross_Street:** Nominal categorical. Object. This feature shows the cross street of where the incident occurred.

**19. MethodOfCallRcvd:** Nominal categorical. Object. This feature shows the method of the 911 call received such as through a cell phone, alarm company, etc.    

**20. Time_First_Unit_Assigned:** Ordinal categorical. Object. Date and time of when the first unit is assigned to the incident. The format is YYYY-MM-DD HH:MM:SS.000.

**21. Time_First_Unit_Enroute:** Ordinal categorical. Object. Date and time of when the first unit is enroute to the incident scene. The format is YYYY-MM-DD HH:MM:SS.000.

**22. Time_First_Unit_Arrived:** Ordinal categorical. Object. Date and time of when the first unit arrives at the incident scene. The format is YYYY-MM-DD HH:MM:SS.000.

**23. Call_Disposition:** Nominal categorical. Object.  The disposition method of the 911 call. 

**24. TimeFirstStaged:** Ordinal categorical. Object. Date and time of when the ambulances are first staged while police secure potentially dangerous incident scene. Null if not staged.
The format is YYYY-MM-DD HH:MM:SS.000.

In [1]:
# Explore the basic statistical characteristics of all features
#incidents_with_locations.describe(include=['object','int','float'])
# Commented not to show location data of incidents

In [20]:
# Explore the data types and the number of non-null observations in the columns of the dataset
incidents_with_locations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961710 entries, 0 to 961709
Data columns (total 24 columns):
ID                          961710 non-null int64
Master_Incident_Number      961710 non-null object
Response_Date               961710 non-null object
Jurisdiction                961710 non-null object
Incident_Type               961710 non-null object
Problem                     961710 non-null object
Priority_Description        961710 non-null object
Call_Category               961710 non-null object
Transport_Mode              296480 non-null object
Location_Name               573724 non-null object
Address                     961708 non-null object
Apartment                   280026 non-null object
Postal_Code                 960311 non-null object
Longitude                   961710 non-null float64
Latitude                    961710 non-null float64
Longitude_Decimal           961710 non-null float64
Latitude_Decimal            961710 non-null float64
Cross_Street       

## 1.4. Contruct a Dataframe for Each Unique Incident ID

In [21]:
# Group the data frame in terms of incident ID and take the latitude, longitude, postal code, response date and problem type features for each incident 
unique_incident_location=incidents_with_locations.groupby('ID', as_index=False)[['Latitude_Decimal','Longitude_Decimal','Postal_Code','Response_Date','Problem']].first()
unique_incident_location.shape

(420031, 6)

While all 420031 data entries have latitude and longitude data available, not all postal codes are available. We will use Google API to reverse geocode the postal codes using latitude and longitude data.

In [22]:
# Define a function which determines if a number is nan
def isNaN(num):
    return num != num

In [27]:
# Using Google API and the latitude and longitude of the incident, reverse-geocode the postal code for each incident with no postal code
from pygeocoder import Geocoder
API_KEY='AIzaSyCwTIJi34m3Uq8AHaD9YJErWVy91z8lRg8'
sd_geocoder = Geocoder(api_key=API_KEY)

for i, zipcode in enumerate(unique_incident_location['Postal_Code'].values):
    if isNaN(zipcode)==True:
        lat, lon=unique_incident_location[['Latitude_Decimal','Longitude_Decimal']].iloc[i].values
        unique_incident_location['Postal_Code'][i]=sd_geocoder.reverse_geocode(lat, lon).postal_code
    if i % 100000==0: print("{} rows are done!".format(i))    
unique_incident_location.astype({'Postal_Code': 'float64'}).dtypes

0 rows are done!
100000 rows are done!
200000 rows are done!
300000 rows are done!
400000 rows are done!


ID                     int64
Latitude_Decimal     float64
Longitude_Decimal    float64
Postal_Code          float64
Response_Date         object
Problem               object
dtype: object

In [26]:
# Check if there are any incidents with null postal code
if unique_incident_location['Postal_Code'][unique_incident_location['Postal_Code'].isnull()].shape[0]==0:
    print("No null postal codes are left in the data!")
else:     
    print("There are still null postal codes in the data!")

No null postal codes are left in the data!


## 1.5. Assign A Weather Station to Each Incident

National Weather Service provides historical daily temperature and precipitation data for 16 weather stations near San Diego: https://www.weather.gov/sgx/cliplot

The latitude and longitude of each weather station is collected in the following dataframe:

In [29]:
# Load the location data for all weather stations near San Diego
weather_station_location=pd.read_csv('weather_stations.csv', header=0)
weather_station_location

Unnamed: 0,stations,abbreviations,latitude,longitude
0,carlsbad,KCRQ,33.1268,-117.27583
1,campo,KCZZ,32.62611,-116.46833
2,chino_airport,KCNO,33.97556,-117.62361
3,corona_airport,KAJO,33.8977,-117.6024
4,fullerton,KFUL,33.87194,-117.98472
5,john_wayne_airport,KSNA,33.6798,-117.8674
6,oceanside_airport,KOKB,33.21806,-117.35139
7,ontario,KONT,34.05316,-117.57685
8,palm_springs,KPSP,33.82219,-116.50431
9,riverside_municipal_airport,KRAL,33.95299,-117.43491


Given that we know the coordinates of each incident, we can assign to each incident the nearest weather station. 
In this way, we know which weather station's data will be relevant to each incident:

In [31]:
# Assign to each incident a weather station closest to the incident 
import geopy.distance
distance=pd.DataFrame(pd.Series([0.0 for i in range(len(weather_station_location))]), columns=['distance'])
incident_station=pd.DataFrame(pd.Series([0 for i in range(len(unique_incident_location))]), columns=['Assigned_Station'])
for j, (lat1, lon1)  in enumerate(unique_incident_location[['Latitude_Decimal','Longitude_Decimal']].values):
    for i, (lat2, lon2)  in enumerate(weather_station_location[['latitude', 'longitude']].values):
        distance['distance'][i]=geopy.distance.distance((lat1, lon1), (lat2,lon2)).miles
    incident_station['Assigned_Station'][j]=distance['distance'].idxmin()
    if j % 20000==0:
        print("{} points done".format(j))        
incident_station['Assigned_Station'].value_counts() 

0 points done
20000 points done
40000 points done
60000 points done
80000 points done
100000 points done
120000 points done
140000 points done
160000 points done
180000 points done
200000 points done
220000 points done
240000 points done
260000 points done
280000 points done
300000 points done
320000 points done
340000 points done
360000 points done
380000 points done
400000 points done
420000 points done


12    217975
14    142253
13     43462
11     14082
0       2247
1          9
15         3
Name: Assigned_Station, dtype: int64

In [34]:
# Add the assigned station for each incident to the dataframe 'unique_incident_location'
incidents=unique_incident_location.join(pd.DataFrame({'assigned_station':incident_station['Assigned_Station']})) 
incidents.columns

Index(['ID', 'Latitude_Decimal', 'Longitude_Decimal', 'Postal_Code',
       'Response_Date', 'Problem', 'assigned_station'],
      dtype='object')

In [33]:
# Save the dataframe into a csv file
incidents.to_csv('incidents.csv', index=False)