<a href="https://colab.research.google.com/github/kkrusere/Developing-a-Score-to-Measure-Riskiness-of-Residential-Properties-Insurance/blob/main/data_collection_prep_and_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import warnings
warnings.filterwarnings("ignore")

Mounted at /content/drive


## <center> **Developing a Score to Measure Riskiness of Residential Properties Insurance** 

<center><em>Developing a Score to Measure Riskiness of Residential buildings, Homes, Apartments and Condos as part of an insurance policy underwriting. Insurance underwriting is how an insurance company evaluates its risk. In this project, we identify and explore multiple data sources to collect variables that could be used to develop a score that measures the riskiness of residential buildings to aid the insurance underwriting process.</em></center>

<center><img src="https://github.com/kkrusere/Developing-a-Score-to-Measure-Riskiness-of-Residential-Properties-Insurance/blob/main/assets/real-estate-risk.jpg?raw=1" width=600/></center>

***Project Contributors:*** Kuzi Rusere and Umair Shaikh<br>
**MVP streamlit App URL:** https://kkrusere-developing-a-score-to-measure-prototype-mvp-app-acxav4.streamlitapp.com





### **Data collection**

This notebook is for the data collection, cleaning and preparation. The first dataset that we are going to be using is from the New York City OpenDataset.

<center><img src="https://github.com/kkrusere/Developing-a-Score-to-Measure-Riskiness-of-Residential-Properties-Insurance/blob/main/assets/nycOpenData.png?raw=1" width=600/></center>

The NYC OpenData is a data registry/repositoory of public data generated by various New York City agencies and other City organizations that is publicly available and accessible for anyone to use, participate in and improve government by conducting research and analysis gaining a better understanding of the services provided by City. The repository is an initiative to improve the accessibility, transparency, and accountability of City government.


The datasets are available and accessible in a variety of machine-readable formats including API access. We are going to be using the NYC 311 dataset. NYC 311 gives access to non-emergency City services and info about City government programs, this tool was launched in 2003 with phone call as only contact Type, but now is also accessible through text messages, chat, a mobile application, social media and a website.
<br>
<br>
<br>
<center><img src="https://github.com/kkrusere/Developing-a-Score-to-Measure-Riskiness-of-Residential-Properties-Insurance/blob/main/assets/311_contact_type.png?raw=1" width=600/><figcaption><em>Image from: https://council.nyc.gov/data/311-services/</em></figcaption></center>

<br>
<br>
<br>

We are going to be using 311 data with `Request Types` related to incidents,complaints tied to residential areas. So, right from the bet, our data in the `Location Type` will be filtered to only include entries with `Residential` as you will see from the below.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline


The data that we are woking with is huge, so we are going to be using pyspark for reading the dataframe/table and then use pandas to filter and clean the data.
## **Setting up pyspark:**
Installing PySpark on Google Colab is to use pip install

In [None]:
# Install pyspark
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 38 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 50.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=8812b925a148a620fffd0f3f6b64f2b0dd373291ed010f3574f0dd0cde3fa4c2
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
# # Import SparkSession
# from pyspark.sql import SparkSession
# # Create a Spark Session
# spark = SparkSession.builder.master("local[*]").getOrCreate()
# # Check Spark Session Information
# spark

We will use spark to read the csv dataframe 

In [None]:
#data_df = spark.read.csv("/content/drive/MyDrive/capstone/311_Service_Requests_from_2010_to_Present.csv", header=True, inferSchema=True)
data_df = pd.read_csv("/content/drive/MyDrive/capstone/311_Service_Requests_from_2010_to_Present.csv")

In [None]:
data_df.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,25595691,05/23/2013 12:00:00 AM,05/29/2013 12:00:00 AM,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,11209.0,7207 3 AVENUE,...,,,,,,,,40.63338,-74.026993,"(40.63338019237986, -74.02699256583904)"
1,25595692,05/23/2013 12:00:00 AM,06/03/2013 12:00:00 AM,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,CEILING,RESIDENTIAL BUILDING,10457.0,269 EAST BURNSIDE AVENUE,...,,,,,,,,40.851333,-73.902133,"(40.851332558936704, -73.90213313592302)"
2,25595877,05/23/2013 12:20:25 PM,09/06/2013 12:00:00 AM,DSNY,Department of Sanitation,Graffiti,Graffiti,Residential,10472.0,1963 HAVILAND AVENUE,...,,,,,,,,40.829475,-73.858298,"(40.829474814637784, -73.85829772136906)"
3,25595984,05/23/2013 12:00:00 AM,05/31/2013 12:00:00 AM,HPD,Department of Housing Preservation and Develop...,NONCONST,VERMIN,RESIDENTIAL BUILDING,11229.0,1820 AVENUE V,...,,,,,,,,40.597049,-73.952872,"(40.59704910449971, -73.95287153097844)"
4,25596010,05/23/2013 12:00:00 AM,06/08/2013 12:00:00 AM,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,10467.0,3535 ROCHAMBEAU AVENUE,...,,,,,,,,40.882408,-73.879058,"(40.88240811038497, -73.87905847713522)"


In [None]:
#lets take a look atn the columns that we have in our dataset
list(data_df.columns)

['Unique Key',
 'Created Date',
 'Closed Date',
 'Agency',
 'Agency Name',
 'Complaint Type',
 'Descriptor',
 'Location Type',
 'Incident Zip',
 'Incident Address',
 'Street Name',
 'Cross Street 1',
 'Cross Street 2',
 'Intersection Street 1',
 'Intersection Street 2',
 'Address Type',
 'City',
 'Landmark',
 'Facility Type',
 'Status',
 'Due Date',
 'Resolution Description',
 'Resolution Action Updated Date',
 'Community Board',
 'BBL',
 'Borough',
 'X Coordinate (State Plane)',
 'Y Coordinate (State Plane)',
 'Open Data Channel Type',
 'Park Facility Name',
 'Park Borough',
 'Vehicle Type',
 'Taxi Company Borough',
 'Taxi Pick Up Location',
 'Bridge Highway Name',
 'Bridge Highway Direction',
 'Road Ramp',
 'Bridge Highway Segment',
 'Latitude',
 'Longitude',
 'Location']

In [None]:
#we remove the columns/features of our dataset that are not going to be of any use for this project 
data_df = data_df[[
    'Unique Key',
    'Created Date',
    'Agency',
    'Agency Name',
    'Complaint Type',
    'Descriptor',
    'Location Type',
    'Incident Zip',
    'Incident Address',
    'Street Name',
    'Address Type',
    'City',
    'Resolution Description',
    'Borough',
    'Latitude',
    'Longitude',]]

In [None]:
list(data_df.columns)

['Unique Key',
 'Created Date',
 'Agency',
 'Agency Name',
 'Complaint Type',
 'Descriptor',
 'Location Type',
 'Incident Zip',
 'Incident Address',
 'Street Name',
 'Address Type',
 'City',
 'Resolution Description',
 'Borough',
 'Latitude',
 'Longitude']

Data cleaning and preparation

In [None]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6289602 entries, 0 to 6289601
Data columns (total 16 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unique Key              int64  
 1   Created Date            object 
 2   Agency                  object 
 3   Agency Name             object 
 4   Complaint Type          object 
 5   Descriptor              object 
 6   Location Type           object 
 7   Incident Zip            float64
 8   Incident Address        object 
 9   Street Name             object 
 10  Address Type            object 
 11  City                    object 
 12  Resolution Description  object 
 13  Borough                 object 
 14  Latitude                float64
 15  Longitude               float64
dtypes: float64(3), int64(1), object(12)
memory usage: 767.8+ MB


In [None]:
#we are going to drop nan from our dataset, we have plenty enough data that we can afford to drop rows
data_df.dropna(axis = 0, how ='any', inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6048735 entries, 0 to 6289601
Data columns (total 16 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Unique Key              int64  
 1   Created Date            object 
 2   Agency                  object 
 3   Agency Name             object 
 4   Complaint Type          object 
 5   Descriptor              object 
 6   Location Type           object 
 7   Incident Zip            float64
 8   Incident Address        object 
 9   Street Name             object 
 10  Address Type            object 
 11  City                    object 
 12  Resolution Description  object 
 13  Borough                 object 
 14  Latitude                float64
 15  Longitude               float64
dtypes: float64(3), int64(1), object(12)
memory usage: 784.5+ MB


In [None]:
#now we change datatypes for the `Unique Key`, `Created Date`, and `Incident Zip`
data_df['Unique Key'] = data_df['Unique Key'].astype(object)
data_df['Incident Zip'] = data_df['Incident Zip'].astype(str)
#we are going to split the `Created Date` into dat
data_df['Date'] = [ele.split(" ")[0] for ele in data_df['Created Date']]
#we drop the the `Created Date` column
data_df.drop('Created Date', axis=1, inplace=True)
#now we change datatypes of the `Date` to the datetime format 
data_df['Date'] = pd.to_datetime(data_df['Date'], format='%m/%d/%Y')


In [None]:
data_df = data_df[[
    'Unique Key',
    'Date',
    'Agency',
    'Agency Name',
    'Complaint Type',
    'Descriptor',
    'Location Type',
    'Incident Zip',
    'Incident Address',
    'Street Name',
    'Address Type',
    'City',
    'Resolution Description',
    'Borough',
    'Latitude',
    'Longitude',]]

In [None]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6048735 entries, 0 to 6289601
Data columns (total 16 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   Unique Key              object        
 1   Date                    datetime64[ns]
 2   Agency                  object        
 3   Agency Name             object        
 4   Complaint Type          object        
 5   Descriptor              object        
 6   Location Type           object        
 7   Incident Zip            object        
 8   Incident Address        object        
 9   Street Name             object        
 10  Address Type            object        
 11  City                    object        
 12  Resolution Description  object        
 13  Borough                 object        
 14  Latitude                float64       
 15  Longitude               float64       
dtypes: datetime64[ns](1), float64(2), object(13)
memory usage: 784.5+ MB


In [None]:
data_df.head()

Unnamed: 0,Unique Key,Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Address Type,City,Resolution Description,Borough,Latitude,Longitude
0,25595691,2013-05-23,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,11209.0,7207 3 AVENUE,3 AVENUE,ADDRESS,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,40.63338,-74.026993
1,25595692,2013-05-23,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,CEILING,RESIDENTIAL BUILDING,10457.0,269 EAST BURNSIDE AVENUE,EAST BURNSIDE AVENUE,ADDRESS,BRONX,The Department of Housing Preservation and Dev...,BRONX,40.851333,-73.902133
2,25595877,2013-05-23,DSNY,Department of Sanitation,Graffiti,Graffiti,Residential,10472.0,1963 HAVILAND AVENUE,HAVILAND AVENUE,ADDRESS,BRONX,The City has removed the graffiti from this pr...,BRONX,40.829475,-73.858298
3,25595984,2013-05-23,HPD,Department of Housing Preservation and Develop...,NONCONST,VERMIN,RESIDENTIAL BUILDING,11229.0,1820 AVENUE V,AVENUE V,ADDRESS,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,40.597049,-73.952872
4,25596010,2013-05-23,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,10467.0,3535 ROCHAMBEAU AVENUE,ROCHAMBEAU AVENUE,ADDRESS,BRONX,The Department of Housing Preservation and Dev...,BRONX,40.882408,-73.879058


In [None]:
#we install the Python SQL Toolkit and Object Relational Mapper and the python MySQL connector
!pip install SQLAlchemy
!pip install mysql-connector-python
!pip install PyMySQL

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyMySQL
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
[?25hInstalling collected packages: PyMySQL
Successfully installed PyMySQL-1.0.2


In [None]:
#change directory so that we can access the config.py file 
%cd /content/drive/MyDrive/capstone

/content/drive/MyDrive/capstone


Storing our cleaned data set to a MySQL AWS RDS

In [None]:
import mysql.connector as connection
from sqlalchemy import create_engine
import config #this holds our credentials for the database 

host= config.host
user= config.user
db_password = config.password
port = config.port

#create the connection to the AWS MySQL database
conn = connection.connect(
  host=host,
  user=user,
  password=db_password,
  port = port,
)
mycursor = conn.cursor()

In [None]:
#we create the database to store our 311 dataset
mycursor.execute("CREATE DATABASE IF NOT EXISTS NYC311_db")
database = "NYC311_db"


In [None]:
# create sqlalchemy engine and converting our pandas dataframe to an SQL table
engine = create_engine(f"mysql+pymysql://{user}:{db_password}@{host}/{database}")
data_df.to_sql('NYC311Open_Data', con = engine, if_exists = 'append', index=False)

In [None]:
#we will alse save a csv
data_df.to_csv("NYC311Open_Data.csv", index=False)

# NYPD complaints Data

In [None]:
column_Name_Description = {  'RPT_DT': 'Date event was reported to police',
                                  'OFNS_DESC': 'Description of offense corresponding with key code',
                                  'CRM_ATPT_CPTD_CD': 'Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely',
                                  'LAW_CAT_CD': 'Level of offense: felony, misdemeanor, violation',
                                  'BORO_NM': 'The name of the borough in which the incident occurred',
                                  'LOC_OF_OCCUR_DESC': 'Specific location of occurrence in or around the premises; inside, opposite of, front of, rear of',
                                  'PREM_TYP_DESC': 'Specific description of premises; grocery store, residence, street, etc.',
                                  'Latitude': 'Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)',
                                  'Longitude': 'Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)'}

In [None]:
filter_list = list(column_Name_Description.keys())

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Capstone Project/NYPD_Complaint_Data_Historic.csv")
data.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,Lat_Lon,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,280364018,06/09/2018,21:42:00,06/09/2018,21:43:00,10,06/10/2018,361,OFF. AGNST PUB ORD SENSBLTY &,639,...,M,,40.75931,-73.994706,"(40.759310399, -73.994706072)",PATROL BORO MAN SOUTH,,18-24,WHITE HISPANIC,F
1,377132404,08/04/2018,22:15:00,,,44,08/04/2018,344,ASSAULT 3 & RELATED OFFENSES,101,...,M,,40.82617,-73.916831,"(40.826169612, -73.916830709)",PATROL BORO BRONX,,25-44,WHITE HISPANIC,F
2,336011712,11/04/2018,11:15:00,,,103,11/04/2018,106,FELONY ASSAULT,109,...,M,,40.707858,-73.759307,"(40.707858236, -73.759306969)",PATROL BORO QUEENS SOUTH,,25-44,BLACK,M
3,599398393,05/23/2018,23:30:00,05/24/2018,02:00:00,47,05/24/2018,351,CRIMINAL MISCHIEF & RELATED OF,254,...,,,40.882615,-73.851948,"(40.882615325, -73.851947659)",PATROL BORO BRONX,,25-44,ASIAN / PACIFIC ISLANDER,F
4,310389190,11/18/2018,16:00:00,11/18/2018,16:10:00,48,11/18/2018,105,ROBBERY,388,...,M,,40.850357,-73.882989,"(40.85035684, -73.882989431)",PATROL BORO BRONX,,<18,WHITE HISPANIC,M


In [None]:
data = data[filter_list]

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3031883 entries, 0 to 3031882
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   RPT_DT             object 
 1   OFNS_DESC          object 
 2   CRM_ATPT_CPTD_CD   object 
 3   LAW_CAT_CD         object 
 4   BORO_NM            object 
 5   LOC_OF_OCCUR_DESC  object 
 6   PREM_TYP_DESC      object 
 7   Latitude           float64
 8   Longitude          float64
dtypes: float64(2), object(7)
memory usage: 208.2+ MB


In [None]:
data.head()

Unnamed: 0,RPT_DT,OFNS_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,Latitude,Longitude
0,06/10/2018,OFF. AGNST PUB ORD SENSBLTY &,COMPLETED,MISDEMEANOR,MANHATTAN,INSIDE,RESIDENCE - APT. HOUSE,40.75931,-73.994706
1,08/04/2018,ASSAULT 3 & RELATED OFFENSES,COMPLETED,MISDEMEANOR,BRONX,INSIDE,RESIDENCE - APT. HOUSE,40.82617,-73.916831
2,11/04/2018,FELONY ASSAULT,COMPLETED,FELONY,QUEENS,INSIDE,RESIDENCE-HOUSE,40.707858,-73.759307
3,05/24/2018,CRIMINAL MISCHIEF & RELATED OF,COMPLETED,MISDEMEANOR,BRONX,INSIDE,RESIDENCE-HOUSE,40.882615,-73.851948
4,11/18/2018,ROBBERY,COMPLETED,FELONY,BRONX,INSIDE,RESIDENCE - APT. HOUSE,40.850357,-73.882989


In [None]:
data.to_csv("/content/drive/MyDrive/Capstone Project/NYPD_Complaint_Data_Historic.csv", index=False)

In [22]:
#converting the lattitudes and logitudes to zipcodes
import geopy
import pandas as pd

geolocator = geopy.Nominatim(user_agent='my-application')

data = pd.read_csv("/content/drive/MyDrive/Capstone Project/NYPD_Complaint_Data_Historic.csv")

data['zipcode'] = [0]*len(data)

In [23]:
data.head()

Unnamed: 0,RPT_DT,OFNS_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,Latitude,Longitude,zipcode
0,06/10/2018,OFF. AGNST PUB ORD SENSBLTY &,COMPLETED,MISDEMEANOR,MANHATTAN,INSIDE,RESIDENCE - APT. HOUSE,40.75931,-73.994706,0
1,08/04/2018,ASSAULT 3 & RELATED OFFENSES,COMPLETED,MISDEMEANOR,BRONX,INSIDE,RESIDENCE - APT. HOUSE,40.82617,-73.916831,0
2,11/04/2018,FELONY ASSAULT,COMPLETED,FELONY,QUEENS,INSIDE,RESIDENCE-HOUSE,40.707858,-73.759307,0
3,05/24/2018,CRIMINAL MISCHIEF & RELATED OF,COMPLETED,MISDEMEANOR,BRONX,INSIDE,RESIDENCE-HOUSE,40.882615,-73.851948,0
4,11/18/2018,ROBBERY,COMPLETED,FELONY,BRONX,INSIDE,RESIDENCE - APT. HOUSE,40.850357,-73.882989,0


In [None]:
for i in range(len(data)):
  try:
    location = geolocator.reverse((data['Latitude'][i], data['Longitude'][i]))
    x = location.raw['address']['postcode']
    print(f"we're at {i} and zipcode {x}")
    data['zipcode'][i] = x
  except:
    data['zipcode'][i] = '0'

we're at 0 and zipcode 10036
we're at 1 and zipcode 10451
we're at 2 and zipcode 11412
we're at 3 and zipcode 10466
we're at 4 and zipcode 10460
we're at 5 and zipcode 10303
we're at 6 and zipcode 10469
we're at 7 and zipcode 11434
we're at 8 and zipcode 10458
we're at 9 and zipcode 10029
we're at 10 and zipcode 11216
we're at 11 and zipcode 11203
we're at 12 and zipcode 11692
we're at 13 and zipcode 10459
we're at 14 and zipcode 10026
we're at 15 and zipcode 10030
we're at 16 and zipcode 11208
we're at 17 and zipcode 11226
we're at 18 and zipcode 11103
we're at 19 and zipcode 10458
we're at 20 and zipcode 11434
we're at 21 and zipcode 10462
we're at 22 and zipcode 11374
we're at 23 and zipcode 10461
we're at 24 and zipcode 10454
we're at 25 and zipcode 10305
we're at 26 and zipcode 11225
we're at 27 and zipcode 10031
we're at 28 and zipcode 10452
we're at 29 and zipcode 10456
we're at 30 and zipcode 11412
we're at 31 and zipcode 10451
we're at 32 and zipcode 10456
we're at 33 and zipc

In [None]:
data.drop(['Latitude','Longitude'], axis=1, inplace=True)

In [None]:
data.head()

In [None]:
data.to_csv("/content/drive/MyDrive/Capstone Project/NYPD_Complaint_Data_Historic.csv", index=False)

For The FDNYC

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Capstone Project/Fire_Incident_Dispatch_Data.csv")

In [None]:
data.head()

Unnamed: 0,STARFIRE_INCIDENT_ID,INCIDENT_DATETIME,ALARM_BOX_BOROUGH,ALARM_BOX_NUMBER,ALARM_BOX_LOCATION,INCIDENT_BOROUGH,ZIPCODE,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,...,FIRST_ACTIVATION_DATETIME,FIRST_ON_SCENE_DATETIME,INCIDENT_CLOSE_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,VALID_INCIDENT_RSPNS_TIME_INDC,INCIDENT_RESPONSE_SECONDS_QY,INCIDENT_TRAVEL_TM_SECONDS_QY,ENGINES_ASSIGNED_QUANTITY,LADDERS_ASSIGNED_QUANTITY,OTHER_UNITS_ASSIGNED_QUANTITY
0,500192400000000.0,01/01/2005 12:07:32 AM,QUEENS,9237,N/SVC RD H. HARDING EXPY & 99 ST,QUEENS,11368.0,110.0,21.0,404.0,...,01/01/2005 12:09:31 AM,01/01/2005 12:13:10 AM,01/01/2005 12:33:42 AM,N,Y,338,236,3,2,2
1,500114900000000.0,01/01/2005 12:14:40 AM,MANHATTAN,1493,BWAY & W125 ST\M.L.KING JR BLVD,MANHATTAN,10027.0,26.0,7.0,109.0,...,01/01/2005 12:15:43 AM,01/01/2005 12:19:06 AM,01/01/2005 12:35:27 AM,N,Y,266,217,2,2,1
2,500106500000000.0,01/01/2005 12:24:58 AM,BROOKLYN,653,LAFAYETTE & CLASSON AVES,BROOKLYN,11238.0,79.0,35.0,303.0,...,01/01/2005 12:25:51 AM,01/01/2005 12:28:44 AM,01/01/2005 12:47:38 AM,N,Y,226,189,3,2,1
3,500116500000000.0,01/01/2005 12:27:19 AM,MANHATTAN,1649,RIVERSIDE DR & 150 ST,MANHATTAN,10031.0,30.0,7.0,109.0,...,01/01/2005 12:28:48 AM,01/01/2005 12:31:53 AM,01/01/2005 02:25:27 AM,N,Y,274,200,5,3,5
4,500116500000000.0,01/01/2005 12:27:19 AM,MANHATTAN,1649,RIVERSIDE DR & 150 ST,MANHATTAN,10031.0,30.0,7.0,109.0,...,01/01/2005 12:28:48 AM,01/01/2005 12:31:53 AM,01/01/2005 02:25:27 AM,N,Y,274,200,5,3,5


In [None]:
data = data[[
              'INCIDENT_DATETIME',
              'INCIDENT_BOROUGH',
              'ZIPCODE',
              'HIGHEST_ALARM_LEVEL',
              'INCIDENT_CLASSIFICATION',
              'INCIDENT_CLASSIFICATION_GROUP',

              ]]
data.head()

Unnamed: 0,INCIDENT_DATETIME,INCIDENT_BOROUGH,ZIPCODE,HIGHEST_ALARM_LEVEL,INCIDENT_CLASSIFICATION,INCIDENT_CLASSIFICATION_GROUP
0,01/01/2005 12:07:32 AM,QUEENS,11368.0,First Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
1,01/01/2005 12:14:40 AM,MANHATTAN,10027.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
2,01/01/2005 12:24:58 AM,BROOKLYN,11238.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
3,01/01/2005 12:27:19 AM,MANHATTAN,10031.0,Seventh Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
4,01/01/2005 12:27:19 AM,MANHATTAN,10031.0,All Hands Working,Multiple Dwelling 'A' - Other fire,Structural Fires


In [None]:
data['INCIDENT_DATETIME'] = pd.to_datetime(data['INCIDENT_DATETIME'], infer_datetime_format=True)
print(f"the last date {np.max(data['INCIDENT_DATETIME'])}")
print(f"the begining  date {np.min(data['INCIDENT_DATETIME'])}")

the last date 2021-08-01 23:59:54
the begining  date 2005-01-01 00:07:32


In [None]:
data.head()

Unnamed: 0,INCIDENT_DATETIME,INCIDENT_BOROUGH,ZIPCODE,HIGHEST_ALARM_LEVEL,INCIDENT_CLASSIFICATION,INCIDENT_CLASSIFICATION_GROUP
0,2005-01-01 00:07:32,QUEENS,11368.0,First Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
1,2005-01-01 00:14:40,MANHATTAN,10027.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
2,2005-01-01 00:24:58,BROOKLYN,11238.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
3,2005-01-01 00:27:19,MANHATTAN,10031.0,Seventh Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
4,2005-01-01 00:27:19,MANHATTAN,10031.0,All Hands Working,Multiple Dwelling 'A' - Other fire,Structural Fires


In [None]:
data.dropna(inplace=True)
data.head()

Unnamed: 0,INCIDENT_DATETIME,INCIDENT_BOROUGH,ZIPCODE,HIGHEST_ALARM_LEVEL,INCIDENT_CLASSIFICATION,INCIDENT_CLASSIFICATION_GROUP
0,2005-01-01 00:07:32,QUEENS,11368.0,First Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
1,2005-01-01 00:14:40,MANHATTAN,10027.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
2,2005-01-01 00:24:58,BROOKLYN,11238.0,First Alarm,Multiple Dwelling 'A' - Compactor fire,Structural Fires
3,2005-01-01 00:27:19,MANHATTAN,10031.0,Seventh Alarm,Multiple Dwelling 'A' - Other fire,Structural Fires
4,2005-01-01 00:27:19,MANHATTAN,10031.0,All Hands Working,Multiple Dwelling 'A' - Other fire,Structural Fires


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 386662 entries, 0 to 388531
Data columns (total 6 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   INCIDENT_DATETIME              386662 non-null  datetime64[ns]
 1   INCIDENT_BOROUGH               386662 non-null  object        
 2   ZIPCODE                        386662 non-null  float64       
 3   HIGHEST_ALARM_LEVEL            386662 non-null  object        
 4   INCIDENT_CLASSIFICATION        386662 non-null  object        
 5   INCIDENT_CLASSIFICATION_GROUP  386662 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 20.6+ MB


In [None]:
data.to_csv("/content/drive/MyDrive/Capstone Project/Fire_Incident_Dispatch_Data.csv", index=False)