<a href="https://www.kaggle.com/code/lucamodica/911-calls-first-exploratory-data-analysis?scriptVersionId=112997111" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 911 Calls Capstone Project

This is a capstone project that I did for a Data Science course. I will analyze some 911 call data from [Kaggle](https://www.kaggle.com/mchirico/montcoalert). The data contains the following fields:

* lat : String variable, Latitude
* lng: String variable, Longitude
* desc: String variable, Description of the Emergency Call
* zip: String variable, Zipcode
* title: String variable, Title
* timeStamp: String variable, YYYY-MM-DD HH:MM:SS
* twp: String variable, Township
* addr: String variable, Address
* e: String variable, Dummy variable (always 1)

## Data and Setup

Importing the libraries and setting initial information:

In [12]:
import numpy as np
import pandas as pd
import seaborn as sns
import sys

sns.set_theme()
%matplotlib inline

Loading the dataset to be analyzed:

In [19]:
try:
  df = pd.read_csv('/kaggle/input/montcoalert/911.csv')
except FileNotFoundError:
  print('File not found.')

File not found.


## Exploratory data analysis (EDA)

For first, I check the info() of the dataset:

In [14]:
df.info()

NameError: name 'df' is not defined

After retrieving the general data info, I check the head of the dataset, to see how the information is structured:

In [None]:
df.head()

I can now start with some basic information like the top 5 zipcodes in the Montgomery County, for 911 calls.

In [None]:
df['zip'].value_counts().head()

Let's see the same leaderboard, bu considering the top 5 townships ('twp') instead.

In [None]:
df['twp'].value_counts().head()

Now what can we asked ourselves is: how many codes people call 911 for?
In other words: how many unique codes are there in the dataset?

In [None]:
df['title'].nunique()

To also have the general reason info for a 911 call, I create a new feature called "Reason".

In [None]:
df['Reason'] = df['title'].apply(lambda t: t.split(':')[0])
df['Reason'].value_counts()

As seen above, there are 3 major reason for calls: EMS (Emergency Medical Service), traffic (incident, disabled vehicles) and fire reasons.

Let's have a better visualization with a countplot:

In [None]:
sns.countplot(data=df, x='Reason')

In [None]:
sns.violinplot(data=df, x='Reason')

___
Now I'll concentrate more on the time information, by converting for first the 'timeStamp' feature from object to a DateTime object (using pd.to_datetime).

In [None]:
df['timeStamp'].dtype

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
df['timeStamp'].dtype

Now It's possible to read time values from the 'timeStamp' feature. Indeed, I take advantage from this to create 3 new columns, dedicated to hour, month and day of the week of the call.

In [None]:
df['Hour'] = df['timeStamp'].apply(lambda d: d.hour)
df['Month'] = df['timeStamp'].apply(lambda d: d.month)
df['Day of Week'] = df['timeStamp'].apply(lambda d: d.weekday())

# 'Day of Week' convertion
dmap = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'}
df['Day of Week'] = df['Day of Week'].apply(
    lambda d: dmap[d] if (type(d) == int) else d)

For each weekday let's see the number of 911 calls, always divided by reason:

In [None]:
ax = sns.countplot(data=df, x='Day of Week', hue='Reason')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

Same thing for the month:

In [None]:
ax = sns.countplot(data=df, x='Month', hue='Reason')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

- - -
Also based on the last countplot, what I'll try to do now is to create a linear fit on the number of calls per month. We grouped the dataset for month indeed:

In [None]:
byMonth = df.groupby('Month').count()
byMonth.head()

Let's start with a simple lineplot to see how the 911 calls goes through the months:

In [None]:
# the 'twp' coloumn is chosen by random, since the dataframe is grouped by month
byMonth['twp'].plot.line()

The I can pass plotting an lmplot(), to create a linear fit on the number of calls per month.

In [None]:
byMonth['Month'] = byMonth.index
sns.lmplot(data=byMonth, x='Month', y='twp')

- - -
Let's pass on analyzing number of 911 calls through the days instead. 

For first, I create a new feature "Date", always taking the data from the 'timeStamp' column.

In [None]:
df['Date'] = df['timeStamp'].apply(lambda d: d.date())

Now I can start plotting a linechart to see the 911 calls trend:

In [None]:
byDate = df.groupby('Date').count()
byDate['twp'].plot.line(figsize=(18,5)).set(title='Number of 911 calls per days')

This, followed by a plot for each of the 3 calls reson.

In [None]:
df[df['Reason'] == 'EMS'].groupby('Date').count()['twp'].plot.line(figsize=(18,5)).set(title='Number of 911 calls per days, due to EMS reasons')

In [None]:
df[df['Reason'] == 'Traffic'].groupby('Date').count()['twp'].plot.line(figsize=(18,5)).set(title='Number of 911 calls per days, due to traffic reasons')

In [None]:
df[df['Reason'] == 'Fire'].groupby('Date').count()['twp'].plot.line(figsize=(18,5)).set(title='Number of 911 calls per days, due to fire reasons')

____
In the last part of the EDA of this dataset, we concentrate on correlating time data using heatmap and clustermap.

I start creating a matrix to show the number of 911 calls for each hour of the day of week.

In [None]:
dayHour = df.groupby(by=['Day of Week', 'Hour']).count()['Reason'].unstack()
dayHour

Firstly, let's plot this matrix in a heatmap:

In [None]:
sns.heatmap(dayHour, cmap='viridis')

Then, to see common values grouped together, we pass to a clustermap:

In [None]:
sns.clustermap(dayHour, cmap='viridis')

As we can expect and see from the 2 plots, most of the calls are concentrated in the most active hours of the afternoon (15, 16 and 17).

In this case, we a concentration especially on Friday.

Now I reapeat the same procedure, replacing the hours with the month.

In [None]:
dayMonth = df.groupby(by=['Day of Week', 'Month']).count()['twp'].unstack()
dayMonth

In [None]:
sns.heatmap(dayMonth, cmap='viridis')

In [None]:
sns.clustermap(dayMonth, cmap='viridis')

The result seen on the heatmap and on the clustermap is the concentration of the 911 calls on the Fridays of March. 