# Introduction to Pandas

## Setup and preliminaries

In [None]:
# Render our plots inline
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Make the graphs a bit bigger
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])

## Exercise: NYPD Vehicle Collisions

* We interacted with the NYC Restaurant Inspection Data. Now, let's download another dataset, and do some analysis. We will focus on the [NYPD Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data) data set.


### Task 1: 

Download the dataset. Use the "Export" view, get the URL for the CSV file, and dowload it using curl. (See the top of the notebook for guidance.) 


#### Solution

In [None]:
# It is a big file, ~350Mb. It will take 2-3 minutes to download
!curl https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD -o accidents.csv

In [None]:
df = pd.read_csv("accidents.csv", low_memory=False)

In [None]:
df.dtypes

### Task 2: 

Find out the most common contributing factors to the collisions. 
 

#### Solution

In [None]:
# Task 2: Find out the most common contributing factors to the collisions.
df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts().plot(kind='barh')

In [None]:
# Task 2: If we want to remove the "Unspecified", we select the elements starting
# from position 1 (i.e., the second element in the list, the first one is 0)
df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()[1:10].plot(kind='barh')

### Task 3: 

Break down the number of collisions by borough.





#### Solution

In [None]:
# Task 3: Break down the number of collisions by borough.
df['BOROUGH'].value_counts().plot(kind='barh', figsize=(10,5))

### Task 4

Find out the how many collisions had 0 persons injured, 1 persons injured, etc. persons injured in each accident. Use the `value_counts()` approach. You may also find the `.plot(logy=True)` option useful when you create the plot to make the y-axis logarigthmic.
 

#### Solution

In [None]:
# "Chain" style of writing data maniputation operations
plot = (
    df['NUMBER OF PERSONS INJURED'] # take the num of injuries column
    .value_counts() # compure the freuquency of each value
    .sort_index() # sort the results based on the index value instead of the frequency, 
                  # which is the default for value_counts
    .plot( # and plot the results
        kind='line', # we use a line plot because the x-axis is numeric/continuous
        marker='o',  # we use a marker to mark where we have data points 
        logy=True # make the y-axis logarithmic
    )
)
plot.set_xlabel("Number of injuries")
plot.set_ylabel("Number of collisions")
plot.set_title("Analysis of number of injuries per collision")

### Task 5

(a) Compute the average number of injuries and deaths per accident, broken down by borough. Use the `pivot_table` functionality, putting `BOROUGH` as the index. You can answer this query by generating two separate tables, or you can create a single table by using the fact that you can pass a list of attributes/columns to the `values` parameter of the pivot table.

(b) Repeat the exercise above, but break down the average number of deaths and injuries using the contributing factor for the accident. Use the `sort_values` command to sort the results, putting on top the contributing factors that generate the highest number of deaths.

#### Solution

In [None]:
pd.pivot_table(
    data = df,
    index = 'BOROUGH',
    aggfunc = 'mean',
    values = ['NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED']
)

In [None]:
pd.pivot_table(
    data = df,
    index = 'CONTRIBUTING FACTOR VEHICLE 1',
    aggfunc = 'mean',
    values = ['NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED']
).sort_values('NUMBER OF PERSONS KILLED', ascending=False)

### Task 6

Break down the accidents by borough and contributing factor. Use the `pivot_table` function of Pandas
 

#### Solution

In [None]:
pivot = pd.pivot_table(
    data = df, # we analyze the df (accidents) dataframe
    index = 'CONTRIBUTING FACTOR VEHICLE 1', 
    columns = 'BOROUGH', 
    values = 'COLLISION_ID', 
    aggfunc = 'count'
)

# Create an extra column showing the total deaths across boroughs (=columns)
pivot["Total"] = pivot.sum(axis="columns") 

# Sort the dataframe by descending order of the values in the column "Total"
pivot = pivot.sort_values("Total", ascending=False)

pivot

### Task 7

Find the dates with the most accidents. Can you figure out what happened on these days? 


#### Solution

In [None]:
df["CRASH DATE"].value_counts()

### Task 8

Plot the number of accidents per day. (Hint: Ensure that your date column is in the right datatype and that it is properly sorted, before plotting)


#### Solution 

In [None]:
df["CRASH DATE"] = pd.to_datetime(df["CRASH DATE"], format="%m/%d/%Y", errors="coerce")

In [None]:
(
    df["CRASH DATE"].value_counts() # count the number of accidents per day
    .sort_index() # sort the dates
    .resample('1M') # take periods of 1 month
    .sum() # sum the number of accidents per month
    .plot(figsize=(10,5)) # plot the result
)

### Task 9

We want to analyze the timing patterns of accidents that lead to death or injury.

We will do the analysis by creating histograms showing the frequency of deadly vs non-deadly accidents throughout the day. By comparing the two histograms we will be able to understand if time of day is correlated with deadly accidents or not.

Steps to follow:
* Ensure that the `CRASH TIME` column is converted to a datetime. The format is HH:MM, which can be written as `format="%H:%M"` in the `to_datetime` command of Pandas.
* Create a boolean column `DEATH` that is true when someone was killed in the accident (i.e., `NUMBER OF PERSONS KILLED > 0`). 
* Create a boolean column `INJURY` that is true when someone was injured in the accident (i.e., `NUMBER OF PERSONS INJURED > 0`). 
* Query the dataframe to get back the deadly accidents and create a histogram of deadly accidents over time. Do the same for non-deadly accidents.
* To allow a more direct visual comparison of the two histograms, we want to merge them in one plot. Since the number of accidents without deaths is *much* higher, we want the histograms to be normalized (i.e., `density=True`). 
* It is also a good idea to make the histographs partially transparent, to allow for easier comparison of the two histograms.


### Solution

In [None]:
# Define the indicator variables
df['INJURY'] = (df['NUMBER OF PERSONS INJURED']>0)
df['DEATH'] = (df['NUMBER OF PERSONS KILLED']>0)

# Convert the date/time columns to proper datetime formats
df['DATETIME'] = df['CRASH DATE'] + ' ' + df['CRASH TIME']
df['DATETIME'] = pd.to_datetime(df['DATETIME'], format="%m/%d/%Y %H:%M")

df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'], format="%H:%M")

In [None]:
# Define the two subsets
deadly_accidents = df[ df['DEATH'] == True ]
noharm_accidents = df[ df['DEATH'] == False ]

In [None]:
deadly_accidents['CRASH TIME'].hist(
    bins=48, # one bar per half hour
    figsize=(20,10),  # make the figure bigger
    density=True, # normalize the counts
    alpha=0.5,  # make the histogram semi-transparent
    color='red' # color red the deadly accidents
)

noharm_accidents['CRASH TIME'].hist(
    bins=48,
    figsize=(20,10), 
    density=True,
    alpha=0.5, 
    color='green'
)

In [None]:
injuries = df[ df['INJURY'] == True ]
no_injuries = df[ df['INJURY'] == False ]

In [None]:
injuries['CRASH TIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='red')
no_injuries['CRASH TIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='green')

In [None]:
deadly_accidents['DATETIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='red')
noharm_accidents['DATETIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='green')

In [None]:
injuries['DATETIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='red')
no_injuries['DATETIME'].hist(bins=48,figsize=(20,10), density=True,alpha=0.5, color='green')

In [None]:
import seaborn as sns

In [None]:
sns.kdeplot(data = df, x ='CRASH TIME', hue='DEATH', common_norm=False, cut=0)

In [None]:
sns.kdeplot(data = df, x ='CRASH TIME', hue='INJURY', common_norm=False, cut=0)