# Introduction to Pandas

## Setup and preliminaries

In [2]:
# Render our plots inline
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Make the graphs a bit bigger
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])

## Exercise: NYPD Vehicle Collisions

* We interacted with the NYC Restaurant Inspection Data. Now, let's download another dataset, and do some analysis. We will focus on the [NYPD Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data) data set.


### Task 1: 

Download the dataset. Use the "Export" view, get the URL for the CSV file, and dowload it using curl. (See the top of the notebook for guidance.) 


#### Solution

In [5]:
# It is a big file, ~270Mb. It will take 2-3 minutes to download
!curl https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv?accessType=DOWNLOAD -o accidents.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  318M    0  318M    0     0  3181k      0 --:--:--  0:01:42 --:--:-- 3296k 0  3163k      0 --:--:--  0:01:27 --:--:-- 3254k0  279M    0     0  3166k      0 --:--:--  0:01:30 --:--:-- 3236k


In [6]:
df = pd.read_csv("accidents.csv", low_memory=False)

In [7]:
df.dtypes

DATE                              object
TIME                              object
BOROUGH                           object
ZIP CODE                          object
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED        float64
NUMBER OF PERSONS KILLED         float64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER OF MOTORIST KILLED          int64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR VEHICLE 3     object
CONTRIBUTING FACTOR VEHICLE 4     object
CONTRIBUTING FACTOR VEHICLE 5     object
UNIQUE KEY                         int64
VEHICLE TYPE COD

### Task 2: 

Find out the most common contributing factors to the collisions. 
 

#### Solution

In [None]:
# Task 2: Find out the most common contributing factors to the collisions.
# Notice that we skip the first element (0) of the list, and we get the elements 1:10
df['CONTRIBUTING FACTOR VEHICLE 1'].value_counts()[1:10].plot(kind='barh')

### Task 3: 

Break down the number of collisions by borough.





#### Solution

In [None]:
# Task 3: Break down the number of collisions by borough.
df['BOROUGH'].value_counts().plot(kind='barh')

### Task 4

Find out the how many collisions had 0 persons injured, 1 persons injured, etc. persons injured in each accident. Use the `value_counts()` approach. You may also find the `.plot(logy=True)` option useful when you create the plot to make the y-axis logarigthmic.
 

#### Solution

In [None]:
# "Chain" style of writing data maniputation operations
plot = (
    df['NUMBER OF PERSONS INJURED'] # take the num of injuries column
    .value_counts() # compure the freuquency of each value
    .sort_index() # sort the results based on the index value instead of the frequency, 
                  # which is the default for value_counts
    .plot( # and plot the results
        kind='line', # we use a line plot because the x-axis is numeric/continuous
        marker='o',  # we use a marker to mark where we have data points 
        logy=True # make the y-axis logarithmic
    )
)
plot.set_xlabel("Number of injuries")
plot.set_ylabel("Number of collisions")
plot.set_title("Analysis of number of injuries per collision")

### Task 5

(a) Compute the average number of injuries and deaths per accident, broken down by borough. Use the `pivot_table` functionality, putting `BOROUGH` as the index. You can answer this query by generating two separate tables, or you can create a single table by using the fact that you can pass a list of attributes/columns to the `values` parameter of the pivot table.

(b) Repeat the exercise above, but break down the average number of deaths and injuries using the contributing factor for the accident. Use the `sort_values` command to sort the results, putting on top the contributing factors that generate the highest number of deaths.

#### Solution

In [9]:
pd.pivot_table(
    data = df,
    index = 'BOROUGH',
    aggfunc = 'mean',
    values = ['NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED']
)

Unnamed: 0_level_0,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED
BOROUGH,Unnamed: 1_level_1,Unnamed: 2_level_1
BRONX,0.2843,0.001045
BROOKLYN,0.289539,0.001135
MANHATTAN,0.172707,0.000828
QUEENS,0.260617,0.001221
STATEN ISLAND,0.24431,0.001356


In [10]:
pd.pivot_table(
    data = df,
    index = 'CONTRIBUTING FACTOR VEHICLE 1',
    aggfunc = 'mean',
    values = ['NUMBER OF PERSONS INJURED', 'NUMBER OF PERSONS KILLED']
).sort_values('NUMBER OF PERSONS KILLED', ascending=False)

Unnamed: 0_level_0,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED
CONTRIBUTING FACTOR VEHICLE 1,Unnamed: 1_level_1,Unnamed: 2_level_1
Illnes,0.851485,0.037129
Drugs (illegal),0.803783,0.014184
Unsafe Speed,0.612718,0.010269
Pedestrian/Bicyclist/Other Pedestrian Error/Confusion,0.746296,0.008727
Passenger Distraction,0.529045,0.007299
Traffic Control Disregarded,0.621926,0.007286
Drugs (Illegal),0.482100,0.007160
Tow Hitch Defective,0.116883,0.006494
Alcohol Involvement,0.462344,0.004188
Physical Disability,0.614375,0.003282


### Task 6

Break down the accidents by borough and contributing factor. Use the `pivot_table` function of Pandas
 

#### Solution

In [None]:
pivot = pd.pivot_table(
    data = df, # we analyze the df (accidents) dataframe
    index = 'CONTRIBUTING FACTOR VEHICLE 1', 
    columns = 'BOROUGH', 
    values = 'UNIQUE KEY', 
    aggfunc = 'count'
)
pivot

### Task 7

Find the dates with the most accidents. Can you figure out what happened on these days? 


#### Solution

In [None]:
df.DATE.value_counts()

### Task 8

Plot the number of accidents per day. (Hint: Ensure that your date column is in the right datatype and that it is properly sorted, before plotting)


#### Solution 

In [None]:
df['DATE'] = pd.to_datetime(df['DATE'], format="%m/%d/%Y")

In [11]:
(
    df.DATE.value_counts() # count the number of accidents per day
    .sort_index() # sort the dates
    .resample('1M') # take periods of 1 month
    .sum() # sum the number of accidents per month
    .plot() # plot the result
)

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'