# How can we control the increasing number of accidents in New York?

In [36]:
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Introduction

**Business Context.** The city of New York has seen a rise in the number of accidents on the roads in the city. They would like to know if the number of accidents have increased in the last few weeks. For all the reported accidents, they have collected details for each accident and have been maintaining records for the past year and a half (from January 2018 to August 2019). 

The city has contracted you to build visualizations that would help them identify patterns in accidents, which would help them take preventive actions to reduce the number of accidents in the future. They have certain parameters like borough, time of day, reason for accident, etc. Which they care about and which they would like to get specific information on.

**Business Problem.** Your task is to format the given data and provide visualizations that would answer the specific questions the client has, which are mentioned below.

**Analytical Context.** You are given a CSV file containing details about each accident like date, time, location of the accident, reason for the accident, types of vehicles involved, injury and death count, etc. The delimiter in the given CSV file is `;` instead of the default `,`. You will be performing the following tasks on the data:

1. Extract data from Wikipedia
2. Read, transform, and prepare data for visualization
3. Perform analytics and construct visualizations of the data to identify patterns in the dataset
        
The client has a specific set of questions they would like to get answers to. You will need to provide visualizations to accompany these:

1. How have the number of accidents fluctuated over the past year and a half? Have they increased over the time?
2. For any particular day, during which hours are accidents most likely to occur?
3. Are there more accidents on weekdays than weekends?
4. What are the accidents count-to-area ratio per borough? Which boroughs have disproportionately large numbers of accidents for their size?
5. For each borough, during which hours are accidents most likely to occur?
6. What are the top 5 causes of accidents in the city? 
7. What types of vehicles are most involved in accidents per borough?
8. What types of vehicles are most involved in deaths?

## Fetch borough data from Wikipedia

The client has requested analysis of the accidents-to-area ratio for boroughs. You will need to fetch the area of each borough from the Wikipedia page: https://en.wikipedia.org/wiki/Boroughs_of_New_York_City.

Since we are fetching this resource from an external page, you should instead fetch the HTML document and store the results locally in a JSON file, so that you can parse it later when you need it. Create a folder named `data` and store the file inside it.

Insert **answer** below:

For later usage, let's store the borough data into a JSON file in the already created `data` folder:

## Overview of the data

Now that we've stored the borough data in a JSON file, we can re-open it and use it whenever we wish. We can use the `read_json()` function in `pandas` to do that:

In [1]:
with open('data/data.json') as f:
    df = pd.read_json(f, orient='records')

NameError: name 'pd' is not defined

Let's go through the columns present in the dataframe:

In [24]:
df.columns

Index(['BOROUGH', 'COLLISION_ID', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'DATE', 'DATETIME', 'LATITUDE', 'LONGITUDE',
       'NUMBER OF CYCLIST INJURED', 'NUMBER OF CYCLIST KILLED',
       'NUMBER OF MOTORIST INJURED', 'NUMBER OF MOTORIST KILLED',
       'NUMBER OF PEDESTRIANS INJURED', 'NUMBER OF PEDESTRIANS KILLED',
       'ON STREET NAME', 'TIME', 'TOTAL INJURED', 'TOTAL KILLED',
       'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3',
       'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5', 'ZIP CODE'],
      dtype='object')

We have the following columns

1. **Borough**: The borough in which the accident occured
2. **COLLISION_ID**: A unique identifier for this collision
3. **CONTRIBUTING FACTOR VEHICLE (1, 2, 3, 4, 5)**: Reasons for the accident
4. **CROSS STREET NAME**: Nearest cross street to the place of accidents
5. **DATE**: Date of the accident
6. **TIME**: Time of accident
7. **DATETIME**: The column we previously created with the combination of date and time
8. **LATITUDE**: Latitude of the accident
9. **LONGITUDE**: Longitude of the accident
10. **NUMBER OF (CYCLIST, MOTORIST, PEDESTRIANS) INJURED**: Category wise injury
11. **NUMBER OF (CYCLIST, MOTORIST, PEDESTRIANS) KILLED**: Category wise death
12. **ON STREET NAME**: Street where the accident occured
13. **TOTAL INJURED**: Total injury from the accident
14. **TOTAL KILLED**: Total casualties in the accident
15. **VEHICLE TYPE CODE (1, 2, 3, 4, 5)**: Types of vehicles involved in the accident
16. **ZIP CODE**: zip code of the accident location

Let's go ahead and answer each of the client's questions.

## Answering the client's questions

### Part 1: Accidents over time

Group the available data on a monthly basis and generate a line plot of accidents over time. Has the number of accidents increased over the past year and a half?

Insert **answer** below:

### Part 2: Accident hotspots in a day

How does the number of accidents vary throughout a single day? Create a new column `HOUR` based on the data from the `DATETIME` column, then plot a bar graph of the distribution per hour throughout the day.

Insert **answer** below:

### Part 3: Accidents by weekday

How does the number of accidents vary throughout a single week? Plot a bar graph based on the accidents count by day of the week.

Insert **answer** below:

### Part 4: Borough analysis

Plot a bar graph of the total number of accidents in each borough, as well as one of the accidents per square kilometer per borough. What can you conclude?

Insert **answer** below:

In [2]:
# Update keys in borough data


# Since there are differences in the text used in the data and Wikipedia data, let's update it


We have now got the keys to match in the dictionary and the dataframe. The difference in case can be handled by making the mapping action case-insensitive. This can be done by either converting the dictionary keys to uppercase, or the dataframe data to lowercase.

Let's do that and plot `accidents_per_sq_km`, which is the accidents-to-area ratio:

### Part 5: Borough hourly analysis

Which hours have the most accidents for each borough? Plot a bar graph for each borough showing the number of accidents for each hour of the day.

Insert **answer** below:

**Is the number of accidents higher at different times in different boroughs? Should we concentrate at different times for each borough?**

### Part 6: Cause of accidents

What factors cause the most accidents?

Insert **answer** below:

### Part 7: Boroughs and vehicle types

Which vehicle types are most involved in accidents per borough?

Insert **answer** below:

### Part 8: Death counts by vehicle type

Calculate the number of deaths by vehicle and plot a bar chart for the top 5 vehicles. Which vehicles are most often involved in deaths, and by how much more than the others?

Insert **answer** below: