# Final Project - Programming For Data Science
---
---

## Member Information
| Name              | ID       |
|-------------------|----------|
| Tran Nguyen Huan  | 21127050 |
| Nguyen Phat Dat   | 21127240 |

## Table of Contents:

1. [Acknowledgement](#acknowledgement)

2. [Introduction](#introduction)
    
    1. [About dataset](#about-dataset)


3. [Details](#details)

    1. [Collecting data](#1-collecting-data)

    2. [Exploring and Preprocessing data](#2-exploring-and-preprocessing-data)

    3. [Asking meaningful question](#3-asking-meaningful-questions)

    4. [Reflection](#reflection)

    5. [References](#references)

## Overview
---

<h3>
    <b>
    US Accidents (2016 - 2023)
    </b>
</h3>
    <img style="padding:10px" src="https://hire.refactored.ai/upload-nct/portfolio_images/253/1626148168_GqMh6a2U.png" width="800"/>
</center>
"This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately **7.7 million** accident records."
This Dataset can be accessed here: Sobhan Moosavi. (2023). <i>US Accidents (2016 - 2023)</i> [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/199387
<center>


The primary goal of the project is to analyze and generate insights on the traffic accidents that took place in USA from Feb. 2016 to Mar. 2023.

## Usage Policy and Legal Disclaimer
---
This dataset is being distributed solely for research purposes under the Creative Commons Attribution-Noncommercial-ShareAlike license (CC BY-NC-SA 4.0). By downloading the dataset, we agree to use it only for non-commercial, research, or academic applications. 

## Code Environment
---

In [1]:
import sys
sys.executable

'd:\\Downloads\\anaconda\\python.exe'

## Import necessary libraries
---

In [11]:
import numpy as np
import pandas as pd
from pprint import pprint

## Dataset Import
---

In the first place we are going to import the dataset using Pandas module.

In [None]:
df = pd.read_csv("Data/US_Accidents_March23.csv")

In [8]:
print("Size of our Dataset:", df.shape)

Size of our Dataset: (7728394, 46)


In [9]:
# Set the Pandas display options to show all columns
pd.set_option('display.max_columns', None)
df.head(3)

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),Description,Street,City,County,State,Zipcode,Country,Timezone,Airport_Code,Weather_Timestamp,Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Direction,Wind_Speed(mph),Precipitation(in),Weather_Condition,Amenity,Bump,Crossing,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,Right lane blocked due to accident on I-70 Eas...,I-70 E,Dayton,Montgomery,OH,45424,US,US/Eastern,KFFO,2016-02-08 05:58:00,36.9,,91.0,29.68,10.0,Calm,,0.02,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,Accident on Brice Rd at Tussing Rd. Expect del...,Brice Rd,Reynoldsburg,Franklin,OH,43068-3402,US,US/Eastern,KCMH,2016-02-08 05:51:00,37.9,,100.0,29.65,10.0,Calm,,0.0,Light Rain,False,False,False,False,False,False,False,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,Accident on OH-32 State Route 32 Westbound at ...,State Route 32,Williamsburg,Clermont,OH,45176,US,US/Eastern,KI69,2016-02-08 06:56:00,36.0,33.3,100.0,29.67,10.0,SW,3.5,,Overcast,False,False,False,False,False,False,False,False,False,False,False,True,False,Night,Night,Day,Day


## Data Cleansing
---

This dataset contains a large amount of information for analysis. However, some of the fields may be overly complex and not contribute significantly to our analysis. Before proceeding further, I plan to streamline the dataset by removing the following fields:

1. 'Id' and 'Source': These fields do not provide substantial information for our analysis.

2. 'End_Lat' and 'End_Lng': We already have the starting coordinates, making these fields redundant.

3. 'Airport_Code': Since all the data pertains to the USA, specifying the nearest airport code is unnecessary.

4. 'Country': As mentioned earlier, all the data is related to the USA, so this field does not add value.

5. 'Weather_Timestamp': We have other weather-related fields that are more relevant.

6. 'Civil_Twilight', 'Nautical_Twilight', and 'Astronomical_Twilight': These fields may not be directly relevant to our analysis.

7. 'Timezone': This information can be derived from other relevant fields.

By removing these fields, we aim to simplify the dataset, making it more focused and efficient for our analysis.

In [10]:
# Specify the names of the columns to be dropped
cols2drop = ['End_Lat', 'End_Lng', 'ID', 'Source', 'Airport_Code', 'Country', 'Weather_Timestamp', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Timezone']

# Use the drop() method to remove the specified columns
df.drop(columns=cols2drop, inplace=True)

**Quick overview of the Data:**

In the following, a function is defined to show detailed overview of our data, and this function can be reused. 

In [14]:
from pprint import pprint

def sanity_check(df):
    pprint('-' * 70)
    pprint('No. of Rows: {0[0]}        No. of Columns : {0[1]}'.format(df.shape))
    pprint('-' * 70)
    data_profile = pd.DataFrame(df.dtypes.reset_index()).rename(columns={'index': 'Attribute', 0: 'DataType'}).set_index('Attribute')

    data_profile = pd.concat([data_profile, df.isnull().sum()], axis=1).rename(columns={0: 'Missing Values'})
    data_profile = pd.concat([data_profile, (df.isnull().mean() * 100).round(2)], axis=1).rename(columns={0: 'Missing %'})
    data_profile = pd.concat([data_profile, df.nunique()], axis=1).rename(columns={0: 'Unique Values'})

    # Additional features for numeric columns
    numeric_cols = df.select_dtypes(include=['number']).columns
    numeric_stats = df[numeric_cols].describe().transpose()[['min', '25%', '50%', '75%', 'max']]
    data_profile = pd.concat([data_profile, numeric_stats], axis=1).rename(columns={'min': 'Min', '25%': 'Q1', '50%': 'Median', '75%': 'Q3', 'max': 'Max'})

    pprint(data_profile)
    pprint('-' * 70)

# Example usage:
sanity_check(df)


'----------------------------------------------------------------------'
'No. of Rows: 7728394        No. of Columns : 35'
'----------------------------------------------------------------------'
                  DataType  Missing Values  Missing %  Unique Values  \
Severity             int64               0       0.00              4   
Start_Time          object               0       0.00        6131796   
End_Time            object               0       0.00        6705355   
Start_Lat          float64               0       0.00        2428358   
Start_Lng          float64               0       0.00        2482533   
Distance(mi)       float64               0       0.00          22382   
Description         object               5       0.00        3761578   
Street              object           10869       0.14         336306   
City                object             253       0.00          13678   
County              object               0       0.00           1871   
State       

For the columns, **Precipitation(in), Wind_Chill(F), and Wind_Speed(mph)**, the missing data is in high percentage, removing missing data from these columns would cause us to lose a lot of data (around 3 million records). Therefore, we are going to impute them with the mean values of those fields.

In [15]:
df.dropna(subset=['Visibility(mi)', 'Wind_Direction', 'Description', 'Humidity(%)', 'Weather_Condition', 'Temperature(F)', 'Pressure(in)', 'Sunrise_Sunset', 'Street', 'Zipcode'], inplace=True)

For the columns, **Precipitation(in), Wind_Chill(F), and Wind_Speed(mph)**, the missing data is in high percentage, removing missing data from these columns would cause us to lose a lot of data (around 3 million records). Therefore, we are going to impute them with the mean values of those fields.

In [16]:
columns = ['Precipitation(in)', 'Wind_Chill(F)', 'Wind_Speed(mph)']

for c in columns:
    df[c].fillna(df[c].mean(), inplace=True)

Let's run the sanity check on the modified data.

In [17]:
sanity_check(df)

'----------------------------------------------------------------------'
'No. of Rows: 7426729        No. of Columns : 35'
'----------------------------------------------------------------------'
                  DataType  Missing Values  Missing %  Unique Values  \
Severity             int64               0        0.0              4   
Start_Time          object               0        0.0        5926304   
End_Time            object               0        0.0        6470695   
Start_Lat          float64               0        0.0        2347656   
Start_Lng          float64               0        0.0        2396804   
Distance(mi)       float64               0        0.0          21834   
Description         object               0        0.0        3632562   
Street              object               0        0.0         327345   
City                object               0        0.0          12237   
County              object               0        0.0           1813   
State       

It's time to remove duplicate rows.

In [18]:
print("Number of rows:", len(df.index))
df.drop_duplicates(inplace=True)
print("Number of rows after dropping duplicates:", len(df.index))

Number of rows: 7426729
Number of rows after dropping duplicates: 7329850


## Exploring Accidents: A Deep Dive into Data Insights (Raise questions that need answering)
---

### Question 1:

#### The purpose of the question

#### How to solve the question

#### Pre-processing

#### Exploratory Analysis and Visualization

#### Answer the question

### Question 2:

#### The purpose of the question

#### How to solve the question

#### Pre-processing

#### Exploratory Analysis and Visualization

#### Answer the question

### Question 3:

#### The purpose of the question

#### How to solve the question

#### Pre-processing

#### Exploratory Analysis and Visualization

#### Answer the question

### Question 4:

#### The purpose of the question

#### How to solve the question

#### Pre-processing

#### Exploratory Analysis and Visualization

#### Answer the question

## Reflection

## References