# Project Introduction

Through this dataset, I have identified the three following patterns:
1. IDEA 1 w/ LOCAL LINK
2. IDEA 2 w/ LOCAL LINK
3. IDEA 3 w/ LOCAL LINK

## Resources:

- __[Kaggle-Full](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents/data)__

- __[Kaggle-Sampled](https://drive.google.com/file/d/1U3u8QYzLjnEaSurtZfSAS_oh9AT2Mn8X/edit)__

- __['A Countrywide Traffic Accident Dataset'](https://arxiv.org/pdf/1906.05409)__

- __['Accident Risk Prediction based on Heterogenous Sparse Data: New Dataset & Insights](https://arxiv.org/pdf/1909.09638)__

# 1-Setup Environment

## Libraries

In [1]:
#Utilities
import warnings

# Data Basics
import pandas as pd
import numpy as np

#PySpark
from pyspark.sql import SparkSession
from pyspark.mllib.stat import Statistics
import pyspark.sql.functions as F

#Visuals
import matplotlib.pyplot as plt
import seaborn as sns


# Statistical Analysis
from scipy.stats import zscore

# Spatial Tools
import geopy.distance

# Text Tools
import re
from wordcloud import WordCloud

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams

## Info on New Libraries
For improved analysis, these libraries were included, but not covered in the material:

-
-

## Settings

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 500)
_LITE_SWITCH_ = False
_SPARK_ = False

## Custom Functions

In [3]:
# For High-level data exploration
def count_outliers(df_col,cap=3):
    zs = zscore(df_col)
    return df_col[zs > cap].shape[0]

In [4]:
# For Identify Unusual Weather Pattern(s)
def unusual_weather(observation):
    flag = False
    return flag

In [5]:
def dist(pt_0,pt_1):
    return geopy.distance.geodesic(pt_0,pt_1)

In [6]:
def same_road(eventA,eventB):
    indicator = False
    road_identifiers = ['Street','City','State','Zipcode']
    indicator = all(eventA[id]==eventB[id] for id in road_identifiers)
    return indicator

In [7]:
# For indicating whether two events are close in space and, optionally, time.
def proximity_indicator(eventA,eventB,space_band=15,time_band=0):
    flag = False
    distance = dist(eventA[['longitude','latitude']],eventB[['longitude','latitude']])
    if (time_band > 0):
        time_lapse = abs(eventA['time']-eventB['time'])
        flag = bool((space_band > distance) and (time_band > time_lapse))
    else:
        flag = bool(space_band > distance)
    return flag

## Load Dataset

In [8]:
if _SPARK_:
    spark = SparkSession.builder.appName("Accident Data Project").getOrCreate()
    data = spark.read.csv('US_Accidents_March23.csv',header=True,inferSchema=True)
else:
    if _LITE_SWITCH_:
        data = pd.read_csv('US_Accidents_March23_sampled_500k.csv')
    else:
        data = pd.read_csv('US_Accidents_March23.csv')
    

# 2-Initial EDA

## Schema & Feature Basics

In [9]:
if _SPARK_:
    data.printSchema()
else:
    data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7728394 entries, 0 to 7728393
Data columns (total 46 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   ID                     object 
 1   Source                 object 
 2   Severity               int64  
 3   Start_Time             object 
 4   End_Time               object 
 5   Start_Lat              float64
 6   Start_Lng              float64
 7   End_Lat                float64
 8   End_Lng                float64
 9   Distance(mi)           float64
 10  Description            object 
 11  Street                 object 
 12  City                   object 
 13  County                 object 
 14  State                  object 
 15  Zipcode                object 
 16  Country                object 
 17  Timezone               object 
 18  Airport_Code           object 
 19  Weather_Timestamp      object 
 20  Temperature(F)         float64
 21  Wind_Chill(F)          float64
 22  Humidity(%)       

In [10]:
if _SPARK_:
    data.show(5)
else:
    data.head(5)

In [11]:
if _SPARK_:
    data.describe().show()
else:
    pass
    

Very large dataset with over 7.7 million observations.  (I also chose to do some initial analysis ont he kaggle-provided sampled dataset which contains only 500k observations.)  So we will have to handle with care either by using Spark or through other tricks.


Notes on Features: Dimensionally much more managable with just 45 features.
- 'Severity' would appear to be the primary variable of interest; 'Description' might be interesting to experiment with.  
- 'Description' feature might be an interesting avenue to explore.  
- A number of location features
    * 'Start_Lat'/'Start_Lng' are easy enough to interpret but not sure what is meant by 'End_Lat' and 'End_Lng'
    * Similar question for 'Distance(mi)'
    * Street and City might not be useful features since they are not informative without further context.  
- Identifiers in ID and Source
- We have three time features ('Weather_Timestamp', 'Start_Time' and 'End_Time').
    * They need to be [converted to datetimes](#datetimes).  Right now they are just strings.
    * There may be an issue with inconsistent formatting.
    * It may be interesting to use these values to engineer some features like time of day, season, weekday/weekend, etc.
    * However, the information contained in 'Weather_Timestamp' is unclear.
- There are 9 (not including 'Weather_Timestamp') features on the weather.
    * Visability is probably the most relevant of these but 'Weather_Condition' might be a fine substitute/summary.
- Next are 10 binary features which seem to provide some information about the road infrastructure at the location of the accident.  


In [12]:
_TARGET_ = ['Severity']
_NUMERICS_ = ['Distance(mi)','Temperature(F)',
              'Wind_Chill(F)','Humidity(%)','Pressure(in)',
              'Visibility(mi)','Wind_Speed(mph)','Precipitation(in)']

## Missing Values

In [None]:
if _SPARK_:
    print(data.select(*[F.sum(F.isnull(F.col(c)).cast("int")).alias(c) for c in data.columns]).show())
else:
    print(data.isna().sum().sort_values(ascending=False))

There is a lot of missing data. 

- Most notably, the 'End_Lat'/'End_Lng'.  Will have to confirm but it would appear that observations missing one are also missing the other.  
- There aren't any missing values for 'Severity' as well as basic location data (gps + state/county).
- Only five observations are missing 'Description' which may prove convenient.


## Duplicates

There are no duplicates to deal with:

In [15]:
if _SPARK_:
    duplicates = int(data.count() - data.dropDuplicates().count())
    duplicates.show()
else:
    print(data.duplicated().sum())

0


## Outliers

In [16]:
if _SPARK_:
    pass
else:
    print(pd.DataFrame({c:{z:count_outliers(data[c],z) for z in [3,5,10,15,20]} for c in _NUMERICS_}).T)

                       3      5     10    15    20
Distance(mi)       109071  39983  9533  3722  1736
Temperature(F)          0      0     0     0     0
Wind_Chill(F)           0      0     0     0     0
Humidity(%)             0      0     0     0     0
Pressure(in)            0      0     0     0     0
Visibility(mi)          0      0     0     0     0
Wind_Speed(mph)         0      0     0     0     0
Precipitation(in)       0      0     0     0     0


Dealing with 'Distance' outliers requires additional research about the feature itself.

## Severity

In [17]:
data['Severity'].value_counts()

Severity
2    6156981
3    1299337
4     204710
1      67366
Name: count, dtype: int64

## High-Level Features

In [18]:
data['Country'].value_counts()

Country
US    7728394
Name: count, dtype: int64

We can cut this 'Country' feature entirely from the data going forward.

# 3-Data Processing

In [None]:
data_clean = pd.DataFrame()

In [None]:
data_clean[_NUMERICS_] = data[_NUMERICS_]

<a id='datetimes'> Converting dates </a>

In [None]:
if _SPARK_:
    pass
else:
    data_clean['Start'] = pd.to_datetime(data['Start_Time'],format='mixed')
    data_clean['End'] = pd.to_datetime(data['End_Time'],format='mixed')

## Engineer Features

In [None]:
timedelta_hrs = round((data_clean['End'] - data_clean['Start']).dt.seconds / 360)

In [None]:
sns.histplot(timedelta_hrs);

### Partition the time data

In [None]:
if _SPARK_:
    pass
else:
    data_clean['Month'] = data_clean['Start'].dt.month
    data_clean['Year'] = data_clean['Start'].dt.year
    data_clean['Day'] = data_clean['Start'].dt.day
    data_clean['DayofWeek'] = data_clean['Start'].dt.day_of_week
    data_clean['Quarter'] = data_clean['Start'].dt.quarter
    data_clean['Hour'] = data_clean['Start'].dt.hour

In [None]:
data_clean['Year'].value_counts(normalize=True).sort_index()

In [None]:
data_clean['Month'].value_counts(normalize=True).sort_index()

In [None]:
data_clean['Quarter'].value_counts(normalize=True)

In [None]:
data_clean['DayofWeek'].value_counts(normalize=True).sort_index()

In [None]:
data_clean['Hour'].value_counts(normalize=True).sort_index()

Spatial-Based Data

# 4-Full EDA

## Target Variable

In [None]:
if _LITE_SWITCH_:
    print(data_clean[_NUMERICS_+_TARGET_].describe())
else:
    df.select(num_cols).describe().show()

In [None]:
if _LITE_SWITCH_:
    sns.histplot(data_clean['Severity'])
else:
    pass

## Feature Variables

### Relationships

In [None]:
if _LITE_SWITCH_:
    corrs = data_clean[_NUMERICS_].corr()
    print(corrs)
else:
    corr_data = df.select(_NUMERICS_)
    col_names = corr_data.columns
    features = corr_data.rdd.map(lambda row: row[0:]) 
    corrs = Statistics.corr(features, method="pearson")

In [None]:
sns.heatmap(corrs,vmin=-1,vmax=1,cmap='coolwarm')

### Against Target Variable

# 5-Statistical Analysis

## Basic

## Advanced

# 6-Insights & Conclusions

## Annual Distinctions

## Temporal & Spatial Considerations

How do accident counts relate to different times of the day and for different region types (urban vs rural)?

## Severity Prediction

## Unusual Weather 

Does driving in unexpected weather--based on area, time of year and/or both--create a higher likelihood of an accident.

## Famous Highways

## Highway's Near Urban Areas

For cross-state

## Naturual Language

Examination of the description feature.

## Safety Infrastructure
Does certain road infrastructure projects help reduce the number of incidents?

## New Traffic Pattern

Does the existence of a new traffic pattern in the area increase the likelihood of an accident?

## Recent Accident Indicator

Does the presence of one accident, predict another.