**Capstone Project 1 - Predicting Automobile Accidents in Montgomery County**

Can accident frequency be predicted for automobiles based on particular factors?  In this capstone project the data from Maryland's Montgomery county traffic stop database is used to look at variables that could potentially help predict increased accident likelihood.  

-Can accidents be predicted based on the month?
-Is there a connection between certain colors of automobiles being in more accidents due to their color, or is it more a popularity of that color leading to more of them to have more accidents? 
-Can a recommendation be made on certain colors being safer automobiles?
-What caused the year 2017 to have more accidents?
-Did alcohol affect the amount of accidents with any significance?


Future questions - Can accidents be predicted based on the day of the week (weekday, weekend)?
Can conclusions be made that driving a certain color car on a certain day of the week is more likely to get in an accident vs other car colors?


**Import Packages and Read Data**

In [1]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/capstone-1/Traffic_Violations.csv
/kaggle/input/traffic-violations-2020/Traffic_Violations_2020.csv


In [None]:
%%time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import norm
from scipy.stats import t
from numpy.random import seed

from datetime import datetime
from matplotlib.dates import WeekdayLocator
from matplotlib.dates import MO, TU, WE, TH, FR, SA, SU #prep for investigating dates

import pandas as pd
from pandas.io.json import json_normalize 
data = pd.read_csv("../input/traffic-violations-2020/Traffic_Violations_2020.csv", parse_dates = ['Date Of Stop', 'Time Of Stop'])
#This data comes from the dataset with traffic violation information issued in the county of Montgomery, located in Maryland.

**Clean and Merge Data**
This set was fairly clean.  It is crucial to include this step.  There were some missing values to be aware of.  

Question:  Can missing data be identified and can we see where there may be missing values?  **Identify included data**

In [None]:
data.info()
#This was done to identify categories with missing information.

Answer:  1,632,871 is the total number of non-null objects recorded.  Any category without this full amount is missing information.  

Question:  What vehicle types are listed in the data set?  What would be a good representation for a general population to sample?

In [None]:
data.VehicleType.value_counts()
##Step to identify all types of vehicles listed in the 'Vehicle Type'
##From here it was determined to use only the Automobiles section

Answer:  There are over one million automobiles in this data set.  This would be primarily the focus for most consumers and the clients identified for this study.  

Question:  How many of each color automobile exists in the data set?  **Filter only Automobiles**

In [None]:
only_automobiles_color= data.loc[(data['VehicleType'] == '02 - Automobile') , ['Color', 'VehicleType']]
#Create a new dataframe of only automobiles and their colors.
only_automobiles_color.Color.value_counts()
#Count total number of automobiles in each color.

In [None]:
only_automobiles_color_nonull= only_automobiles_color.dropna(axis = 0, subset = ['Color'])
#drop nulls from only autos,color that are nan.
only_automobiles_color_nonull.Color.value_counts()
#Final count on number of automobiles of each color.

In [None]:
only_automobiles_color_nonull.Color.unique()
##Identify all colors of autos that exist in the data that have a value

Question:  When comparing the percentage of the represented colors for all vehicles to represented colors for only automobiles, are they similar enough to give an accurate assessment for what this study is looking at to use only the automobiles as the sample dataset?

In [None]:
(data['Color'].value_counts()/data['Color'].count())*100
##Find the percentages of all 'VehicleTypes' in each color out of all vehicles.

In [None]:
((only_automobiles_color['Color'].value_counts()/only_automobiles_color['Color'].count())*100
)
##Find the percentages of each color of automobile out of the total automobiles.

Answer:   The percentage of the represented colors for all vehicles vs. represented colors for only automobiles have similarity to give an accurate assessment for what this study is looking at to use only the automobiles as the sample dataset.

Question:  What is noticed in the number of automobiles in accidents compared to those not in accidents?  

In [None]:
final_color_accident = data.loc[(data['VehicleType'] == '02 - Automobile') & (data['Accident'] == 'Yes'), ['Color', 'VehicleType', 'Accident']]
##final set of colors, automobiles, that were in an accident
final_color_accident_no_null = final_color_accident.dropna(axis = 0, subset = ['Color'])
#drop any rows with a nan value for color
final_color_accident_no_null.Color.value_counts()
##Final count on number of automobiles of each color in accidents.

In [None]:
final_color_no_accident = data.loc[(data['VehicleType'] == '02 - Automobile') & (data['Accident'] == 'No'), ['Color', 'VehicleType', 'Accident']]
#final set of colors, automobiles, that were NOT in an accident
final_color_no_accident_no_null = final_color_no_accident.dropna(axis = 0, subset = ['Color'])
#drop any rows with a nan value for color
final_color_no_accident_no_null.Color.value_counts()
##Final count on number of automobiles of each color not in accidents.

Answer:  **Looking now at only Automobiles in accidents and not in accidents.**
There is quite a difference between the number of automobiles in accidents versus those not.  Also interesting to note how many more vehicles of certain colors exist in this sample and then consider if this affects the outcomes of those having higher accident rates.

Question:  Does the percentage of those more popular colors show a higher likelihood to be in an accident?

The chosen dataset of only the Automobiles from the database of vehicles that were stopped in Montgomery county.  It is not surprising that the most popular colors are black, silver, white, gray, red, and blue.  It is a consideration if drawing a conclusion that black cars get in more accidents than brown for example.  Therefore, there was a need to look at the percentage of accidents of a color, out of the total vehicles of only that color.

In [None]:
final_color_accident_no_null['Color'].value_counts().divide(only_automobiles_color_nonull['Color'].value_counts())*100

In [None]:
autos_in_accidents = (final_color_accident_no_null['Color'].value_counts()/only_automobiles_color_nonull['Color'].value_counts())*100
autos_in_accidents
#Percent of autos of each color in an accident out of total automobiles of each color.  

In [None]:
autos_not_in_accidents = (final_color_no_accident_no_null['Color'].value_counts()/only_automobiles_color['Color'].value_counts())*100
autos_not_in_accidents
#Percent of autos of each color NOT in an accident out of total automobiles of each color.

Answer:  According to the results, COPPER is the least likely to have been in an accident through the span in this sample, with multicolor being the most likely.  This raises a flag to ask questions;  What colors are in the category of MULTICOLOR?  Why might COPPER be such an outlier?

**Visualize Data**

Question:  When looking at automobile stops broken by color, are more vehicles in accidents compared to those not in accidents across the board or vice versa?

In [None]:
bar_labels = ['BLACK', 'WHITE', 'SILVER', 'GRAY', 'RED', 'GREEN', 'BLUE',
        'GREEN, DK', 'BEIGE', 'YELLOW', 'MAROON', 'BROWN', 'BLUE, LIGHT',
        'TAN', 'GOLD', 'BLUE, DARK', 'BRONZE', 'ORANGE', 'GREEN, LGT',
        'COPPER', 'PURPLE', 'CREAM', 'MULTICOLOR', 'PINK', 'CHROME',
        'CAMOUFLAGE']
#identifying all colors that are to be represented

In [None]:
plt3 = autos_not_in_accidents.plot.barh(label ='Autos Not In Accidents', color='green')
plt3 = autos_in_accidents.plot.barh(label ='Autos In Accidents', color='blue')

plt3.set_xlabel('Percent')
plt3.set_ylabel('Color')
plt3.set_title('Automobile Accidents')
plt3.legend(loc='best')
#Visualizing the difference between Autos involved in accidents vs. those not.

In [None]:
plt1 = autos_in_accidents.plot.barh(color='blue')
plt1.set_xlabel('Percent')
plt1.set_ylabel('Color')
plt1.set_title('Automobiles in Accidents')
#visualizing more accurately the percentage of Automobiles in Accidents.  

In [None]:
plt2 = autos_not_in_accidents.plot.barh(color='green')
plt2.set_xlabel('Percent')
plt2.set_ylabel('Color')
plt2.set_title('Automobiles Not in Accidents')
#visualizing more accurately the percentage of Automobiles Not in Accidents.  

Question:  **Breakdown the data by the dates of the stop.**
This is where we begin to see if not only color can help predict an automobiles chance of being in an accident.  Are certain months more prone to have accidents?  Further looking to see if certain years had more accidents?  Can a cause for this spike be identified and is it possible it will occur again?  On the reverse, if certain year had less accidents, can this cause be identified so that perhaps we can reduce accidents annually?

In [None]:
data.groupby(by=data['Date Of Stop'].dt.date).count()
#Group data by day of the stop

In [None]:
auto_accidents_by_date = data[(data['VehicleType'] == '02 - Automobile')& (data['Accident'] == 'Yes')].groupby(by=data['Date Of Stop']).size()
#Identify sum of stops for each day.

In [None]:
data['year'] = data['Date Of Stop'].dt.year
#Creating new column, year for stops
data['month'] = data['Date Of Stop'].dt.month
#Creating new column, month for stops

In [None]:
data[data['year']== 2012]['month'].value_counts()
#View the filter frame based on year if desired

**Now looking at periods of the year that may have more frequent accidents overall.  To begin with, I created visuals to look at years as a whole, then by month. **

In [None]:
df_accident = data[(data['Accident'] == 'Yes')]

In [None]:
#dataframe showing all 7 years with monthly sums.
###HOW TO MAKE THIS ONLY ACCIDENT STOPS, (data['Accident'] == 'Yes')

pd.set_option('display.max_rows',100)
year_month_count = df_accident['Date Of Stop'].groupby([df_accident['Date Of Stop'].rename('Year').dt.year, df_accident['Date Of Stop'].rename('Month').dt.month]).agg({'count'})
year_month_count

In [None]:
year_month_count.columns

In [None]:
g = sns.catplot(x="month",hue = 'month',
              col="year",
             data=data, kind="count",
           height=6, aspect=.7);
#View all years and month sums in one visual.

* https://seaborn.pydata.org/generated/seaborn.countplot.html
    https://seaborn.pydata.org/generated/seaborn.barplot.html

**##Curiousity at particular months in each year that appear to have the highest numbers. What happened during this time to cause this?  Number of cars on the road, gas prices, weather. **

In [None]:
monthly_sum = auto_accidents_by_date.resample('M').sum()
yearly_sum = auto_accidents_by_date.resample('Y').sum()
#Create a sum variable to use for line graph

In [None]:
yearly_sum

In [None]:
plt4 = monthly_sum.plot(figsize=(30,10), label = 'Monthy Sum', color = 'red')
plt4.set_xlabel('Date Of Accident')
plt4.set_ylabel('Number of Automobile Accidents')
plt4.set_title('Automobile Accidents')

plt4 = yearly_sum.plot(figsize=(30,10), label = 'Yearly Sum', color = 'blue')
plt4.set_xlabel('Date Of Accident')
plt4.set_ylabel('Number of Automobile Accidents')
plt4.set_title('Automobile Accidents')

plt4 = auto_accidents_by_date.plot(figsize=(30,10), label = 'Daily Sum',  color = 'black')
plt4.set_xlabel('Date Of Accident')
plt4.set_ylabel('Number of Automobile Accidents')
plt4.set_title('Automobile Accidents')
plt4.legend(loc='best')
#Line graphs to see a visual for any spikes or drops.

**##Interesting that the yearly sum appears to spike in 2017.  Curiousity as to what may have caused this.  Need to investigate the weather patterns potentially affecting it.**

In [None]:
year_month_count.unstack(level=0)
#Easy viewing comparison on total accidents each year.  Consider reuploading the csv file with dates through current, but only want to include through 2019, December.

Answer:  There is a spike in 2017 and a low in 2012.  Several months in 2012 are low compared to all months.  September, October, and December of 2016 are significantly higher than most months recorded.  Time to consider what causes this. Are there less vehicles on the road due to weather, the economy, politics, etc.?  Now to identify if these are event that could occcur again and be predicted, leading to potentially reducing annual and monthly accidents.  


> Question:  What was the distribution for the number of automobile stops that were considered accidents each day over the entire span?

In [None]:

auto_accidents_by_date.describe()

Answer:  Above shows the maximum number of accidents in one day through the years was 56.  The minimum being 1.  The average, mean, per day over all years was about 12 accidents.

Question:  What was the distribution for the number of automobile stops that were considered accidents each month over the entire span?

In [None]:
year_month_count.describe()

Answer:  Above shows the maximum number of accidents in one year through the years 22,601.  The minimum being 7,372.  The average, mean, per year over all years was about 16,932 accidents.

**Find Correlations**
Investigating relationships between the variables to understand how and if there are influencers.  There are interesting distributions related to the Automobile accidents.  This section is where the testing happened two of the hypotheses that were in mind during this study.  

**Null Hypothesis:**
There is no statistical significance in the likelihood of an Automobile getting into an accident related to color.

**Alternative Hypothesis:**
Certain colors of Automobiles show a higher likelihood for getting in an accident.

In [None]:
contingency_table = pd.crosstab(data['Color'], data['Accident'])
print('contingency_table :-\n', contingency_table)

In [None]:
Observed_Values = contingency_table.values
print("Observed Values :-\n",Observed_Values)


In [None]:
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

In [None]:
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)
alpha = 0.05

In [None]:
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)

In [None]:
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

In [None]:
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)

In [None]:
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)

In [None]:
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Based on the testing, the Null Hypothesis is rejected. Therefore taking the Alternative Hypothesis certain colors of Automobiles show a higher likelihood for getting in an accident.


**Null Hypothesis:  **
There is no statistical significance in the likelihood of an Automobile getting into an accident related to the month.

**Alternative Hypothesis:**
Certain months of the year show a greater or reduced likelihood for an Automobile to get into an accident. 

In [None]:
hyp_test = year_month_count.reset_index()

In [None]:
hyp_test[['Month', 'count']].describe()
ttest,pval = stats.ttest_rel(hyp_test['Month'], hyp_test['count'])

#df[['bp_before','bp_after']].describe()
#ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
print(pval)

if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

The test shows to reject the null hypothesis.  Therefore certain months of the year do show a greater or reduced likelihood for an Automobile to get in an accident.  

**Below are studies to look at classification techniques common in Machine Learning.  It is interesting to look at what a decision tree will show, in addition to a binary linear regression and heat map.  **

In [None]:
categorical_values= data.loc[(data['VehicleType'] == '02 - Automobile') , ['Color', 'Alcohol', 'Race', 'Gender', 'Accident']]

https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python

In [None]:
#https://www.datacamp.com/community/tutorials/decision-tree-classification-python
from sklearn import datasets
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation


In [None]:
#split dataset in features and target variable
feature_cols = ['Color', 'Alcohol', 'Race', 'Gender']
X = categorical_values[feature_cols]
y = categorical_values.Accident

In [None]:
categorical_values= data.loc[(data['VehicleType'] == '02 - Automobile') , ['Color', 'Alcohol', 'Race', 'Gender', 'Accident']]
categorical_values['Accident'].unique()

In [None]:
#https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python


#your old value(string) and value is your new value(integer).
Accident = {'Yes': 1, 'No': 0}
#Assign these different key-value pair from above dictiionary to your table
categorical_values.Accident = [Accident[item] for item in categorical_values.Accident]


##https://www.tutorialspoint.com/replacing-strings-with-numbers-in-python-for-data-analysis

In [None]:
categorical_values = categorical_values.dropna()
categorical_values.count()
categorical_values#split dataset in features and target variable
categorical_values['Accident'].unique()

In [None]:
#https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/
# Import label encoder
from sklearn import preprocessing

In [None]:
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 

# categorical_values['Color'].unique()
# categorical_values['Color'].isnull().sum()
# Encode labels in column 
categorical_values['Color']= label_encoder.fit_transform(categorical_values['Color']) 
categorical_values['Alcohol']= label_encoder.fit_transform(categorical_values['Alcohol']) 
categorical_values['Race']= label_encoder.fit_transform(categorical_values['Race']) 
categorical_values['Gender']= label_encoder.fit_transform(categorical_values['Gender']) 

In [None]:
#split dataset in features and target variable
feature_cols = ['Color', 'Alcohol', 'Race', 'Gender']
X = categorical_values[feature_cols]
y = categorical_values.Accident

In [None]:
from sklearn import datasets
from sklearn import svm
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
#Dataset is broken into two parts in a ratio of 75:25. 
#75% data will be used for model training and 25% for model testing.

In [None]:
X_train.size

In [None]:
X_test.size

In [None]:
y_train.size

In [None]:
y_test.size

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

In [None]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

In [None]:
# fit the model with data
logreg.fit(X_train,y_train)

In [None]:
y_pred=logreg.predict(X_test)

In [None]:
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
#https://www.datacamp.com/community/tutorials/decision-tree-classification-python
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
#https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
#https://chrisalbon.com/machine_learning/trees_and_forests/decision_tree_classifier/

# Create decision tree classifer object using gini
clf = DecisionTreeClassifier(criterion='gini', random_state=0)

# Train model
model = clf.fit(X, y)

# Make new observation
observation = [[ 5,  4,  3,  2]]

# Predict observation's class    
model.predict(observation)

In [None]:
#your old value(string) and value is your new value(integer).
###Accident = {'Yes': 1, 'No': 2}
#Assign these different key-value pair from above dictiionary to your table
###categorical_values.Accident = [Accident[item] for item in categorical_values.Accident]

##https://www.tutorialspoint.com/replacing-strings-with-numbers-in-python-for-data-analysis

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

In [None]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

In [None]:
# fit the model with data
logreg.fit(X_train,y_train)

In [None]:
y_pred=logreg.predict(X_test)

In [None]:
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Question:  Would grouping the colors of 'CREAM', 'MULTICOLOR', 'COPPER', 'PINK', 'CHROME', 'CAMOUFLAGE', 
into the category of 'Other' impact the viewers interpretation?  This was done due to those being more outliers and having low numbers.  

In [None]:
final_color_no_accident_other = final_color_no_accident_no_null.copy()
final_color_no_accident_other['Color'] = final_color_no_accident_other['Color'].replace(['CREAM', 'MULTICOLOR', 'COPPER', 'PINK', 'CHROME', 'CAMOUFLAGE'],'OTHER')
final_color_no_accident_other.Color.value_counts()
#Create new dataframe grouping very low numbered colors as one group called 'Other'
#This was done to consider if it has any impact on displaying information more advantageously.
#It does not do much in my opinion.  

In [None]:
final_color_accident_other = final_color_accident_no_null.copy()
final_color_accident_other['Color'] = final_color_accident_other['Color'].replace(['CREAM', 'MULTICOLOR', 'COPPER', 'PINK', 'CHROME', 'CAMOUFLAGE'],'OTHER')
final_color_accident_other.Color.value_counts()
#Using the category of 'Other' for lower numbered color samples as a group
#Looking to see how that impacts the numbers

Answer:  Grouping the outliers into a category of 'Other' did not impact the viewers interpretation of how much greater the numbers in the popular colors are.  

>Question: **Does alcohol appear to have any significance in traffic stops for automobiles in accidents?  **

In [None]:
autos_in_accidents_alcohol = data.loc[(data['VehicleType'] == '02 - Automobile') & (data['Accident'] == 'Yes') & (data['Alcohol'] == 'Yes'), ['Color', 'VehicleType', 'Accident', 'Alcohol']]
##final set of colors, automobiles, that were in an accident
autos_in_accidents_alcohol_no_null = autos_in_accidents_alcohol.dropna(axis = 0, subset = ['Color'])
##drop any rows with a nan value for color
autos_in_accidents_alcohol = autos_in_accidents_alcohol_no_null 
autos_in_accidents_alcohol.Color.value_counts()
#Count the final set of colors for Automobiles in Accidents, that involved alcohol.

In [None]:
final_color_accident_no_null.Color.value_counts()
#Revisitng the count the final set of colors for Automobiles in Accidents.

In [None]:
(autos_in_accidents_alcohol['Color'].value_counts()/final_color_accident_no_null['Color'].count())*100

#Auto in accident with alcohol / Autos in accidents

Answer:  **The percentage of vehicles in accidents that involved alcohol vs. vehicles in accidents in each color.  This is a small percentage.  For example, 47 out of 6,788 black automobiles in accidents involved alcohol.  There was a total of 302,920 black automobiles overall. 


Conclusion:
Certain colors of vehicles have a slightly higher risk for being in an accident when looking at the percentages related to each individual color.  The risks overall vary by <1%, and therefore it doesn't seem a strong statement to make for clients to base decisions from.  
Certain months of the years showed a variance in accidents that could vary by more than 200 compared to other months in the same year.  This merits further investigation by a client or company who wishes to use this information.  The weather may have played a factor or perhaps the economy.  

