<h1>DATA ANALYTICS PROJECT</h1>
<h2>UNITED KINGDOM ROAD ACCIDENT DATA ANALYSIS</h2>
<h3>INCLUSIVE YEARS: 2019-2022</h3>
<h4>IMPORTING LIBRARIES</h4>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from scipy.stats import f_oneway
warnings.filterwarnings('ignore')

In [2]:
accident = pd.read_csv("accident_data.csv")

In [3]:
accident

Unnamed: 0,Index,Accident_Severity,Accident Date,Latitude,Light_Conditions,District Area,Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,Road_Type,Urban_or_Rural_Area,Weather_Conditions,Vehicle_Type
0,200701BS64157,Serious,05/06/2019,51.506187,Darkness - lights lit,Kensington and Chelsea,-0.209082,1,2,Dry,Single carriageway,Urban,Fine no high winds,Car
1,200701BS65737,Serious,02/07/2019,51.495029,Daylight,Kensington and Chelsea,-0.173647,1,2,Wet or damp,Single carriageway,Urban,Raining no high winds,Car
2,200701BS66127,Serious,26/08/2019,51.517715,Darkness - lighting unknown,Kensington and Chelsea,-0.210215,1,3,Dry,,Urban,,Taxi/Private hire car
3,200701BS66128,Serious,16/08/2019,51.495478,Daylight,Kensington and Chelsea,-0.202731,1,4,Dry,Single carriageway,Urban,Fine no high winds,Bus or coach (17 or more pass seats)
4,200701BS66837,Slight,03/09/2019,51.488576,Darkness - lights lit,Kensington and Chelsea,-0.192487,1,2,Dry,,Urban,,Other vehicle
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660674,201091NM01760,Slight,18/02/2022,57.374005,Daylight,Highland,-3.467828,2,1,Dry,Single carriageway,Rural,Fine no high winds,Car
660675,201091NM01881,Slight,21/02/2022,57.232273,Darkness - no lighting,Highland,-3.809281,1,1,Frost or ice,Single carriageway,Rural,Fine no high winds,Car
660676,201091NM01935,Slight,23/02/2022,57.585044,Daylight,Highland,-3.862727,1,3,Frost or ice,Single carriageway,Rural,Fine no high winds,Car
660677,201091NM01964,Serious,23/02/2022,57.214898,Darkness - no lighting,Highland,-3.823997,1,2,Wet or damp,Single carriageway,Rural,Fine no high winds,Motorcycle over 500cc


In [4]:
accident.dtypes

Index                       object
Accident_Severity           object
Accident Date               object
Latitude                   float64
Light_Conditions            object
District Area               object
Longitude                  float64
Number_of_Casualties         int64
Number_of_Vehicles           int64
Road_Surface_Conditions     object
Road_Type                   object
Urban_or_Rural_Area         object
Weather_Conditions          object
Vehicle_Type                object
dtype: object

In [5]:
accident.describe()

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles
count,660654.0,660653.0,660679.0,660679.0
mean,52.553866,-1.43121,1.35704,1.831255
std,1.406922,1.38333,0.824847,0.715269
min,49.91443,-7.516225,1.0,1.0
25%,51.49069,-2.332291,1.0,1.0
50%,52.315641,-1.411667,1.0,2.0
75%,53.453452,-0.232869,1.0,2.0
max,60.757544,1.76201,68.0,32.0


In [6]:
accident['Accident_Severity'] = accident['Accident_Severity'].astype('category')
accident['Accident_Severity'].value_counts()

Accident_Severity
Slight     563801
Serious     88217
Fatal        8661
Name: count, dtype: int64

<h1>CONVERTING OBJECT TO DATETIME DATA TYPE</h1>

In [7]:
accident['Accident Date'] = pd.to_datetime(accident['Accident Date'], dayfirst = True, errors = 'coerce')
accident.dtypes

Index                              object
Accident_Severity                category
Accident Date              datetime64[ns]
Latitude                          float64
Light_Conditions                   object
District Area                      object
Longitude                         float64
Number_of_Casualties                int64
Number_of_Vehicles                  int64
Road_Surface_Conditions            object
Road_Type                          object
Urban_or_Rural_Area                object
Weather_Conditions                 object
Vehicle_Type                       object
dtype: object

In [8]:
accident.isnull().sum()

Index                          0
Accident_Severity              0
Accident Date                  0
Latitude                      25
Light_Conditions               0
District Area                  0
Longitude                     26
Number_of_Casualties           0
Number_of_Vehicles             0
Road_Surface_Conditions      726
Road_Type                   4520
Urban_or_Rural_Area           15
Weather_Conditions         14128
Vehicle_Type                   0
dtype: int64

<h1>CATEGORICAL DATA FIELDS</h1>

In [9]:
accident['Latitude'] = accident['Latitude'].fillna(accident['Latitude'].mode()[0])
accident['Longitude'] = accident['Longitude'].fillna(accident['Longitude'].mode()[0])
accident['Road_Surface_Conditions'] = accident['Road_Surface_Conditions'].fillna('unknown surface condition')
accident['Road_Type'] = accident['Road_Type'].fillna('unaccounted')
accident['Weather_Conditions'] = accident['Weather_Conditions'].fillna('unaccounted weather conditions')
accident['Urban_or_Rural_Area'] = accident['Urban_or_Rural_Area'].fillna(accident['Urban_or_Rural_Area'].mode()[0])
accident.isnull().sum()

Index                      0
Accident_Severity          0
Accident Date              0
Latitude                   0
Light_Conditions           0
District Area              0
Longitude                  0
Number_of_Casualties       0
Number_of_Vehicles         0
Road_Surface_Conditions    0
Road_Type                  0
Urban_or_Rural_Area        0
Weather_Conditions         0
Vehicle_Type               0
dtype: int64

<h1>CATEGORICAL FIELDS</h1>

In [10]:
accident['Index'] = accident['Index'].astype('category')
accident['Accident_Severity'] = accident['Accident_Severity'].astype('category')
accident['Accident Date'] = accident['Accident Date'].astype('category')
accident['Latitude'] = accident['Latitude'].astype('category')
accident['Light_Conditions'] = accident['Light_Conditions'].astype('category')
accident['District Area'] = accident['District Area'].astype('category')
accident['Longitude'] = accident['Longitude'].astype('category')
accident['Number_of_Casualties'] = accident['Number_of_Casualties'].astype('category')
accident['Number_of_Vehicles'] = accident['Number_of_Vehicles'].astype('category')
accident['Road_Surface_Conditions'] = accident['Road_Surface_Conditions'].astype('category')
accident['Road_Type'] = accident['Road_Type'].astype('category')
accident['Urban_or_Rural_Area'] = accident['Urban_or_Rural_Area'].astype('category')
accident['Weather_Conditions'] = accident['Weather_Conditions'].astype('category')
accident['Vehicle_Type'] = accident['Vehicle_Type'].astype('category')
accident.dtypes

Index                      category
Accident_Severity          category
Accident Date              category
Latitude                   category
Light_Conditions           category
District Area              category
Longitude                  category
Number_of_Casualties       category
Number_of_Vehicles         category
Road_Surface_Conditions    category
Road_Type                  category
Urban_or_Rural_Area        category
Weather_Conditions         category
Vehicle_Type               category
dtype: object

<h1>DATA ANALYTICS</h1>
<h2>ANALYZING EACH FIELD FROM THE DATA SET</h2>

<h1>UNIVARIATE ANALYSIS FROM YEAR 2019</h1>
<h2>QUESTION: HOW MANY INCIDENTS HAPPENED UNDER LIGHT CONDITIONS</h2>

In [11]:
accident['Light_Conditions'].value_counts()

Light_Conditions
Daylight                       484880
Darkness - lights lit          129335
Darkness - no lighting          37437
Darkness - lighting unknown      6484
Darkness - lights unlit          2543
Name: count, dtype: int64

<h1>INSIGHT 1</h1>
<h2>QUESTION: HOW DO ACCIDENTS VARY UNDER DIFFERENT LIGHT CONDITIONS?</h2>
<h3>ANSWER: THE MAJORITY OF ACCIDENTS OCCUR IN DAYLIGHT</h3>

In [12]:
accident['Light_Conditions'].value_counts()

Light_Conditions
Daylight                       484880
Darkness - lights lit          129335
Darkness - no lighting          37437
Darkness - lighting unknown      6484
Darkness - lights unlit          2543
Name: count, dtype: int64

<h1>INSIGHT 2</h1>
<h2>QUESTION: CORRELATION BETWEEN NUMBER OF CASUALTIES AND NUMBER OF VEHICLES?</h2>
<h3>ANSWER: THERE IS A CORRELATION BETWEEN NUMBER OF CASUALTIES AND NUMBER OF VEHICLES</h3>

In [13]:
number_casualties_vehicles = accident['Number_of_Casualties'].corr(accident['Number_of_Vehicles'])
number_casualties_vehicles

np.float64(0.2288888612692756)

<h1>INSIGHT 3</h1>
<h2>QUESTION: WHICH DISTRICT AREAS HAVE THE HIGHEST NUMBER OF ACCIDENTS?</h2>
<h3>ANSWER: THE HIGHEST NUMBER OF ACCIDENTS IN THE DISTRICT AREA IS BIRMINGHAM</h3>

In [14]:
accident['District Area'].value_counts().head(10)

District Area
Birmingham          13491
Leeds                8898
Manchester           6720
Bradford             6212
Sheffield            5710
Westminster          5706
Liverpool            5587
Glasgow City         4942
Bristol, City of     4819
Kirklees             4690
Name: count, dtype: int64

<h1>INSIGHT 4</h1>
<h2>QUESTION: HOW MANY ACCIDENTS OCCURED ON 2019-01-31?</h2>
<h3>ANSWER: THE ACCIDENTS ACCURED ON 2019-01-31 is 697 </h3>

In [15]:
accident['Accident Date'].value_counts()

Accident Date
2019-11-30    704
2019-01-31    697
2021-11-13    692
2019-07-13    692
2019-08-14    688
             ... 
2022-12-30    171
2019-12-25    157
2022-12-25    145
2022-01-10    123
2020-12-25    118
Name: count, Length: 1461, dtype: int64

<h1>INSIGHT 5</h1>
<h2>QUESTION: WHAT IS THE DISTRIBUTION OF ACCIDENT SEVERITY LEVEL?</h2>
<h3>ANSWER: MOST ACCIDENTS ARE SLIGHT, INDICATING MINOR INJURIES OR DAMAGES</h3>

In [16]:
accident['Accident_Severity'].value_counts()

Accident_Severity
Slight     563801
Serious     88217
Fatal        8661
Name: count, dtype: int64

<h1>INSIGHT 6</h1>
<h2>QUESTION: WHICH ROAD SURFACE CONDITIONS ARE ASSOCIATED WITH THE MOST ACCIDENTS?</h2>
<h3>ANSWER: THE MOST ACCIDENTS IN THE ROAD CONDITIONS IS IN DRY WITH 447,821</h3>

In [17]:
accident['Road_Surface_Conditions'].value_counts()

Road_Surface_Conditions
Dry                          447821
Wet or damp                  186708
Frost or ice                  18517
Snow                           5890
Flood over 3cm. deep           1017
unknown surface condition       726
Name: count, dtype: int64

<h1>INSIGHT 7</h1>
<h2>QUESTION: WHICH VEHICLE TYPES ARE MOST COMMONLY INVOLVED IN ACCIDENTS?</h2>
<h3>ANSWER: ARE THE MOST COMMONLY INVOLVED IN ACCIDENTS IN THE VEHICLE TYPES IS CAR WITH 497,992 ACCIDENTS</h3>

In [18]:
accident['Vehicle_Type'].value_counts()

Vehicle_Type
Car                                      497992
Van / Goods 3.5 tonnes mgw or under       34160
Bus or coach (17 or more pass seats)      25878
Motorcycle over 500cc                     25657
Goods 7.5 tonnes mgw and over             17307
Motorcycle 125cc and under                15269
Taxi/Private hire car                     13294
Motorcycle over 125cc and up to 500cc      7656
Motorcycle 50cc and under                  7603
Goods over 3.5t. and under 7.5t            6096
Other vehicle                              5637
Minibus (8 - 16 passenger seats)           1976
Agricultural vehicle                       1947
Pedal cycle                                 197
Data missing or out of range                  6
Ridden horse                                  4
Name: count, dtype: int64

<h1>INSIGHT 8</h1>
<h2>QUESTION: WHAT IS THE COMMON ACCIDENT SEVERITY IN URBAN AND RURAL AREAS?</h2>
<h3>ANSWER: THE MOST COMMON ACCIDENT SEVERITY IN URBAN IS SLIGHT WITH 367,714, IN RURAL IS SLIGHT WITH 196,077</h3>

In [19]:
accident.groupby('Urban_or_Rural_Area')['Accident_Severity'].value_counts()

Urban_or_Rural_Area  Accident_Severity
Rural                Slight               196077
                     Serious               37312
                     Fatal                  5601
Unallocated          Slight                   10
                     Serious                   1
                     Fatal                     0
Urban                Slight               367714
                     Serious               50904
                     Fatal                  3060
Name: count, dtype: int64

<h1>INSIGHT 9</h1>
<h2>QUESTION: WHICH MONTH RECORDS THE HIGHEST NUMBER OF ACCIDENTS?</h2>
<h3>ANSWER: ARE THE MOST NUMBER OF ACCIDENTS IN TERMS OF MONTH IS NOVEMBER WITH 60,424</h3>

In [20]:
accident['Accident Date'] = pd.to_datetime(accident['Accident Date'])
accident['Month'] = accident['Accident Date'].dt.month
accident['Month'].value_counts().sort_index()

Month
1     52872
2     49491
3     54086
4     51744
5     56352
6     56481
7     57445
8     53913
9     56455
10    59580
11    60424
12    51836
Name: count, dtype: int64

<h1>INSIGHT 10</h1>
<h2>QUESTION: WHAT IS THE PERCENTAGE NUMBER OF ACCIDENTS OCCURED IN CAR?</h2>
<h3>ANSWER: THE AVERAGE ACCIDENTS OCCURED IN CAR IS 75.4%</h3>

In [21]:
np.round(accident['Vehicle_Type'].value_counts(normalize=True) * 100 , 1)

Vehicle_Type
Car                                      75.4
Van / Goods 3.5 tonnes mgw or under       5.2
Bus or coach (17 or more pass seats)      3.9
Motorcycle over 500cc                     3.9
Goods 7.5 tonnes mgw and over             2.6
Motorcycle 125cc and under                2.3
Taxi/Private hire car                     2.0
Motorcycle over 125cc and up to 500cc     1.2
Motorcycle 50cc and under                 1.2
Goods over 3.5t. and under 7.5t           0.9
Other vehicle                             0.9
Minibus (8 - 16 passenger seats)          0.3
Agricultural vehicle                      0.3
Pedal cycle                               0.0
Data missing or out of range              0.0
Ridden horse                              0.0
Name: proportion, dtype: float64

<h1>INSIGHT 11</h1>
<h2>QUESTION: CORRELATION BETWEEN LATITUDE AND LONGITUDE?</h2>
<h3>ANSWER: THERE IS NO CORRELATION BETWEEN LATITUDE AND LONGITUDE</h3>

In [22]:
latitude_longitude = accident['Latitude'].corr(accident['Longitude'])
latitude_longitude

np.float64(-0.3981137948101014)

<h1>INSIGHT 12</h1>
<h2>QUESTION: WHICH ROAD ARE THE HIGHEST ACCIDENT RATES?</h2>
<h3>ANSWER: THE HIGHEST ROAD ACCIDENTS RATES ARE IN SINGLE CARRIAGEWAY </h3>

In [23]:
accident['Road_Type'].value_counts()

Road_Type
Single carriageway    492143
Dual carriageway       99424
Roundabout             43992
One way street         13559
Slip road               7041
unaccounted             4520
Name: count, dtype: int64

<h1>INSIGHT 13</h1>
<h2>QUESTION: WHICH DISTRICT ARE THE HIGHEST NUMBER OF CASUALTIES?</h2>
<h3>ANSWER: THE DISTRICT ARE THE HIGHEST NUMBER OF CASUALTIES IS BIRMINGHAM WITH 13,491 CASUALTIES </h3>

In [24]:
accident.groupby('District Area')['Number_of_Casualties'].size().sort_values(ascending=False).head(10)

District Area
Birmingham          13491
Leeds                8898
Manchester           6720
Bradford             6212
Sheffield            5710
Westminster          5706
Liverpool            5587
Glasgow City         4942
Bristol, City of     4819
Kirklees             4690
Name: Number_of_Casualties, dtype: int64

<h1>INSIGHT 14</h1>
<h2>QUESTION: HOW MANY INCIDENTS HAPPENED UNDER LIGHT CONDITIONS IN 2022?</h2>
<h3>ANSWER: TOTAL NUMBER OF ACCIDENT IN YEAR 2022 UNDER THE LIGHT CONDITIONS IS 0</h3>

In [25]:
accident_2022 = accident[accident['Accident Date'] == 2022]
print(f"total number of accident in 2022 is {accident_2022['Light_Conditions'].value_counts().sum()}")
accident_2022['Light_Conditions'].value_counts()

total number of accident in 2022 is 0


Light_Conditions
Darkness - lighting unknown    0
Darkness - lights lit          0
Darkness - lights unlit        0
Darkness - no lighting         0
Daylight                       0
Name: count, dtype: int64

<h1>INSIGHT 15</h1>
<h2>QUESTION: HOW MANY ACCIDENTS OCCUR IN "FINE NO HIGH WINDS"?</h2>
<h3>ANSWER: THE OCCURS ACCIDENTS IN "FINE NO HIGH WINDS IS 520885</h3>

In [26]:
accident[accident['Weather_Conditions'].str.contains('Fine no high winds', na=False)].shape[0]

520885

<h1>INSIGHT 16</h1>
<h2>QUESTION: WHAT YEAR RECORDED THE HIGHEST NUMBER OF ACCIDENTS?</h2>
<h3>ANSWER: THE HIGHEST NUMBER OF ACCIDENTS IN YEAR OF 2019</h3>

In [27]:
accident['Accident Date'].dt.year.value_counts()

Accident Date
2019    182115
2020    170591
2021    163554
2022    144419
Name: count, dtype: int64

<h1>INSIGHT 17</h1>
<h2>QUESTION: WHAT DAY RECORDED THE HIGHEST NUMBER OF ACCIDENTS?</h2>
<h3>ANSWER: THE HIGHEST NUMBER OF ACCIDENTS IN DAY IS SATURDAY</h3>

In [28]:
accident['Accident Date'] = pd.to_datetime(accident['Accident Date'])
accident['Accident Date'].dt.day_name().value_counts()

Accident Date
Saturday     107178
Wednesday     99558
Thursday      99511
Friday        97900
Tuesday       94550
Sunday        89302
Monday        72680
Name: count, dtype: int64

<h1>INSIGHT 18</h1>
<h2>QUESTION: WHAT IS THE MOST COMMON CAUSE OF FATAL ACCIDENTS?</h2>
<h3>ANSWER: THE COMMON CAUSE OF FATAL ACCIDENTS IS "FINE NO HIGH WINDS</h3>

In [29]:
accident[accident['Accident_Severity'] == 'Fatal']['Weather_Conditions'].value_counts()

Weather_Conditions
Fine no high winds                7100
Raining no high winds              848
Fine + high winds                  175
Other                              165
Raining + high winds               145
unaccounted weather conditions     107
Fog or mist                         82
Snowing no high winds               36
Snowing + high winds                 3
Name: count, dtype: int64

<h1>INSIGHT 19</h1>
<h2>QUESTION: WHAT IS THE PERCENTAGE OF FATAL ACCIDENTS IN VAN?</h2>
<h3>ANSWER: THE PERCENTAGE OF FATAL ACCIDENTS IN VAN IS 75.9%</h3>

In [30]:
np.round(accident[accident['Accident_Severity'] == 'Fatal']['Vehicle_Type'].value_counts(normalize=True) * 100 , 1)

Vehicle_Type
Car                                      75.9
Van / Goods 3.5 tonnes mgw or under       5.4
Motorcycle over 500cc                     3.9
Bus or coach (17 or more pass seats)      3.8
Goods 7.5 tonnes mgw and over             2.5
Motorcycle 125cc and under                2.2
Taxi/Private hire car                     1.8
Motorcycle over 125cc and up to 500cc     1.2
Motorcycle 50cc and under                 1.1
Other vehicle                             0.8
Goods over 3.5t. and under 7.5t           0.8
Minibus (8 - 16 passenger seats)          0.3
Agricultural vehicle                      0.2
Pedal cycle                               0.1
Data missing or out of range              0.0
Ridden horse                              0.0
Name: proportion, dtype: float64

<h1>INSIGHT 20</h1>
<h2>QUESTION: WHAT IS THE PERCENTAGE OF ACCIDENTS OCCUR IN URBAN AREAS?</h2>
<h3>ANSWER: THE PERCENTAGE OF ACCIDENTS IN URBAN AREAS IS 63.8%</h3>

In [31]:
np.round(accident['Urban_or_Rural_Area'].value_counts(normalize=True) * 100, 1)

Urban_or_Rural_Area
Urban          63.8
Rural          36.2
Unallocated     0.0
Name: proportion, dtype: float64