<h1>Activity 2: UK Road Accidents </h1>
<hr>
<h3>Analyst: Martin Ryan V. Garay</h3>

<H2>Import Libraries</H2>

In [1]:
import numpy as np
import pandas as pd
import warnings
from scipy.stats import f_oneway
warnings.filterwarnings("ignore")

<H2>Load Dataset into DataFrame</H2>

In [2]:
df = pd.read_csv('Datasets\\uk_road_accident.csv')

<h2>Check DataFrame Information</h2>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660679 entries, 0 to 660678
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Index                    660679 non-null  object 
 1   Accident_Severity        660679 non-null  object 
 2   Accident Date            660679 non-null  object 
 3   Latitude                 660654 non-null  float64
 4   Light_Conditions         660679 non-null  object 
 5   District Area            660679 non-null  object 
 6   Longitude                660653 non-null  float64
 7   Number_of_Casualties     660679 non-null  int64  
 8   Number_of_Vehicles       660679 non-null  int64  
 9   Road_Surface_Conditions  659953 non-null  object 
 10  Road_Type                656159 non-null  object 
 11  Urban_or_Rural_Area      660664 non-null  object 
 12  Weather_Conditions       646551 non-null  object 
 13  Vehicle_Type             660679 non-null  object 
dtypes: f

<h2>Basic Descriptive Statistic</h2>

In [4]:
df.describe()

Unnamed: 0,Latitude,Longitude,Number_of_Casualties,Number_of_Vehicles
count,660654.0,660653.0,660679.0,660679.0
mean,52.553866,-1.43121,1.35704,1.831255
std,1.406922,1.38333,0.824847,0.715269
min,49.91443,-7.516225,1.0,1.0
25%,51.49069,-2.332291,1.0,1.0
50%,52.315641,-1.411667,1.0,2.0
75%,53.453452,-0.232869,1.0,2.0
max,60.757544,1.76201,68.0,32.0


<h1>Clearing any Inconsistencies with the Data Set</h1>

In [5]:
df['Accident Data'] = df['Accident Date'].str.strip()
df['Accident Data'] = df['Accident Date'].astype('str')
df['Accident Data'] = df['Accident Date'].str.replace('/', '-')

In [6]:
df['Accident Date'] = pd.to_datetime(df['Accident Date'], dayfirst = True, errors = 'coerce')

In [7]:
df.dtypes

Index                              object
Accident_Severity                  object
Accident Date              datetime64[ns]
Latitude                          float64
Light_Conditions                   object
District Area                      object
Longitude                         float64
Number_of_Casualties                int64
Number_of_Vehicles                  int64
Road_Surface_Conditions            object
Road_Type                          object
Urban_or_Rural_Area                object
Weather_Conditions                 object
Vehicle_Type                       object
Accident Data                      object
dtype: object

<h2>Check and Fill Null Values</h2>

In [8]:
df.isnull().sum()

Index                          0
Accident_Severity              0
Accident Date                  0
Latitude                      25
Light_Conditions               0
District Area                  0
Longitude                     26
Number_of_Casualties           0
Number_of_Vehicles             0
Road_Surface_Conditions      726
Road_Type                   4520
Urban_or_Rural_Area           15
Weather_Conditions         14128
Vehicle_Type                   0
Accident Data                  0
dtype: int64

In [9]:
df['Latitude'] = df['Latitude'].fillna(df['Latitude'].mean())
df['Longitude'] = df['Longitude'].fillna(df['Longitude'].mean())
df['Road_Surface_Conditions'] = df['Road_Surface_Conditions'].fillna(df['Road_Surface_Conditions'].mode()[0])
df['Road_Type'] = df['Road_Type'].fillna(df['Road_Type'].mode()[0])
df['Urban_or_Rural_Area'] = df['Urban_or_Rural_Area'].fillna(df['Urban_or_Rural_Area'].mode()[0])
df['Weather_Conditions'] = df['Weather_Conditions'].fillna(df['Weather_Conditions'].mode()[0])

<h1>Extracting Date information using Pandas Date Time</h1>

In [10]:
df['Year'] = df['Accident Date'].dt.year
df['Month'] = df['Accident Date'].dt.month
df['Day'] = df['Accident Date'].dt.day
df['DayOfWeek'] = df['Accident Date'].dt.dayofweek

In [11]:
df.isnull().sum()

Index                      0
Accident_Severity          0
Accident Date              0
Latitude                   0
Light_Conditions           0
District Area              0
Longitude                  0
Number_of_Casualties       0
Number_of_Vehicles         0
Road_Surface_Conditions    0
Road_Type                  0
Urban_or_Rural_Area        0
Weather_Conditions         0
Vehicle_Type               0
Accident Data              0
Year                       0
Month                      0
Day                        0
DayOfWeek                  0
dtype: int64

<hr>
<H1>Exploratory Data Analytics</H1>

<h2>>Question No.1</h2>
<h3>What type of vehicle is more prone to accidents according to the data?</h3>

In [12]:
df['Vehicle_Type'].value_counts()

Vehicle_Type
Car                                      497992
Van / Goods 3.5 tonnes mgw or under       34160
Bus or coach (17 or more pass seats)      25878
Motorcycle over 500cc                     25657
Goods 7.5 tonnes mgw and over             17307
Motorcycle 125cc and under                15269
Taxi/Private hire car                     13294
Motorcycle over 125cc and up to 500cc      7656
Motorcycle 50cc and under                  7603
Goods over 3.5t. and under 7.5t            6096
Other vehicle                              5637
Minibus (8 - 16 passenger seats)           1976
Agricultural vehicle                       1947
Pedal cycle                                 197
Data missing or out of range                  6
Ridden horse                                  4
Name: count, dtype: int64

<h3>Insights:</h3>
<p>Car vehicles are the most prone to accidents according to the data</p>
<h3></h3>
<hr>

<h2>>Question No.2</h2>
<h3>What road condition has the most number of accidents?</h3>

In [13]:
df['Road_Surface_Conditions'].value_counts()

Road_Surface_Conditions
Dry                     448547
Wet or damp             186708
Frost or ice             18517
Snow                      5890
Flood over 3cm. deep      1017
Name: count, dtype: int64

<h3>Insights:</h3>
<p>According to the data, the most number of accident happens on a dry road, this may indicate that driver behavior play a more significant role in accidents </p>
<h3></h3>
<hr>

<h2>>Question No.3</h2>
<h3>In which lighting situation do the most serious accidents occur?</h3>

In [14]:
df.groupby(['Accident_Severity', 'Light_Conditions']).size()

Accident_Severity  Light_Conditions           
Fatal              Darkness - lighting unknown        68
                   Darkness - lights lit            1860
                   Darkness - lights unlit            45
                   Darkness - no lighting           1612
                   Daylight                         5076
Serious            Darkness - lighting unknown       794
                   Darkness - lights lit           19130
                   Darkness - lights unlit           360
                   Darkness - no lighting           7174
                   Daylight                        60759
Slight             Darkness - lighting unknown      5622
                   Darkness - lights lit          108345
                   Darkness - lights unlit          2138
                   Darkness - no lighting          28651
                   Daylight                       419045
dtype: int64

<h3>Insights:</h3>
<p>The data shows that the most severe accidents happens at night time, indicating that poor lighting cause accidents</p>
<h3></h3>
<hr>

<h2>>Question No.4</h2>
<h3>Which weather condition has the highest number of accidents?</h3>

In [15]:
df['Weather_Conditions'].value_counts()

Weather_Conditions
Fine no high winds       535013
Raining no high winds     79696
Other                     17150
Raining + high winds       9615
Fine + high winds          8554
Snowing no high winds      6238
Fog or mist                3528
Snowing + high winds        885
Name: count, dtype: int64

<h3>Insights:</h3>
<p>The data shows that most accidents happened when the weather was clear, indicating that poor weather is not the leading cause of accidents.</p>
<h3></h3>
<hr>

<h2>>Question No.5</h2>
<h3>Are nighttime accidents more common in urban areas or rural areas?</h3>

In [16]:
df.groupby(['Light_Conditions', 'Urban_or_Rural_Area']).size()


Light_Conditions             Urban_or_Rural_Area
Darkness - lighting unknown  Rural                    2467
                             Urban                    4017
Darkness - lights lit        Rural                   24695
                             Unallocated                 2
                             Urban                  104638
Darkness - lights unlit      Rural                     961
                             Urban                    1582
Darkness - no lighting       Rural                   35517
                             Urban                    1920
Daylight                     Rural                  175350
                             Unallocated                 9
                             Urban                  309521
dtype: int64

<h3>Insights:</h3>
<p>Nighttime accidents occur more often in urban areas than in rural areas. This indicates that traffic density in cities contributes to more accidents at night. However, under ‘darkness with no lighting,’ rural areas show more accidents, likely due to limited street lighting and visibility.</p>
<h3></h3>
<hr>

<h2>>Question No.6</h2>
<h3>Which district has the largest number of accidents?</h3>

In [17]:
df['District Area'].value_counts()

District Area
Birmingham            13491
Leeds                  8898
Manchester             6720
Bradford               6212
Sheffield              5710
                      ...  
Berwick-upon-Tweed      153
Teesdale                142
Shetland Islands        133
Orkney Islands          117
Clackmannanshire         91
Name: count, Length: 422, dtype: int64

<h3>Insights:</h3>
<p>Based on the data, Birmingham has the highest number of accidents. This suggests that Birmingham is a hotspot for road accidents, likely due to its large population.</p>
<h3></h3>
<hr>

<h2>>Question No.7</h2>
<h3>Which road type (single carriageway, dual carriageway, roundabout, etc.) sees the most accidents?</h3>

In [18]:
df['Road_Type'].value_counts()

Road_Type
Single carriageway    496663
Dual carriageway       99424
Roundabout             43992
One way street         13559
Slip road               7041
Name: count, dtype: int64

<h3>Insights:</h3>
<p>Most accidents occur on single carriageways, far more than on other road types, likely due to their widespread use and higher traffic exposure.</p>
<h3></h3>
<hr>

<h2>>Question No.8</h2>
<h3>What is the average number of vehicles per accident?</h3>

In [19]:
df['Number_of_Vehicles'].mean()

np.float64(1.8312554205597575)

<h3>Insights:</h3>
<p>On average, about 1.83 vehicles are involved in each accident, meaning most accidents typically involve two vehicles rather than single-vehicle incidents.</p>
<h3></h3>
<hr>

<h2>>Question No.9</h2>
<h3>What is the average number of casualties per accident?</h3>

In [20]:
df['Number_of_Casualties'].mean()

np.float64(1.357040257068864)

<h3>Insights:</h3>
<p>On average, each accident results in about 1.36 casualties, showing that most accidents involve at least one person being injured or affected.</p>
<h3></h3>
<hr>

<h2>>Question No.10</h2>
<h3>Do accidents with more vehicles usually result in more casualties?</h3>

In [21]:
df['Number_of_Vehicles'].corr(df['Number_of_Casualties'])


np.float64(0.22888886126927557)

<h3>Insights:</h3>
<p>There is a weak positive correlation (≈0.23) between the number of vehicles and casualties. This means accidents involving more vehicles tend to have slightly more casualties, but the relationship is not very strong.</p>
<h3></h3>
<hr>

<h2>>Question No.11</h2>
<h3>Which weather condition is linked to the highest number of severe accidents?</h3>

In [22]:
df.groupby(['Weather_Conditions', 'Accident_Severity']).size()

Weather_Conditions     Accident_Severity
Fine + high winds      Fatal                   175
                       Serious                1245
                       Slight                 7134
Fine no high winds     Fatal                  7207
                       Serious               73285
                       Slight               454521
Fog or mist            Fatal                    82
                       Serious                 483
                       Slight                 2963
Other                  Fatal                   165
                       Serious                1801
                       Slight                15184
Raining + high winds   Fatal                   145
                       Serious                1261
                       Slight                 8209
Raining no high winds  Fatal                   848
                       Serious                9468
                       Slight                69380
Snowing + high winds   Fatal             

<h3>Insights:</h3>
<p>Most severe accidents (fatal and serious) occur under fine weather without high winds. This suggests that good weather may create a false sense of safety,</p>
<h3></h3>
<hr>

<h2>>Question No.12</h2>
<h3>Which district area had the most accidents involving cars?</h3>

In [23]:
carses = df[df['Vehicle_Type'] == 'Car']['District Area'].value_counts().head(5)
carses

District Area
Birmingham    9600
Leeds         6875
Manchester    5248
Bradford      4749
Sheffield     4306
Name: count, dtype: int64

<h3>Insights:</h3>
<p>Birmingham has the highest number of accidents involving cars. Possibly due to a large number of people owning cars in this district</p>
<h3></h3>
<hr>

<h2>>Question No.13</h2>
<h3>Which district area had the most accidents with slight severity</h3>

In [24]:
sliacc = df[df['Accident_Severity'] == 'Slight']
sliacc['District Area'].describe()

count         563801
unique           422
top       Birmingham
freq           11912
Name: District Area, dtype: object

<h3>Insights:</h3>
<p>The data shows that Birmingham had the most slightly severe accidents</p>
<h3></h3>
<hr>

<h2>>Question No.14</h2>
<h3>What is the highest number of casualties recorded in a single accident?</h3>

In [25]:
df['Number_of_Casualties'].max()


np.int64(68)

In [26]:
df[df['Number_of_Casualties']== 68]

Unnamed: 0,Index,Accident_Severity,Accident Date,Latitude,Light_Conditions,District Area,Longitude,Number_of_Casualties,Number_of_Vehicles,Road_Surface_Conditions,Road_Type,Urban_or_Rural_Area,Weather_Conditions,Vehicle_Type,Accident Data,Year,Month,Day,DayOfWeek
117980,200743N002017,Fatal,2019-01-03,51.497547,Darkness - lights lit,South Bucks,-0.496697,68,1,Wet or damp,Slip road,Rural,Raining no high winds,Car,03-01-2019,2019,1,3,3


<h3>Insights:</h3>
<p>The highest number of casualties in a single accident was 68, occurring in South Bucks. This suggests rare but extreme incidents can cause unusually high casualty counts, even involving just one vehicle.</p>
<h3></h3>
<hr>

<h2>>Question No.15</h2>
<h3>In Urban areas, what type of road surface condition has the most accident happen?</h3>

In [27]:
urbroad = df[df['Urban_or_Rural_Area'] == 'Urban']

urbroad['Road_Surface_Conditions'].value_counts()

Road_Surface_Conditions
Dry                     303397
Wet or damp             107698
Frost or ice              7564
Snow                      2788
Flood over 3cm. deep       231
Name: count, dtype: int64

<h3>Insights:</h3>
<p>Accidents mostly happens on dry roads in Urban areas.</p>
<h3></h3>
<hr>

<h2>>Question No.16</h2>
<h3>What's the average number of casualties by the severity of accident?</h3>

In [28]:
np.round(df.groupby('Accident_Severity')['Number_of_Casualties'].mean(),2)

Accident_Severity
Fatal      1.90
Serious    1.47
Slight     1.33
Name: Number_of_Casualties, dtype: float64

<h3>Insights:</h3>
<p>The data above shows that the average number of casualties, be it in every severity of accidents does not exceeds to more than 2.</p>
<h3></h3>
<hr>

<h2>>Question No.17</h2>
<h3>Which district has the highest number of Frost or ice road accidents?</h3>

In [29]:
frost = df[df['Road_Surface_Conditions'] == "Frost or ice"].groupby('District Area').size()
mostfrost = frost.idxmax()
mostfrostNum = frost.max()
print(f'{mostfrost} - {mostfrostNum}')

Birmingham - 306


<h3>Insights:</h3>
<p>Birmingham has the highest number of accidents on Frost or ice, with a total of 306</p>
<h3></h3>
<hr>

<h2>>Question No.18</h2>
<h3>Which weather condition has the highest average number of casualties?</h3>

In [30]:
df.groupby("Weather_Conditions")["Number_of_Casualties"].mean()

Weather_Conditions
Fine + high winds        1.386018
Fine no high winds       1.347397
Fog or mist              1.452948
Other                    1.354869
Raining + high winds     1.416641
Raining no high winds    1.408214
Snowing + high winds     1.418079
Snowing no high winds    1.341776
Name: Number_of_Casualties, dtype: float64

<h3>Insights:</h3>
<p>The highest average casualties per accident happen during fog or mist, showing low visibility is particularly dangerous.</p>
<h3></h3>
<hr>

<h2>>Question No.19</h2>
<h3>Do Bus accidents increase casualty count more in rural or urban areas?</h3>

In [31]:
df[(df['Vehicle_Type'] == 'Bus or coach (17 or more pass seats)') | 
    (df['Vehicle_Type'] == 'Bus or coach (17 or more pass seats)')].groupby('Urban_or_Rural_Area').size()

Urban_or_Rural_Area
Rural           9025
Unallocated        2
Urban          16851
dtype: int64

<h3>Insights:</h3>
<p>Motorcycle accidents are more frequent in urban areas, with 16851 accidents compared to 9025 in rural areas.</p>
<h3></h3>
<hr>

<h2>>Question No.20</h2>
<h3>Which light condition (daylight, darkness, etc.) is most associated with Serious accidents?</h3>

In [33]:
light = df[df['Accident_Severity']== 'Serious'].groupby('Light_Conditions').size()
light

Light_Conditions
Darkness - lighting unknown      794
Darkness - lights lit          19130
Darkness - lights unlit          360
Darkness - no lighting          7174
Daylight                       60759
dtype: int64

<h3>Insights:</h3>
<p>Daylight has the highest number of Serious accidents with 60759 accidents, indicating that accidents that cause Serious injuries happen when the sun is still out</p>
<h3></h3>
<hr>

<h2>>Question No.21</h2>
<h3>Which light condition (daylight, darkness, etc.) is most associated with Serious accidents?</h3>