# Exploratory Data Analysis: Malta Road Traffic Accidents

This notebook performs a comprehensive exploratory analysis of Malta road traffic accidents to extract meaningful insights that could inform road safety policy. Based on this analysis, we will formulate research questions and determine appropriate modeling approaches (classification, regression, or both).

## Key Research Questions to Investigate:
- What factors predict accident severity?
- Can we identify high-risk time periods or locations?
- What weather conditions correlate with increased accident rates?
- Which vehicle types or road conditions are associated with serious injuries?

## 1. Setup and Data Loading

### 1.1 Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


### 1.2 Load Prepared Dataset

In [5]:
#data_folder = "../../data"
# Load the final prepared dataset
data_path = f"C:\\Python\\ics5110-assignment\\data\\final\\data.csv"
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumn names:\n{df.columns.tolist()}")

Dataset shape: 219 rows, 97 columns

Column names:
['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus', '

In [6]:
# Keep only the specified columns - exclude one hot encoded columns
columns_to_keep = [
    'id', 'title', 'content', 'date_published', 'accident_datetime',
    'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type',
    'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type',
    'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type',
    'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type',
    'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category',
    'num_drivers_total', 'vehicle_pedestrian', 'vehicle_bicycle', 'vehicle_motorbike',
    'vehicle_car', 'vehicle_van', 'vehicle_bus', 'num_vehicle_unknown',
    'num_vehicle_pedestrian', 'num_vehicle_bicycle', 'num_vehicle_motorbike',
    'num_vehicle_car', 'num_vehicle_van', 'num_vehicle_bus',
    'is_weekend', 'is_public_holiday_mt', 'is_school_holiday_mt', 'is_school_day_mt',
    'street_type', 'region', 'temperature_max', 'temperature_min', 'temperature_mean',
    'precipitation_sum', 'windspeed_max', 'is_raining', 'traffic_ratio', 'traffic_level'
]

df = df[columns_to_keep]

print(f"✓ Dataset filtered to {len(columns_to_keep)} columns")
print(f"New shape: {df.shape[0]} rows, {df.shape[1]} columns")

✓ Dataset filtered to 58 columns
New shape: 219 rows, 58 columns


In [8]:
# Check for missing values
print("=== Missing Values Analysis ===")
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)

if len(missing_data) > 0:
    missing_percent = (missing_data / len(df) * 100).round(2)
    missing_df = pd.DataFrame({
        'Missing Count': missing_data,
        'Percentage': missing_percent
    })
    display(missing_df)
else:
    print("No missing values found in the dataset!")

=== Missing Values Analysis ===


Unnamed: 0,Missing Count,Percentage
D_accident_severity,217,99.09
D_driver_age,217,99.09
D_driver_gender,217,99.09
D_number_injured,217,99.09
D_vehicle_type,217,99.09
C_accident_severity,208,94.98
C_driver_age,208,94.98
C_driver_gender,208,94.98
C_number_injured,208,94.98
C_vehicle_type,208,94.98
