# **<span style="color:black;">UK Dataset: Loading and Cleaning</span>**
## **Purpose of Notebook**
Preparing the raw data for EDA, merging datasets, clearning, dropping columns, filling blanks with 'missing', fixing formats etc
## **Problem Statement and Sourcing the Dataset**
Datasets are from:  https://www.gov.uk/government/publications/road-accidents-and-safety-statistics-notes-and-definitions/road-casualty-statistics-overview-and-coverage
Including two different datasets
* Vehicle Incident Data
* Collision Incident Data
This included collsion and vehicle data which is split up into 2 dataset hence a merge based on accident reference is needed to ensure the cells in the US dataset can be found too
---

## **Contents of notebook**
### **Section 1: Loading and Merging Dataset**
```python 
- [x] Loading the datasets
- [x] Getting Basic Information
- [x] Checking How Much missing data before merging
- [x] Finding Primary Keys and Merging Dataset
- [x] Reduce the report versions to single 
```
### **Section 2: Dropping Unneeded Columns**
```python 
- [x] Combine connected columns
- [x] Drop unneeded columns
```
### **Section 3: Handling missingness**
```python 
- [x] Checking unknown vs missing data
- [x] Which columns have the most missing data
- [x] Unknown and missing combined
```
### **Section 4: Fixing Formats and Standardising**
```python 
- [x] Convert Datetime
- [x] Convert Numerical and String Columns
- [x] Renaming Columns and Values to match STATS19
- [x] Finding Primary Keys and Merging Dataset
```
### **Section 5: Save to File**
---

**Import Necessary Libraries and Modules**

In [20]:
import pandas as pd 
import itertools # For finding combinations for primary key
from tabulate import tabulate # For pretty printing tables
import time
from tqdm import tqdm # to see progress for how long it will take
import reverse_geocode
import matplotlib.pyplot as plt 
from datetime import timedelta # to convert time and date

## **<span style="color:black;">Section 1: Loading and Merging Datasets</span>**

**Loading the datasets**
- Change the path when reusing this notebook

In [21]:
df_collisions = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/Datasets/UK/dft-road-casualty-statistics-collision-last-5-years.csv')
df_vehicles = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/Datasets/UK/dft-road-casualty-statistics-vehicle-last-5-years.csv')

  df_collisions = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/Datasets/UK/dft-road-casualty-statistics-collision-last-5-years.csv')
  df_vehicles = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/Datasets/UK/dft-road-casualty-statistics-vehicle-last-5-years.csv')


**Finding the Primary Keys**
- Extremely important for merging to ensure theres no duplicates 
- Otherwise we need to add

In [22]:
def find_primary_keys(df):
    """
       Purpose: Finding the primary keys in each dataframe 
       Methods:
         - Check each row for this column is unique (count of column vals = number of rows)
         - If not, then use itertools to experiment with every combination to find what combination is unique
       Input: Dataframe 
       Output: Keys within a string for easy printing
       References: 
       - finding combination https://docs.python.org/3/library/itertools.html
       - how to find primary key: https://medium.com/@tkprof.h/find-uniq-conbinations-of-fields-from-csv-file-32a460e775fb
    """
    # Check for single primary key
    for col in df.columns:
        if df[col].is_unique:
            return f"Primary key: ({col})"
    
    # Loop to check every combination
    for i in range(2, len(df.columns) + 1): # keep increasing if no primary key found for double, triple, etc.
        for combo in itertools.combinations(df.columns, i):
            if df[list(combo)].drop_duplicates().shape[0] == df.shape[0]: # check if count = total rows in df
                return f"Primary key: {combo}"
            
    return "No primary key found" # just incase there is no primary key

# Calling Function
print("COLLISIONS: ", find_primary_keys(df_collisions))
print("VEHICLES: ", find_primary_keys(df_vehicles))

COLLISIONS:  Primary key: (accident_index)
VEHICLES:  Primary key: ('accident_index', 'vehicle_reference')


**General Basic Information**

In [23]:
# Basic Information about the datasets
"""
References:
- Plot table: https://amrinarosyd.medium.com/prettytable-vs-tabulate-which-should-you-use-e9054755f170
"""
# Use the keys 
collision_keys = ['accident_index'] # works for all 
vehicle_keys = ['accident_index', 'vehicle_reference']
data = {
    "Dataset": ["COLLISIONS",
                "VEHICLES"],

    "Shape": [df_collisions.shape,
              df_vehicles.shape],

    "key": [ # ensure primary key is unique
    df_collisions[collision_keys].drop_duplicates().shape[0],
    df_vehicles[vehicle_keys].drop_duplicates().shape[0]],

    "Duplicates": [df_collisions.duplicated(subset=collision_keys).sum(),
                   df_vehicles.duplicated(subset=vehicle_keys).sum()]}
df_info = pd.DataFrame(data)
print(tabulate(df_info, headers='keys', tablefmt='pretty')) # simple table to easily see

+---+------------+--------------+--------+------------+
|   |  Dataset   |    Shape     |  key   | Duplicates |
+---+------------+--------------+--------+------------+
| 0 | COLLISIONS | (520084, 37) | 520084 |     0      |
| 1 |  VEHICLES  | (953559, 34) | 953559 |     0      |
+---+------------+--------------+--------+------------+


**Checking how much missing dataset**

In [24]:
# Missing values percentage overall per dataset
def missing_percentage(df):
        """
        Purpose: Calculate the percentage of missing values in a DataFrame.
        Input: DataFrame
        Output: Percentage of missing values in the DataFrame
        Reference:
        - finding missing values in df https://medium.com/@denizgunay/missing-values-6742e535196b
        """
        total_cells = df.size
        total_missing = df.isnull().sum().sum()
        return (total_missing / total_cells) * 100

print("Missing Values Percentage")
print(f"Collision missing: {missing_percentage(df_collisions):.2f}%")
print(f"Vehicles missing: {missing_percentage(df_vehicles):.2f}%")

Missing Values Percentage
Collision missing: 0.00%
Vehicles missing: 10.85%


**How much missingness per column**
- Columns which are extremely highly missing must be dropped as not enough raw information for imputation

In [25]:
"""
References: 
- Finding missing per col to drop: https://stackoverflow.com/questions/51070985/find-out-the-percentage-of-missing-values-in-each-column-in-the-given-dataset
"""


missing_percentage = (df_vehicles.isnull().mean() * 100).sort_values(ascending=False) # high to low
print("\nMissing Values Percentage by Column:")
print(missing_percentage.to_string(float_format="%.2f%%")) # 2dp


Missing Values Percentage by Column:
dir_to_n                           92.25%
dir_to_e                           92.25%
dir_from_n                         92.14%
dir_from_e                         92.14%
accident_index                      0.00%
age_of_vehicle                      0.00%
age_of_driver                       0.00%
age_band_of_driver                  0.00%
engine_capacity_cc                  0.00%
propulsion_code                     0.00%
driver_home_area_type               0.00%
generic_make_model                  0.00%
driver_imd_decile                   0.00%
accident_year                       0.00%
lsoa_of_driver                      0.00%
escooter_flag                       0.00%
sex_of_driver                       0.00%
journey_purpose_of_driver           0.00%
vehicle_left_hand_drive             0.00%
first_point_of_impact               0.00%
hit_object_off_carriageway          0.00%
vehicle_leaving_carriageway         0.00%
hit_object_in_carriageway           0.

**Merging the vehicle and collision key**

In [26]:
# common key = 'accident_index' for merging
"""
References:
- merging dataset: https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
"""
merged_df = pd.merge(df_collisions, df_vehicles, on='accident_index', how='inner',suffixes=('', '_drop'))
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))

+----+------------------+-----------------+----------------------+-------------------------+--------------------------+-------------+------------+----------------+---------------------+----------------------+------------------------+------------+---------------+--------+----------------------------+--------------------------------+---------------------------+--------------------+---------------------+-------------+---------------+-------------------+--------------------+---------------------+----------------------+-------------------------------------+-------------------------------------------+--------------------+----------------------+---------------------------+------------------------------+-----------------------+-----------------------+-----------------------------------------------+-------------------+-----------------------------+-------------------------------+----------------------+---------------------------+---------------------+----------------+---------------------------

**Checking Overall Shape after merge and reducing duplicates**

In [27]:
print("Shape of df_collisions:", df_collisions.shape)
print("Shape of df_vehicles:", df_vehicles.shape)
print("Shape of df_merged:", merged_df.shape)
merged_df = merged_df.drop_duplicates(subset='accident_index')
print("Removed Duplicates:", merged_df.shape) # drops all the report versions 

Shape of df_collisions: (520084, 37)
Shape of df_vehicles: (953559, 34)
Shape of df_merged: (768171, 70)
Removed Duplicates: (419478, 70)


## **<span style="color:black;">Section 2: Decoding and Dropping Uneeded columns, Filtering Vehicles</span>**
- Using the accident index glossary decoding from numbers to the corresponding text values

**Removing Completely Unneeded Columns**

In [28]:
columns_to_keep = ['accident_index', 'accident_reference', 'latitude', 'longitude', 'location_easting_osgr', 'location_northing_osgr',
                    'accident_severity', 'date', 'time', 'road_type', 'speed_limit', 
                    'light_conditions', 'vehicle_type','age_of_vehicle', 
                    'weather_conditions', 'road_surface_conditions', 'vehicle_reference', 'vehicle_manoeuvre', 'hit_object_in_carriageway',
                        'hit_object_off_carriageway', 'first_point_of_impact', 'generic_make_model']

merged_df = merged_df[columns_to_keep]

**Mapping the corresponding values**

In [29]:
"""
Process: Mapping the names of each value given in glossary so EDA can be understood
Reference:
- how to rename and map new values: https://www.geeksforgeeks.org/python/python-mapping-key-values-to-dictionary/
"""

merged_df = merged_df.fillna(-1) # fill the empty with -1 and then change to missing
accident_severity_mapping = {
    1: "Fatal",
    2: "Serious",
    3: "Slight"
}

road_type_mapping = {
    1: "Roundabout",
    2: "One way street",
    3: "Dual carriageway",
    6: "Single carriageway",
    7: "Slip road",
    9: "Unknown",
    12: "One way street/Slip road",
    -1: "Missing"
}

light_conditions_mapping = {
    1: "Daylight",
    4: "Darkness - lights lit",
    5: "Darkness - lights unlit",
    6: "Darkness - no lighting",
    7: "Darkness - lighting unknown",
    -1: "Missing"
}

weather_conditions_mapping = {
    1: "Fine no high winds",
    2: "Raining no high winds",
    3: "Snowing no high winds",
    4: "Fine + high winds",
    5: "Raining + high winds",
    6: "Snowing + high winds",
    7: "Fog or mist",
    8: "Other",
    9: "Unknown",
    -1: "Missing"
}

road_surface_conditions_mapping = {
    1: "Dry",
    2: "Wet or damp",
    3: "Snow",
    4: "Frost or ice",
    5: "Flood over 3cm. deep",
    6: "Oil or diesel",
    7: "Mud",
    9: "Unknown",
    -1: "Missing"
}

vehicle_type_mapping = {
    1: "Pedal cycle",
    2: "Motorcycle 50cc and under",
    3: "Motorcycle 125cc and under",
    4: "Motorcycle over 125cc and up to 500cc",
    5: "Motorcycle over 500cc",
    8: "Taxi/Private hire car",
    9: "Car",
    10: "Minibus (8 - 16 passenger seats)",
    11: "Bus or coach (17 or more pass seats)",
    16: "Ridden horse",
    17: "Agricultural vehicle",
    18: "Tram",
    19: "Van / Goods 3.5 tonnes mgw or under",
    20: "Goods over 3.5t. and under 7.5t",
    21: "Goods 7.5 tonnes mgw and over",
    22: "Mobility scooter",
    23: "Electric motorcycle",
    90: "Other vehicle",
    97: "Motorcycle - unknown cc",
    98: "Goods vehicle - unknown weight",
    99: "Unknown",
    -1: "Missing"
}

vehicle_manoeuvre_mapping = {
    1: "Reversing",
    2: "Parked",
    3: "Waiting to go - held up",
    4: "Slowing or stopping",
    5: "Moving off",
    6: "U-turn",
    7: "Turning left",
    8: "Waiting to turn left",
    9: "Turning right",
    10: "Waiting to turn right",
    11: "Changing lane to left",
    12: "Changing lane to right",
    13: "Overtaking moving vehicle - offside",
    14: "Overtaking static vehicle - offside",
    15: "Overtaking - nearside",
    16: "Going ahead left-hand bend",
    17: "Going ahead right-hand bend",
    18: "Going ahead other",
    99: "Unknown",
    -1: "Missing"
}

hit_object_in_carriageway_mapping = {
    0: "None",
    1: "Previous accident",
    2: "Road works",
    4: "Parked vehicle",
    5: "Bridge (roof)",
    6: "Bridge (side)",
    7: "Bollard or refuge",
    8: "Open door of vehicle",
    9: "Central island of roundabout",
    10: "Kerb",
    11: "Other object",
    12: "Any animal (except ridden horse)",
    99: "Unknown",
    -1: "Missing"
}

hit_object_off_carriageway_mapping = {
    0: "None",
    1: "Road sign or traffic signal",
    2: "Lamp post",
    3: "Telegraph or electricity pole",
    4: "Tree",
    5: "Bus stop or bus shelter",
    6: "Central crash barrier",
    7: "Near/Offside crash barrier",
    8: "Submerged in water",
    9: "Entered ditch",
    10: "Other permanent object",
    11: "Wall or fence",
    99: "Unknown",
    -1: "Missing"
}

first_point_of_impact_mapping = {
    0: "Did not impact",
    1: "Front",
    2: "Back",
    3: "Offside",
    4: "Nearside",
    9: "Unknown",
    -1: "Missing"
}

# Actually apply the mappings 
merged_df["accident_severity"] = merged_df["accident_severity"].map(accident_severity_mapping)
merged_df["road_type"] = merged_df["road_type"].map(road_type_mapping)
merged_df["light_conditions"] = merged_df["light_conditions"].map(light_conditions_mapping)
merged_df["weather_conditions"] = merged_df["weather_conditions"].map(weather_conditions_mapping)
merged_df["road_surface_conditions"] = merged_df["road_surface_conditions"].map(road_surface_conditions_mapping)
merged_df["vehicle_type"] = merged_df["vehicle_type"].map(vehicle_type_mapping)
merged_df["vehicle_manoeuvre"] = merged_df["vehicle_manoeuvre"].map(vehicle_manoeuvre_mapping)
merged_df["hit_object_in_carriageway"] = merged_df["hit_object_in_carriageway"].map(hit_object_in_carriageway_mapping)
merged_df["hit_object_off_carriageway"] = merged_df["hit_object_off_carriageway"].map(hit_object_off_carriageway_mapping)
merged_df["first_point_of_impact"] = merged_df["first_point_of_impact"].map(first_point_of_impact_mapping)

# Checking is the mapping worked
print("First 10 Rows of df_merged after mapping:")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))

First 10 Rows of df_merged after mapping:
+----+------------------+----------------------+------------+-------------+-------------------------+--------------------------+---------------------+------------+--------+--------------------+---------------+-----------------------+-----------------------+------------------+----------------------+---------------------------+---------------------+-------------------------+-----------------------------+------------------------------+-------------------------+----------------------+
|    |   accident_index |   accident_reference |   latitude |   longitude |   location_easting_osgr |   location_northing_osgr | accident_severity   | date       | time   | road_type          |   speed_limit | light_conditions      | vehicle_type          |   age_of_vehicle | weather_conditions   | road_surface_conditions   |   vehicle_reference | vehicle_manoeuvre       | hit_object_in_carriageway   | hit_object_off_carriageway   | first_point_of_impact   | generic_m

**Only Keeping Car Crashes**
- Almost 300000 data entries removed due to only wanting car data as US dataset only has car data

In [30]:
merged_df = merged_df[merged_df['vehicle_type'] == 'Car'] # only keep cars as only cars within the US dataset
print("\nFiltered DataFrame (Only Cars):")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))
print("Shape of car data:", merged_df.shape)


Filtered DataFrame (Only Cars):
+----+------------------+----------------------+------------+-------------+-------------------------+--------------------------+---------------------+------------+--------+--------------------+---------------+-----------------------+----------------+------------------+----------------------+---------------------------+---------------------+-------------------------+-----------------------------+------------------------------+-------------------------+----------------------+
|    |   accident_index |   accident_reference |   latitude |   longitude |   location_easting_osgr |   location_northing_osgr | accident_severity   | date       | time   | road_type          |   speed_limit | light_conditions      | vehicle_type   |   age_of_vehicle | weather_conditions   | road_surface_conditions   |   vehicle_reference | vehicle_manoeuvre       | hit_object_in_carriageway   | hit_object_off_carriageway   | first_point_of_impact   | generic_make_model   |
|  0 |   

## **<span style="color:black;">Section 3: Assessing Missing Values</span>**

**Checking out how much missing data**

In [31]:
def count_and_sum_missing_values(df):
    """
    Purpose: Count and sum missing, empty, NaN, and 'unknown' values for each column in the DataFrame.
    Input: DataFrame
    Output: Prints a summary table with the counts and percentages of missing values.
    References:
    - finding nan values count again|: https://www.geeksforgeeks.org/python/how-to-count-the-number-of-nan-values-in-pandas/
    - list and dictonary to plot table: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
    - check if the named unknown in each col: https://stackoverflow.com/questions/33788913/pythonic-efficient-way-to-strip-whitespace-from-every-pandas-data-frame-cell-tha

    """
    overall_summary = [] # list to store the missing values
    for col in df.columns:
        null_count = df[col].isnull().sum()  # Count missing (NaN) values
        empty_count = (df[col] == '').sum()  # Count empty string values
        nan_count = df[col].apply(lambda x: pd.isna(x)).sum()  # should not be any
        unknown_count = df[col].apply(lambda x: str(x).strip().lower() == 'unknown').sum()  # Count named 'unknown'/Missing values
        total = null_count + empty_count + nan_count + unknown_count
        percentage = (total / len(df)) * 100  # Calculate percentage
        overall_summary.append((col, null_count, empty_count, nan_count, unknown_count, total, percentage))

    # dataframe to hold Colum, Null, Nan and Unknown 
    summary_df = pd.DataFrame(overall_summary, columns=['Column', 'Null', 'Empty', 'NaN', 'Unknown', 'Total', 'Percentage']) # keys
    summary_df = summary_df.sort_values(by='Percentage', ascending=False)
    print(tabulate(summary_df, headers='keys', tablefmt='grid')) # make table to clearly see the results with tabulate

count_and_sum_missing_values(merged_df) # Call the function on the dataframe find how many missing and drop those rows

+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
|    | Column                     |   Null |   Empty |   NaN |   Unknown |   Total |   Percentage |
| 17 | vehicle_manoeuvre          |      0 |       0 |     0 |     12128 |   12128 |     4.08297  |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
| 18 | hit_object_in_carriageway  |      0 |       0 |     0 |     12001 |   12001 |     4.04021  |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
| 19 | hit_object_off_carriageway |      0 |       0 |     0 |     11762 |   11762 |     3.95975  |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
| 14 | weather_conditions         |      0 |       0 |     0 |      7295 |    7295 |     2.45591  |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+


**Drop all Unknown Value Rows as <5%**

In [32]:
merged_df.replace(['Unknown', 'Missing','-1'], pd.NA, inplace=True) # make all missing values nan
merged_df.dropna(inplace=True) # then drop all the nan
merged_df.drop(['location_easting_osgr', 'location_northing_osgr'], axis=1, inplace=True) # dropping not useful for plotting map


# Display the updated DataFrame
print("Shape of DataFrame after dropping rows with missing values:", merged_df.shape)
count_and_sum_missing_values(merged_df)

Shape of DataFrame after dropping rows with missing values: (231021, 20)
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
|    | Column                     |   Null |   Empty |   NaN |   Unknown |   Total |   Percentage |
|  0 | accident_index             |      0 |       0 |     0 |         0 |       0 |            0 |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
|  1 | accident_reference         |      0 |       0 |     0 |         0 |       0 |            0 |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
| 18 | first_point_of_impact      |      0 |       0 |     0 |         0 |       0 |            0 |
+----+----------------------------+--------+---------+-------+-----------+---------+--------------+
| 17 | hit_object_off_carriageway |      0 |       0 |     0 |         0 |       0 |            0 |
+----+---------------------

## **<span style="color:black;">Section 4: Fixing Formats</span>**


**Convert Long, Latt into City, County and Country**
- this is because city and state doesnt not exist in this dataset but longtiude and lat does 
- additionally would rather the overall state to ensure no data ethics being broken

In [33]:
def get_location_offline(row):
    """
    process: function to get city, state, country from lattitude/longitude using offline reverse_geocode
    References (for reverse code )
    - https://stackoverflow.com/questions/69409255/how-to-get-city-state-and-country-from-a-list-of-latitude-and-longitude-coordi
    - https://stackoverflow.com/questions/59736682/find-nearest-location-coordinates-in-land-using-python
    """
    coords = (row['latitude'], row['longitude'])
    location = reverse_geocode.search([coords])[0]  # returns a dict
    return pd.Series([location['city'], location['state'], location['country']])

# Apply function to whole DataFrame
merged_df[['city', 'state', 'country']] = merged_df.apply(get_location_offline, axis=1) # call function for columns
print(merged_df)

       accident_index accident_reference   latitude  longitude  \
15      2019010155206           10155206  51.572326   0.141957   
23      2019010155217           10155217  51.557075   0.070277   
105     2019010155425           10155425  51.561751  -0.078765   
620     2019010156538           10156538  51.545376   0.106077   
733     2019010156721           10156721  51.396245  -0.069305   
...               ...                ...        ...        ...   
768154  2023991335556          991335556  55.704920  -3.211116   
768160  2023991335643          991335643  55.870252  -4.158079   
768162  2023991335835          991335835  55.479619  -4.570310   
768164  2023991335837          991335837  55.792841  -4.069439   
768166  2023991335842          991335842  55.805611  -4.381678   

       accident_severity        date   time           road_type  speed_limit  \
15                Slight  01/01/2019  01:15  Single carriageway           30   
23               Serious  01/01/2019  04:10    

**Converting string, float and date, splititng make and model col, finding model year using age**

In [34]:
"""
References:
- converting to datetime https://stackoverflow.com/questions/38067704/how-to-change-the-datetime-format-in-pandas
- converting to datetime https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
- reading all dtypes for each col: https://medium.com/@amit25173/understanding-pandas-dataframe-dtypes-0a2f75eb8a38#:~:text=In%20simple%20terms%2C%20dtypes%20tells,you%20immediately%20see%20what's%20inside.
- finding just the year https://stackoverflow.com/questions/16103238/pandas-timedelta-in-days
"""

# ALL to string first for ease
merged_df = merged_df.astype(str)

# Numerical 
merged_df['speed_limit'] = merged_df['speed_limit'].astype(float)
merged_df['age_of_vehicle'] = merged_df['age_of_vehicle'].astype(float)

# Convert 'date' and 'time' to datetime
merged_df['date'] = pd.to_datetime(merged_df['date'], format='%d/%m/%Y')
merged_df['time'] = pd.to_datetime(merged_df['time'], format='%H:%M').dt.time

# splitting cols
merged_df = merged_df[~merged_df.isin([-1]).any(axis=1)] # remove any columns that did not convert correcly
merged_df[['make', 'model']] = merged_df['generic_make_model'].str.extract(r'^(\S+)\s*(.*)$') # Split 'generic_make_model' into 'make' and 'model'
merged_df['age_of_vehicle'] = pd.to_numeric(merged_df['age_of_vehicle'], errors='coerce') # Ensure 'age_of_vehicle' is numeric
merged_df['incident_date'] = pd.to_datetime(merged_df['date'], errors='coerce') # Convert 'date' to datetime
merged_df['model_date'] = merged_df['incident_date'] - merged_df['age_of_vehicle'].apply(lambda x: pd.Timedelta(days=365.25 * x) if pd.notnull(x) else pd.NaT) # Extract model year from model_date
merged_df['model_year'] = merged_df['model_date'].dt.year
merged_df['model_year'] = merged_df['model_year'].where(merged_df['model_year'].notna(), 'Unknown') # Replace invalid model years with 'Unknown'

print(merged_df.dtypes)

accident_index                        object
accident_reference                    object
latitude                              object
longitude                             object
accident_severity                     object
date                          datetime64[ns]
time                                  object
road_type                             object
speed_limit                          float64
light_conditions                      object
vehicle_type                          object
age_of_vehicle                       float64
weather_conditions                    object
road_surface_conditions               object
vehicle_reference                     object
vehicle_manoeuvre                     object
hit_object_in_carriageway             object
hit_object_off_carriageway            object
first_point_of_impact                 object
generic_make_model                    object
city                                  object
state                                 object
country   

**Final Columns to keep**
- only keeping columns that match the US dataset for easy direct comparison

In [35]:
columns_to_keep = ['accident_index', 'accident_reference', 
           'accident_severity', 'date', 'time', 
           'road_type', 'speed_limit', 'light_conditions', 
            'weather_conditions', 'road_surface_conditions',
            'vehicle_manoeuvre', 
             'hit_object_in_carriageway',
             'first_point_of_impact',  
             'city', 'state', 'country', 'latitude', 'longitude', 
             'make', 'model', 'model_year']
merged_df = merged_df[columns_to_keep]
# final columns
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))

+-----+------------------+----------------------+---------------------+---------------------+----------+--------------------+---------------+-----------------------+----------------------+---------------------------+-------------------------+-----------------------------+-------------------------+----------------+---------+----------------+------------+-------------+------------+----------+--------------+
|     |   accident_index |   accident_reference | accident_severity   | date                | time     | road_type          |   speed_limit | light_conditions      | weather_conditions   | road_surface_conditions   | vehicle_manoeuvre       | hit_object_in_carriageway   | first_point_of_impact   | city           | state   | country        |   latitude |   longitude | make       | model    |   model_year |
|  15 |    2019010155206 |             10155206 | Slight              | 2019-01-01 00:00:00 | 01:15:00 | Single carriageway |            30 | Darkness - lights lit | Fine no high win

## **<span style="color:black;">Section 4: Fixing Formats and Standardisng Values to match with NHSTA</span>**


**STATS 19 MATCHING Columns**
* Report ID -> accident_index 
* Report Version -> accident_reference 
* Make -> make (from generic_make_model)
* Model -> model (from generic_make_model)
* Model Year -> Model Year (from age_of_vehicle)
* Incident Date -> date
* Incident Time (24:00) -> time
* city -> city (from longitude and latitude)
* state -> county
* Roadway Type -> road_type
* Roadway Surface -> road_surface_conditions
* Posted Speed Limit (MPH) -> speed_limit
* Lighting -> light_conditions
* Crash With -> hit_object_in_carriageway and hit_object_off_carriageway 
* Highest Injury Severity Alleged -> accident_severity
* SV Precrash Movement -> vehicle_manoeuvre
* SV Precrash Speed (MPH) 
* Weather -> weather_conditions
* SV Contact Area -> first_point_of_impact

**UNMATCHED**
* SV Any Air Bags Deployed?  
* CP Contact Area 
* Property Damage 
* Mileage 
* ADS Equipped 
* Automatic System Engaged? 
* Roadway Description

**ADD**
* Source = conventional

**Renaming Columns to match US NHSTA**

In [36]:
column_rename_mapping = { # dictionary created to rename the column values and captial letters
    'accident_index': 'Report ID',
    'accident_reference': 'Report Version',
    'make': 'Make',
    'model': 'Model',
    'model_year': 'Model Year',
    'date': 'Incident Date',
    'time': 'Incident Time (24:00)',
    'city': 'City',
    'state': 'State',
    'country': 'Country',
    'road_type': 'Roadway Type',
    'road_surface_conditions': 'Roadway Surface',
    'speed_limit': 'Posted Speed Limit (MPH)',
    'light_conditions': 'Lighting',
    'hit_object_in_carriageway': 'Crash With',
    'accident_severity': 'Highest Injury Severity Alleged',
    'vehicle_manoeuvre': 'SV Pre-Crash Movement',
    'weather_conditions': 'Weather',
    'first_point_of_impact': 'SV Contact Area'
}
merged_df.rename(columns=column_rename_mapping, inplace=True) 

# Add rows for ADS Equipped and Automation System Engaged which did not previously exist
merged_df['ADS Equipped?'] = 'Conventional'
merged_df['Automation System Engaged?'] = 'Conventional'
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))

+-----+---------------+------------------+-----------------------------------+---------------------+-------------------------+--------------------+----------------------------+-----------------------+--------------------+-------------------+-------------------------+--------------+-------------------+----------------+---------+----------------+------------+-------------+------------+----------+--------------+-----------------+------------------------------+
|     |     Report ID |   Report Version | Highest Injury Severity Alleged   | Incident Date       | Incident Time (24:00)   | Roadway Type       |   Posted Speed Limit (MPH) | Lighting              | Weather            | Roadway Surface   | SV Pre-Crash Movement   | Crash With   | SV Contact Area   | City           | State   | Country        |   latitude |   longitude | Make       | Model    |   Model Year | ADS Equipped?   | Automation System Engaged?   |
|  15 | 2019010155206 |         10155206 | Slight                           

**Renaming Values to match US dataset**
- renaming the column and values to match the us 
- changing the other of the columns too just for ease in EDA

In [37]:
merged_df['Highest Injury Severity Alleged'] = merged_df['Highest Injury Severity Alleged'].replace({
    'Fatal': 'Fatality',
    'Slight': 'Minor'
})

merged_df['SV Pre-Crash Movement'] = merged_df['SV Pre-Crash Movement'].replace({
    'Turning left': 'Making Left Turn',
    'Turning right': 'Making Right Turn',
    'Reversing': 'Backing',
    'Slowing or stopping': 'Stopping',
    'U-turn': 'Making U-Turn',
    'Chnaging lane to right': 'Changing Lanes',
    'Chnaging lane to left': 'Changing Lanes',
    'Going ahead left-hand bend': 'Travelling around Bend',
    'Going ahead right-hand bend': 'Travelling around Bend',
      
})

merged_df['Lighting'] = merged_df['Lighting'].replace({
    'Darkness - lights lit': 'Dark - Lighted',
    'Darkness - no lighting': 'Dark - Not Lighted',
    'Darkness - lighting unknown': 'Dark - Unknown Lighting',
})

merged_df['Crash With'] = merged_df['Crash With'].replace({
    'Other object': 'Other Fixed Object',
    'Any animal (except ridden horse)': 'Animal',
})

merged_df['Roadway Surface'] = merged_df['Roadway Surface'].replace({
    'Frost or ice': 'Snow / Slush / Ice',
    'Snow': 'Snow / Slush / Ice',
    'Wet or damp': 'Wet',
})

merged_df['Weather'] = merged_df['Weather'].replace({
    'Fine no high winds': 'Clear',
    'Fine + high winds': 'Clear',
})

merged_df['Crash With'] = merged_df['Crash With'].replace({
    'None': 'No Object', # pandas mapping none as nan
})
merged_df.head(10)

# chnage column order
desired_order = [
    'Report ID', 'Report Version', 'Make', 'Model', 'Model Year', 'ADS Equipped?',
    'Automation System Engaged?', 'Incident Date', 'Incident Time (24:00)', 'City',
    'State', 'Roadway Type', 'Roadway Surface', 'Posted Speed Limit (MPH)', 'Lighting',
    'Crash With', 'Highest Injury Severity Alleged', 'SV Pre-Crash Movement',
    'Weather', 'SV Contact Area', 'Country', 'latitude', 'longitude'
]
merged_df = merged_df[desired_order]

## **<span style="color:black;">Section 5: Save Cleaned Dataframe</span>**

In [38]:
print("Shape of merged_df:", merged_df.shape)# final shape and save to col
merged_df.to_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/Datasets/UK-cleaned_data.csv', index=False)

Shape of merged_df: (229782, 23)
