# **<span style="color:pink;">US Dataset: Loading and Cleaning</span>**

## **Purpose of Notebook**
Preparing the raw data for EDA, merging datasets, clearning, dropping columns, filling blanks with 'missing', fixing formats etc

## **Problem Statement and Sourcing the Dataset**
Datasets are from:  https://www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting
Including three different datasets
* ADAS = Advanced Driver Assistance Systems
* ADS = Automated Driving Systems
* OTHER = Misclassed/No Available Information


This includes fully autonomous and driver assisted accidents, other includes any other accidents that are not linked as ADS/ADAS.

Goal combine these datasets but only the unique rows, hence finding a primary unique key to identify the duplicates and overlapped rows (as source dataset entity acknowledges this)

Additionally finding missing data, rows which are important, fixing formatting, renaming rows etc ready for Exploratory Data Analysis

---

## **<span style="color:lightblue;">Importing Neccessary Libraries and Modules</span>**

In [2]:
# Import necessary libraries
import pandas as pd # For data manipulation
import numpy as np # For numerical operations
import itertools # For creating combinations when finding primary key
from tabulate import tabulate # For pretty printing tables

## **<span style="color:lightblue;">Loading Dataset and Getting Basic Information</span>**


**Loading Datasets into Dataframes**

In [3]:
# Loading the datasets (change the relative paths as needed)
df_adas = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/US/SGO-2021-01_Incident_Reports_ADAS.csv')
df_ads = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/US/SGO-2021-01_Incident_Reports_ADS.csv')
df_other = pd.read_csv('/Users/mahnooriqbal/COMP702 Project/ML-AI-Risk-Analysis-AV-Data-/US/SGO-2021-01_Incident_Reports_OTHER.csv')

**Finding the Primary Keys for Dataset**

In [4]:
def find_primary_keys(df):
    """
       Purpose: Finding the primary keys in each dataframe 
       Methods:
         - Check each row for this column is unique (count of column vals = number of rows)
         - If not, then use itertools to experiment with every combination to find what combination is unique
       Input: Dataframe 
       Output: Keys within a string for easy printing
    """
    # Check for single primary key
    for col in df.columns:
        if df[col].is_unique:
            return f"Primary key: ({col})"
    
    # Loop to check every combination
    for i in range(2, len(df.columns) + 1): # keep increasing if no primary key found for double, triple, etc.
        for combo in itertools.combinations(df.columns, i):
            if df[list(combo)].drop_duplicates().shape[0] == df.shape[0]: # check if count = total rows in df
                return f"Primary key: {combo}"
            
    return "No primary key found" # just incase there is no primary key

# Calling Function
print("ADAS: ", find_primary_keys(df_adas))
print("ADS: ", find_primary_keys(df_ads))
print("OTHER: ", find_primary_keys(df_other))

ADAS:  Primary key: ('Report ID', 'Report Version')
ADS:  Primary key: ('Report ID', 'Report Version')
OTHER:  Primary key: ('Report ID', 'Report Version')


**General Basic Information**

In [5]:
# Basic Information about the datasets
keys = ['Report ID','Report Version'] # works for all 
data = {
    "Dataset": ["ADAS",
                "ADS",
                "OTHER"],

    "Shape": [df_adas.shape,
              df_ads.shape,
              df_other.shape],

    "Same Incident ID": [ # ensure primary key is unique
    df_adas[keys].drop_duplicates().shape[0],
    df_ads[keys].drop_duplicates().shape[0],
    df_other[keys].drop_duplicates().shape[0]],

    "Duplicates": [df_adas.duplicated(subset=keys).sum(),
                   df_ads.duplicated(subset=keys).sum(),
                   df_other.duplicated(subset=keys).sum()]}
df_info = pd.DataFrame(data)
print(tabulate(df_info, headers='keys', tablefmt='pretty')) # simple table to easily see

+---+---------+-------------+------------------+------------+
|   | Dataset |    Shape    | Same Incident ID | Duplicates |
+---+---------+-------------+------------------+------------+
| 0 |  ADAS   | (3905, 137) |       3905       |     0      |
| 1 |   ADS   | (2173, 137) |       2173       |     0      |
| 2 |  OTHER  | (3547, 137) |       3547       |     0      |
+---+---------+-------------+------------------+------------+


**How Much Missing Data in Dataset**

#### **Summary of Section**
- Trying as a Report ID is a primary key however it is not unique, the dataset finds that there are multiple report versions for each report 
- As time goes on more data is added hence creating a composite primary key to be (Report ID + Report Version) for now and later we can drop the multiple report versions (only keeping latest report
- Shape of dataset shows that all datasets have the same number of columns which is great for merging but it is still important to see if any unique cols



---

## **<span style="color:lightblue;">Checking Missing Values in Datasets</span>**


**Checking Missingness in the datasets**

In [6]:
# Missing values percentage overall per dataset
def missing_percentage(df):
        total_cells = df.size
        total_missing = df.isnull().sum().sum()
        return (total_missing / total_cells) * 100

print("=== Missing Values Percentage ===")
print(f"Dataset 1 missing: {missing_percentage(df_adas):.2f}%")
print(f"Dataset 2 missing: {missing_percentage(df_ads):.2f}%")
print(f"Dataset 3 missing: {missing_percentage(df_other):.2f}%")

=== Missing Values Percentage ===
Dataset 1 missing: 13.04%
Dataset 2 missing: 9.84%
Dataset 3 missing: 41.06%


**Other Dataset Columns with missing values**

In [7]:
missing_percentage = (df_other.isnull().mean() * 100).sort_values(ascending=False)
print("\nMissing Values Percentage by Column:")
print(missing_percentage.to_string(float_format="%.2f%%"))


Missing Values Percentage by Column:
State or Local Permit             100.00%
Other Federal Reg. Exemption      100.00%
Federal Regulatory Exemption?     100.00%
Weather - Other Text              100.00%
ADAS/ADS Hardware Version          99.94%
Investigating Officer Email        99.94%
Investigating Officer Phone        99.94%
State or Local Permit?             99.94%
Serial Number                      99.94%
ADAS/ADS Software Version          99.94%
Other Reporting Entities?          99.92%
Narrative - CBI?                   99.89%
Source - Other Text                99.69%
ADAS/ADS System Version            99.52%
Latitude                           99.46%
Longitude                          99.46%
Investigating Officer Name         99.41%
Investigating Agency               99.32%
ADS Equipped?                      98.51%
Zip Code                           98.25%
Posted Speed Limit (MPH)           98.20%
Automation System Engaged?         98.20%
Mileage                            98.

#### **Summary**
-  Other Dataset is almost halfway empty which would be difficult to impute, additionally it does not have any information about automaton which is integral to the risk analysis
- Better to drop Other dataset then reduce reliabilty of the rest of the dataset

---

## **<span style="color:lightblue;">Merging Datasets</span>**


**Checking if Columns Overlap**

In [8]:
# Find overlapping columns between df_adas and df_ads
print(f"Shape of df_adas: {df_adas.shape}")
print(f"Shape of df_ads: {df_ads.shape}")

# Ensuring the columns are the same so merge is easy
overlapping_columns = set(df_adas.columns).intersection(set(df_ads.columns))
print(f"Overlapping columns ({len(overlapping_columns)}):")

Shape of df_adas: (3905, 137)
Shape of df_ads: (2173, 137)
Overlapping columns (137):


**Merging the ADAS and ADS Dataset**

In [9]:
# Add a source column
df_adas['Source'] = 'ADAS'
df_ads['Source'] = 'ADS'

# combining the two dataframes but using concat instead of merge to ensure 274 columns are kept
merged_df = pd.concat([df_adas, df_ads], ignore_index=True)

# Checking the result of the merge
print(f"Shape of merged dataframe: {merged_df.shape}")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))

Shape of merged dataframe: (6078, 138)
+----+-------------+------------------+------------------------------------------+---------------+----------------+---------------+--------------------------+-------------+-----------------+---------------------------------------------------+---------------+----------------+-------------------+--------------+------------------------+-------------------+-----------+---------------------+--------------------------------+---------------------------------------------------------------+---------------------------------+-------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------

**Reducing the number of columns due to Report versions**

In [10]:
# For each report ID count how many versions there are
version_counts = merged_df.groupby('Report ID')['Report Version'].nunique()
multi_version = version_counts[version_counts > 1]
print(f"\nReport IDs with multiple versions (count: {len(multi_version)}):")
multi_version = multi_version.sort_values(ascending=False)
print(multi_version)


Report IDs with multiple versions (count: 1497):
Report ID
945-8258      9
28349-4648    8
855-1341      5
1243-1379     5
753-4134      5
             ..
13781-3716    2
13781-3715    2
13781-3714    2
13781-3713    2
988-3684      2
Name: Report Version, Length: 1497, dtype: int64


**Dropping the old versions hence no duplicates in the datasets**

In [11]:
# Dropping all the old versions of the reports
merged_df = merged_df.loc[merged_df.groupby('Report ID')['Report Version'].idxmax()].reset_index(drop=True)

# Final Merged DataFrame with latest versions
print(f"\nShape of merged dataframe after dropping old versions: {merged_df.shape}")
print("\nFirst 10 Rows of latest versions:")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))


Shape of merged dataframe after dropping old versions: (4372, 138)

First 10 Rows of latest versions:
+----+-------------+------------------+---------------------------+---------------+----------------+---------------+--------------------------+-------------+-----------------+-----------------+---------------+-------------+-------------------+--------------+------------------------+-------------------+-----------+---------------------+--------------------------------+-----------------------------------------------------------+---------------------------------+-------------------------------+----------------------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+----------------------------------+-----------------------------

#### **Summary**
-  Columns are both exactly the same for both datasets however given the each report has multiple versions it is better to only keep the latest version as this will skew the EDA given they are not unique instances
- After combining the dataset there is 6078 rows to 4372 dataset
- New row added called Source tell explicitly say if ADAS or ADS

---

## **<span style="color:lightblue;">Dropping Unneccessary Columns</span>**

**Exploring what columns exist and whether they are connected or can be reduced**

In [12]:
# Printing all the columns of the merged DataFrame
print(merged_df.columns.tolist())

['Report ID', 'Report Version', 'Reporting Entity', 'Report Type', 'Report Month', 'Report Year', 'Report Submission Date', 'VIN', 'VIN - Unknown', 'Serial Number', 'Make', 'Model', 'Model - Unknown', 'Model Year', 'Model Year - Unknown', 'Same Vehicle ID', 'Mileage', 'Mileage - Unknown', 'Driver / Operator Type', 'ADAS/ADS System Version', 'ADAS/ADS System Version - Unk', 'ADAS/ADS System Version CBI', 'ADAS/ADS Hardware Version', 'ADAS/ADS Hardware Version - Unk', 'ADAS/ADS Hardware Version CBI', 'ADAS/ADS Software Version', 'ADAS/ADS Software Version - Unk', 'ADAS/ADS Software Version CBI', 'Other Reporting Entities?', 'Other Reporting Entities? - Unk', 'Other Reporting Entities? - NA', 'Federal Regulatory Exemption?', 'Other Federal Reg. Exemption', 'Federal Reg. Exemption - Unk', 'Federal Reg. Exemption - No', 'State or Local Permit?', 'State or Local Permit', 'ADS Equipped?', 'Automation System Engaged?', 'Operating Entity', 'Operating Entity - Unknown', 'Source - Complaint/Claim

#### List of Connected Columns
**Double Columns** = If value for column is empty then unknown column has 'Y' value 
1. VIN, VIN - Unknown
2. Model, Model - Unknown
3. Model Year, Model Year - Unknown
4. Mileage, Mileage - Unknown
5. Incident Time (24:00), Incident Time - Unknown
6. Latitude, Latitude - Unknown
7. Longitude, Longitude - Unknown
8. Address, Address - Unknown
9. City, City - Unknown
10. Zip Code, Zip Code - Unknown
11. Operating Entity, Operating Entity - Unknown
12. Investigating Agency, Investigating Agency - Unknown
13. Investigating Officer Name, Inv. Officer Name - Unknown
14. Investigating Officer Phone, Inv. Officer Phone - Unknown
15. Investigating Officer Email, Inv. Officer Email - Unknown
16. Posted Speed Limit (MPH), Posted Speed Limit - Unknown
17. SV Precrash Speed (MPH), SV Pre-crash Speed - Unknown

**Multiple Value Columns** = Useful data split into multiple rows which can be combined
1. ADAS/ADS System Version, ADAS/ADS System Version - Unk, ADAS/ADS System Version CBI
2. ADAS/ADS Hardware Version, ADAS/ADS Hardware Version - Unk, ADAS/ADS Hardware Version CBI
3. ADAS/ADS Software Version, ADAS/ADS Software Version - Unk, ADAS/ADS Software Version CBI
4. Other Reporting Entities?, Other Reporting Entities? - Unk, Other Reporting Entities? - NA
5. Federal Regulatory Exemption?, Other Federal Reg. Exemption, Federal Reg. Exemption - Unk, Federal Reg. Exemption - No
6. State or Local Permit?, State or Local Permit
7. Source - Complaint/Claim, Source - Telematics, Source - Law Enforcement, Source - Field Report, Source - Testing, Source - Media, Source - Other, Source - Other Text
8. Weather - Clear, Weather - Snow, Weather - Cloudy, Weather - Fog/Smoke, Weather - Rain, Weather - Severe Wind, Weather - Other, Weather - Other Text
9. CP Contact Area - Rear Left, CP Contact Area - Left, CP Contact Area - Front Left, CP Contact Area - Rear, CP Contact Area - Top, CP Contact Area - Front, CP Contact Area - Rear Right, CP Contact Area - Right, CP Contact Area - Front Right, CP Contact Area - Bottom
10. SV Contact Area - Rear Left, SV Contact Area - Left, SV Contact Area - Front Left, SV Contact Area - Rear, SV Contact Area - Top, SV Contact Area - Front, SV Contact Area - Rear Right, SV Contact Area - Right, SV Contact Area - Front Right, SV Contact Area - Bottom
11. Data Availability - EDR, Data Availability - Police Rpt, Data Availability - Telematics, Data Availability - Complaints, Data Availability - Video, Data Availability - Other, Data Availability - No Data
12. Within ODD?, Within ODD? - CBI
13. Narrative, Narrative - CBI?

**Dropping the 'Unknown' double Columns**
- For the double columns, if a value is missing then the unknown version of the column has a Y value within it 
- This column can easily be dropped as the main column has a NaN which can be imputted or listed as 'missing' later on reducing the number of rows significantly

In [13]:
# drop all the unknown columns as they can be empty and imputed later (22 columns)
merged_df = merged_df.loc[:, ~merged_df.columns.str.contains('Unknown')]
print(tabulate(merged_df.head(2), headers='keys', tablefmt='grid'))
print(f"Number of columns after dropping 'Unknown': {merged_df.shape[1]}")

+----+-------------+------------------+--------------------+---------------+----------------+---------------+--------------------------+-------------+-----------------+---------------+---------+--------------+-------------------+-----------+--------------------------+---------------------------+---------------------------------+-------------------------------+-----------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+-------------------------------+--------------------------+-------------------------+-----------------+------------------------------+--------------------+----------------------------+-----------------------+----------------------------+

**Combining the values in multivalue columns**
- List all the columns which are connected and then merging whatever values are in those into a main column
- eg: CP Contact area
Example Row Data
  - 'CP Contact Area - Rear Left' 
  - 'CP Contact Area - Left'
  - 'CP Contact Area - Front Left'
  - 'CP Contact Area - Rear'
  - 'CP Contact Area - Top'  | **Y**
  - 'CP Contact Area - Front'
  - 'CP Contact Area - Rear Right'
  - 'CP Contact Area - Right'
  - 'CP Contact Area - Front Right'
  - 'CP Contact Area - Bottom'

Turns into: CP Contact area -> CP Contact Area - Top

In [14]:
# Combine columns with 'Y' values into a single column with their col name
def combine_and_drop(merged_df, new_col_name, cols_to_combine):
    '''
        Purpose: Combine column values into 1, if they contain Y then place the column name 
        Input:
            merged_df: DataFrame to modify
            new_col_name: Name of the new column to create
            cols_to_combine: List of columns to combine
        Output: DataFrame with the new column and specified columns dropped
    '''
    def combine_values(row):
        combined_values = []
        for col in cols_to_combine: # go through all columns in list
            if str(row[col]).strip().upper() == 'Y':  # if any value in that row then
                combined_values.append(col.split(' - ')[-1]) # only place what is after the column name 
        return ', '.join(combined_values)

    merged_df[new_col_name] = merged_df[cols_to_combine].apply(combine_values, axis=1)
    merged_df.drop(columns=cols_to_combine, inplace=True)


In [15]:
# The merging/combining columns
combine_and_drop(merged_df, 'CP Contact Area', [
    'CP Contact Area - Rear Left', 'CP Contact Area - Left', 'CP Contact Area - Front Left',
    'CP Contact Area - Rear', 'CP Contact Area - Top', 'CP Contact Area - Front',
    'CP Contact Area - Rear Right', 'CP Contact Area - Right', 'CP Contact Area - Front Right',
    'CP Contact Area - Bottom' 
])

combine_and_drop(merged_df, 'ADAS/ADS System Version', [
    'ADAS/ADS System Version', 'ADAS/ADS System Version - Unk', 'ADAS/ADS System Version CBI'
])

combine_and_drop(merged_df, 'ADAS/ADS Hardware Version', [
    'ADAS/ADS Hardware Version', 'ADAS/ADS Hardware Version - Unk', 'ADAS/ADS Hardware Version CBI'
])

combine_and_drop(merged_df, 'ADAS/ADS Software Version', [
    'ADAS/ADS Software Version', 'ADAS/ADS Software Version - Unk', 'ADAS/ADS Software Version CBI'
])

combine_and_drop(merged_df, 'Other Reporting Entities', [
    'Other Reporting Entities?', 'Other Reporting Entities? - Unk', 'Other Reporting Entities? - NA'
])

combine_and_drop(merged_df, 'Federal Regulatory Exemption', [
    'Federal Regulatory Exemption?', 'Other Federal Reg. Exemption',
    'Federal Reg. Exemption - Unk', 'Federal Reg. Exemption - No'
])

combine_and_drop(merged_df, 'State or Local Permit', [
    'State or Local Permit?', 'State or Local Permit'
])

combine_and_drop(merged_df, 'Source', [
    'Source - Complaint/Claim', 'Source - Telematics', 'Source - Law Enforcement',
    'Source - Field Report', 'Source - Testing', 'Source - Media',
    'Source - Other', 'Source - Other Text'
])

combine_and_drop(merged_df, 'Weather', [
    'Weather - Clear', 'Weather - Snow', 'Weather - Cloudy', 'Weather - Fog/Smoke',
    'Weather - Rain', 'Weather - Severe Wind', 'Weather - Other', 'Weather - Other Text'
])

combine_and_drop(merged_df, 'SV Contact Area', [
    'SV Contact Area - Rear Left', 'SV Contact Area - Left', 'SV Contact Area - Front Left',
    'SV Contact Area - Rear', 'SV Contact Area - Top', 'SV Contact Area - Front',
    'SV Contact Area - Rear Right', 'SV Contact Area - Right', 'SV Contact Area - Front Right',
    'SV Contact Area - Bottom'
])

combine_and_drop(merged_df, 'Data Availability', [
    'Data Availability - EDR', 'Data Availability - Police Rpt', 'Data Availability - Telematics',
    'Data Availability - Complaints', 'Data Availability - Video', 'Data Availability - Other',
    'Data Availability - No Data'
])

# drop all the unknown columns as they can be empty and imputed later (22 columns)
merged_df = merged_df.loc[:, ~merged_df.columns.str.contains('Unknown')]
print(tabulate(merged_df.head(), headers='keys', tablefmt='grid'))
print(f"Number of columns after combining connected columns': {merged_df.shape[1]}")

+----+-------------+------------------+--------------------+---------------+----------------+---------------+--------------------------+-------------+-----------------+---------------+---------+--------------+-------------------+-----------+--------------------------------+-----------------+------------------------------+--------------------+-----------------+------------------------+-------------------------+--------------------+---------------------------------------------------+---------------------------------------------------+---------------------------------------------------+--------------------+---------+---------------------------------------------------+-------------------+-------------------+-----------------------+----------------------------+--------------------+--------------------+-----------------------------------+--------------------+-------------------------+-----------------------------+-------------------------+-------------------------+---------------------------

**Dropping any other unneeded columns**
- columns not merged, combined or needed hence can drop them
- these include columns which have information we already have eg: city and state available (long, lat, address, zip code not needed)
- Some information is mostly redacted and contains Personal information hence dropping those (address, zipcode, vin)

In [16]:
cols_to_drop = [
    'Report Type', 'Report Month', 'Report Year', 'Report Submission Date','Driver / Operator Type'
    'Notice Received Date', 'Notice Received Date','Reporting Entity', 'Operating Entity', 
    'Serial Number','Data Availability',
    'Latitude', 'Longitude', 'Address', 'Zip Code', 
    'Investigating Agency', 'Rep Ent Or Mfr Investigating?', 'Investigating Officer Name',
    'Investigating Officer Phone', 'Investigating Officer Email',
    'Other Reporting Entities', 'Federal Regulatory Exemption',
    'Within ODD? - CBI','Within ODD',
    'Same Incident ID', 'Same Vehicle ID',
    'Narrative', 'Narrative - CBI?',
    'Report Type', 'Report Month', 'Report Year',
    'VIN', 
    'Law Enforcement Investigating?',
    'Source'
]

# Drop the unwanted columns
cols_to_drop = [col for col in cols_to_drop if col in merged_df.columns]
merged_df.drop(columns=cols_to_drop, inplace=True)
print(tabulate(merged_df.head(8), headers='keys', tablefmt='grid'))
print(f"Number of columns after combining connected columns': {merged_df.shape[1]}")

+----+-------------+------------------+---------------+-------------+--------------+-----------+--------------------------------+-----------------+------------------------------+-----------------+-------------------------+--------------------+---------+-------------------+-------------------+-----------------------+----------------------------+--------------------+-------------------------+-----------------------------------+--------------------+-------------------------+-----------------------------+-------------------------+-------------------------+-----------------------------+-------------------------+----------------------------------+---------------------------+-----------------------------------------------------------+-----------------------------------+-----------+--------------------------------------------------------------------+
|    | Report ID   |   Report Version | Make          | Model       |   Model Year |   Mileage | Driver / Operator Type         | ADS Equipped?  

#### **Summary**
- During this section more than 100 columns were removed
    - many columns combined into 1 to ensure the values are not split across the dataset
    - columns which would not be useful for analysis 
    - unknown columns
---

## **<span style="color:lightblue;">Assessing Missing Values + Dropping more cols</span>**
- Balancing act of reducing columns to see how much cell informaation can be saved

**Create a Table to see how much data is still missing**
- reduce any rows that may not be in the UK STATS19 rows 
- reduce any rows that are more than 60% missing 
- see the number of missing/nan values across the columns

In [17]:
# Function to count and print the overall sum of missing, empty, NaN, and "unknown" values for each column
def count_and_sum_missing_values(df):
    """
    Purpose: Count and sum missing, empty, NaN, and 'unknown' values for each column in the DataFrame.
    Input: DataFrame
    Output: Prints a summary table with the counts and percentages of missing values.
    """

    # Initialize a list to store the summary for each column
    overall_summary = []
    for col in df.columns:
        null_count = df[col].isnull().sum()  # Count missing (NaN) values
        empty_count = (df[col] == '').sum()  # Count empty string values
        nan_count = df[col].apply(lambda x: pd.isna(x)).sum()  # Count NaN values explicitly
        unknown_count = df[col].apply(lambda x: str(x).strip().lower() == 'unknown').sum()  # Count 'unknown' values
        total = null_count + empty_count + nan_count + unknown_count
        percentage = (total / len(df)) * 100  # Calculate percentage
        overall_summary.append((col, null_count, empty_count, nan_count, unknown_count, total, percentage))

    # Create a DataFrame for the summary and sort by the highest percentage
    summary_df = pd.DataFrame(overall_summary, columns=['Column', 'Null', 'Empty', 'NaN', 'Unknown', 'Total', 'Percentage'])
    summary_df = summary_df.sort_values(by='Percentage', ascending=False)
    print(tabulate(summary_df, headers='keys', tablefmt='grid')) # make table to clearly see the results

# Call the function on the dataframe
count_and_sum_missing_values(merged_df)

+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
|    | Column                          |   Null |   Empty |   NaN |   Unknown |   Total |   Percentage |
| 23 | CP Was Vehicle Towed?           |    832 |       0 |   832 |      1904 |    3568 |   81.6102    |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 22 | CP Any Air Bags Deployed?       |    832 |       0 |   832 |      1773 |    3437 |   78.6139    |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 21 | CP Pre-Crash Movement           |    678 |       0 |   678 |      1097 |    2453 |   56.107     |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 27 | SV Were All Passengers Belted?  |      0 |       0 |     0 |      2325 |    2325 |   53.1793    |
+----+---------------------------------+--------+------

- **CP Any Air Bags Deployed?**: 20.50% missing
- **CP Was Vehicle Towed?**: 20.50% missing
- **CP Pre-Crash Movement**: 17.14% 
- the driver operator it was useful but id rather impute later on and this is a cateortogrical i dont wna tot use mode and reduce data interhgeity 

In [18]:
# Columns to drop with more than 45% missing values
columns_to_remove = ['CP Pre-Crash Movement', 'CP Any Air Bags Deployed?', 'CP Was Vehicle Towed?', 'SV Any Air Bags Deployed?',
                     'SV Was Vehicle Towed?', 'SV Were All Passengers Belted?', 'Driver / Operator Type']

# Drop the columns
merged_df.drop(columns=columns_to_remove, inplace=True)

# Display the updated dataframe shape
print(f"Shape of the dataframe after dropping specified columns: {merged_df.shape}")
count_and_sum_missing_values(merged_df)

Shape of the dataframe after dropping specified columns: (4372, 26)
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
|    | Column                          |   Null |   Empty |   NaN |   Unknown |   Total |   Percentage |
| 18 | Highest Injury Severity Alleged |      0 |       0 |     0 |      2147 |    2147 |   49.108     |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 23 | CP Contact Area                 |      0 |    2135 |     0 |         0 |    2135 |   48.8335    |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 14 | Roadway Description             |      0 |       0 |     0 |      1086 |    1086 |   24.8399    |
+----+---------------------------------+--------+---------+-------+-----------+---------+--------------+
| 15 | Posted Speed Limit (MPH)        |    496 |       0 |   496 |         0 |     992 |   

In [19]:
# Replace all null values (NaN) with 'Missing'
merged_df.fillna('Missing', inplace=True)

# Replace empty cells with 'Missing' (if not already handled)
merged_df.replace('', 'Missing', inplace=True)

# Print the updated dataframe
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid'))
print("All null values and empty cells replaced with 'Missing'.")

+----+-------------+------------------+---------------+-------------+--------------+-----------+-----------------+------------------------------+-----------------+-------------------------+--------------------+---------+-------------------+----------------------+-----------------------+----------------------------+----------------------+-------------------------+-----------------------------------+--------------------+-------------------------+---------------------------+-----------------------------------------------------------+-----------------------------------+-----------+--------------------------------------------------------------------+
|    | Report ID   |   Report Version | Make          | Model       | Model Year   | Mileage   | ADS Equipped?   | Automation System Engaged?   | Incident Date   | Incident Time (24:00)   | City               | State   | Roadway Type      | Roadway Surface      | Roadway Description   | Posted Speed Limit (MPH)   | Lighting             | Crash 

**Filling Rest of Unknown as Missing**

In [20]:
# Count missing values for each column all filled
missing_values_count = merged_df.isnull().sum()

# Print the missing values count
print("Missing values count for each column:")
print(missing_values_count)

Missing values count for each column:
Report ID                          0
Report Version                     0
Make                               0
Model                              0
Model Year                         0
Mileage                            0
ADS Equipped?                      0
Automation System Engaged?         0
Incident Date                      0
Incident Time (24:00)              0
City                               0
State                              0
Roadway Type                       0
Roadway Surface                    0
Roadway Description                0
Posted Speed Limit (MPH)           0
Lighting                           0
Crash With                         0
Highest Injury Severity Alleged    0
Property Damage?                   0
SV Pre-Crash Movement              0
SV Precrash Speed (MPH)            0
Within ODD?                        0
CP Contact Area                    0
Weather                            0
SV Contact Area                    0


#### **Summary**
- further columns removed which were more than 60-80% missing to keep data integrity
- some values removed to match stats19 columns
- remaining missing values can be imputed or easily calculated and labelled as missing
eg:
---

## **<span style="color:lightblue;">Fixing Formats</span>**

**Converting to Strings, Datetime, and Numerical Datatypes**

In [21]:
from dateutil import parser

def clean_incident_date(val):
    try:
        val = str(val).strip()

        if val.lower() == 'missing' or val == '':
            return pd.NaT

        # Try parsing common formats
        return parser.parse(val, dayfirst=False, yearfirst=False)
    except:
        return pd.NaT

merged_df['Incident Date'] = merged_df['Incident Date'].apply(clean_incident_date)
print("Remaining NaT values:", merged_df['Incident Date'].isna().sum())



Remaining NaT values: 19


In [22]:
# Convert numerical columns to numeric type
numerical_columns = ['Report Version', 'Mileage', 'Posted Speed Limit (MPH)', 'SV Precrash Speed (MPH)']
merged_df[numerical_columns] = merged_df[numerical_columns].apply(pd.to_numeric, errors='coerce')

# Convert to datetime (assumes format like 'NOV-2021')
# Convert to datetime (assumes format like 'NOV-2021')
merged_df['Incident Date'] = pd.to_datetime(
    merged_df['Incident Date'], format='%b-%Y', errors='coerce'
)

# First: Clean string representation and convert float-like years
merged_df['Model Year'] = merged_df['Model Year'].apply(
    lambda x: str(int(float(x))) if pd.notnull(x) and str(x).replace('.', '', 1).isdigit() else x
)

# Then convert to datetime (expects 4-digit year like '2021')
merged_df['Model Year'] = pd.to_datetime(
    merged_df['Model Year'], format='%Y', errors='coerce'
)

# Convert time strings to time objects
merged_df['Incident Time (24:00)'] = pd.to_datetime(
    merged_df['Incident Time (24:00)'], format='%H:%M', errors='coerce' # nat if coerece fails
).dt.time

print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid'))

+----+-------------+------------------+---------------+-------------+---------------------+-----------+-----------------+------------------------------+---------------------+-------------------------+--------------------+---------+-------------------+----------------------+-----------------------+----------------------------+----------------------+-------------------------+-----------------------------------+--------------------+-------------------------+---------------------------+-----------------------------------------------------------+-----------------------------------+-----------+--------------------------------------------------------------------+
|    | Report ID   |   Report Version | Make          | Model       | Model Year          |   Mileage | ADS Equipped?   | Automation System Engaged?   | Incident Date       | Incident Time (24:00)   | City               | State   | Roadway Type      | Roadway Surface      | Roadway Description   |   Posted Speed Limit (MPH) | Lightin

**Renaming Columns to Match UK STATS19**


**STATS 19 MATCHING Columns**
* Report ID -> accident_index 
* Report Version -> accident_reference 
* Make -> make (from generic_make_model)
* Model -> model (from generic_make_model)
* Model Year -> Model Year (from age_of_vehicle)
* Incident Date -> date
* Incident Time (24:00) -> time
* city -> city (from longitude and latitude)
* state -> county
* Roadway Type -> road_type
* Roadway Surface -> road_surface_conditions
* Posted Speed Limit (MPH) -> speed_limit
* Lighting -> light_conditions
* Crash With -> hit_object_in_carriageway and hit_object_off_carriageway 
* Highest Injury Severity Alleged -> accident_severity
* SV Precrash Movement -> vehicle_manoeuvre
* SV Precrash Speed (MPH) 
* Weather -> weather_conditions
* SV Contact Area -> first_point_of_impact

**UNMATCHED**
* SV Any Air Bags Deployed?  
* CP Contact Area 
* Property Damage 
* Mileage 
* ADS Equipped 
* Automatic System Engaged? 
* Roadway Description

**ADD**
- Country (US)

In [23]:
# Add a new column 'Country' and fill it with the value 'US' for all rows
merged_df['Country'] = 'US'

# Display the updated dataframe
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid'))

+----+-------------+------------------+---------------+-------------+---------------------+-----------+-----------------+------------------------------+---------------------+-------------------------+--------------------+---------+-------------------+----------------------+-----------------------+----------------------------+----------------------+-------------------------+-----------------------------------+--------------------+-------------------------+---------------------------+-----------------------------------------------------------+-----------------------------------+-----------+--------------------------------------------------------------------+-----------+
|    | Report ID   |   Report Version | Make          | Model       | Model Year          |   Mileage | ADS Equipped?   | Automation System Engaged?   | Incident Date       | Incident Time (24:00)   | City               | State   | Roadway Type      | Roadway Surface      | Roadway Description   |   Posted Speed Limit (MP

**Renaming Values to Standardised Names**

In [24]:
# Display unique values for all columns in the DataFrame
key_values = {
 'Roadway Type',
 'Roadway Surface',
'Lighting',
 'Crash With',
'Highest Injury Severity Alleged',
'SV Pre-Crash Movement',
 'Weather',
'SV Contact Area'
}

for column in key_values:
    unique_values = merged_df[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print("-" * 50)

Unique values for Lighting:
['Dark - Not Lighted' 'Daylight' 'Unknown' 'Other, see Narrative'
 'Dark - Unknown Lighting' 'Dawn / Dusk' 'Dark - Lighted']
--------------------------------------------------
Unique values for Roadway Surface:
['Dry' 'Unknown' 'Other, see Narrative' 'Wet' 'Snow / Slush / Ice']
--------------------------------------------------
Unique values for Roadway Type:
['Street' 'Highway / Freeway' 'Unknown' 'Rural Road' 'Intersection'
 'Parking Lot' 'Traffic Circle' 'Unpaved Road']
--------------------------------------------------
Unique values for SV Contact Area:
['Front' 'Front Left'
 'Rear Left, Left, Front Left, Front, Rear Right, Right, Front Right'
 'Front Left, Front, Front Right'
 'Rear Left, Front Left, Front, Front Right' 'Missing'
 'Front, Front Right' 'Rear Left, Left, Front Left, Front, Front Right'
 'Rear Left, Left, Front Left, Rear, Front, Front Right'
 'Rear Right, Right, Front Right' 'Rear Left, Front Left'
 'Rear Left, Left, Front Left' 'Bottom' 

**Save Cleaned Dataframe**

In [25]:
merged_df.to_csv('US-cleaned_data.csv', index=False)