DATA CLEANING SECTION 
- LOADING BOTH DATASETS 
- INITIAL CHECKS 
- DUPLICATES
- COMBINING / DROPPING ROWS
- MERGING
- REMOVING WHITESPACE

problem statement and getting the data
* ADAS
* ADS
* OTHER

In [46]:
# Import necessary libraries
import pandas as pd
from tabulate import tabulate # For pretty printing tables

Checking dataset before merging

In [47]:
def explore_datasets(df1, df2, df3):
    import pandas as pd

    print("=== Dataset Shapes ===")
    print(f"Dataset 1 shape: {df1.shape}")
    print(f"Dataset 2 shape: {df2.shape}")
    print(f"Dataset 3 shape: {df3.shape}")
    # Count unique 'Report ID' for each dataset
    unique_ids_df1 = df1['Report ID'].nunique()
    unique_ids_df2 = df2['Report ID'].nunique()
    unique_ids_df3 = df3['Report ID'].nunique()

    print("\n=== Unique 'Report ID' Counts ===")
    print(f"Dataset 1: {unique_ids_df1}")
    print(f"Dataset 2: {unique_ids_df2}")
    print(f"Dataset 3: {unique_ids_df3}")

    # Overlapping columns between each pair
    common_12 = set(df1.columns).intersection(df2.columns)
    common_13 = set(df1.columns).intersection(df3.columns)
    common_23 = set(df2.columns).intersection(df3.columns)

    print("\n=== Overlapping Columns ===")
    print(f"Between df1 & df2: {len(common_12)} columns")
    print(f"Columns: {sorted(common_12)}")

    print(f"\nBetween df1 & df3: {len(common_13)} columns")
    print(f"Columns: {sorted(common_13)}")

    print(f"\nBetween df2 & df3: {len(common_23)} columns")
    print(f"Columns: {sorted(common_23)}")

    # Missing values percentage overall per dataset
    def missing_percentage(df):
        total_cells = df.size
        total_missing = df.isnull().sum().sum()
        return (total_missing / total_cells) * 100

    print("\n=== Missing Values Percentage ===")
    print(f"Dataset 1 missing: {missing_percentage(df1):.2f}%")
    print(f"Dataset 2 missing: {missing_percentage(df2):.2f}%")
    print(f"Dataset 3 missing: {missing_percentage(df3):.2f}%")

    # Function to count exact overlapping rows on common columns (dtype-safe)
    def count_exact_row_overlap(dfA, dfB, common_cols):
        if not common_cols:
            return 0
        dfA_sub = dfA[list(common_cols)].copy()
        dfB_sub = dfB[list(common_cols)].copy()

        # Convert common columns to strings to avoid dtype mismatch
        for col in common_cols:
            dfA_sub[col] = dfA_sub[col].astype(str)
            dfB_sub[col] = dfB_sub[col].astype(str)

        merged = pd.merge(dfA_sub.drop_duplicates(), dfB_sub.drop_duplicates(), how='inner')
        return merged.shape[0]

    # Exact row overlaps
    print("\n=== Exact Row Overlaps ===")
    overlap_12 = count_exact_row_overlap(df1, df2, common_12)
    overlap_13 = count_exact_row_overlap(df1, df3, common_13)
    overlap_23 = count_exact_row_overlap(df2, df3, common_23)

    print(f"Exact overlapping rows between df1 & df2: {overlap_12}")
    print(f"Exact overlapping rows between df1 & df3: {overlap_13}")
    print(f"Exact overlapping rows between df2 & df3: {overlap_23}")


In [48]:
# Example usage:
df_adas = pd.read_csv('SGO-2021-01_Incident_Reports_ADAS.csv')
df_ads = pd.read_csv('SGO-2021-01_Incident_Reports_ADS.csv')
df_other = pd.read_csv('SGO-2021-01_Incident_Reports_OTHER.csv')
keys = ['Report ID']  # Replace with your actual key columns and add ADAS/ADS column
explore_datasets(df_adas, df_ads, df_other)

=== Dataset Shapes ===
Dataset 1 shape: (3787, 137)
Dataset 2 shape: (2063, 137)
Dataset 3 shape: (3489, 137)

=== Unique 'Report ID' Counts ===
Dataset 1: 2482
Dataset 2: 1685
Dataset 3: 3462

=== Overlapping Columns ===
Between df1 & df2: 137 columns
Columns: ['ADAS/ADS Hardware Version', 'ADAS/ADS Hardware Version - Unk', 'ADAS/ADS Hardware Version CBI', 'ADAS/ADS Software Version', 'ADAS/ADS Software Version - Unk', 'ADAS/ADS Software Version CBI', 'ADAS/ADS System Version', 'ADAS/ADS System Version - Unk', 'ADAS/ADS System Version CBI', 'ADS Equipped?', 'Address', 'Address - Unknown', 'Automation System Engaged?', 'CP Any Air Bags Deployed?', 'CP Contact Area - Bottom', 'CP Contact Area - Front', 'CP Contact Area - Front Left', 'CP Contact Area - Front Right', 'CP Contact Area - Left', 'CP Contact Area - Rear', 'CP Contact Area - Rear Left', 'CP Contact Area - Rear Right', 'CP Contact Area - Right', 'CP Contact Area - Top', 'CP Contact Area - Unknown', 'CP Pre-Crash Movement', 'CP

In [49]:
# other 
print("\n🔍 First 10 Rows of df_other:")
print(tabulate(df_other.head(10), headers='keys', tablefmt='grid', showindex=True))


🔍 First 10 Rows of df_other:
+----+-------------+------------------+-----------------------------------+------------------------------------+----------------+---------------+--------------------------+-------+-----------------+-----------------+--------+---------+-------------------+--------------+------------------------+-------------------+-----------+---------------------+--------------------------+---------------------------+---------------------------------+-------------------------------+-----------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+----------------------------------+---------------------------------+--------------------------------+--------------------------------+-------------------------------+--------------------------+-------------------------+--------------

In [50]:
# Display missing values percentage for df_other (NOT SUITABLE FOR MERGING)
missing_percent = (df_other.isnull().mean() * 100).sort_values(ascending=False)
print("\n📊 Missing Values Percentage per Column (High to Low):")
print(missing_percent.to_string(float_format="%.2f%%"))



📊 Missing Values Percentage per Column (High to Low):
State or Local Permit             100.00%
Other Federal Reg. Exemption      100.00%
Federal Regulatory Exemption?     100.00%
Weather - Other Text              100.00%
ADAS/ADS Hardware Version          99.94%
Investigating Officer Email        99.94%
Investigating Officer Phone        99.94%
State or Local Permit?             99.94%
Serial Number                      99.94%
ADAS/ADS Software Version          99.94%
Other Reporting Entities?          99.91%
Narrative - CBI?                   99.89%
Source - Other Text                99.68%
ADAS/ADS System Version            99.51%
Latitude                           99.46%
Longitude                          99.46%
Investigating Officer Name         99.40%
Investigating Agency               99.31%
ADS Equipped?                      98.48%
Zip Code                           98.22%
Posted Speed Limit (MPH)           98.17%
Automation System Engaged?         98.17%
Mileage              

In [51]:
# MERGE ADS AND ADAS DATASETS
"""
- based on the common 'Report ID' column primary key
- also key i want is adas/ads column and also i want to have this is sort anf filter idk waht key this is 
- df_adas and df_ads  df names
Your datasets might record multiple events or versions per Report ID.
Report ID is not unique as a row identifier, just a grouping key.
Sometimes a primary key needs to be composite — made from more than one column (e.g., Report ID + Version).

# Optional: drop duplicates based on 'Report ID' if you want unique keys only
# merged_df = merged_df.drop_duplicates(subset=['Report ID']) 
# - A unique report number generated by NHTSA to track initial and updated reports for a given incident.
"""
merged_df = pd.merge(
    df_adas,
    df_ads,
    on=['Report ID', 'Report Version'],
    how='outer'  # use 'inner' if you want only matching rows
)

print(f"Shape of merged dataframe: {merged_df.shape}")


print("\n🔍 First 10 Rows of merged data:")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))



Shape of merged dataframe: (5850, 272)

🔍 First 10 Rows of merged data:
+----+-------------+------------------+---------------------------+-----------------+------------------+-----------------+----------------------------+-------------+-------------------+---------------------------------------------------+----------+-------------+---------------------+----------------+--------------------------+---------------------+-------------+-----------------------+--------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------------------------------------+-------------------------------------+-----------------------------------+-----------------------------------------------------------+-------------------------------------+-----------------------------------+-------------------------------+-------------------------------------+------------------------------------+--

In [52]:
# Assuming 'Report Version' is numeric or sortable
merged_df_sorted = merged_df.sort_values(by=['Report ID', 'Report Version'], ascending=[True, False])

# Keep the first row per Report ID (which will have the highest Report Version)
latest_reports = merged_df_sorted.drop_duplicates(subset=['Report ID'], keep='first')

print(f"\nShape of latest reports: {latest_reports.shape}")
print("\n🔍 First 10 Rows of latest reports:")
print(tabulate(latest_reports.head(10), headers='keys', tablefmt='grid', showindex=True))




Shape of latest reports: (4167, 272)

🔍 First 10 Rows of latest reports:
+------+-------------+------------------+---------------------------+-----------------+------------------+-----------------+----------------------------+-------------+-------------------+-------------------+---------------+-------------+---------------------+----------------+--------------------------+---------------------+-------------+-----------------------+--------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+----------------------------------------------------------------------+-------------------------------------+-----------------------------------+-----------------------------------------------------------+-------------------------------------+-----------------------------------+-------------------------------+-------------------------------------+------------------------------------+--------------

In [53]:
version_counts = merged_df.groupby('Report ID')['Report Version'].nunique()
multi_version = version_counts[version_counts > 1]
print(f"\nReport IDs with multiple versions (count: {len(multi_version)}):")
multi_version = multi_version.sort_values(ascending=False)
print(multi_version)




Report IDs with multiple versions (count: 1479):
Report ID
945-8258      9
28349-4648    8
753-6768      5
28349-4582    5
855-1341      5
             ..
13781-3716    2
13781-3715    2
13781-3714    2
13781-3713    2
988-3684      2
Name: Report Version, Length: 1479, dtype: int64


In [54]:

# keep latest version for each Report ID
# Step 1: Concatenate both DataFrames
combined_df = pd.concat([df_adas,df_ads], ignore_index=True)

# Step 2: Keep only the latest version per Report ID
merged_df = combined_df.loc[combined_df.groupby('Report ID')['Report Version'].idxmax()].reset_index(drop=True)

# Shape of the final merged dataframe with latest versions
print(f"\nShape of latest versions: {merged_df.shape}")

# Unique Report IDs in the final merged dataframe
unique_report_ids = merged_df['Report ID'].nunique()
print(f"\nUnique Report IDs in the final merged dataframe: {unique_report_ids}")


# Display the first 10 Report IDs with multiple versions
print("\n🔍 First 10 Rows of latest versions:")
print(tabulate(merged_df.head(10), headers='keys', tablefmt='grid', showindex=True))




Shape of latest versions: (4167, 137)

Unique Report IDs in the final merged dataframe: 4167

🔍 First 10 Rows of latest versions:
+----+-------------+------------------+---------------------------+---------------+----------------+---------------+--------------------------+-------------+-----------------+-----------------+---------------+-------------+-------------------+--------------+------------------------+-------------------+-----------+---------------------+--------------------------------+-----------------------------------------------------------+---------------------------------+-------------------------------+----------------------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------------------------------------+-----------------------------------+---------------------------------+-----------------------------+-----------------------------------+----------------------------------+-

In [55]:
# Output the merged dataframe to a CSV file verison control
# Add a column to indicate the source dataset (ADAS or ADS)
merged_df['Source Dataset'] = merged_df.apply(
    lambda row: 'ADAS' if row['Report ID'] in df_adas['Report ID'].values else 'ADS', axis=1
)

# Save the merged dataframe to a CSV file
merged_df.to_csv('merged_dataset.csv', index=False)
print("Merged dataset with source information has been saved to 'merged_dataset.csv'.")

Merged dataset with source information has been saved to 'merged_dataset.csv'.


Summary of this section! == merging
- # Display missing values percentage for df_other (NOT SUITABLE FOR MERGING)