#Final Pre-Processing

Oviya Adhan

DATASCI 207 Machine Learning

Professor Cornelia Paulik

*Note: This step is simply to one-hot encode the variables to format them to be acceptable to TensorFlow for ML models. The previous pre-processing steps can be seen in Christine's processing_2024.ipynb and processing_2025.ipynb notebooks. Christine's notebook resulted in data files final_merged_2024.csv and final_merged_2025.csv. The later pre-processing steps that was tied in and informed with the EDA can be found in Oviya's EDA.ipynb notebook, but those steps are also copied and demarcated in this notebook. This notebook culminates in training, validation, and test datasets ready to be used in ML models.*

In [48]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.model_selection import train_test_split
#from itertools import chain
#import ast
#from itertools import combinations
#from collections import Counter, defaultdict
from sklearn.preprocessing import MultiLabelBinarizer
import re

In [49]:
# Connect to GitHub Repo
from getpass import getpass

# Step 1: Enter token securely
token = getpass('Enter your GitHub token: ')

# Step 2: Build the full URL
repo_owner = "christinesako-berk"
repo_name = "ds_207_final_project"
repo_url = f"https://{token}@github.com/{repo_owner}/{repo_name}.git"

# Clone repo
!git clone "{repo_url}"

Enter your GitHub token: ··········
Cloning into 'ds_207_final_project'...
remote: Enumerating objects: 116, done.[K
remote: Counting objects: 100% (116/116), done.[K
remote: Compressing objects: 100% (104/104), done.[K
remote: Total 116 (delta 52), reused 35 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (116/116), 6.25 MiB | 4.51 MiB/s, done.
Resolving deltas: 100% (52/52), done.
Updating files: 100% (17/17), done.
Filtering content: 100% (6/6), 561.80 MiB | 34.20 MiB/s, done.


In [50]:
%cd /content/ds_207_final_project/data/processed

/content/ds_207_final_project/data/processed


In [51]:
!ls

ds_207_final_project  final_merged_2024.csv  final_merged_2025.csv


In [52]:
# Load data
initial_2024 = pd.read_csv('final_merged_2024.csv')
initial_2025 = pd.read_csv('final_merged_2025.csv')

### 1 - Filter NaN Values
*Originally from EDA.ipynb*

In [53]:
# Remove rows with NaN values in feature columns
# Check initial sums
print('BEFORE REMOVING NAN VALUES:')
print('Shape of data:')
print(f'Train/Val: {initial_2024.shape}')
print(f'Test: {initial_2025.shape}')
print('NaN count:')
print(f'{initial_2024.isna().sum()}')
print(f'{initial_2025.isna().sum()}')

df_2024 = initial_2024.copy() # Create copy for processed df
df_2025 = initial_2025.copy() # Create copy for processed df

# Remove rows with NaN values in the listed features
nan_features = ['MovementPrecCollDescription',
                'AirbagDescription',
                'SafetyEquipmentDescription',
                'SobrietyDrugPhysicalDescription1',
                'SpecialInformation',
                'SpeedLimit']
for i in nan_features:
  df_2024 = df_2024.dropna(subset=[i])
  df_2025 = df_2025.dropna(subset=[i])

# Replace remaining NaN values in outcome label to "No Injury"
df_2024['ExtentOfInjuryCode'] = df_2024['ExtentOfInjuryCode'].fillna('No Injury')
df_2025['ExtentOfInjuryCode'] = df_2025['ExtentOfInjuryCode'].fillna('No Injury')


# Check sums after removal
print('\nAFTER REMOVING NAN VALUES:')
print('Shape of data:')
print(f'Train/Val: {df_2024.shape}')
print(f'Test: {df_2025.shape}')
print('NaN count:')
print(f'{df_2024.isna().sum()}')
print(f'{df_2025.isna().sum()}')

BEFORE REMOVING NAN VALUES:
Shape of data:
Train/Val: (406874, 13)
Test: (139443, 13)
NaN count:
CollisionId                              0
CollisionTypeDescription                 0
IsHighwayRelated                         0
Weather1                                 0
RoadCondition1                           0
LightingDescription                      0
ExtentOfInjuryCode                  293729
MovementPrecCollDescription         133951
AirbagDescription                   133951
SafetyEquipmentDescription          133951
SobrietyDrugPhysicalDescription1    133951
SpecialInformation                  133951
SpeedLimit                          133951
dtype: int64
CollisionId                             0
CollisionTypeDescription                0
IsHighwayRelated                        0
Weather1                                0
RoadCondition1                          0
LightingDescription                     0
ExtentOfInjuryCode                  98942
MovementPrecCollDescription         3

### 2 - Split Data
*Originally from EDA.ipynb*

In [54]:
# Split into training, validation, and test sets
train_df, val_df = train_test_split(df_2024, test_size=0.2, random_state=42) # 80% train, 20% validation
test_df = df_2025.copy()

In [55]:
# Define numeric and categorical variables
numeric_vars = ['SpeedLimit'] # Notice: Collision ID not normalized
binary_vars = ['IsHighwayRelated']
categorical_vars = ['CollisionTypeDescription', 'Weather1', 'RoadCondition1',
       'LightingDescription', 'MovementPrecCollDescription',
       'AirbagDescription', 'SafetyEquipmentDescription',
       'SobrietyDrugPhysicalDescription1', 'SpecialInformation']
target_var = ['ExtentOfInjuryCode']

### 3 - Normalize Numeric Data
*Originally from EDA.ipynb*

Note: Collision ID is not normalized since this is an arbitrary unique identifier (UUID) and thus does not add meaningful information to our model. We will drop this feature.

In [56]:
# Drop collision ID
for df in [train_df, val_df, test_df]:
    df.drop(columns=['CollisionId'], inplace=True)

In [57]:
# Normalize numeric variables (speed limit) with training df statistics
speed_mean = train_df['SpeedLimit'].mean()
speed_std = train_df['SpeedLimit'].std()

for df in [train_df, val_df, test_df]:
    df['SpeedLimit'] = (df['SpeedLimit'] - speed_mean) / speed_std

###4 - Change Binary Data to Numeric Binary

In [58]:
# Change IsHighwayRelated to integer
for df in [train_df, val_df, test_df]:
  df['IsHighwayRelated'] = df['IsHighwayRelated'].astype(int)

### 5 - Encode *cateogrical* variables with one-hot encoding



In [74]:
train_copy = train_df.copy()
val_copy = val_df.copy()
test_copy = test_df.copy()

In [75]:
# EXPLODE EACH CATEGORICAL COLUMN INTO ONE-HOT COLUMNS
for col in categorical_vars:
    print(f"Processing: {col}") # check progress

    # 1 - Extract raw values by dropping NaNs and converting to string
    raw_vals = train_df[col].dropna().astype(str).tolist()

    # 2 - Split values by ',' and strip whitespace
    split_vals = [[item.strip() for item in row.split(',')] for row in raw_vals]

    # 3 - Fit the sklearn multi label binarizer on the training data to extract unique values
    mlb = MultiLabelBinarizer()
    mlb.fit(split_vals)
    print(f"  → Found {len(mlb.classes_)} unique labels in '{col}'")

    # Helper function to convert a single cell into a list of labels
    def to_label_list(x):
        if pd.isna(x): return []
        return [i.strip() for i in x.split(',')]

    # Transform train/val/test
    for name, df in zip(['train', 'val', 'test'], [train_copy, val_copy, test_copy]):
        # encode column using fitted mlb
        encoded = mlb.transform(df[col].apply(to_label_list))
        encoded_df = pd.DataFrame(
            encoded,
            columns=[f"{col}_{c}" for c in mlb.classes_],
            index=df.index
        )

        # concatenate encoded columns to df
        df = pd.concat([df.drop(columns=[col]), encoded_df], axis=1)

        # persist changes made to the correct variable
        if name == 'train':
            train_copy = df
        elif name == 'val':
            val_copy = df
        else:
            test_copy = df

Processing: CollisionTypeDescription
  → Found 8 unique labels in 'CollisionTypeDescription'
Processing: Weather1
  → Found 40 unique labels in 'Weather1'
Processing: RoadCondition1
  → Found 333 unique labels in 'RoadCondition1'
Processing: LightingDescription
  → Found 5 unique labels in 'LightingDescription'
Processing: MovementPrecCollDescription
  → Found 76 unique labels in 'MovementPrecCollDescription'
Processing: AirbagDescription
  → Found 20 unique labels in 'AirbagDescription'
Processing: SafetyEquipmentDescription
  → Found 78 unique labels in 'SafetyEquipmentDescription'
Processing: SobrietyDrugPhysicalDescription1
  → Found 34 unique labels in 'SobrietyDrugPhysicalDescription1'
Processing: SpecialInformation
  → Found 44 unique labels in 'SpecialInformation'


In [76]:
# CLEAN UP COLUMN NAMES
def clean_column_names(df):
    """
    Clean and standardize one-hot encoded column names.
    Removes brackets, quotes, and excessive spaces.
    """
    clean_cols = []
    for col in df.columns:
        # remove brackets and quotes
        col_clean = re.sub(r"[\[\]']", "", col)

        # remove 'OTHER* - ' if present (specifically for road condition var)
        col_clean = re.sub(r"OTHER\*\s*-\s*", "", col_clean)

        # collapse multiple spaces or underscores
        col_clean = re.sub(r"\s+", " ", col_clean)
        col_clean = re.sub(r"_+", "_", col_clean)

        # strip spaces around underscores
        col_clean = col_clean.replace(" _", "_").replace("_ ", "_").strip()
        clean_cols.append(col_clean)

    df.columns = clean_cols
    return df

train_copy = clean_column_names(train_copy)
val_copy = clean_column_names(val_copy)
test_copy = clean_column_names(test_copy)

In [81]:
# CONSOLIDATE DUPLICATE COLUMNS BY TAKING MAX (1)
def resolve_column_duplicates(df, verbose=True):
    """
    Resolves duplicate columns in a DataFrame by keeping one column per name
    and taking the row-wise max across duplicates.
    """
    dupes = df.columns[df.columns.duplicated()].unique()

    if len(dupes) == 0:
        if verbose:
            print("✅ No duplicate columns found.")
        return df

    if verbose:
        print(f"Found {len(dupes)} duplicate column(s). Resolving...")

    for col in dupes:
        cols = df.loc[:, df.columns == col]
        # compute max and drop all instances of col
        merged = cols.max(axis=1)
        df = df.drop(columns=cols.columns)
        # add back a single column with merged values
        df[col] = merged

    if verbose:
        print("✅ Duplicates resolved.")

    return df

# Apply function
train_copy = resolve_column_duplicates(train_copy)
val_copy = resolve_column_duplicates(val_copy)
test_copy = resolve_column_duplicates(test_copy)

Found 66 duplicate column(s). Resolving...
✅ Duplicates resolved.
Found 66 duplicate column(s). Resolving...
✅ Duplicates resolved.
Found 66 duplicate column(s). Resolving...
✅ Duplicates resolved.


In [82]:
train_copy.columns.duplicated().any()

np.False_

In [84]:
# CONSOLIDATE SINGULAR/PLURAL COLUMNS BY TAKING MAX (1)
def merge_singular_plural(df):
    '''
    Function to merge columns that differ only by a trailing 's'.
    For each such pair, merge by keeping one column (singular instance) and take max value.
    '''
    cols = list(df.columns)
    merged_cols = set()

    for col in cols:
        if col in merged_cols:
            continue  # already handled

        if col.endswith('s'):
            # define singular instance of col
            singular = col[:-1]
            # if singular instance is a column
            if singular in df.columns:
                # merge the two columns
                df[singular] = df[[col, singular]].max(axis=1)
                df.drop(columns=[col], inplace=True)
                merged_cols.add(singular)
                merged_cols.add(col)
        else:
            # define plural instance of col
            plural = col + 's'
            # if plural instance is a column
            if plural in df.columns:
                # merge the two columns
                df[col] = df[[col, plural]].max(axis=1)
                df.drop(columns=[plural], inplace=True)
                merged_cols.add(col)
                merged_cols.add(plural)

    # Clean up any remaining exact duplicates (just in case)
    df = resolve_column_duplicates(df)
    return df

# Apply function only on train
train_copy = merge_singular_plural(train_copy)
val_copy = merge_singular_plural(val_copy)
test_copy = merge_singular_plural(test_copy)

✅ No duplicate columns found.
✅ No duplicate columns found.
✅ No duplicate columns found.


In [85]:
# Align val/test columns to train
val_copy = val_copy.reindex(columns=train_copy.columns, fill_value=0)
test_copy = test_copy.reindex(columns=train_copy.columns, fill_value=0)

In [86]:
# Check final list of columns
for i in train_copy.columns:
  print(i)

IsHighwayRelated
ExtentOfInjuryCode
SpeedLimit
CollisionTypeDescription_BROADSIDE
CollisionTypeDescription_HEAD-ON
CollisionTypeDescription_HIT OBJECT
CollisionTypeDescription_OTHER
CollisionTypeDescription_OVERTURNED
CollisionTypeDescription_REAR END
CollisionTypeDescription_SIDE SWIPE
CollisionTypeDescription_VEHICLE/PEDESTRAIN
Weather1_CLEAR
Weather1_CLEAR- SMOKY
Weather1_CLOUDY
Weather1_DENSE SMOKE
Weather1_DIRT CLOUD
Weather1_DIRT PLUME
Weather1_DUST
Weather1_DUST CLOUD
Weather1_DUST STORM
Weather1_DUST VISIBILITY 1000 FT.
Weather1_DUST/POOR VISIBILITY
Weather1_DUSTY
Weather1_EXTREME HEAT
Weather1_FIRE IN AREA
Weather1_FOG/VISIBILITY
Weather1_HAIL
Weather1_HAILING
Weather1_HEAVY HAIL
Weather1_HEAVY RAIN/HAIL
Weather1_HEAVY/THICK MIST
Weather1_ICY
Weather1_ICY ROADWAYS
Weather1_LOW/FREEZING TEMPERATURES
Weather1_MIST
Weather1_MISTING
Weather1_MISTY
Weather1_OTHER
Weather1_OVERCAST
Weather1_PARTLY CLOUDY
Weather1_RAINING
Weather1_SANDSTORM
Weather1_SANDSTROM
Weather1_SMOKE
Weather1_

In [87]:
# Check final shapes
print(f'Train: {train_copy.shape}')
print(f'Val: {val_copy.shape}')
print(f'Test: {test_copy.shape}')

Train: (218338, 458)
Val: (54585, 458)
Test: (103669, 458)


### 6 - Store processed data into CSV files

In [89]:
train_copy.to_csv("train_final.csv", index=False)
val_copy.to_csv("val_final.csv", index=False)
test_copy.to_csv("test_final.csv", index=False)

##### Move files into new final data folder

In [93]:
!mkdir -p /content/ds_207_final_project/data/final
!mv /content/ds_207_final_project/data/processed/train_final.csv \
     /content/ds_207_final_project/data/processed/val_final.csv \
     /content/ds_207_final_project/data/processed/test_final.csv \
     /content/ds_207_final_project/data/final/

In [94]:
!ls

ds_207_final_project  final_merged_2024.csv  final_merged_2025.csv


##### Change to final data directory and ensure files made it

In [97]:
%cd /content/ds_207_final_project/data/final

/content/ds_207_final_project/data/final


In [98]:
!ls

test_final.csv	train_final.csv  val_final.csv


###7 - Push new folder with files to GitHub

##### Switch to top level of repo and check status

In [100]:
%cd /content/ds_207_final_project

/content/ds_207_final_project


In [101]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/final/[m
	[31mdata/processed/ds_207_final_project/[m

nothing added to commit but untracked files present (use "git add" to track)


##### Stage, Commit, Push

In [102]:
!git add data/final

In [103]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   data/final/test_final.csv[m
	[32mnew file:   data/final/train_final.csv[m
	[32mnew file:   data/final/val_final.csv[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/processed/ds_207_final_project/[m



In [105]:
!git config user.email "oviya.adhan@gmail.com"
!git config user.name "oadhan"

In [106]:
!git commit -m "Add final processed train, val, test data in CSVs"

[main 7d74229] Add final processed train, val, test data in CSVs
 3 files changed, 376595 insertions(+)
 create mode 100644 data/final/test_final.csv
 create mode 100644 data/final/train_final.csv
 create mode 100644 data/final/val_final.csv


In [107]:
!git push

Enumerating objects: 9, done.
Counting objects:  11% (1/9)Counting objects:  22% (2/9)Counting objects:  33% (3/9)Counting objects:  44% (4/9)Counting objects:  55% (5/9)Counting objects:  66% (6/9)Counting objects:  77% (7/9)Counting objects:  88% (8/9)Counting objects: 100% (9/9)Counting objects: 100% (9/9), done.
Delta compression using up to 2 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 4.97 MiB | 863.00 KiB/s, done.
Total 7 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 2 local objects.[K
remote: [1;31merror[m: Trace: eabaa8bb11344ddd5243478f31914a63b5b0fadf1ac0f351a08c843354d87e4b[K
remote: [1;31merror[m: See https://gh.io/lfs for more information.[K
remote: [1;31merror[m: File data/final/train_final.csv is 195.74 MB; this exceeds GitHub's file size limit of 100.00 MB[K
remote: [1;31merror[m: GH001: Large files detected. You may want to try Git Large File Storage - https://git-

See above: data files are too large for GitHub as is

Detour: Save files locally

In [108]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [109]:
!mkdir -p "/content/drive/My Drive/DATASCI 207 - Applied Machine Learning/Final Project/data/"

In [110]:
!cp -r /content/ds_207_final_project/data/final/* "/content/drive/My Drive/DATASCI 207 - Applied Machine Learning/Final Project/data/"

In [115]:
%cd /content/ds_207_final_project/data/final/

/content/ds_207_final_project/data/final


In [116]:
!ls

test_final.csv	train_final.csv  val_final.csv


Compress files and push to Git

In [118]:
%cd /content/ds_207_final_project/data/

/content/ds_207_final_project/data


In [119]:
!ls

ccrs_template.docx  final  processed  raw


In [120]:
!tar -czvf final_data.tar.gz final/

final/
final/val_final.csv
final/test_final.csv
final/train_final.csv


In [122]:
!mv /content/ds_207_final_project/data/final_data.tar.gz \
     /content/ds_207_final_project/data/final/

In [125]:
%cd /content/ds_207_final_project/

/content/ds_207_final_project


In [126]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/final/final_data.tar.gz[m
	[31mdata/processed/ds_207_final_project/[m

nothing added to commit but untracked files present (use "git add" to track)


In [153]:
%cd /content/ds_207_final_project/data

/content/ds_207_final_project/data


In [155]:
!git add final/final_data.tar.gz

In [157]:
!git commit -m "Adding final processed train, val, test CSVs as a compressed tar file"

[main 33c3e8d] Adding final processed train, val, test CSVs as a compressed tar file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 data/final/final_data.tar.gz


In [165]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mfinal/test_final.csv[m
	[31mfinal/train_final.csv[m
	[31mfinal/val_final.csv[m
	[31mprocessed/ds_207_final_project/[m

nothing added to commit but untracked files present (use "git add" to track)


In [166]:
!git push

Enumerating objects: 7, done.
Counting objects:  14% (1/7)Counting objects:  28% (2/7)Counting objects:  42% (3/7)Counting objects:  57% (4/7)Counting objects:  71% (5/7)Counting objects:  85% (6/7)Counting objects: 100% (7/7)Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 5.04 MiB | 8.00 MiB/s, done.
Total 5 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/christinesako-berk/ds_207_final_project.git
   deaa594..d0c6588  main -> main
