<h1><center>ASDS 5303 Final Project Assignment #2 Dataset 3: TBI-NDSC Data Preparation </center></h1>

## Group Members:
### Henry Berrios #1001392315
### LeMaur Kydd #1001767382

# **A. Introduction & Dataset Overview**

## <ins>Dataset Description:</ins>
The Traumatic Brain Injury National Data and Statistical Center (TBI-NDSC) dataset contains extensive information on individuals who have sustained traumatic brain injuries (TBIs). It incldues a wide range of variables such as demographic, injury-related, rehabiliation, and outcome variables. These variables help clinicians track recovery and functional outcomes.

For this project, dataset 3 focuses on the TBIMS Form 1 dataset, which provides admission/intake data including:
- Demographics
- Pre-injury conditions
- Acute hospital information
- Cognitive and functional scores
- Glasgow Coma Scale (GCS) Scores
- FIM Cognitive & Motor Scores
- Insurance and employment details

Since data collection forms have changed over time, columns contain a high percentage of missing values. Therefore, we will select only the most relevant columns.

## <ins>Defining the ML Problem</ins>
- Supervised Learning Task: Regression
- Goal: Predict Disability Rating Scale (DRS) at discharge (DRSd) using only admission data.
- Potential Use: Early prognosis estimation to assist clinicians in treatment planning and rehabilitation resource allocation.
- Target variable: Disability Rating Scale at Discharge (DRSd) (continuous variable)

# **B. Data Loading & Cleaning**

For this section, we will be pulling code from the 1st assignment, as well as improving some sections that needed changes after review.

### Importing Libraries

In [None]:
# Import Libraries (same as 1st assignment)
import pandas as pd
import numpy as np
from scipy.io import wavfile

import torch
import torch.nn as nn
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from google.colab import drive

### Load Dataset 3 (from 1st assignment)

In [None]:
# loading the dataset from google drive (from the 1st assignment) (1st assignment)
file_path1 = '/content/TBIMSForm1_Public_20240405.csv' # path to the dataset that has admission/intake information

# initial intake/admission dataset
tbi1 = pd.read_csv(file_path1, low_memory = False) # read csv file containing the admission/intake information into a dataframe

### Preprocessing for TBI1 (from 1st assignment)

In [None]:
# only keeping important columns (taken from local device as colab was having trouble saving my previous written code) (1st assignment)
tbi1_adm = tbi1[['SexF', 'Race', 'EthnicityF', 'Mar', 'EduYears', 'Emp1', 'PreconImpair', 'PreconPhys', 'PrelimLearn', 'PrelimDress', 'PrelimOuthm', 'PrelimWork', 'SCI', 'Cause', 'CTComp',
                 'CTIntracrain', 'CTPunctate', 'CTSubarachnioid', 'CTIntraventricular', 'CTFrag', 'GCSEye', 'GCSVer', 'GCSMot', 'PTAMethod', 'DRSEyeA', 'DRSVerA', 'DRSMotA', 'DRSFeedA',
                 'DRSToiletA', 'DRSGroomA', 'DRSFuncA', 'DRSEmpA', 'FIMCompA', 'FIMExpressA', 'FIMSocialA', 'FIMProbSlvA', 'FIMMemA', 'Craniotomy', 'AcutePay1', 'RehabPay1', 'Drugs', 'DAYStoACUTEadm',
                 'DAYStoACUTEdc', 'DAYStoREHABadm', 'DAYStoREHABdc', 'DRSa', 'DRSd', 'FIMMOTA', 'FIMCOGA', 'FIMTOTA', 'LOSACUTE', 'LOSRehab' ]] # keeping important demographics and rehab info

### Removing Missing Values and Changing Data Types (from 1st assignment, as well as improvement)

In [None]:
# observing the data types of the dataset (1st assignment)
tbi1_adm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19185 entries, 0 to 19184
Data columns (total 52 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SexF                19185 non-null  object 
 1   Race                19185 non-null  object 
 2   EthnicityF          19185 non-null  object 
 3   Mar                 19185 non-null  object 
 4   EduYears            19185 non-null  object 
 5   Emp1                19185 non-null  object 
 6   PreconImpair        19185 non-null  object 
 7   PreconPhys          19185 non-null  object 
 8   PrelimLearn         19185 non-null  object 
 9   PrelimDress         19185 non-null  object 
 10  PrelimOuthm         19185 non-null  object 
 11  PrelimWork          19185 non-null  object 
 12  SCI                 19185 non-null  object 
 13  Cause               19185 non-null  object 
 14  CTComp              19185 non-null  object 
 15  CTIntracrain        19185 non-null  object 
 16  CTPu

In [None]:
# changing the columns to float before removing NaN values (1st assignment)
tbi1_adm = tbi1_adm.apply(pd.to_numeric, errors = 'coerce')

In [None]:
# checking the range of columns (new)
tbi1_adm.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SexF,19174.0,1.741629,0.829112,1.0,1.0,2.0,2.0,99.0
Race,19174.0,1.879159,3.408344,1.0,1.0,1.0,2.0,99.0
EthnicityF,14627.0,0.968415,8.086862,0.0,0.0,0.0,0.0,99.0
Mar,19167.0,2.052225,3.93382,1.0,1.0,2.0,2.0,99.0
EduYears,19101.0,89.187948,216.312888,1.0,11.0,12.0,16.0,999.0
Emp1,19162.0,100.675712,232.670027,2.0,5.0,5.0,11.0,999.0
PreconImpair,18229.0,20.140929,30.575024,0.0,0.0,0.0,66.0,99.0
PreconPhys,19143.0,19.291699,30.220322,0.0,0.0,0.0,66.0,99.0
PrelimLearn,19141.0,19.301447,30.221126,0.0,0.0,0.0,66.0,99.0
PrelimDress,19141.0,19.215036,30.224445,0.0,0.0,0.0,66.0,99.0


Using the data dictionary, columns that used placeholder numbers were specified to avoid removing rows that had valid data.

In [None]:
# placeholder values were changed since some of them did contain the 66, 77, 88, etc but not all columns used those and were actual values (such as DAYStoACUTEadm) (new)
placeholder_values = {
    "PreconImpair": [66, 77, 88, 99, 888, 999], "PreconPhys": [66, 77, 88, 99, 888, 999], "PrelimLearn": [66, 77, 88, 99, 888, 999], "PrelimDress": [66, 77, 88, 99, 888, 999],
    "PrelimOuthm": [66, 77, 88, 99, 888, 999], "PrelimWork": [66, 77, 88, 99, 888, 999], 'SCI': [66, 77, 88, 99, 888, 999], 'CTComp': [66, 77, 88, 99, 888, 999],
    'CTIntracrain': [66, 77, 88, 99, 888, 999], 'CTPunctate': [66, 77, 88, 99, 888, 999], 'CTSubarachnioid': [66, 77, 88, 99, 888, 999], 'CTIntraventricular': [66, 77, 88, 99, 888, 999],
    'CTFrag': [66, 77, 88, 99, 888, 999], 'GCSMot': [66, 77, 88, 99, 888, 999], 'GCSEye': [66, 77, 88, 99, 888, 999], 'GCSVer': [66, 77, 88, 99, 888, 999],
    'PTAMethod': [66, 77, 88, 99, 888, 999], 'AcutePay1': [55, 66, 77, 88, 99, 888, 999], 'RehabPay1': [55, 66, 77, 88, 99, 888, 999], 'Drugs': [66, 77, 88, 99, 888, 999],
    'EduYears': [66, 77, 88, 99, 888, 999], 'Emp1': [66, 77, 88, 99, 777, 888, 999], 'Cause': [66, 77, 88, 99, 888, 999], 'DRSEyeA': [66, 77, 88, 99, 888, 999],
    'DRSVerA': [66, 77, 88, 99, 888, 999], 'DRSMotA': [66, 77, 88, 99, 888, 999], 'DRSFeedA': [66, 77, 88, 99, 888, 999], 'DRSToiletA': [66, 77, 88, 99, 888, 999],
    'DRSGroomA': [66, 77, 88, 99, 888, 999], 'DRSFuncA': [66, 77, 88, 99, 888, 999], 'DRSEmpA': [66, 77, 88, 99, 888, 999], 'FIMCompA': [66, 77, 88, 99, 888, 999],
    'FIMExpressA': [66, 77, 88, 99, 888, 999],'FIMSocialA': [66, 77, 88, 99, 888, 999], 'FIMProbSlvA': [66, 77, 88, 99, 888, 999], 'FIMMemA': [66, 77, 88, 99, 888, 999],
    'Craniotomy': [66, 77, 88, 99, 888, 999], 'SexF': [66, 77, 88, 99, 888, 999], 'Race': [66, 77, 88, 99, 888, 999], 'EthnicityF': [66, 77, 88, 99, 888, 999],
    'Mar': [66, 77, 88, 99, 888, 999], 'FIMCompA': [66, 77, 88, 99, 888, 999], 'FIMExpressA': [66, 77, 88, 99, 888, 999], 'FIMSocialA': [66, 77, 88, 99, 888, 999],
    'FIMProbSlvA': [66, 77, 88, 99, 888, 999],'FIMMemA': [66, 77, 88, 99, 888, 999], 'DAYStoACUTEadm': [9999], 'DAYStoACUTEdc': [9999], 'DAYStoREHABadm': [9999], 'DAYStoREHABdc': [9999],
    'DRSa': [999], 'DRSd': [999], 'FIMMOTA': [999], 'FIMCOGA': [999], 'FIMTOTA': [9999], 'LOSACUTE': [-25, -5], 'LOSRehab': [-10]
}

In [None]:
# replacing placeholders with NaN only in the specified columns (new)
for col, values in placeholder_values.items():
  if col in tbi1_adm.columns:
    tbi1_adm[col] = tbi1_adm[col].replace(values, np.nan)

In [None]:
# removing all NaN values (1st assignment)
tbi1_adm = tbi1_adm.dropna()

In [None]:
# checking the dataset a final time (1st assignment)
tbi1_adm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9188 entries, 3 to 18243
Data columns (total 52 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SexF                9188 non-null   float64
 1   Race                9188 non-null   float64
 2   EthnicityF          9188 non-null   float64
 3   Mar                 9188 non-null   float64
 4   EduYears            9188 non-null   float64
 5   Emp1                9188 non-null   float64
 6   PreconImpair        9188 non-null   float64
 7   PreconPhys          9188 non-null   float64
 8   PrelimLearn         9188 non-null   float64
 9   PrelimDress         9188 non-null   float64
 10  PrelimOuthm         9188 non-null   float64
 11  PrelimWork          9188 non-null   float64
 12  SCI                 9188 non-null   float64
 13  Cause               9188 non-null   float64
 14  CTComp              9188 non-null   float64
 15  CTIntracrain        9188 non-null   float64
 16  CTPunctate

In [None]:
# checking the ranges one more time (new)
tbi1_adm.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SexF,9188.0,1.738354,0.439555,1.0,1.0,2.0,2.0,2.0
Race,9188.0,1.796147,1.447728,1.0,1.0,1.0,2.0,7.0
EthnicityF,9188.0,0.142904,0.349994,0.0,0.0,0.0,0.0,1.0
Mar,9188.0,1.874619,1.075447,1.0,1.0,2.0,2.0,7.0
EduYears,9188.0,12.802133,2.97243,1.0,11.0,12.0,15.0,21.0
Emp1,9188.0,6.627775,3.821234,2.0,5.0,5.0,9.0,55.0
PreconImpair,9188.0,0.064541,0.245727,0.0,0.0,0.0,0.0,1.0
PreconPhys,9188.0,0.082281,0.274808,0.0,0.0,0.0,0.0,1.0
PrelimLearn,9188.0,0.092403,0.28961,0.0,0.0,0.0,0.0,1.0
PrelimDress,9188.0,0.018067,0.133201,0.0,0.0,0.0,0.0,1.0


### Handling Outliers

In [None]:
 # defining continous variables for outlier removal
cont_vars = ['EduYears', 'DAYStoACUTEadm', 'DAYStoACUTEdc', 'DAYStoREHABadm', 'DAYStoREHABdc',
              'DRSa', 'FIMMOTA', 'FIMCOGA', 'FIMTOTA', 'LOSACUTE', 'LOSRehab']

In [None]:
# using IQR for continous variables
Q1 = tbi1_adm[cont_vars].quantile(0.25)
Q3 = tbi1_adm[cont_vars].quantile(0.75)
IQR = Q3 - Q1

# defining upper and lower bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

In [None]:
# applying the outlier removal to continous variables
tbi1_adm_filtered = tbi1_adm[~((tbi1_adm[cont_vars] < lower_bound) | (tbi1_adm[cont_vars] > upper_bound)).any(axis = 1)]

# resetting index
tbi1_adm_filtered = tbi1_adm_filtered.reset_index(drop = True)

### Label Encoding, One-Hot Encoding

In [None]:
# list of nominal columns for one-hot encoding that are not already binary
nom_cols = ['SexF','Race','Mar', 'Emp1', 'Cause', 'PTAMethod', 'Craniotomy', 'AcutePay1', 'RehabPay1']

#list of ordinal columns for label encoding:
ord_cols = ['CTComp', 'GCSEye', 'GCSVer', 'GCSMot', 'DRSEyeA', 'DRSVerA', 'DRSMotA', 'DRSFeedA', 'DRSToiletA', 'DRSGroomA', 'DRSFuncA', 'DRSEmpA', 'FIMCompA', 'FIMExpressA', 'FIMSocialA', 'FIMProbSlvA', 'FIMMemA']

In [None]:
# one-hot encoding for nominal columns
tbi1_adm_encoded = pd.get_dummies(tbi1_adm_filtered, columns = nom_cols, drop_first = True)

# label encoding for ordinal columns
le = LabelEncoder()
for col in ord_cols:
  tbi1_adm_encoded[col] = le.fit_transform(tbi1_adm_encoded[col])

# **C. Convert Dataset into Tensor Format**

In [None]:
# isolating X and y variables
X = tbi1_adm_encoded.drop(columns = ['DRSd'])
y = tbi1_adm_encoded['DRSd']

In [None]:
# splitting into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

Before converting our data to tensors, they need to be standardized. In this case, we used StandardScaler from sklearn.preprocessing. Standardized features allow for machine learning models, as well as deep learning networks, to perform better (Scikit-learn Developers, 2023). Additionally, PyTorch models often converge faster and more stably when the variables are standardized (Raschka, Liu & Mirjalili, 2022).

In [None]:
# initializing the scaler
scaler = StandardScaler()

# fit on training data, and transform the train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Once the data gets standardized, we convert it into PyTorch tensors. Tensors are the core data structure for PyTorch. These tensors are similar to NumPy arrays, but are optimized for deep learning computations (Paszke et al., 2019). PyTorch tensors also allow for the use of GPUs using CUDA, which significantly speeds up training (NVIDIA, 2020). In order for us to create neural networks in PyTorch, we convert the data into tensor inputs.

In our code, we use *.values* for the *y_train* and *y_test* since they are Pandas series, and PyTorch does not accept Pandas objects. By using *.values*, we convert them into NumPy arrays, which PyTorch can use (Scikit.learn Developers, 2023). Moreover, we use *.view(-1, 1)* to ensure that the target variable y is formatted as a column vector (N, 1), rather than a 1D array (N,) (PyTorch Community, 2023). This prevents the shape to become mismatched during training. Finally, *dtype = torch.float32* is used to optimize the deep learning computations. *float32* reduces memory usage while still maintaining precision (NVIDIA, 2020). Without this specificiation, PyTorch may default to torch.float64, which is more intensive and can slow down computations (PyTorch Developers, 2023).

In [None]:
# convert train and test sets to tensors
# code adapted and assisted by OpenAi's ChatGPT (OpenAI, 2025) and from PyTorch Documentation (Pazke et al., 2019)
X_train_tensor = torch.tensor(X_train_scaled, dtype = torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype = torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype = torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype = torch.float32).view(-1,1)

# **D. Save Processed Data**

In [None]:
# mounting my drive
drive.mount('/content/drive')

#save to drive
torch.save(X_train_tensor, '/content/drive/MyDrive/X_train_tensor_d3.pt')
torch.save(X_test_tensor, '/content/drive/MyDrive/X_test_tensor_d3.pt')
torch.save(y_train_tensor, '/content/drive/MyDrive/y_train_tensor_d3.pt')
torch.save(y_test_tensor, '/content/drive/MyDrive/y_test_tensor_d3.pt')

Mounted at /content/drive


# **References**
- NVIDIA. (2020). CUDA Programming Guide. Retrieved from https://developer.nvidia.com/cuda-toolkit.
- OpenAI. (2025). Response generated by ChatGPT [Large language model]. OpenAI. Retrieved from https://chat.openai.com
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
- PyTorch Community. (2024). torch.Tensor.view. Retrieved from https://pytorch.org/docs/stable/generated/torch.Tensor.view.html/.
- PyTorch Developers. (2023). Data types in PyTorch. Retrieved from https://pytorch.org/docs/stable/tensors.html#torch-tensor.
- Raschka, S., Liu, Y., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing.
- Scikit-learn Developers. (2023). Preprocessing data: StandardScaler. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.