# Most important variables
The purpose of this notebook is to create a new dataset, one that is composed of only all the most important variables. This will be done to create a more condensed model that can be used to create predictions on the spot.

## Data Loading
First we compile all the data together. This will be taken from tqip_exploration.ipynb for the most part.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import os

In [2]:
# directories for csv files
csv_dir = 'C:/Users/micha/OneDrive - UT Health San Antonio/UTHSCSA/Trauma/TransfusionPrediction/data/PUF AY 2022/CSV'
os.listdir(csv_dir)

['PUF Variable Formats.csv',
 'PUF_AIS15TO05_CROSSWALK.csv',
 'PUF_AISDIAGNOSIS.csv',
 'PUF_AISDIAGNOSIS_LOOKUP.csv',
 'PUF_Ecode_Lookup.csv',
 'PUF_HOSPITALEVENTS.csv',
 'PUF_ICDDIAGNOSIS.csv',
 'PUF_ICDDIAGNOSIS_LOOKUP.csv',
 'PUF_ICDPROCEDURE.csv',
 'PUF_ICDPROCEDURE_LOOKUP.csv',
 'PUF_PREEXISTINGCONDITIONS.csv',
 'PUF_TRAUMA.csv',
 'PUF_TRAUMA_LOOKUP.csv',
 'TQP_INCLUSION.csv']

In [3]:
# create a pandas dataframe for each file

os.listdir(csv_dir)

directory = csv_dir
# Loop through each file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".csv"):
        # Remove the first 4 characters, replace spaces with underscores, and create a dataframe name
        df_name = filename[4:].lower().replace('.csv', '').replace(' ', '_')
        
        # Try different encodings and handle spaces
        try:
            df = pd.read_csv(os.path.join(directory, filename), encoding='utf-8', skipinitialspace=True)
        except UnicodeDecodeError:
            df = pd.read_csv(os.path.join(directory, filename), encoding='latin1', skipinitialspace=True)
        
        # Assign the dataframe to a variable with the processed name
        globals()[df_name] = df

        print(f"Created dataframe: {df_name}")

Created dataframe: variable_formats
Created dataframe: ais15to05_crosswalk
Created dataframe: aisdiagnosis
Created dataframe: aisdiagnosis_lookup
Created dataframe: ecode_lookup
Created dataframe: hospitalevents
Created dataframe: icddiagnosis
Created dataframe: icddiagnosis_lookup
Created dataframe: icdprocedure
Created dataframe: icdprocedure_lookup
Created dataframe: preexistingconditions


  df = pd.read_csv(os.path.join(directory, filename), encoding='utf-8', skipinitialspace=True)


Created dataframe: trauma
Created dataframe: trauma_lookup
Created dataframe: inclusion


We need to create a response variable, so we will create the variable transfusion. This will be a Yes or No for whether the patient recieved a transfusion, of any kind.

In [4]:
# transfusions, which ICD procedure code starts with 302:
transfusions = icdprocedure[icdprocedure['ICDPROCEDURECODE'].str.startswith('302') & icdprocedure['ICDPROCEDURECODE'].notnull()]

print(transfusions.shape)
# 200411 transfusions in this dataset

(200411, 6)


Now creating a new dataframe with transfusions as the response:

In [11]:
# Create a new DataFrame with the dummy variable
trauma_transfusions = trauma.copy()
trauma_transfusions['transfusion'] = trauma['inc_key'].isin(transfusions['Inc_Key'])

# Display the new DataFrame
print(f'shape of df: {trauma_transfusions.shape}')
print(f'total number of transfusions in dataset: {trauma_transfusions['transfusion'].sum()}')


shape of df: (1232956, 232)
total number of transfusions in dataset: 109819


Now we will decide which columns to keep. We will remove all but these:

- ISS
- AgeYears
- SBP
- PULSERATE
- TEMPERATURE
- RESPIRATORYRATE
- PULSEOXIMETRY
- HIGHESTACTIVATION
- TRANSPORTMODE
- transfusion

In [16]:
# List of columns to keep
cols_to_keep = [
    "ISS",
    "AgeYears",
    "SBP",
    "PULSERATE",
    "TEMPERATURE",
    "RESPIRATORYRATE",
    "PULSEOXIMETRY",
    "HIGHESTACTIVATION",
    "TRANSPORTMODE",
    "transfusion"
]

# Keep only the specified columns (and preserve order)
trauma_transfusions = trauma_transfusions[cols_to_keep].copy()

## Preprocessing
We will do imputation using KNN. First, we will see what is missing of the important columns:

In [18]:
# --- Missing-value summary ---
missing_per_col = trauma_transfusions[cols_to_keep].isna().sum()       # count per column of important variables
total_missing    = missing_per_col.sum()                  # grand total

print("Missing values per column:")
print(missing_per_col)
print(f"\nTotal missing values in dataframe: {total_missing}")

percent_missing = (missing_per_col / len(trauma_transfusions) * 100).round(2)
missing_summary = pd.DataFrame({
    "missing_count": missing_per_col,
    "missing_percent": percent_missing
})

print(missing_summary)


Missing values per column:
ISS                    3241
AgeYears              83038
SBP                   54617
PULSERATE             41717
TEMPERATURE          120495
RESPIRATORYRATE       56133
PULSEOXIMETRY         56310
HIGHESTACTIVATION     15937
TRANSPORTMODE          3919
transfusion               0
dtype: int64

Total missing values in dataframe: 435407
                   missing_count  missing_percent
ISS                         3241             0.26
AgeYears                   83038             6.73
SBP                        54617             4.43
PULSERATE                  41717             3.38
TEMPERATURE               120495             9.77
RESPIRATORYRATE            56133             4.55
PULSEOXIMETRY              56310             4.57
HIGHESTACTIVATION          15937             1.29
TRANSPORTMODE               3919             0.32
transfusion                    0             0.00


We see 10% missing on temperature, which is tolerable. There won't be any issue in using those variables.

We will now continue to impute using KNNImputer from sklearn. 

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

# ------------------------------------------
# 1. Separate predictors and target
# ------------------------------------------
target_col = "transfusion"
X = trauma_transfusions.drop(columns=[target_col]).copy()
y = trauma_transfusions[target_col]

# ------------------------------------------
# 2. Identify column types
# ------------------------------------------
num_cols = [
    "ISS", "AgeYears", "SBP", "PULSERATE",
    "TEMPERATURE", "RESPIRATORYRATE", "PULSEOXIMETRY"
]
cat_cols = ["HIGHESTACTIVATION", "TRANSPORTMODE"]

# ------------------------------------------
# 3. Ordinal-encode categoricals (so KNNImputer can work)
#    • unknown_value = -1 → keeps NaNs distinct while fitting
# ------------------------------------------
encoder = OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1
)
X[cat_cols] = encoder.fit_transform(X[cat_cols])

# ------------------------------------------
# 4. Run KNN imputation (k = 5, distance-weighted)
# ------------------------------------------
imputer = KNNImputer(n_neighbors=5, weights="distance")
X_imputed = pd.DataFrame(
    imputer.fit_transform(X),
    columns=X.columns,
    index=X.index
)

# ------------------------------------------
# 5. Cast categorical columns back to integers → original labels
# ------------------------------------------
X_imputed[cat_cols] = (
    X_imputed[cat_cols].round().astype(int)
)
X_imputed[cat_cols] = encoder.inverse_transform(
    X_imputed[cat_cols]
)

# ------------------------------------------
# 6. Reassemble the full dataframe
# ------------------------------------------
trauma_imputed = pd.concat([X_imputed, y], axis=1)

# optional sanity check
print(trauma_imputed.isna().sum())


In [None]:
trauma_imputed.head()

Imputation is done, so now we can continue to save it.

In [None]:
# --- Save the full imputed dataset ---
trauma_imputed.to_csv("trauma_most_important.csv", index=False)


We will save a small sample csv as well. This code is for a sample if needed so can be ran before the dataset has been imputed.

In [23]:
# --- Save a 15-row random sample with NO missing values ---
clean_df = trauma_transfusions.dropna()                   # remove any rows that still contain NaNs
assert len(clean_df) >= 15, "Not enough complete cases to sample 15 rows."

sample_df = clean_df.sample(n=15, random_state=42)   # reproducible sample
sample_df.to_csv("trauma_most_important_sample.csv", index=False)
