# Data creation for benchmarking
In order for fair comparison across benchmarking, we generate a single dataset for train and test to be used across all models.

The input file is NHAMCS 2022 formatted and preprocessed as per described in section 4.Data (EDIT). Data is further prepared here by mapping triage level from text to respective number (0-4). Target is defined as triage level, and dataset is split into training and test in a 98:2 split

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
#Importing preprocessed NHAMCS file
file_path = '/content/formatted_ed2022.csv'

df = pd.read_csv(file_path)

In [None]:
#Mapping triage from free text to numbers
triage_mapping = {
    "Immediate": 0,
    "Emergent": 1,
    "Urgent": 2,
    "Semi-urgent": 3,
    "Nonurgent": 4
}

df_mapped = df.copy()
df_mapped["IMMEDR"] = df_mapped["IMMEDR"].map(triage_mapping) #IMMEDR is the column containing triage level


In [None]:
#splitting dataset into features and target variables
X = df_mapped.drop("IMMEDR", axis=1)
y = df_mapped["IMMEDR"]

#splitting data into training and validation sets
X, X_test, y, y_test = train_test_split(X, y, test_size=0.02, random_state=42) #test size as 2% of total set


In [None]:
#saving test and train set to use for all models
y_test.to_csv('y_test_final.csv', index=False)
X_test.to_csv('X_test_final.csv', index=False)
X.to_csv('X.csv_final', index=False)
y.to_csv('y.csv_final', index=False)