### Splitting data

In [8]:
import joblib
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

In [9]:
df = pd.read_csv('../data/Encoded_Cleaned_Top15Diseases.csv')

In [11]:
# Encode Disease column into numerical labels
le = LabelEncoder()
df["DiseaseEncoded"] = le.fit_transform(df["Disease"])

In [12]:
# Define features (X) and target (y)
X = df.drop(columns=["Disease", "DiseaseEncoded"])
y = df["DiseaseEncoded"]

In [13]:
# Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [14]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, stratify=y, random_state=42
)

1. train_test_split()
- Function from sklearn.model_selection that randomly splits data into training and testing sets.
2. Parameters Used:
- X, y → The feature matrix (X) and target labels (y).
- test_size=0.3 → 30% of the dataset is used for testing, and 80% is used for training.
- stratify=y → Ensures that the class distribution in y is preserved in both training and testing sets.
    - This is important if y has imbalanced classes (e.g., some diseases appear more often than others).
- random_state=42 → Fixes the random seed for reproducibility, ensuring the same split occurs each time.

In [15]:
# Save the preprocessed data and label encoder
joblib.dump(le, "label_encoder.pkl")
joblib.dump(scaler, "scaler.pkl")
joblib.dump(X_train, "X_train.pkl")
joblib.dump(X_test, "X_test.pkl")
joblib.dump(y_train, "y_train.pkl")
joblib.dump(y_test, "y_test.pkl")

print("Data successfully split and saved.")

Data successfully split and saved.
