In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

kidney_disease_data = pd.read_csv("kidney_disease.csv")
kidney_disease_data = kidney_disease_data.replace('?', pd.NA).dropna()

X = kidney_disease_data.drop(columns=["classification"])
y = kidney_disease_data["classification"]

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42
)
print("Number of training X samples: ", X_train.shape[0])
print("Number of test X samples: ", X_test.shape[0])
print()
print("Number of training y samples: ", y_train.shape[0])
print("Number of test y samples: ", y_test.shape[0])

# In this kidney disease task, a True Positive (TP) means the patient truly has CKD and the model predicts CKD correctly.

# A True Negative (TN) means the patient truly does not have CKD and the model predicts notckd correctly.

# A False Positive (FP) means the model predicts CKD but the patient actually does not have CKD, which could cause unnecessary stress or follow-up tests.

# A False Negative (FN) means the model predicts notckd but the patient actually has CKD, which is dangerous because the disease case is missed.

# Accuracy alone may be misleading because it can look high even when the model performs poorly on the more important class, especially if the classes are imbalanced.

# If missing a kidney disease case is very serious, **recall (sensitivity) for CKD** is the most important metric because it measures how many real CKD cases we successfully detect.

# Maximizing recall helps minimize false negatives, which reduces the chance of failing to identify a patient who actually has CKD.


# We should not train and test a model on the same data since the model has already seen that data during training, it can memorize the patterns. Testing on the same data makes the results look artificially high and doesnâ€™t tell us how the model will perform on new, unseen cases

# The purpose of the testing set is held out and not used for training. It simulates new patient data and gives an unbiased estimate of how well the trained model generalizes in the real world.

Number of training X samples:  110
Number of test X samples:  48

Number of training y samples:  110
Number of test y samples:  48
