Let's load up our data and do a little feature engineering. We have a bunch of Yes/No variables, and we want to make them booleans so we can treat them like numeric data.

In [1]:
import pandas as pd

DATA = '/kaggle/input/leukemia-cancer-risk-prediction-dataset/biased_leukemia_dataset.csv'
df = pd.read_csv(filepath_or_buffer=DATA, index_col=['Patient_ID'])
for column in ['Genetic_Mutation', 'Family_History', 'Smoking_Status', 'Alcohol_Consumption', 'Radiation_Exposure', 'Infection_History', 'Chronic_Illness', 'Immune_Disorders']:
    df[column] = df[column] == 'Yes'
df.head()

Unnamed: 0_level_0,Age,Gender,Country,WBC_Count,RBC_Count,Platelet_Count,Hemoglobin_Level,Bone_Marrow_Blasts,Genetic_Mutation,Family_History,...,Alcohol_Consumption,Radiation_Exposure,Infection_History,BMI,Chronic_Illness,Immune_Disorders,Ethnicity,Socioeconomic_Status,Urban_Rural,Leukemia_Status
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,52,Male,China,2698,5.36,262493,12.2,72,True,False,...,False,False,False,24.0,False,False,Ethnic_Group_B,Low,Rural,Negative
2,15,Female,China,4857,4.81,277877,11.9,97,True,False,...,False,False,False,28.7,False,False,Ethnic_Group_A,Low,Urban,Positive
3,72,Male,France,9614,5.17,319600,13.4,94,False,True,...,True,False,False,27.7,False,False,Ethnic_Group_B,Low,Urban,Negative
4,61,Male,Brazil,6278,5.41,215200,11.6,50,False,False,...,False,False,False,31.6,False,False,Ethnic_Group_A,Medium,Rural,Negative
5,21,Male,Brazil,8342,4.78,309169,14.3,28,False,False,...,False,False,False,22.3,False,False,Ethnic_Group_B,Low,Rural,Negative


We're going to visualize before we model, and we want to use as much numeric data as we can, so let's pick out the numeric columns.

In [2]:
COLUMNS = [column for column, dtype in df.dtypes.to_dict().items() if str(dtype) in {'bool', 'float64', 'int64'} and column != 'Leukemia_Status']
RANDOM_STATE = 2025
TARGET = 'Leukemia_Status'

Fortunately leukemia is rare even among people who are tested for leukemia, so we expect our classes to be unbalanced. Let's have a look.

In [3]:
df['Leukemia_Status'].value_counts().to_dict()

{'Negative': 121797, 'Positive': 21397}

We have way too much data to plot it all, so let's take a sample.

In [4]:
sample_df = df.sample(n=5000, random_state=RANDOM_STATE)

Now let's use TSNE to visualize our data as a scatter plot.

In [5]:
from sklearn.manifold import TSNE

reducer = TSNE(random_state=RANDOM_STATE)
plot_df = pd.DataFrame(columns=['x', 'y'], data=reducer.fit_transform(X=sample_df[COLUMNS],))
plot_df[TARGET] = sample_df[TARGET].tolist()

In [6]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
express.scatter(data_frame=plot_df, x='x', y='y', color=TARGET).show(renderer='iframe_connected')

What do we see? We see that TSNE finds some natural clusters, but they do not appear to correspond to our target variable. We should probably have moderate expectations regarding how well our model will predict the target variable. Let's build a model and find out.

An initial attempt to model this data gave us a model that did no better than always predicting Negative, so let's try to remedy that by drawing a balanced sample, that is, a sample that has as many positive as negative instances.

In [7]:
model_positive_df = df[df[TARGET] == 'Positive']
model_negative_df = df[df[TARGET] == 'Negative']
model_df = pd.concat(axis='index', objs=[model_positive_df, model_negative_df.sample(n=len(model_positive_df), random_state=RANDOM_STATE)])
model_df[TARGET].value_counts().to_dict()

{'Positive': 21397, 'Negative': 21397}

Now let's build a model. We would expect a tree model to do well for data like this.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(model_df[COLUMNS], model_df[TARGET], test_size=0.2, random_state=RANDOM_STATE, shuffle=True, stratify=model_df[TARGET])
tree = DecisionTreeClassifier(random_state=RANDOM_STATE)
tree.fit(X=X_train, y=y_train)
y_pred = tree.predict(X=X_test)

How did our model do?

In [9]:
from sklearn.metrics import classification_report

print(classification_report(y_true=y_test, y_pred=y_pred))

              precision    recall  f1-score   support

    Negative       0.50      0.50      0.50      4280
    Positive       0.50      0.50      0.50      4279

    accuracy                           0.50      8559
   macro avg       0.50      0.50      0.50      8559
weighted avg       0.50      0.50      0.50      8559



Our model does exactly as well as guessing, which suggests our data is random/synthetic. Let's do a little more EDA just to check.

In [10]:
express.histogram(data_frame=model_df, x='WBC_Count', facet_col=TARGET).show(renderer='iframe_connected')

It is highly unlikely that patients with and without leukemia would have identically distributed white blood cell counts, so it's very likely our data is synthetic.