# From Data to Diagnosis
### Building a Machine Learning Pipeline for Thalassemia Classification

**Author:** Konstantinos Kalaitzidis and Evanthia Chatzigeorgiou 
**Dataset:** [Mendeley Thalassemia Dataset](https://data.mendeley.com/datasets/p8rv84hrbs/1)

---

## 1. Introduction
The goal of this notebook is to build a simple, end-to-end ML pipeline that predicts whether a patient is likely to have thalassemia based on clinical features.


In [17]:
# setup
import pandas as pd # for handling and analyzing structured data
import numpy as np # for numerical operations — fast array math and linear algebra
import matplotlib.pyplot as plt # basic plotting library
import seaborn as sns # statistical data visualization built on top of matplotlib
from sklearn.model_selection import train_test_split, GridSearchCV # for splitting data and hyperparameter tuning
from sklearn.preprocessing import StandardScaler # for feature scaling
from sklearn.linear_model import LogisticRegression # a linear model for binary or multiclass classification.
from sklearn.ensemble import RandomForestClassifier # an ensemble learning method for classification
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score # for evaluating model performance
import joblib # for saving and loading machine learning models

print("Libraries imported successfully.")

Libraries imported successfully.


In [21]:
import os

# path to dataset
DATA_PATH = "../data/raw/thalassimia.xlsx"  

# Load dataset
try:
    df = pd.read_excel(DATA_PATH)
    print("✅ Loaded dataset successfully from:", DATA_PATH)
    print("Shape:", df.shape)
    display(df.head(10))
except FileNotFoundError:
    print("❌ File not found at:", DATA_PATH)
    print("Please download the dataset from:")
    print("https://data.mendeley.com/datasets/p8rv84hrbs/1")
    print("and place it at data/raw/thalassemia.xlsx")


✅ Loaded dataset successfully from: ../data/raw/thalassimia.xlsx
Shape: (1073, 18)


Unnamed: 0,ID,Gender,Age,MCV,HBG,MCH,RBC,S,HBA2,HBA,HBF,Iron,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,1,1,30.0,63.9,11.5,20.4,,,5.8,92.1,1.7,9.6,,,,,,
1,2,0,14.0,62.0,7.0,17.8,,,2.2,97.8,,,,,,,,
2,3,1,29.0,72.0,16.8,24.1,,,3.5,96.3,,,,,,,,
3,4,0,27.0,61.0,8.5,24.6,,,2.3,97.6,,0.1,,,,,,
4,5,1,2.0,53.6,8.9,17.0,,,2.5,97.5,,,,,,,,
5,6,0,8.0,59.2,9.7,18.8,,,5.5,92.1,2.0,,,,,,,
6,7,1,23.0,69.7,16.1,27.4,,,2.8,97.2,,,,,,,ذكر,1.0
7,8,0,17.0,77.9,10.4,26.1,,,2.7,97.1,,0.7,,,,,انثى,0.0
8,9,0,36.0,64.0,10.9,20.8,,,5.0,92.0,2.9,15.4,,,,,,
9,10,1,17.0,66.8,11.9,20.6,,,4.9,94.5,,,,,,,,


In [22]:
# Basic info: dtypes, missing values, descriptive stats
print("Columns and dtypes:")
display(df.dtypes)

print("\nMissing values per column:")
display(df.isna().sum())

print("\nDescriptive statistics (numeric):")
display(df.describe().T)


Columns and dtypes:


ID               int64
Gender           int64
Age            float64
MCV            float64
HBG            float64
MCH            float64
RBC            float64
S              float64
HBA2            object
HBA            float64
HBF            float64
Iron           float64
Unnamed: 12    float64
Unnamed: 13    float64
Unnamed: 14    float64
Unnamed: 15    float64
Unnamed: 16     object
Unnamed: 17    float64
dtype: object


Missing values per column:


ID                0
Gender            0
Age               0
MCV              19
HBG              19
MCH              41
RBC             926
S              1054
HBA2              5
HBA               4
HBF             585
Iron            827
Unnamed: 12    1073
Unnamed: 13    1073
Unnamed: 14    1073
Unnamed: 15    1073
Unnamed: 16    1071
Unnamed: 17    1071
dtype: int64


Descriptive statistics (numeric):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,1073.0,537.0,309.892723,1.0,269.0,537.0,805.0,1073.0
Gender,1073.0,0.387698,0.487452,0.0,0.0,0.0,1.0,1.0
Age,1073.0,20.479031,15.110963,1.0,9.0,20.0,29.0,221.0
MCV,1054.0,68.759583,17.792284,2.0,60.0,67.6,77.0,488.0
HBG,1054.0,11.642808,8.199484,1.3,9.2,11.0,12.275,125.0
MCH,1032.0,22.971221,11.741189,3.9,19.3,22.0,26.0,319.0
RBC,147.0,5.434762,2.664376,1.8,4.4,5.0,5.8,23.9
S,19.0,39.926316,8.288603,30.6,37.2,38.0,39.6,70.3
HBA,1069.0,96.722357,59.938013,0.0,93.8,96.8,97.4,981.0
HBF,488.0,5.959795,16.750269,0.3,0.8,1.3,3.1,99.0
