#1. Inspecting transfusion.data file
Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to WebMD, "about 5 million Americans need a blood transfusion every year".

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.

The data is stored in datasets/transfusion.data and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.

#2. Loading the blood donations data
We can directly load data from a link (URL) if the file is hosted publicly and in a readable format like CSV, Excel, JSON, HTML, etc. We proceed to loading the data into memory.

In [None]:
import pandas as pd

url = "https://storage.googleapis.com/kagglesdsdata/datasets/2513835/4266123/transfusion.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250927%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250927T124430Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=55c57d32ea2fe1bf77f22af21e3e2d543d350f48aefd023428f7ac15dff52c866b92bdf6897d894bbd15b121fdfb6806656186d9472dc22003a73d05d758f76d3c925aae0e9a168f64aa1331026a720df2385edbff1f00fdf1a0fc141d330ef4452847ed8207b1226ccab84d5814fd0c789256800d9682f7d44ab800c547f88601221acd299e3ca74efb273573ee037145ba96d34177daa13f57978df5f759fc3b03a863797a568afd37701cdbe165293df6bb08cfb43f4e91fc979c88b1c7bedc2c7fd6afeea1ff96b75c58a5b1a4b90bc681c057d39d8a9b3292bc5daf85da04d14d859f0d2abe251db16a08519e79e718280ebd3590b393d1729582b501c7"

df = pd.read_csv(url)

In [None]:
df.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


#3. Inspecting transfusion DataFrame
Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.

RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:

R (Recency - months since the last donation)
F (Frequency - total number of donation)
M (Monetary - total blood donated in c.c.)
T (Time - months since the first donation)
a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)
It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


#4. Creating target column
We are aiming to predict the value in whether he/she donated blood in March 2007 column. Let's rename this it to target so that it's more convenient to work with.

In [None]:
# Rename target column as 'target' for brevity
df.rename(
    columns={'whether he/she donated blood in March 2007': 'target'},
    inplace=True
)

# Printing out the first 2 rows
df.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1


#5. Checking target incidence
We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:

0 - the donor will not give blood
1 - the donor will give blood
Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.

In [None]:
# Print target incidence proportions, rounding output to 3 decimal places

df.target.value_counts(normalize=True).round(3)

Unnamed: 0_level_0,proportion
target,Unnamed: 1_level_1
0,0.762
1,0.238


#6. Splitting transfusion into train and test datasets
We'll now use train_test_split() method to split transfusion DataFrame.

Target incidence informed us that in our dataset 0s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the train_test_split() method from the scikit learn library - all we need to do is specify the stratify parameter. In our case, we'll stratify on the target column.

In [None]:
from sklearn.model_selection import train_test_split

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `target` column
X_train,X_test,y_train,y_test= train_test_split(
    df.drop(columns='target'),
    df.target,
    test_size=0.25,
    random_state=42,
    stratify=df.target
)

# Print out the first 2 rows of X_train

X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


# Logistic Regression
logreg = LogisticRegression(solver='liblinear', random_state=42)
logreg.fit(X_train, y_train)

# AUC score
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test)[:, 1])
print(f'\nLogistic Regression AUC score: {logreg_auc_score:.4f}')


Logistic Regression AUC score: 0.7851


#7. Checking the Variance
One of the assumptions for linear models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's orders of magnitude greater than the other features, this could impact the model's ability to learn from other features in the dataset.

Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.

In [None]:
X_train.var().round(3)

Unnamed: 0,0
Recency (months),66.929
Frequency (times),33.83
Monetary (c.c. blood),2114363.7
Time (months),611.147


#8. Log normalization
Monetary (c.c. blood)'s variance is very high in comparison to any other column in the dataset. This means that, unless accounted for, this feature may get more weight by the model (i.e., be seen as more important) than any other feature.

One way to correct for high variance is to use log normalization.

In [None]:
# Import numpy
import numpy as np

# Copy X_train and X_test into X_train_normed and X_test_normed
X_train_normed,X_test_normed = X_train.copy(), X_test.copy()

# Specify which column to normalize
col_to_normalize ='Monetary (c.c. blood)'

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    # Add log normalized column
    df_['monetary_log'] = np.log(df_[col_to_normalize])
    # Drop the original column
    df_.drop(columns=col_to_normalize, inplace=True)

# Check the variance for X_train_normed

X_train_normed.var().round(3)

Unnamed: 0,0
Recency (months),66.929
Frequency (times),33.83
Time (months),611.147
monetary_log,0.837


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Logistic Regression
logreg = LogisticRegression(solver='liblinear', random_state=42)
logreg.fit(X_train_normed, y_train)

# AUC score
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nLogistic Regression AUC score: {logreg_auc_score:.4f}')


Logistic Regression AUC score: 0.7890


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_normed, y_train)

# AUC score
rf_auc_score = roc_auc_score(y_test, rf.predict_proba(X_test_normed)[:, 1])
print(f'\nRandom Forest AUC score: {rf_auc_score:.4f}')


Random Forest AUC score: 0.7116


In [None]:
import xgboost as xgb
print(xgb.__version__)


3.0.5


In [None]:
import xgboost as xgb
from sklearn.metrics import roc_auc_score, accuracy_score

# Convert to DMatrix
dtrain = xgb.DMatrix(X_train_normed, label=y_train)
dtest = xgb.DMatrix(X_test_normed, label=y_test)

# Best params from RandomizedSearchCV
best_params = random_search.best_params_
best_params.update({
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "seed": 42
})

# Train with early stopping
evals = [(dtrain, "train"), (dtest, "eval")]
bst = xgb.train(
    params=best_params,
    dtrain=dtrain,
    num_boost_round=500,
    evals=evals,
    early_stopping_rounds=10
)

# Predictions (probabilities)
y_pred_proba = bst.predict(dtest)

# AUC score
xgb_auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nOptimized XGBoost AUC score: {xgb_auc_score:.4f}")

# Accuracy score (need class labels, so threshold at 0.5)
y_pred = (y_pred_proba > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f"Optimized XGBoost accuracy: {accuracy:.4f}")


[0]	train-auc:0.73641	eval-auc:0.77026
[1]	train-auc:0.75301	eval-auc:0.75731
[2]	train-auc:0.75908	eval-auc:0.75318
[3]	train-auc:0.76145	eval-auc:0.75397
[4]	train-auc:0.76582	eval-auc:0.75882
[5]	train-auc:0.76934	eval-auc:0.75954
[6]	train-auc:0.77490	eval-auc:0.75540
[7]	train-auc:0.77719	eval-auc:0.75755
[8]	train-auc:0.78091	eval-auc:0.75882
[9]	train-auc:0.78388	eval-auc:0.75405

Optimized XGBoost AUC score: 0.7506
Optimized XGBoost accuracy: 0.7807


Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()
