## 1. Inspecting transfusion.data file
<p><img src="https://assets.datacamp.com/production/project_646/img/blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>, "about 5 million Americans need a blood transfusion every year".</p>
<p>My dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. I want to predict whether or not a donor will give blood the next time the vehicle comes to campus.</p>
<p>The data is stored in <code>transfusion.data</code> and it is structured according to RFMTC marketing model (a variation of RFM).</p>

In [8]:
with open('transfusion.data') as file:

    for _ in range(5):
        print(file.readline().strip())

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1


## 2. Loading the blood donations data
<p>We now proceed to loading the data into memory.</p>

In [9]:
# Importing pandas to manage and manipulate dataset
import pandas as pd

# Read in dataset
transfusion = pd.read_csv('transfusion.data')

(transfusion.head())

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## 3. Inspecting transfusion DataFrame
<p>RFMTC is a variation of the RFM model. Below is a description of what each column means in the dataset <code>transfusion.data</code>:</p>
<ul>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>
</ul>

In [10]:
# Print a concise summary of transfusion DataFrame
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Recency (months)                            748 non-null    int64
 1   Frequency (times)                           748 non-null    int64
 2   Monetary (c.c. blood)                       748 non-null    int64
 3   Time (months)                               748 non-null    int64
 4   whether he/she donated blood in March 2007  748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## 4. Creating target column
<p>Now it's time to to predict the value in <code>whether he/she donated blood in March 2007</code> column. I'll rename it to <code>target</code> so that it's more convenient to work with.</p>

In [11]:
# Rename target column as 'target' for brevity
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': 'target'},
    inplace=True
)

transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1


## 5. Checking target incidence
<p>Now, it's time to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the donor will not give blood</li>
<li><code>1</code> - the donor will give blood</li>
</ul>


In [12]:
transfusion.target.value_counts(normalize=True).round(3)

Unnamed: 0_level_0,proportion
target,Unnamed: 1_level_1
0,0.762
1,0.238


## 6. Splitting transfusion into train and test datasets


In [20]:
from sklearn.model_selection import train_test_split

# Stratified train-test split to maintain target distribution
X_train, X_test, y_train, y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.25,
    random_state=42,
    stratify=transfusion.target
)

X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


## 7. Selecting model using TPOT (installation first)


In [14]:
!pip install tpot

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl.metadata (2.0 kB)
Collecting deap>=1.2 (from tpot)
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Building wheel

In [15]:
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# TPOTClassifier configuration with ROC AUC as the evaluation metric
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# Calculate AUC score
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Output the steps in the best pipeline
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):

    print(f'{idx}. {transform}')

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7422459184429089

Generation 2 - Current best internal CV score: 0.7422459184429089

Generation 3 - Current best internal CV score: 0.7422459184429089

Generation 4 - Current best internal CV score: 0.7422459184429089

Generation 5 - Current best internal CV score: 0.7423330644124078

Best pipeline: LogisticRegression(RobustScaler(input_matrix), C=25.0, dual=False, penalty=l2)

AUC score: 0.7858

Best pipeline steps:
1. RobustScaler()
2. LogisticRegression(C=25.0, random_state=42)


## 8. Checking the variance


In [16]:
X_train.var().round(3)

Unnamed: 0,0
Recency (months),66.929
Frequency (times),33.83
Monetary (c.c. blood),2114363.7
Time (months),611.147


## 9. Log normalization


In [17]:
import numpy as np

# Create copies of X_train and X_test for normalization
X_train_normed, X_test_normed = X_train.copy(), X_test.copy()

# Column to apply log normalization
col_to_normalize = 'Monetary (c.c. blood)'

# Apply log transformation and replace original column
for df_ in [X_train_normed, X_test_normed]:

    df_['monetary_log'] = np.log(df_[col_to_normalize])

    df_.drop(columns=col_to_normalize, inplace=True)

# Output variance after normalization
X_train_normed.var().round(3)

Unnamed: 0,0
Recency (months),66.929
Frequency (times),33.83
Time (months),611.147
monetary_log,0.837


## 10. Training the logistic regression model


In [18]:
from sklearn import linear_model

# Initialize Logistic Regression with liblinear solver
logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

# Fit the model on the normalized training data
logreg.fit(X_train_normed, y_train)

# Compute AUC score on the test set
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7890


## 11. Conclusion
<p>The demand for blood fluctuates throughout the year. As one <a href="https://www.kjrh.com/news/local-news/red-cross-in-blood-donation-crisis">prominent</a> example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.</p>
<p>In this notebook, I explored automatic model selection using TPOT and AUC score I got was 0.7850. This is better than simply choosing <code>0</code> all the time (the target incidence suggests that such a model would have 76% success rate). I then log normalized my training data and improved the AUC score by 0.5%.</p>


In [22]:
from operator import itemgetter

# Rank models based on AUC scores (descending order)
sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)

[('logreg', 0.7890178003814368), ('tpot', 0.7857596948506039)]