## 1. INTRODUCTION

Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. 

The objective of this project is to predict whether or not a donor will give blood the next time the health organization comes to their local area. 

In order to achieve this, I start off by inspecting my transfusion.data dataset and fixing certain aspects of data before I proceed to the evaluation of a classification model. 

## 2. Loading the dataset
From previous inspections, it is known that the dataset has a .data extension and it can be treated as a CSV file. 

In [1]:
import pandas as pd

# Read in dataset
transfusion = pd.read_csv("datasets/transfusion.data")
# Print out the first rows of our dataset
transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## 3. Inspecting the dataset

The dataset follows a model called RFMTC which is normally used for identifying "best customers" (in this context our customers are patients). 
Based on the available data, the variables can be understood as follows:

<p></p>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>



In [2]:
#Summary:
transfusion.describe()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


In [3]:
transfusion.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)                              748 non-null int64
Frequency (times)                             748 non-null int64
Monetary (c.c. blood)                         748 non-null int64
Time (months)                                 748 non-null int64
whether he/she donated blood in March 2007    748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


In [4]:
#Missing values per column
pd.DataFrame(transfusion.isna().sum(), columns=[ '#_missing_values'])

Unnamed: 0,#_missing_values
Recency (months),0
Frequency (times),0
Monetary (c.c. blood),0
Time (months),0
whether he/she donated blood in March 2007,0


In [5]:
# For my convenience, I'll identify and rename the target column (y). 
transfusion.rename(
    columns={'whether he/she donated blood in March 2007': 'target'},
    inplace=True #to return new DF
)

transfusion.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),target
0,2,50,12500,98,1
1,0,13,3250,28,1


In [24]:
# At this point, I know that the appropriate model for this model is a binary classifier 
#0 =the donor will not give blood
#1 =the donor will give blood

#Now, let's get an idea of the target incidence (how balanced my target variable is):
#Target incidence proportions, rounding output to 3 decimal places
transfusion.target.value_counts(normalize=True).round(3)

### 3.1 Observations

Based on the previous summary, it was observed that all the columns have numeric values which is good in this case, additionally, we can also observe that each column has the right data type assigned to it (int64). We could perform some memory optimization depending on each column's domain but that aspect is not within the scope of this project. 

It is worth to mention that  luckily none of the columns has missing values. 

## 4. Splitting transfusion into train and test datasets

We'll now use train_test_split() method to split transfusion DataFrame.

Target incidence informed us that in our dataset 0s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the train_test_split() method from the scikit learn library - all we need to do is specify the stratify parameter. In our case, we'll stratify on the target column.

In [9]:
# Import train_test_split method
from sklearn.model_selection import train_test_split

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `target` column
X_train, X_test, y_train, y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.25,
    random_state=42,#seed
    stratify= transfusion.target #to make sure tha train and test sets 
    #have more or less the same percentage of each class in each data frame
)

# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months)
334,16,2,500,16
99,5,7,1750,26


## 5. Selecting model using TPOT
<p>TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.</p>
<p><img src="https://assets.datacamp.com/production/project_646/img/tpot-ml-pipeline.png" alt="TPOT Machine Learning Pipeline"></p>
TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.

This means that I will obtain a pipeline that suggest a given model with its corresponding optimal values. Naturally, the more execution time we give to TPOT, the better as it will be able to extesively explore more options and test more hyperparameters. 

<b>NOTE: Please note that you will need to install TPOT in order to run the cell  below, you will also need to install some libraries that are required before you can install TPOT. </b>

In [11]:
# Import TPOTClassifier and roc_auc_score
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# Instantiate TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=3,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light',
)

tpot.fit(X_train, y_train)

# AUC score for tpot model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')


19 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
Generation 1 - Current Pareto front scores:
-1	0.7424354492343548	LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l2)

_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
Generation 2 - Current Pareto front scores:
-1	0.7424354492343548	LogisticRegression(input_matrix,

## 6. Variance Correction

If we detect that one or more variables have a high variance in comparison to the others (e.g. massive difference in magnitudes), the model's performance could be significantly affected by this problem. For that reason, in order to correct high variance I will normalize the applicable variables as a measure of transformation. 

In [17]:
# X_train's variance, rounding the output to 3 decimal places
X_train.var().round(3)

Recency (months)              66.929
Frequency (times)             33.830
Monetary (c.c. blood)    2114363.700
Time (months)                611.147
dtype: float64

We now see that Monetary (c.c blood) is massively different in terms of variance compared to the othere variables. This is expected as monetary values can have an infinite range of continuous values. 

## 7. Log Normalization

In [18]:
import numpy as np

# Copy X_train and X_test into X_train_normed and X_test_normed
X_train_normed, X_test_normed = X_train.copy(), X_test.copy()

# Specify which column to normalize
col_to_normalize = "Monetary (c.c. blood)"

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    # Add log normalized column:
    df_['monetary_log'] = np.log(df_[col_to_normalize])
    # Drop the original column:
    df_.drop(columns= col_to_normalize , inplace=True)

# Check the variance for X_train_normed
X_train_normed.var().round(3)

Recency (months)      66.929
Frequency (times)     33.830
Time (months)        611.147
monetary_log           0.837
dtype: float64

In [19]:
# Now I will implement a linear regression model based on the output produced by TPOT:

from sklearn import linear_model

# Instantiate LogisticRegression
logreg = linear_model.LogisticRegression()

# Train the model
logreg.fit(X_train_normed, y_train)

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])

print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891




## Conclusion

In [26]:
# Importing itemgetter
from operator import itemgetter

# Sort models based on their AUC score from highest to lowest
sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse=True
)


[('logreg', 0.7890972663699937), ('tpot', 0.7637476160203432)]

From the previous experiment I can conclude that TPOT provides a very good clue on what model may be the most effective which represents a good starting point as I can further improve model accuracy by making appropriate transformations such as the one I performed earlier at the normalization stage. 
I demonstrated that the normalization step slighly improved the model's performance and helped to provide a better prediction. 