### **Context of the research**

- In this case, we used TPOT to create a model that predicted the amount of blood that might be gathered if a blood drive were to be held in that region or city.


- We had dataset with various attributes, including "last month of donation," "Frequency," "Monetory," and many others. We initially used this data to test various machine learning algorithms, including logistic regression and random forest, in an effort to make predictions. Then, using TPOT, we examined the method that is automatically selected in relation to this dataset. 


- These will aid the organiser in determining the best location for a blood donation camp to obtain the intended volume of blood.


- Hospitals, healthcare networks, and other organisations are included in our project scopes.



### **Problem Statement**

Blood transfusions are a critical aspect of the medical industry, and the timely availability of blood is crucial for saving lives. However, the manual process of blood transfusions can be time-consuming and result in delays that may cause adverse consequences. Automating the blood transfusion supply chain through machine learning techniques can improve the efficiency of blood banks and health agencies in organizing blood donations.

The aim of this research is to predict the likelihood of individuals donating blood in future blood donation camps based on their blood donation history. To achieve this objective, we will use a publicly available dataset of blood donations and train multiple machine learning models to identify the factors that influence an individual's decision to donate blood. We will then optimize the parameters of the models and evaluate their performance using appropriate metrics.

#### **About the dataset**

R (Recency - months since the last donation)
 
F (Frequency - total number of donation)

M (Monetary - total blood donated in c.c.)

T (Time - months since the first donation)

a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)

#### **Loading the dataset**

In [28]:
import pandas as pd

df = pd.read_csv('transfusion_data_with_name.csv',index_col='Sr.no')
df

Unnamed: 0_level_0,Name,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),Year,whether he/she donated blood in March 2007,Location,Location_Name
Sr.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Oluwatobiloba Goulding,2,50,12500,98,2007,1,1,Kitchner Hospital
2,Gianluca Herring,0,13,3250,28,2007,1,1,Kitchner Hospital
3,Antonia Almond,1,16,4000,35,2007,1,1,Kitchner Hospital
4,Sally Sloan,2,20,5000,45,2007,1,3,Victoria Hospital
5,Kelis Hirst,1,24,6000,77,2007,0,2,Civik Hospital
...,...,...,...,...,...,...,...,...,...
745,Ioana Mccoy,21,2,500,52,2007,0,3,Victoria Hospital
746,Amirah Vang,23,3,750,62,2007,0,1,Kitchner Hospital
747,Brent Almond,39,1,250,39,2007,0,1,Kitchner Hospital
748,Elena Mccormack,72,1,250,72,2007,0,1,Kitchner Hospital


Our dataset has 749 rows and 9 columns

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 749 entries, 1 to 749
Data columns (total 9 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   Name                                        749 non-null    object
 1   Recency (months)                            749 non-null    int64 
 2   Frequency (times)                           749 non-null    int64 
 3   Monetary (c.c. blood)                       749 non-null    int64 
 4   Time (months)                               749 non-null    int64 
 5   Year                                        749 non-null    int64 
 6   whether he/she donated blood in March 2007  749 non-null    int64 
 7   Location                                    749 non-null    int64 
 8   Location_Name                               749 non-null    object
dtypes: int64(7), object(2)
memory usage: 58.5+ KB


From the given information, we can see that there is no missing value in the dataset.

In [30]:
df.describe()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),Year,whether he/she donated blood in March 2007,Location
count,749.0,749.0,749.0,749.0,749.0,749.0,749.0
mean,9.59012,5.508678,1377.169559,34.332443,2007.0,0.23765,2.045394
std,8.406069,5.837734,1459.433448,24.399368,0.0,0.425928,0.808095
min,0.0,1.0,250.0,2.0,2007.0,0.0,1.0
25%,3.0,2.0,500.0,16.0,2007.0,0.0,1.0
50%,7.0,4.0,1000.0,28.0,2007.0,0.0,2.0
75%,14.0,7.0,1750.0,50.0,2007.0,0.0,3.0
max,74.0,50.0,12500.0,98.0,2007.0,1.0,3.0


In [31]:
#renaming the column names

df = df.rename(columns={'Recency (months)': 'Recency', 'Frequency (times)': 'Frequency', 'Monetary (c.c.blood)': 'Monetary', 'Time (months)': 'Time', 'whether he/she donated blood in March 2007': 'Donated'})
df

Unnamed: 0_level_0,Name,Recency,Frequency,Monetary (c.c. blood),Time,Year,Donated,Location,Location_Name
Sr.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Oluwatobiloba Goulding,2,50,12500,98,2007,1,1,Kitchner Hospital
2,Gianluca Herring,0,13,3250,28,2007,1,1,Kitchner Hospital
3,Antonia Almond,1,16,4000,35,2007,1,1,Kitchner Hospital
4,Sally Sloan,2,20,5000,45,2007,1,3,Victoria Hospital
5,Kelis Hirst,1,24,6000,77,2007,0,2,Civik Hospital
...,...,...,...,...,...,...,...,...,...
745,Ioana Mccoy,21,2,500,52,2007,0,3,Victoria Hospital
746,Amirah Vang,23,3,750,62,2007,0,1,Kitchner Hospital
747,Brent Almond,39,1,250,39,2007,0,1,Kitchner Hospital
748,Elena Mccormack,72,1,250,72,2007,0,1,Kitchner Hospital


In [33]:
#Checking how many donors have donated in 2007.

print('Donation proportions:\n')
print(round(df.Donated.value_counts(normalize = True) * 100,3))

Donation proportions:

0    76.235
1    23.765
Name: Donated, dtype: float64


From the target proportion, we can see that 76.23% people did not donate in 2007 but only 23.76% did.

For the ease of our analysis, we are only keeping the numerical columns, Removing other columns

In [34]:
new_df = df.copy()
new_df

Unnamed: 0_level_0,Name,Recency,Frequency,Monetary (c.c. blood),Time,Year,Donated,Location,Location_Name
Sr.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Oluwatobiloba Goulding,2,50,12500,98,2007,1,1,Kitchner Hospital
2,Gianluca Herring,0,13,3250,28,2007,1,1,Kitchner Hospital
3,Antonia Almond,1,16,4000,35,2007,1,1,Kitchner Hospital
4,Sally Sloan,2,20,5000,45,2007,1,3,Victoria Hospital
5,Kelis Hirst,1,24,6000,77,2007,0,2,Civik Hospital
...,...,...,...,...,...,...,...,...,...
745,Ioana Mccoy,21,2,500,52,2007,0,3,Victoria Hospital
746,Amirah Vang,23,3,750,62,2007,0,1,Kitchner Hospital
747,Brent Almond,39,1,250,39,2007,0,1,Kitchner Hospital
748,Elena Mccormack,72,1,250,72,2007,0,1,Kitchner Hospital


In [35]:
new_df=new_df.drop(['Name', 'Location_Name'], axis=1)
new_df

Unnamed: 0_level_0,Recency,Frequency,Monetary (c.c. blood),Time,Year,Donated,Location
Sr.no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,50,12500,98,2007,1,1
2,0,13,3250,28,2007,1,1
3,1,16,4000,35,2007,1,1
4,2,20,5000,45,2007,1,3
5,1,24,6000,77,2007,0,2
...,...,...,...,...,...,...,...
745,21,2,500,52,2007,0,3
746,23,3,750,62,2007,0,1
747,39,1,250,39,2007,0,1
748,72,1,250,72,2007,0,1


## **Problem Solution using AutoML tool (TPOT)**

### **Implementation of TPOT**

**What is TPOT?**

TPOT is an automated machine learning (AutoML) tool that uses genetic programming to search for the best machine learning pipeline for a given dataset. 

<p><img src="https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-ml-pipeline.png" alt="TPOT Machine Learning Pipeline"></p>

In [36]:
#Installing TPOT

!pip install tpot





In [37]:
#Importing required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

In [39]:
X_train, X_test, y_train, y_test = train_test_split(new_df.drop('Donated', axis=1), 
                                                    new_df['Donated'], 
                                                    test_size=0.2, 
                                                    random_state=42)

In [80]:
tpot = TPOTClassifier(generations=5, 
                      population_size=20, 
                      cv=8,
                      verbosity=2, 
                      random_state=42)

tpot.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7930405405405405

Generation 2 - Current best internal CV score: 0.7930405405405405

Generation 3 - Current best internal CV score: 0.7930405405405405

Generation 4 - Current best internal CV score: 0.7930405405405405

Generation 5 - Current best internal CV score: 0.7946396396396396

Best pipeline: RandomForestClassifier(ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.7000000000000001, min_samples_leaf=18, min_samples_split=4, n_estimators=100), bootstrap=True, criterion=gini, max_features=0.5, min_samples_leaf=18, min_samples_split=19, n_estimators=100)


In [81]:
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    print(f'{idx}. {transform}')


AUC score: 0.6745

Best pipeline steps:
1. StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=True,
                                                 max_features=0.7000000000000001,
                                                 min_samples_leaf=18,
                                                 min_samples_split=4,
                                                 random_state=42))
2. RandomForestClassifier(max_features=0.5, min_samples_leaf=18,
                       min_samples_split=19, random_state=42)




In [82]:
#Exporting the pipeline
tpot.export('tpot_transfusion_pipeline.py')

In [83]:
#Printing the accuracy of TPOT
accuracy = tpot.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.78




### **Machine Learning Predictions & Outcomes**

### **Reference Papers reviewed so far and comparison with outcomes we got using AutoML**

**Research Paper 1**
- **A survey on machine learning algorithms for the blood donation supply chain - https://iopscience.iop.org/article/10.1088/1742-6596/1362/1/012124/pdf**

    - **Overview:** This paper provides an overview of machine learning techniques used in blood donation supply chain management. It aims to present the advantages and disadvantages of each technique and to help stakeholders in the healthcare industry make informed decisions about which machine learning approach to adopt for their specific needs. The paper describes the initial use of a logistic regression model, followed by the use of artificial neural networks and decision trees, which produced similar and superior performing models. The multilayer perceptron with backpropagation was also used to classify donor groups, and an improved model was found to be ANN. Finally, the text compares SVM and ANN, highlighting the characteristics of least squares SVM. Overall, the paper concludes that machine learning techniques such as ANN, SVM, and decision trees are essential for creating a reliable and fair system for locating and contacting potential blood donors.
    
**Research Paper 2**
-  **Knowledge discovery on RFM model using Bernoulli sequence: https://www.sciencedirect.com/science/article/abs/pii/S0957417408004508?via%3Dihub**

    - **Overview:** The research paper proposes and evaluates a novel approach for customer segmentation and churn prediction in the blood donation industry. The proposed approach combines the RFM model with a Bernoulli sequence-based method. The study shows that this approach is effective in identifying donors with distinct donation behavior patterns and outperforms traditional methods in predicting donor churn. The conclusion suggests that the proposed approach has the potential to improve donor behavior and churn prediction accuracy in the industry.
    
**Research Paper 3**
- **Predicting blood donations using machine learning techniques-http://matthewalanham.com/Students/2017_MWDSI_Final_Bahel.pdf**

    - **Overview:** The text research paper aims to identify the most effective machine learning technique to accurately predict the number of donors who meet the criteria for blood donation, given the increasing number of transfusions resulting from accidents, illnesses, and surgeries. The researchers employed various classification techniques, including SVM, ANN, CART, C5.0, Logistic Regression, Logit (Bagging & Boosting), LDA, and Random Forest. The logistic regression model with 5-fold cross-validation demonstrated the highest performance, but sensitivity was prioritized over other metrics, leading to a recommendation for a clustered SVM model with 98.4% sensitivity. The paper thus provides insights into machine learning techniques that can be used to address the issue of identifying potential blood donors.
    


**What did the research tell you? What ended up being your goal/solution?**

According to TPOT, the pipeline steps suggested by it can lead to the ideal model selection. The first step would be StackingEstimator(estimator=ExtraTreesClassifier), followed by Random Forest classifier. The accuracy of TPOT is 78%. In the initial studies (Assignment 1), we have chosen the Random Forest Classifier as our ideal model. 

This study states that using Random Forest we can be 78% accurate predicting whether blood donors are likely to donate in future or not. Based on the location and people, we can target specific area and time to held the blood donation camps to automate and boost the blood transfusion supply chain. Updating and experminenting with different parameters, our approach would be in the favour of using Random Forest in the future. 

**Does your prediction match with expectation using your designated AutoML, thus far? What were research results? Do you have results of your own yet?**

The model which we chose in the initial study is also suggested by TPOT pipeline, however, model needs further improvements. Our expectations are bit higher than the predicted results (AUC score being 0.6745, and accuracy being 78%)  that we get using TPOT.

**What are you trying to do to meet your end goal/solution?**

The below-mentioned points could be the factors to consider to meet our end goal/solution.

- We can preprocess the data in such a way that it eliminates the outliers if found any.
- We can balance the dataset in such a way that we get similar counts for the class. 
- We can further fine-tune the hyperparameters to increase the model's accuracy.
- We can follow the pipeline suggested by the TPOT.

### **Data Visualization**

![Blood Transfusion](image.png)