<a href="https://colab.research.google.com/github/glgunderson/INFOB2DA-PA3/blob/main/pa3_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification Methods and Model Evaluation**
## Practical Assignment 3 - INFOB2DA
### *Tobias Buiten, Emil Flindt, Grace Gunderson*

# Task 0 - Setup Environment



In [2]:
# Import Relevant Libraries
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import fbeta_score
from sklearn.model_selection import KFold

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  # This is for supressing all the "future warning"


In [3]:
url = "https://raw.githubusercontent.com/glgunderson/INFOB2DA-PA3/main/blood_transfusion.csv"
df = pd.read_csv(url)

# Display dataframe
df.head()

Unnamed: 0,months_since_last_donation,total_number_of_donations,total_blood_donated,months_since_first_donation,class
0,2.0,50.0,12500.0,98.0,1
1,0.0,13.0,3250.0,28.0,1
2,1.0,16.0,4000.0,35.0,1
3,2.0,20.0,5000.0,45.0,1
4,1.0,24.0,6000.0,77.0,0


# Task 1 - Get dataset on screen

## Task 1.1 - Explore dataset

### Basic Statistics


In [4]:
# Display Basic Statistics:
df.describe()

Unnamed: 0,months_since_last_donation,total_number_of_donations,total_blood_donated,months_since_first_donation,class
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


### Dataset Features
- **’months_since_last_donation’** — Number of months since the donor’s last donation
- **’total_number_of_donations’** — Total number of donations the person has made
- **’total_blood_donated’** — Total volume of blood donated (in milliliters)
- **’months_since_first_donation’** — Number of months since the donor’s first donation
- **’class’** — categorical (binary) — 1 = returned to donate, 0 = did not return


###  Dataset Overview
The dataset contains **748 records**.
- All four input features are numeric, while ‘class’ is a binary categorical feature representing whether a donor returned to donate blood the following month (‘1’) or not (‘0’).
- On average, donors have donated **5-6 times**.
- The average **time since last donation** is approximately **9.5 months**, with values ranging from 0 to 74 months since last donation.
- The average **time since first donation** is around **34 months** (about 3 years).
The **standard deviations** across features indicate noticeable variation between donors, meaning donation habits differ widely in timing and frequency.

No missing values or irregular entries are observed, so the dataset appears clean and ready for preprocessing.

### Exploring Feature Relationships

In [5]:
# Scatter Plot - Total Blood Donated vs. Months Since Last Donations
px.scatter(
    data_frame = df,
    y = df['total_blood_donated'],
    x = df['months_since_last_donation'],
    )

### Total Blood Donated vs. Months Since Last Donation
The scatterplot visualizes the relationship between **’months_since_last_donation’** (x-axis) and **‘total_blood_donated’** (y-axis).


**Observations:**
- Most data points are concentrated between **0 and 24 months** since last donation, revealing that the majority of donors returned to donate blood within the past 2 years.
- A smaller number of records extend beyond **30 months**, representing donors with longer intervals between donations.
A few donors also donated **very high total blood volumes** (above **8,000 ml**), appearing clearly above the main cluster of data points.
- These figures are valid data points but represent **rare outliers**—individuals who donate more frequently and therefore accumulate higher blood donation totals.
- Conversely, records with **less than one month since last donation** were also identified as atypical, since most donors do not return that frequently.
- Donors with **shorter intervals between donations** tend to have **higher total blood donated**, indicating a mild inverse relationship between months since last donation and total blood volume donated.
- Overall, there does not appear to be a strong linear trend, but a weak pattern suggests that more frequent donors generally accumulate higher donation totals.


This scatterplot revealed a small number of **extreme values** that appeared to skew the dataset.

Because these rare or infrequent cases differ from the main distribution, they will be addressed in **Task 2 – Preprocessing** to ensure that all subsequent analyses accurately reflect typical donor behavior.


## Task 1.2 - Visualizating Class Differences

### **Summary**

The comparison of averages across both classes show consistent behavioral differences:
- Returning donors have **donated more recently** (‘months_since_last_donation’), **more frequently** (’total_number_of_donations’), and **in greater total volume** (‘total_blood_donated’).
- The overall **donor history duration** (‘months_since_first_donation’) remains similar between groups.

*These class-level distinctions will be preserved in subsequent preprocessing and visualizations, while outliers identified will be addressed to ensure the dataset is accurately analyzed for meaningful insights on donor behavior.*


In [6]:
# Prepare dataframe for visualization:
df_mean = df.groupby(['class'],as_index=False).mean().copy()
df_mean['class'] = df_mean['class'].map({0: "Did not return", 1: "Returned"})

In [7]:
# classes vs. months_since_last_donation
fig1 = px.bar(
    data_frame = df_mean,
    x = df_mean['class'],
    y = df_mean['months_since_last_donation'],
    color = df_mean['class'],
    labels={'months_since_last_donation':'average (#months)'},
    title="Average amount of months since last donation",
    color_discrete_sequence=['steelblue','darkred']
    )

fig1.update_xaxes(type='category')

fig1.show()

### Average Months Since Last Donation
- Donors who **did not return** (‘0’) have a higher average number of months since their last donation (around 10 months).
- Donors who **returned** (‘1’) donated more recently, averaging about 6 months since their last donation.
- Donors with **longer intervals between donations** (‘months_since_last_donation’) are more commonly associated with donors that did not return.
- Overall, the majority of donors with longer gaps between donations appear to be **less likely to return**.


In [8]:
# classes vs. total_number_of_donations
fig2 = px.bar(
    data_frame = df_mean,
    x = df_mean['class'],
    y = df_mean['total_number_of_donations'],
    color = df_mean['class'],
    labels={'total_number_of_donations':'average (#Donations)',},
    title="Average amount of donations",
    color_discrete_sequence=['steelblue','darkred']
    )

fig2.update_xaxes(type='category')

fig2.show()

### Average Total Number of Donations
- Returning donors have a slightly higher average number of total donations than donors who do not return.
- Donors who return are generally **more active donors** with greater overall participation in blood donation.
- While the difference is moderate, it remains consistent across both donor groups.


In [9]:
# classes vs. total_blood_donated
fig3 = px.bar(
    data_frame = df_mean,
    x = df_mean['class'],
    y = df_mean['total_blood_donated'],
    color = df_mean['class'],
    labels={'total_blood_donated':'average (#mL)'},
    title="Average Amount of Blood Donated (in milliliters)",
    color_discrete_sequence=['steelblue','darkred']
    )

fig3.update_xaxes(type='category')

fig3.show()

### Average Total Blood Donated (ml)
- Since each donation corresponds to **250 ml** of blood donated, this bar mirrors the trend observed in total number of donations.
- Returning donors have donated a **larger total volume of blood on average**, reflecting the higher frequency of donations over time.

This confirms that **donation frequency and total blood volume donated** are closely related indicators of overall donor engagement.


In [10]:
# classes vs. months_since_first_donation
fig4 = px.bar(
    data_frame = df_mean,
    x = df_mean['class'],
    y = df_mean['months_since_first_donation'],
    color = df_mean['class'],
    labels={'months_since_first_donation':'average (#Months)',},
    title="Average Months Since First Donation",
    color_discrete_sequence=['steelblue','darkred']
    )

fig4.update_xaxes(type='category')

fig4.show()

### Average Months Since First Donation
- The average months since first donation are **similar across both classes**, suggesting that donor tenure is not a major differentiator.
- Both groups appear to have begun donating within a comparable timeframe.


# Task 2 - Preprocessing

In [11]:
# Manual removal of extreme outliers. (This is only doable because dataset is small)
# We have removed less than 4% of the dataset
df = df[df['total_blood_donated'] <= 8000]
df = df[df['months_since_last_donation'] <= 30]
df = df[df['months_since_last_donation'] > 1]
df.describe()

Unnamed: 0,months_since_last_donation,total_number_of_donations,total_blood_donated,months_since_first_donation,class
count,719.0,719.0,719.0,719.0,719.0
mean,9.351878,5.029207,1257.301808,33.182197,0.233658
std,6.999998,4.34552,1086.38008,23.569288,0.423452
min,2.0,1.0,250.0,2.0,0.0
25%,3.0,2.0,500.0,15.5,0.0
50%,8.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,48.0,0.0
max,26.0,24.0,6000.0,98.0,1.0


## Outlier Removal
- The scatterplot in Task 1 and the 3-D scatter visualization below revealed a small number of **extreme values** that distorted the data distribution and appeared to skew the dataset.

Before filtering, several data points extended far beyond the main cluster:
- Specifically, several donors had unusually high blood volumes donated with **total blood donated greater than 8,000 ml** or long intervals since their last donations with **months since last donation greater than 30 months**.
- A few records also showed less than **1 month** since the previous donation, which is not typical given the average interval between donations.

*While these records are valid, they represent **rare cases** compared to the general donor population and could disproportionately influence later visualizations and distance-based algorithms.*

In the 3-D scatter, these outliers stretched the plot and made overall patterns impossible to interpret, prompting a closer review of the original scatterplot to identify the cause.


## Filtering
**Filtering criteria applied:**
- **total_blood_donated ≤ 8,000 ml**
- **1 < months_since_last_donation ≤ 30**


After removal, **719 records** remain—less than 4% of the data were excluded from subsequent data analysis.

**Observations after filtering:**
- The dataset now displays a more compact, interpretable distribution in 3-D space.
- Averages decreased slightly, and standard deviations narrowed across all features, confirming reduced variability.

These steps preserve the previously observed class-level distinctions while removing rare or atypical donation behaviors that could skew later visualizations and analysis.


## Normalization

In [12]:
# Log normalization
def log_normalization(column: pd.Series) -> pd.Series:
    min = column.min()
    max = column.max()

    c = 0000000000000.1

    normalized = (np.log(column + c) - np.log(min + c)) / (np.log(max + c) - np.log(min + c))

    return normalized

# Remove the target feature from preprocessing
df_norm = df.drop(columns=['class'])

for c in df_norm.columns:
  df_norm[c] = log_normalization(df_norm[c])

df_norm['class'] = df['class']

df_norm.describe()

Unnamed: 0,months_since_last_donation,total_number_of_donations,total_blood_donated,months_since_first_donation,class
count,719.0,719.0,719.0,719.0,719.0
mean,0.464694,0.388831,0.395259,0.629006,0.233658
std,0.342161,0.272803,0.274327,0.250187,0.423452
min,0.0,0.0,0.0,0.0,0.0
25%,0.15455,0.209474,0.218068,0.521538,0.0
50%,0.535686,0.426213,0.436167,0.674765,0.0
75%,0.75565,0.604096,0.612262,0.814595,0.0
max,1.0,1.0,1.0,1.0,1.0


### Log Normalization
To align the feature scales, **log normalization** was applied to each numerical column (i.e., excluding ‘class’)
- This transformation compresses large numeric ranges (e.g., 2 vs. 98 ‘months_since_first_donation’) and rescales all feature values between 0 and 1 while maintaining their relative differences.
-In this context, normalization reduces the influence of attributes measured on different scales (e.g., month vs. milliliters) and ensures that each contributes comparably to later distance-based algorithms.

**Observations after normalization:**
- All features now share a comparable scale (‘min = 0’, ‘max ≈ 1’).
- Mean values between **0.39 and 0.63** indicate that feature distributions are evenly balanced, without clustering too heavily at either end of the range.
- Lower standard deviations show that the spread of values across features is more uniform, preventing any single feature from dominating the distance calculations used in subsequent algorithms (e.g., KNN and SVM).

Although normalization reduces the influence of features with large numeric ranges, **outlier removal remains necessary** as extremely large or rare values can still distort relative distances even on a normalized scale. Filtering these outliers improves the accuracy of subsequent analyses and classification models.

*The resulting dataframe, ‘df_norm’, retains the target variable ‘class’ alongside the normalized numerical features. It provides a consistent, balanced input for subsequent visualization and classification.*


## Dimensionality Reduction

In [13]:
# Plot highlighting that the two attributes are the same. (They are proportional, because donation sizes are constant)
fig5 = px.scatter(
    data_frame= df,
    y = df['total_blood_donated'],
    x = df['total_number_of_donations'],
    title="Total Donations vs. Amount of Blood Donated (mL)",
    color_discrete_sequence=['darkred']
)
fig5.show()

df['ratio'] = df['total_blood_donated'] / df['total_number_of_donations']
print(df['ratio'].std())

0.0


### Feature Dependence and Dimensionality Reduction
A scatterplot comparing **’total_number_of_donations’** and **’total_blood_donated’** confirmed that these two features are **perfectly proportional**, since ’total_blood_donated’ increases linearly by **250 ml** per donation.
- This 1:1 linear relationship means the two features convey **identical information**—one measure as donation counts, the other in total volume of blood donated.
- Keeping both features would introduce **redundancy**, where two variables describe the same aspect of donor behavior without adding explanatory value.
- Because ’total_blood_donated’ adds no new information beyond what ’total_number_of_donations’ already provides, it was removed from the dataset before further analysis.

In data analysis, each feature should contribute **distinct, non-redundant information** about the dataset.
- Removing redundant variables reduces noise, prevents duplication of influence in algorithms, and improves computational efficiency.

This **dimensionality-reduction** ensures that the remaining features represent **meaningful, independent aspects of donor behavior**, making subsequent analyses more interpretable and reliable.


## 3-D Scatter

In [14]:
df_prep = df_norm.drop(columns=['total_blood_donated'])
#df_prep['class'] = df['class'].astype(str)
px.scatter_3d(
    data_frame=df_prep,
    x = 'total_number_of_donations',
    y = 'months_since_first_donation',
    z = 'months_since_last_donation',
    color=df['class'].astype(str),
    color_discrete_sequence=['darkred','steelblue']
    )

#### 3-D Scatter AFTER dimensionality-reduction:

With the removal of redundant features, 3-D scatter visualization (using normalized ’months_since_first_donation’, ’months_since_last_donation’, and ’total_number_of_donations’) displays a clearer, more balanced spatial distribution of data points.*

In [15]:
# This plot should perhaps be removed. Stacked points are very deceptive
px.scatter(
    data_frame=df_prep,
    y = 'months_since_first_donation',
    x = 'months_since_last_donation',
    color='total_number_of_donations',
    color_continuous_scale='RdBu'
)

# Task 3 - Creating a train and test set


In [16]:
# First spilt "X" has 20% of data as test size:
X_train, X_test = train_test_split(
    df_prep, test_size=0.20, random_state=42)

### Split 1 — 80 / 20 (Train / Test)
- 80% of the data are used for model training, and 20% are reserved for testing.
- This is a **standard configuration** that prioritizes more data for learning while keeping a sufficient portion of the dataset (20%) for evaluation.


In [17]:
# Second spilt "Y" has 40% of data as test size:
Y_train, Y_test = train_test_split(
    df_prep, test_size=0.40, random_state=42)

### Split 2 — 60 / 40 (Train / Test)
- A larger test set provides a broader view of **generalization performance** but leaves fewer samples for training.
- This second split allows observation of how reduced training data may affect subsequent classification performance.


### Implementation
- The test sizes are based on an assumption that it is best to  prioritize a greater train_size.
- The argument ‘random_state = 42’ ensures **reproducibility**, meaning the same samples are selected each time the code is run.


In [18]:
print(f"X_train size: {len(X_train)}")
print(f"X_test size: {len(X_test)}")
print(f"Y_train size: {len(Y_train)}")
print(f"Y_test size: {len(Y_test)}")


X_train size: 575
X_test size: 144
Y_train size: 431
Y_test size: 288


### Interpretation
- For the **80 / 20** configuration, 575 records (‘X_train = 575’) are used for training and 144 records (‘X_test  = 144’) for testing.
- For the **60 / 40** configuration, 431 records (‘Y_train = 431’) are used for training and 288 records (‘Y_test  = 288’) for testing.

These values align with the expected proportions for the 80 / 20 and 60 / 40 splits.

*The printed outputs confirm the resulting dataset sizes align with the intended split ratios.*

### Conclusion
These splits ensure that model performance can be compared under two distinct experimental conditions:
- One emphasizing **more data for learning** (80 / 20) and the other emphasizing **more data for testing and generalization** (60 / 40).

*Both configurations will be used in the following classification tasks to evaluate how training set size impacts model accuracy and generalization.*


# Task 4 - Classification algorithms


## Task 4.1 - Manually implement KNN classifier

Before implementing a KNN clasifier algorithm, we first have to define a metric for determining distances. Since our dataset has 3 distinct features after preprocessing, we define the euclidean norm in 3 dimensions.

In [19]:
# First we need a metric:
def custom_euclidean(x, y):
    return np.sqrt((x[0]-y[0])**2 + (x[1]-y[1])**2 + (x[2]-y[2])**2)

The following impelementation finds the K nearest neighbours to a given point based on euclidean distance in 3 dimensions. The point is then classified as the most represented class amongst the K nearest neighbours.


This is done by looping over all other points and storing the distance to each. We now find the labels associated with the K greatest distances.


The most represented label is chosen.


In [20]:
# This function takes the following input
# - A dataframe
# - A point to predict a class for based on distances in the chosen dataframe
# - An integer K representing the amount of points in regarded neighborhoods

def K_nearest(dataframe, point, K):

  distance_to = [float('inf')] * len(dataframe)
  count_0 = 0
  count_1 = 0

  # Loop through all points
  for i in range(len(dataframe)):

    other_point = dataframe.iloc[i]

    if point.equals(other_point):
      continue

    distance_to[i] = custom_euclidean(point, other_point)

  # Collect K nearest:
  for n in range(K):
    minimum = min(distance_to)
    nearest_index = distance_to.index(minimum)

    nearest_point = dataframe.iloc[nearest_index]

    # Detemine Class
    if nearest_point['class'] == 1:
      count_1 += 1
    if nearest_point['class'] == 0:
      count_0 += 1

    distance_to[nearest_index] = float('inf')

  # Find most common class amongst K neighbors:
  if count_1 > count_0:
    return 1
  else:
    return 0


## Task 4.2 - Sklearn's Naive Bayes Classifier

Since the same set of features and class target will be used for all classification models, we simple start out by defining them here.

In [21]:
# Training and test sets ( These will be used for all the next three classifiers )
attributes_X = X_train.drop(columns='class')
target_X = X_train['class']
test_X = X_test.drop(columns='class')

attributes_Y = Y_train.drop(columns='class')
target_Y = Y_train['class']
test_Y = Y_test.drop(columns='class')

In [22]:
# Naive Bayes
clf_naive_bayes_X = GaussianNB()
clf_naive_bayes_X.fit(attributes_X, target_X)

# Prediction
prediction_naive_bayes_X = clf_naive_bayes_X.predict(test_X)

# Evaluation
difference_naive_bayes_X = abs(prediction_naive_bayes_X - X_test['class'])
correct_naive_bayes_X = len(difference_naive_bayes_X.loc[lambda x : x == 0])

print(f"Accuracy : {correct_naive_bayes_X / len(test_X)}")

Accuracy : 0.7222222222222222


Roughly 72% of the points in the test set were correctly classified. This is not an outstanding accuracy, but most points are nonetheless correctly claasified.

## Task 4.3 - Sklearn's Support Vector Classifier

In [23]:
# SVC
clf_svc_X = make_pipeline(StandardScaler(), SVC(kernel='sigmoid')) # We might want to add a different kernel
clf_svc_X.fit(attributes_X, target_X)

# Prediction
prediction_svc_X = clf_svc_X.predict(test_X)

# Evaluation
difference_test_svc_X = abs(prediction_svc_X - X_test['class'])
correct_test_svc_X = len(difference_test_svc_X.loc[lambda x : x == 0])

print(f"Accuracy : {correct_test_svc_X / len(test_X)}")

Accuracy : 0.7569444444444444


We choose the sigmoid kernel since it gave us the highest accuracy score. With roughly 76% accuracy, this was the best performing model among the four.

## Task 4.4 - Use Sklearn's Multilayer Perception Classifier

In [24]:
# MLP
clf_mlp_X = MLPClassifier(random_state=42, max_iter=300)
clf_mlp_X.fit(attributes_X, target_X)

# Prediction
prediction_mlp_X = clf_mlp_X.predict(test_X)

# Evaluation
difference_test_mlp_X = abs(prediction_mlp_X - X_test['class'])
correct_test_mlp_X = len(difference_test_mlp_X.loc[lambda x : x == 0])

print(f"Accuracy : {correct_test_mlp_X / len(test_X)}")

Accuracy : 0.7291666666666666


Adjusting the amount of max itterations or the learning rate did not improve performance, thus we left it with the default arguments and a high iteration cap.

# Task 5 - Evaluation of classification methods

## Task 5.1 - Manually implement a confusion matrix

Our implementation of a confusion matrix function, compares the actual class of each data point with the result of classification. Counts of the four different combinations are then stored in a matrix like structure with costumized labels.

In [25]:
def ConfusionMatrix(test_data, prediction_labels):
  comparison_table = pd.DataFrame()
  comparison_table['class'] = test_data['class']
  comparison_table['prediction'] = prediction_labels

  TP = len(comparison_table[(comparison_table['class'] == 1) & (comparison_table['prediction'] == 1)])
  TN = len(comparison_table[(comparison_table['class'] == 0) & (comparison_table['prediction'] == 0)])
  FP = len(comparison_table[(comparison_table['class'] == 0) & (comparison_table['prediction'] == 1)])
  FN = len(comparison_table[(comparison_table['class'] == 1) & (comparison_table['prediction'] == 0)])

  ConfusionMatrix = pd.DataFrame({
    "Actual Positive": [TP, FN],
    "Actual Negative": [FP, TN]
  })

  ConfusionMatrix = ConfusionMatrix.rename(index={0: "Predicted Positive", 1: "Predicted Negative"})

  return ConfusionMatrix

From the confusion matrix implementation, one can easiliy compute both recall and precision. Each function uses the relevant formula and the associated entries in the confusion matrix.

In [26]:
def recall(matrix):
  TP = matrix['Actual Positive'][0]
  FN = matrix['Actual Positive'][1]
  return float(TP / (TP + FN))

In [27]:
def precision(matrix):
  TP = matrix['Actual Positive'][0]
  FP = matrix['Actual Negative'][0]
  return float(TP / (TP + FP))

We want a general comparison of the model's performance on the two different splits. Because of this we store recall and precision results from each combination of split and model. These results are used for visualization.

In [28]:
# Set up a dataframe for overview of all results
recall_precision = pd.DataFrame({
    "Recall": [None, None, None, None, None, None, None, None],
    "Precision": [None, None, None, None, None, None, None, None]
  })

recall_precision = recall_precision.rename(index={
    0: "KNN_X", 1: "KNN_Y",
    2: "Naive_Bayes_X", 3: "Naive_Bayes_Y",
    4: "SVC_X", 5: "SVC_Y",
    6: "MLP_X", 7: "MLP_Y"
    })

### KNN Evaluation

In [29]:
# With the X-split (80/20)
test_knn_X = X_test.copy()
test_knn_X['result'] = test_knn_X.apply(lambda row: K_nearest(X_train, row, 10), axis=1)

# Get confusion matrix
CM_knn_X = ConfusionMatrix(test_knn_X,test_knn_X['result'])

# Add information to a container (We will use this for visualization later)
recall_precision.loc['KNN_X']['Recall'] = recall(CM_knn_X)
recall_precision.loc['KNN_X']['Precision'] = precision(CM_knn_X)

# Print information
print(CM_knn_X)
print()
print(f"Recall : {recall(CM_knn_X)}")
print(f"Precision : {precision(CM_knn_X)}")

                    Actual Positive  Actual Negative
Predicted Positive                8               16
Predicted Negative               33               87

Recall : 0.1951219512195122
Precision : 0.3333333333333333


On split X, the KNN classifier does a poor job of succesfully classifiying class 1, but does significantly better when it comes to class 0. Here the precision is 87 / 120 = 0.725 and recall is 87 / 103 = 0.84

In [30]:
# With the Y-split (60/40)
test_knn_Y = Y_test.copy()
test_knn_Y['result'] = test_knn_Y.apply(lambda row: K_nearest(Y_train, row, 10), axis=1)

# Get confusion matrix
CM_knn_Y = ConfusionMatrix(test_knn_Y,test_knn_Y['result'])

# Add information to a container (We will use this for visualization later)
recall_precision.loc['KNN_Y']['Recall'] = recall(CM_knn_Y)
recall_precision.loc['KNN_Y']['Precision'] = precision(CM_knn_Y)

# Print information
print(CM_knn_Y)
print()
print(f"Recall : {recall(CM_knn_Y)}")
print(f"Precision : {precision(CM_knn_Y)}")

                    Actual Positive  Actual Negative
Predicted Positive               10                8
Predicted Negative               76              194

Recall : 0.11627906976744186
Precision : 0.5555555555555556


On split Y, The precision is quite a bit better, but this comes at a sacrifice of recall dropiing to 0.116, which is ultimately quite a shallow prediction. The recall of class 0 is a staggering 0.96 and the precision is 0.72. If we want to determine wether or not we think a given blood donor won't return, then this model will likely not overlook many.

### Naive Bayes Evaluation

In [31]:
# With the X-split (80/20)

# Get confusion matrix
CM_Naive_Bayes_X = ConfusionMatrix(X_test, prediction_naive_bayes_X)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['Naive_Bayes_X']['Recall'] = recall(CM_Naive_Bayes_X)
recall_precision.loc['Naive_Bayes_X']['Precision'] = precision(CM_Naive_Bayes_X)

# Print information
print(CM_Naive_Bayes_X)
print()
print(f"Recall : {recall(CM_Naive_Bayes_X)}")
print(f"Precision : {precision(CM_Naive_Bayes_X)}")

                    Actual Positive  Actual Negative
Predicted Positive                4                3
Predicted Negative               37              100

Recall : 0.0975609756097561
Precision : 0.5714285714285714


In [32]:
# With the Y-split (60/40)

clf_naive_bayes_Y = GaussianNB()
clf_naive_bayes_Y.fit(attributes_Y, target_Y)

# Prediction
prediction_naive_bayes_Y = clf_naive_bayes_Y.predict(test_Y)

# Get confusion matrix
CM_Naive_Bayes_Y = ConfusionMatrix(Y_test, prediction_naive_bayes_Y)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['Naive_Bayes_Y']['Recall'] = recall(CM_Naive_Bayes_Y)
recall_precision.loc['Naive_Bayes_Y']['Precision'] = precision(CM_Naive_Bayes_Y)

# With the X-split (80/20)
print(CM_Naive_Bayes_Y)
print()
print(f"Recall : {recall(CM_Naive_Bayes_Y)}")
print(f"Precision : {precision(CM_Naive_Bayes_Y)}")

                    Actual Positive  Actual Negative
Predicted Positive                4                4
Predicted Negative               82              198

Recall : 0.046511627906976744
Precision : 0.5


### SVC Evaluation

In [33]:
# With the X-split (80/20)

# Get confusion matrix
CM_svc_X = ConfusionMatrix(X_test, prediction_svc_X)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['SVC_X']['Recall'] = recall(CM_svc_X)
recall_precision.loc['SVC_X']['Precision'] = precision(CM_svc_X)

# Print information
print(CM_svc_X)
print()
print(f"Recall : {recall(CM_svc_X)}")
print(f"Precision : {precision(CM_svc_X)}")

                    Actual Positive  Actual Negative
Predicted Positive               15                9
Predicted Negative               26               94

Recall : 0.36585365853658536
Precision : 0.625


In [34]:
# With the Y-split (60/40)

clf_svc_Y = make_pipeline(StandardScaler(), SVC(kernel ='sigmoid'))
clf_svc_Y.fit(attributes_Y, target_Y)

# Prediction
prediction_svc_Y = clf_svc_Y.predict(test_Y)

# Get confusion matrix
CM_svc_Y = ConfusionMatrix(Y_test, prediction_svc_Y)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['SVC_Y']['Recall'] = recall(CM_svc_Y)
recall_precision.loc['SVC_Y']['Precision'] = precision(CM_svc_Y)


print(CM_svc_Y)
print()
print(f"Recall : {recall(CM_svc_Y)}")
print(f"Precision : {precision(CM_svc_Y)}")

                    Actual Positive  Actual Negative
Predicted Positive               16               24
Predicted Negative               70              178

Recall : 0.18604651162790697
Precision : 0.4


### MLP Evaluation

In [35]:
# With the X-split (80/20)

# Get confusion matrix
CM_mlp_X = ConfusionMatrix(X_test, prediction_mlp_X)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['MLP_X']['Recall'] = recall(CM_mlp_X)
recall_precision.loc['MLP_X']['Precision'] = precision(CM_mlp_X)

print(CM_mlp_X)
print()
print(f"Recall : {recall(CM_mlp_X)}")
print(f"Precision : {precision(CM_mlp_X)}")

                    Actual Positive  Actual Negative
Predicted Positive                3                1
Predicted Negative               38              102

Recall : 0.07317073170731707
Precision : 0.75


In [36]:
# With the Y-split (60/40)

clf_mlp_Y = MLPClassifier(random_state=42, max_iter=300)
clf_mlp_Y.fit(attributes_Y, target_Y)

# Prediction
prediction_mlp_Y = clf_mlp_Y.predict(test_Y)

# Get confusion matrix
CM_mlp_Y = ConfusionMatrix(Y_test, prediction_mlp_Y)

# Add information to a container (We will use this for visualization later)
recall_precision.loc['MLP_Y']['Recall'] = recall(CM_mlp_Y)
recall_precision.loc['MLP_Y']['Precision'] = precision(CM_mlp_Y)

print(CM_mlp_Y)
print()
print(f"Recall : {recall(CM_mlp_Y)}")
print(f"Precision : {precision(CM_mlp_Y)}")

                    Actual Positive  Actual Negative
Predicted Positive                2                1
Predicted Negative               84              201

Recall : 0.023255813953488372
Precision : 0.6666666666666666


We generally observe that our models tend to be overly inclined to classify points with label “0”. This pattern is consistent across all model types.

Furthermore, the X-split performs better in terms of both precision and recall for all models except the KNN classifier. This supports our hypothesis that it is generally advantageous to prioritize a larger training set.

The KNN classifier performs better on the Y-split 60/40 than on the X-split 80/20, possibly because the higher density of points in the X-split training data does not align well with our chosen K-value. In contrast, the lower density of training points in the Y-split may cause artificially homogeneous clusters to form, which could improve KNNs local classification performance under certain conditions.

#### Overview of Results

In [37]:
overview = px.scatter(
    data_frame = recall_precision,
    x = recall_precision['Recall'],
    y = recall_precision['Precision'],
    color = recall_precision.index
    )

overview.update_layout(yaxis_range=[0,1])
overview.update_layout(xaxis_range=[0,1])

## Task 5.2 - Sklearn's classification report

In [38]:
class_names = ['class 0', 'class 1']

### KNN Evaluation

In [39]:
# X-Split (80/20)
print(classification_report(X_test['class'], test_knn_X['result'],target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.72      0.84      0.78       103
     class 1       0.33      0.20      0.25        41

    accuracy                           0.66       144
   macro avg       0.53      0.52      0.51       144
weighted avg       0.61      0.66      0.63       144



In [40]:
# Y-Split (60/40)
print(classification_report(Y_test['class'], test_knn_Y['result'],target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.72      0.96      0.82       202
     class 1       0.56      0.12      0.19        86

    accuracy                           0.71       288
   macro avg       0.64      0.54      0.51       288
weighted avg       0.67      0.71      0.63       288



### Naive Bayes Evaluation

In [41]:
# X-Split (80/20)
print(classification_report(X_test['class'], prediction_naive_bayes_X,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.73      0.97      0.83       103
     class 1       0.57      0.10      0.17        41

    accuracy                           0.72       144
   macro avg       0.65      0.53      0.50       144
weighted avg       0.68      0.72      0.64       144



In [42]:
# Y-Split (60/40)
print(classification_report(Y_test['class'], prediction_naive_bayes_Y,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.71      0.98      0.82       202
     class 1       0.50      0.05      0.09        86

    accuracy                           0.70       288
   macro avg       0.60      0.51      0.45       288
weighted avg       0.65      0.70      0.60       288



### SVC Evaluation

In [43]:
# X-Split (80/20)
print(classification_report(X_test['class'], prediction_svc_X,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.78      0.91      0.84       103
     class 1       0.62      0.37      0.46        41

    accuracy                           0.76       144
   macro avg       0.70      0.64      0.65       144
weighted avg       0.74      0.76      0.73       144



In [44]:
# Y-Split (60/40)
print(classification_report(Y_test['class'], prediction_svc_Y,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.72      0.88      0.79       202
     class 1       0.40      0.19      0.25        86

    accuracy                           0.67       288
   macro avg       0.56      0.53      0.52       288
weighted avg       0.62      0.67      0.63       288



### MLP Evaluation

In [45]:
# X-Split (80/20)
print(classification_report(X_test['class'], prediction_mlp_X,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.73      0.99      0.84       103
     class 1       0.75      0.07      0.13        41

    accuracy                           0.73       144
   macro avg       0.74      0.53      0.49       144
weighted avg       0.73      0.73      0.64       144



In [46]:
# Y-Split (60/40)
print(classification_report(Y_test['class'], prediction_mlp_Y,target_names=class_names))

              precision    recall  f1-score   support

     class 0       0.71      1.00      0.83       202
     class 1       0.67      0.02      0.04        86

    accuracy                           0.70       288
   macro avg       0.69      0.51      0.44       288
weighted avg       0.69      0.70      0.59       288



The classification repport clearly shows how much better all of the models are at detecting class 0 than class 1.

## Task 5.3 - Sklearn's fbeta score function

Since all our models have had significantly better precision than recall with regards to class 1, we have chosen a low beta score to prioritize the effect of precision.

### KNN Evaluation



In [52]:
print(f"KNN - split X : {fbeta_score(X_test['class'], test_knn_X['result'], average='macro', beta=0.5)}")
print(f"KNN - split Y : {fbeta_score(Y_test['class'], test_knn_Y['result'], average='macro', beta=0.5)}")

KNN - split X : 0.5190557273603686
KNN - split Y : 0.53654298070657


### Naive Bayes Evaluation

In [53]:
print(f"Naive Bayes - split X : {fbeta_score(X_test['class'], prediction_naive_bayes_X, average='macro', beta=0.5)}")
print(f"Naive Bayes - split Y : {fbeta_score(Y_test['class'], prediction_naive_bayes_Y, average='macro', beta=0.5)}")

Naive Bayes - split X : 0.5289521138048487
Naive Bayes - split Y : 0.4591784404728326


### SVC Evaluation

In [54]:
print(f"SVC - split X : {fbeta_score(X_test['class'], prediction_svc_X, average='macro', beta=0.5)}")
print(f"SVC - split Y : {fbeta_score(Y_test['class'], prediction_svc_Y, average='macro', beta=0.5)}")

SVC - split X : 0.6768101062964029
SVC - split Y : 0.5352984434366956


### MLP Evaluation

In [55]:
print(f"MLP - split X : {fbeta_score(X_test['class'], prediction_mlp_X, average='macro', beta=0.5)}")
print(f"MLP - split Y : {fbeta_score(Y_test['class'], prediction_mlp_Y, average='macro', beta=0.5)}")

MLP - split X : 0.5161943319838057
MLP - split Y : 0.42546154080111925


# Task 6 - Cross-validation


I am not quite sure where 6.1 ends and 6.2 starts
