# Project 3: Blood Donor Classification
(by: Martin Marsal, Benedikt Allmendinger, Christian Diegmann; Heilbronn University, Germany, January 2025)

## 0. Preperation
First, get to know the dataset and deal with missing values.
- Perform an exploratory data analysis to get to know the data set
- Preprocess the data. If there are missing values, impute them.
- Estimate the accuracy of your imputation for each feature

In [None]:
import pandas as pd
from sklearn.impute import KNNImputer

In [None]:
# Read the CSV file into a DataFrame
df = pd.read_csv('hemodat.csv')

# Strip any leading/trailing spaces from column names
df.columns = df.columns.str.strip()

# Calculate the number of missing values for each feature
missing_values = df.isnull().sum()

# Output the count of missing values for each feature
print("Missing values per feature:")
for feature, missing_count in missing_values.items():
    print(f"{feature}: {missing_count}")

In [None]:
# Define the KNN imputer
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')

# Apply KNN imputer to the DataFrame
numerical_columns = df.select_dtypes(include=['number']).columns
df[numerical_columns] = knn_imputer.fit_transform(df[numerical_columns])

# Check if all null values are imputed
foundNull = df.isnull().values.any()
if foundNull:
    raise TypeError('Found null value in DataFrame.')

# Output the cleaned DataFrame
print("DataFrame after KNN imputation:")
print(df)

## 1. Anomaly Detection
Since medical conditions that lead to the rejection of a donor are rare (luckily) and can be very
versatile. It is near impossible to categorize every possible condition. Hence, it would be useful to have an anomaly
detection algorithm in place as a safety mechanism to detect suspicious blood samples for further testing.
- Train an anomaly detection model based only on valid blood donors without a medical condition.
- Evaluate the accuracy of your anomaly detection by testing it also on donors with a medical condition.
- Perform a PCA to visualize the true / false positive and true / false negative predictions as well as the decision
boundary of your anomaly detection. How much variance is explained by the first two main components? 

## 2. Explainable Model
For your decision support your model should be explainable. Train a model with a focus on
explainability with an as simple as possible structure while still maintaining its predictive power.
- Train a decision tree classifier on the imputed data. Evaluate your model’s accuracy and visualize the tree structure to
help the hospital personal understand the decision process. Each inference should not only put out the class, but also
the decision path taken. Make the tree as simple and understandable as possible.

## 3. High Performance Model
This time the focus is on predictive power. Try and train a more accurate model. Is it worth
the effort?
- Train and optimize an XGBoost classifier on the imputed data.
- Use SHAP local explanation techniques on 5 selected data points and discuss the results
- Use SHAP global explanation techniques to visualize and discuss the influence of different features.
- Evaluate the XGBoost’s accuracy and compare it to the Decision Tree

## 4. Combined Model
Put all components into a single model artifact for deployment such that clinic personal has all important
information at hand to make an informed decision.
- Combine the XGBoost, Decision Tree and Anomaly Detection in a single model class including all necessary methods (fit,
predict…). The Decision Tree provides an explainable assistance for the hospital personal and the XGBoost (probably) a more
accurate classification. The Anomaly Detection increases the robustness of the model for conditions that have not been
explicitly trained or for human errors. Generate a few test anomalies to check your detection.
- Evaluate, discuss and plot the performance of your combined model.