# Project 3: Blood Donor Classification
(by: Martin Marsal, Benedikt Allmendinger, Christian Diegmann; Heilbronn University, Germany, January 2025)

## 0. Preperation
First, get to know the dataset and deal with missing values.
- Perform an exploratory data analysis to get to know the data set
- Preprocess the data. If there are missing values, impute them.
- Estimate the accuracy of your imputation for each feature

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error

In [None]:
# Read the CSV file into a DataFrame
df = pd.read_csv('hemodat.csv')

# 1. Basic Information
print("Basic Information:")
print(df.info())
print("\nShape of the dataset:", df.shape)

In [None]:
# 2. Summary Statistics
print("\nSummary Statistics (Numerical):")
print(df.describe())

print("\nSummary Statistics (Categorical):")
categorical_columns = df.select_dtypes(include=['object']).columns
print(df[categorical_columns].describe())

In [None]:
# 3. Distribution Analysis
numerical_columns = df.select_dtypes(include=[np.number]).columns

# Histograms for numerical data
df[numerical_columns].hist(bins=15, figsize=(15, 10), layout=(len(numerical_columns)//3 + 1, 3))
plt.tight_layout()
plt.show()

# Count plots for categorical data
for col in categorical_columns:
    sns.countplot(y=col, data=df)
    plt.show()

In [None]:
# 4. Correlation Matrix
# Select only numeric columns for correlation matrix
numeric_columns = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numeric_columns].corr()

# Visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=False, cmap="coolwarm")
plt.title("Correlation Matrix of Numeric Features")
plt.show()

In [None]:
# 5. Feature Engineering Insights
print("\nUnique Values per Column:")
print(df.nunique())

In [None]:
# Strip any leading/trailing spaces from column names
df.columns = df.columns.str.strip()

# Calculate the number of missing values for each feature
missing_values = df.isnull().sum()

# Output the count of missing values for each feature
print("Missing values per feature:")
for feature, missing_count in missing_values.items():
    print(f"{feature}: {missing_count}")

In [None]:
# Define the KNN imputer
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')

# Apply KNN imputer to the DataFrame
numerical_columns = df.select_dtypes(include=['number']).columns
df[numerical_columns] = knn_imputer.fit_transform(df[numerical_columns])

# Check if all null values are imputed
foundNull = df.isnull().values.any()
if foundNull:
    raise TypeError('Found null value in DataFrame.')

# Output the cleaned DataFrame
print("DataFrame after KNN imputation:")
print(df)

In [None]:
# Select numerical columns
numerical_columns = df.select_dtypes(include=['number']).columns
df_numerical = df[numerical_columns]

# Create a copy of the original data
original_data = df_numerical.copy()

# Introduce missing values artificially (10% missing)
np.random.seed(42)
mask = np.random.rand(*df_numerical.shape) < 0.1  # Mask for 10% missing
df_missing = df_numerical.copy()
df_missing[mask] = np.nan

# Apply KNN Imputer
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df_missing), columns=numerical_columns)

# Compare imputed values with original values
print(f"Mean Squared Error for each feature:")
for col in numerical_columns:
    # Calculate error only for artificially missing values
    mask_col = mask[:, df_numerical.columns.get_loc(col)]
    mse = mean_squared_error(original_data[col][mask_col], df_imputed[col][mask_col])
    print(f"'{col}': {mse}")

## 1. Anomaly Detection
Since medical conditions that lead to the rejection of a donor are rare (luckily) and can be very
versatile. It is near impossible to categorize every possible condition. Hence, it would be useful to have an anomaly
detection algorithm in place as a safety mechanism to detect suspicious blood samples for further testing.
- Train an anomaly detection model based only on valid blood donors without a medical condition.
- Evaluate the accuracy of your anomaly detection by testing it also on donors with a medical condition.
- Perform a PCA to visualize the true / false positive and true / false negative predictions as well as the decision
boundary of your anomaly detection. How much variance is explained by the first two main components? 

In [None]:
# df_blood_donors = df[df['Category'] == '0=Blood Donor']
# 
# from sklearn.preprocessing import StandardScaler
# 
# # Select numerical columns for standardization
# numerical_cols = df_blood_donors.select_dtypes(include=['float64', 'int64']).columns
# 
# scaler = StandardScaler()
# df_blood_donors.loc[:, numerical_cols] = scaler.fit_transform(df_blood_donors[numerical_cols])
# 
# print(df_blood_donors)

In [None]:
# from sklearn.ensemble import IsolationForest
# 
# # Initialize and train the anomaly detection model
# model = IsolationForest(contamination=0.05)  # Assuming 5% contamination (anomalies)
# model.fit(df_blood_donors[numerical_cols])
# 
# # Predict anomalies
# anomalies = model.predict(df_blood_donors[numerical_cols])
# 
# # Anomalies will be -1 (outliers) or 1 (normal)
# df_blood_donors = df[df['Category'] == '0=Blood Donor'].copy()  # Create a deep copy
# df_blood_donors['anomaly'] = anomalies  # No warning here
# 
# print(df_blood_donors)

In [None]:
# df_medical_conditions = df[df['Category'] != '0=Blood Donor']
# # test_data = pd.concat([df_blood_donors, df_medical_conditions])
# 
# test_data = df[df['Category'] != '0=Blood Donor'].copy()
# 
# # Preprocess the test data similarly to the training data
# test_data[numerical_cols] = scaler.transform(test_data[numerical_cols])
# 
# test_data['predictions'] = model.predict(test_data[numerical_cols])
# print(test_data)

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Filter data for valid blood donors
donors = df[df['Category'] == '0=Blood Donor']

# Filter data for non-donors (anomalous category)
non_donors = df[df['Category'] != '0=Blood Donor']

# Numerical columns for modeling
numerical_columns = ['ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']

# Features for anomaly detection
X_train = donors[numerical_columns]  # Train on valid donors only
X_test = non_donors[numerical_columns]  # Use non-donors for testing

# Define and train Isolation Forest
iso_forest = IsolationForest(random_state=42, contamination=0.1)  # Assuming 10% contamination
iso_forest.fit(X_train)

# Predict anomalies on the combined dataset
combined_data = pd.concat([donors, non_donors], axis=0)
combined_features = combined_data[numerical_columns]
predictions = iso_forest.predict(combined_features)

# Map predictions to binary format (1: normal, -1: anomaly)
pred_binary = np.where(predictions == 1, 1, 0)  # 1: Valid, 0: Anomaly
true_labels = np.where(combined_data['Category'] == '0=Blood Donor', 1, -1)
true_binary = np.where(true_labels == 1, 1, 0)

# Classification report
print("Classification Report:")
print(classification_report(true_binary, pred_binary, target_names=['Anomaly', 'Valid']))

# Calculate the counts of true/false positives and negatives
tp = np.sum((true_binary == 1) & (pred_binary == 1))  # True Positives
fp = np.sum((true_binary == 0) & (pred_binary == 1))  # False Positives
tn = np.sum((true_binary == 0) & (pred_binary == 0))  # True Negatives
fn = np.sum((true_binary == 1) & (pred_binary == 0))  # False Negatives

# Display the counts
print(f"Counts:")
print(f"True Positives (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"True Negatives (TN): {tn}")
print(f"False Negatives (FN): {fn}")

# Perform PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(combined_features)

# Variance explained by PCA components
variance_explained = pca.explained_variance_ratio_
print(f"Variance explained by the first two components: {variance_explained}")

# Scatter plot with True/False Positives and Negatives
plt.figure(figsize=(12, 8))
plt.scatter(pca_data[(true_binary == 1) & (pred_binary == 1), 0], pca_data[(true_binary == 1) & (pred_binary == 1), 1], 
            c='green', label='True Positives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 0) & (pred_binary == 1), 0], pca_data[(true_binary == 0) & (pred_binary == 1), 1], 
            c='orange', label='False Positives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 0) & (pred_binary == 0), 0], pca_data[(true_binary == 0) & (pred_binary == 0), 1], 
            c='blue', label='True Negatives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 1) & (pred_binary == 0), 0], pca_data[(true_binary == 1) & (pred_binary == 0), 1], 
            c='red', label='False Negatives', alpha=0.6, edgecolor='k')
plt.title("PCA Visualization of Predictions")
plt.xlabel(f"Principal Component 1 ({variance_explained[0]:.2%} variance)")
plt.ylabel(f"Principal Component 2 ({variance_explained[1]:.2%} variance)")
plt.legend()
plt.grid(True)
plt.show()

# Decision boundary visualization using PCA components
xx, yy = np.meshgrid(
    np.linspace(pca_data[:, 0].min() - 1, pca_data[:, 0].max() + 1, 100),
    np.linspace(pca_data[:, 1].min() - 1, pca_data[:, 1].max() + 1, 100)
)
# Prepare the grid points for prediction
mesh_points = np.c_[xx.ravel(), yy.ravel()]
inverse_transformed = pca.inverse_transform(mesh_points)

# Ensure the mesh_points (inverse_transformed) have valid feature names
inverse_transformed_df = pd.DataFrame(inverse_transformed, columns=numerical_columns)

# Predict anomaly scores for the mesh points
mesh_scores = iso_forest.decision_function(inverse_transformed_df).reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(12, 8))
plt.contourf(xx, yy, mesh_scores, levels=50, cmap='coolwarm', alpha=0.5)
plt.colorbar(label='Anomaly Score')
plt.scatter(pca_data[(true_binary == 1) & (pred_binary == 1), 0], pca_data[(true_binary == 1) & (pred_binary == 1), 1], 
            c='green', label='True Positives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 0) & (pred_binary == 1), 0], pca_data[(true_binary == 0) & (pred_binary == 1), 1], 
            c='orange', label='False Positives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 0) & (pred_binary == 0), 0], pca_data[(true_binary == 0) & (pred_binary == 0), 1], 
            c='blue', label='True Negatives', alpha=0.6, edgecolor='k')
plt.scatter(pca_data[(true_binary == 1) & (pred_binary == 0), 0], pca_data[(true_binary == 1) & (pred_binary == 0), 1], 
            c='red', label='False Negatives', alpha=0.6, edgecolor='k')
plt.title("PCA Visualization with Decision Boundary")
plt.xlabel(f"Principal Component 1 ({variance_explained[0]:.2%} variance)")
plt.ylabel(f"Principal Component 2 ({variance_explained[1]:.2%} variance)")
plt.legend()
plt.grid(True)
plt.show()

## 2. Explainable Model
For your decision support your model should be explainable. Train a model with a focus on
explainability with an as simple as possible structure while still maintaining its predictive power.
- Train a decision tree classifier on the imputed data. Evaluate your model’s accuracy and visualize the tree structure to
help the hospital personal understand the decision process. Each inference should not only put out the class, but also
the decision path taken. Make the tree as simple and understandable as possible.

## 3. High Performance Model
This time the focus is on predictive power. Try and train a more accurate model. Is it worth
the effort?
- Train and optimize an XGBoost classifier on the imputed data.
- Use SHAP local explanation techniques on 5 selected data points and discuss the results
- Use SHAP global explanation techniques to visualize and discuss the influence of different features.
- Evaluate the XGBoost’s accuracy and compare it to the Decision Tree

## 4. Combined Model
Put all components into a single model artifact for deployment such that clinic personal has all important
information at hand to make an informed decision.
- Combine the XGBoost, Decision Tree and Anomaly Detection in a single model class including all necessary methods (fit,
predict…). The Decision Tree provides an explainable assistance for the hospital personal and the XGBoost (probably) a more
accurate classification. The Anomaly Detection increases the robustness of the model for conditions that have not been
explicitly trained or for human errors. Generate a few test anomalies to check your detection.
- Evaluate, discuss and plot the performance of your combined model.