# Exercise 1

Consider the dataset mammography.mat that you can download from http://odds.cs.stonybrook.edu/mammography-dataset/ (http://odds.cs.stonybrook.edu/mammography- dataset/). Below you can find the code to load this file into X and Y , you just need to put the file in your data folder. During mammography the doctor would be looking for something called
calcification , which is calcium deposits in the breast tissue and is used as an indicator of cancer. In this dataset the X corresponds to some measurements, there are 6 features. The Y is a 0-1 label where 1 represents calcification and 0 does not.
1. Split the data into three parts, train/test/validation where train is 60% of the data, test is 15% and validation is 25% of the data. Do not do this randomly, I have prepared a shuffling with a fixed seed, this way I can make sure that we all did the same splits. That is [train,test,validation] is the splitting layout.
2. Train a machine learning model on the training data (you are free to choose it yourself). Hint: you could always try LogisticRegression , but for it to work well you would need
StandardScaler and put everything in a Pipeline .
3. Use the classification report from Utils and compute the intervals for precision-recall etc on
the test set with union_bound correction with 95% confidence.
4. You are interested in minimizing the average cost of your classifier, but first we must define it:
If someone is calcified but classified as not, we say it costs 30 (this is the worst scenario) If someone is not calcified but classified as calcified, we say it costs 5 (this probably only costs extra investigation)
If someone is calcified but classified as calcified, costs 0 (We did the right thing, no cost) If someone is not calcified but classified as not, costs 0 (We did the right thing, no cost).
complete filling the function cost to compute the cost of a prediction model under a certain prediction threshold (recall our precision recall lecture and the predict_proba function from trained models). What would be the cost of having a model that always classifies someone as not calcified on the test set?
1. Now, we wish to select the threshold of our classifier that minimizes the cost, we do that by checking say 100 evenly spaced proposal thresholds between 0 and 1, and use our testing data to compute the cost.
2. With your newly computed threshold value, compute the cost of putting this model in production by computing the cost using the validation data. Also provide a confidence interval of the cost using Hoeffdings inequality with a 95% confidence.
3. Let 𝑡 be the threshold you found and 𝑓 the model you fitted, if we define the random variable 𝐶 = 30(1 − 1𝑓(𝑋)≥𝑡 )𝑌 + 5(1 − 𝑌 )1𝑓(𝑋)≥𝑡
then 𝐶 denotes the cost of a randomly chosen patient. In the above we are estimating 𝔼[𝐶] using the empirical mean. However, since the number of calcifications in the population is fairly small and our classifier probably has a fairly high precision for the 0 class, then 𝐶 should have fairly small variance. Compute the empirical variance of 𝐶 on the validation set. What would be the confidence interval if we used Bennett's inequality instead of Hoeffding in point 6 but with the computed empirical variance as our guess for the variance?

In [3]:
import scipy.io as so
import numpy as np
data = so.loadmat('data/mammography.mat')
np.random.seed(0)
shuffle_index = np.arange(0,len(data['X']))
np.random.shuffle(shuffle_index)
X = data['X'][shuffle_index,:]
Y = data['y'][shuffle_index,:].flatten()

In [4]:
#Part1
# Split the X,Y into three parts, make sure that the sizes are
# (6709, 6), (1677, 6), (2797, 6), (6709,), (1677,), (2797,)
# Split the data: 
# 60% for training, 15% for testing, and 25% for validation
train_size = int(0.6 * len(X))
test_size = int(0.15 * len(X))
valid_size = len(X) - train_size - test_size

# Split the data into train, test, and validation sets
X_train = X[:train_size, :]
X_test = X[train_size:train_size + test_size, :]
X_valid = X[train_size + test_size:, :]

Y_train = Y[:train_size]
Y_test = Y[train_size:train_size + test_size]
Y_valid = Y[train_size + test_size:]

# The sizes of the datasets
print(X_train.shape, X_test.shape, X_valid.shape)
print(Y_train.shape, Y_test.shape, Y_valid.shape)
#X_train, X_test, X_valid, Y_train, Y_test, Y_valid = XXX

(6709, 6) (1677, 6) (2797, 6)
(6709,) (1677,) (2797,)


In [7]:
#Part2
# Use X_train,Y_train to train a model from sklearn. Make sure it h as the predict_proba function
# for instance LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline with StandardScaler and LogisticRegression
model = Pipeline([
    ('scaler', StandardScaler()),         # Normalize the data
    ('logreg', LogisticRegression())      # Logistic Regression Model
])

# Train the model using the training data
model.fit(X_train, Y_train)

# The model is now trained and ready to make predictions
# You can use model.predict_proba() for probabilistic predictions


Pipeline(steps=[('scaler', StandardScaler()), ('logreg', LogisticRegression())])

In [15]:
#Part3
# Compute precision and recall on the test set using

# Assuming classification_report_interval is already imported from Utils
from Utils import classification_report_interval

# Make predictions on the test set
y_pred = model.predict(X_test)
import re

# Function to extract precision and recall from the classification report string
def extract_metrics(report):
    # Define a regex pattern to extract precision, recall, and their confidence intervals
    pattern = r'(\d)\s+(\d\.\d{2})\s*:\s*\[(\d\.\d{2}),(\d\.\d{2})\]\s+(\d\.\d{2})\s*:\s*\[(\d\.\d{2}),(\d\.\d{2})\]'
    
    # Find all matches for the pattern
    matches = re.findall(pattern, report)
    
    # Parse the precision and recall for both classes (0 and 1)
    precision0 = (float(matches[0][1]), float(matches[0][2]))
    recall0 = (float(matches[0][4]), float(matches[0][5]))
    
    precision1 = (float(matches[1][1]), float(matches[1][2]))
    recall1 = (float(matches[1][4]), float(matches[1][5]))
    
    return precision0, recall0, precision1, recall1

# Get the report from the classification_report_interval function
report = classification_report_interval(Y_test, y_pred, alpha=0.05)

# Extract precision and recall values
precision0, recall0, precision1, recall1 = extract_metrics(report)

# Print the extracted precision and recall
print(f"Precision for class 0: {precision0}")
print(f"Recall for class 0: {recall0}")
print(f"Precision for class 1: {precision1}")
print(f"Recall for class 1: {recall1}")




Precision for class 0: (0.98, 0.94)
Recall for class 0: (1.0, 0.96)
Precision for class 1: (0.7, 0.33)
Recall for class 1: (0.36, 0.09)


In [16]:
 #Part4

def cost(model, threshold, X, Y):
    # Get predicted probabilities for the positive class (calcified)
    pred_proba = model.predict_proba(X)[:, 1]
    
    # Apply threshold to get the binary predictions
    predictions = (pred_proba >= threshold) * 1
    
    # Calculate True Positives, False Positives, True Negatives, False Negatives
    TP = ((predictions == 1) & (Y == 1)).sum()  # Predicted 1, True 1
    FP = ((predictions == 1) & (Y == 0)).sum()  # Predicted 1, True 0
    TN = ((predictions == 0) & (Y == 0)).sum()  # Predicted 0, True 0
    FN = ((predictions == 0) & (Y == 1)).sum()  # Predicted 0, True 1
    
    # Define costs
    cost_FN = 30  # Cost for False Negative (worst case)
    cost_FP = 10  # Cost for False Positive
    
    # Compute total cost
    total_cost = (FP * cost_FP) + (FN * cost_FN)
    
    # Compute the average cost per sample
    average_cost = total_cost / len(Y)
    
    return average_cost

# Example usage:
threshold = 0.5  # You can adjust the threshold as needed
average_cost = cost(model, threshold, X_test, Y_test)
print(f"Average cost: {average_cost}")


Average cost: 0.48300536672629696


In [17]:
#Part4
# Fill in the naive cost of the model that would classify all as non-calcified on the test set
def naive_cost(Y):
    # False Negatives: Y=1 (calcified), but predicted as 0 (non-calcified)
    FN = (Y == 1).sum()  # All calcified samples will be False Negatives in the naive model
    
    # Define cost for False Negative
    cost_FN = 30  # Cost for False Negative (worst case)
    
    # Compute total naive cost
    total_naive_cost = FN * cost_FN
    
    # Compute the naive average cost per sample
    naive_average_cost = total_naive_cost / len(Y)
    
    return naive_average_cost

# Example usage:
naive_cost_value = naive_cost(Y_test)  # Assuming Y_test contains the true labels
print(f"Naive cost: {naive_cost_value}")


Naive cost: 0.6976744186046512


In [20]:
#Part5

# Now, we need to find the optimal threshold that minimizes the cost.
# We'll check 100 evenly spaced thresholds between 0 and 1.

thresholds = np.linspace(0, 1, 100)
costs = [cost(model, threshold, X_test, Y_test) for threshold in thresholds]

# Find the threshold with the minimum cost
optimal_threshold = thresholds[np.argmin(costs)]
cost_at_optimal_threshold = min(costs)

print(f"Optimal threshold: {optimal_threshold}")
print(f"Cost at optimal threshold: {cost_at_optimal_threshold}")


Optimal threshold: 0.19191919191919193
Cost at optimal threshold: 0.3935599284436494


In [26]:
#Part6

# Compute the cost at optimal threshold on the validation set

# Find the optimal threshold (the one with the minimum cost)
optimal_threshold = thresholds[np.argmin(costs)]

# Get the cost at the optimal threshold on the validation set
cost_at_optimal_threshold_validation = min(costs)

# Report the cost interval as a tuple cost_interval = (a,b) cost_interval = XXX
# The code below will tell you if you filled in the intervals corre ctly
assert(type(cost_interval) == tuple)
assert(len(cost_interval) == 2)

ValueError: attempt to get argmin of an empty sequence

In [None]:
#Part7 

import numpy as np

# Define cost values for False Negatives (FN) and False Positives (FP)
cost_FN = 30  # Cost when a calcified sample is classified as non-calcified
cost_FP = 10  # Cost when a non-calcified sample is classified as calcified

# Ensure the model has been trained and is able to make predictions
if hasattr(model, 'predict_proba'):
    # Get predicted probabilities for the positive class (calcified)
    pred_proba = model.predict_proba(X_valid)[:, 1]  # Probabilities for class 1 (calcified)

    # Get binary predictions based on the optimal threshold (which is already computed)
    predictions = (pred_proba >= optimal_threshold) * 1

    # Calculate True Positives, False Positives, True Negatives, False Negatives
    TP = ((predictions == 1) & (Y_valid == 1)).sum()  # Predicted 1, True 1
    FP = ((predictions == 1) & (Y_valid == 0)).sum()  # Predicted 1, True 0
    TN = ((predictions == 0) & (Y_valid == 0)).sum()  # Predicted 0, True 0
    FN = ((predictions == 0) & (Y_valid == 1)).sum()  # Predicted 0, True 1
    
    # Compute total cost at the optimal threshold
    total_cost = (FP * cost_FP) + (FN * cost_FN)
    
    # Compute average cost per sample
    average_cost = total_cost / len(Y_valid)
    
    # Apply Hoeffding's inequality to compute the confidence interval
    n = len(Y_valid)  # Sample size
    alpha = 0.05  # 95% confidence
    l = np.sqrt(np.log(2 / alpha) / (2 * n))  # Hoeffding's bound for margin of error
    
    # Compute the confidence interval around the cost
    lower_bound = average_cost - l
    upper_bound = average_cost + l
    cost_interval = (lower_bound, upper_bound)

    # Print the results
    print(f"Cost at optimal threshold on validation set: {average_cost}")
    print(f"Confidence interval for the cost at 95% confidence: {cost_interval}")
else:
    raise ValueError("The model does not have the 'predict_proba' method. Ensure it is trained.")


In [27]:
import numpy as np

# Calculate the cost C for each sample in the validation set
def compute_cost(C_pred, Y, t):
    return 30 * (1 - C_pred) * Y + 5 * (1 - Y) * C_pred

# Ensure that the model has been trained and we can make predictions
if hasattr(model, 'predict_proba'):
    # Get predicted probabilities for the positive class (calcified)
    pred_proba = model.predict_proba(X_valid)[:, 1]  # Probabilities for class 1 (calcified)
    
    # Get binary predictions based on the optimal threshold (which is already computed)
    C_pred = (pred_proba >= optimal_threshold) * 1
    
    # Compute the cost C for each sample
    C = compute_cost(C_pred, Y_valid, optimal_threshold)

    # Compute empirical mean of C
    C_mean = np.mean(C)

    # Compute empirical variance of C
    variance_of_C = np.var(C)

    # Compute the margin of error for the 95% confidence interval using Bennett's inequality
    n = len(C)  # Sample size
    alpha = 0.05  # 95% confidence
    l = np.sqrt((2 * variance_of_C * np.log(2 / alpha)) / n)

    # Compute the confidence interval for the cost using Bennett's inequality
    interval_of_C = (C_mean - l, C_mean + l)

    # Print the results
    print(f"Empirical variance of C: {variance_of_C}")
    print(f"Confidence interval for the cost using Bennett's inequality: {interval_of_C}")
else:
    raise ValueError("The model does not have the 'predict_proba' method. Ensure it is trained.")


Empirical variance of C: 8.48486854946608
Confidence interval for the cost using Bennett's inequality: (0.1739585599378358, 0.47316335640109874)
