In [None]:
"""
Machine Learning Homework 1
Done by:
Mariana Santana 106992
Pedro Leal 106154
LEIC-A
"""

#### II. Programming
#### Consider the heart-disease.csv dataset available at the course webpage’s homework tab. Using sklearn, apply a 5-fold stratified cross-validation with shuffling (random_state=0) for the assessment of predictive models along this section.

In [28]:
"""
General imports and variables for all exercises; run this cell before any other
"""
import pandas as pd, matplotlib.pyplot as plt, numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from scipy.stats import ttest_rel

data = pd.read_csv('heart-disease.csv')

X = data.drop(columns='target')
y = data['target']

#### 1) Compare the performance of a 𝑘𝑁𝑁 with k=5 with and a naïve Bayes with Gaussian assumption (consider all remaining parameters as default):
#### a. [1.0v] Plot two boxplots with the fold accuracies for each classifier. Is there one more stable than the other regarding performance? Why do you think that is the case? Explain.

In [None]:
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

classifiers = [KNeighborsClassifier(n_neighbors=5), GaussianNB()]
classifier_names = ['kNN (k=5)', 'Naive Bayes (Gaussian)']

classifier_accuracies = []
for classifier in classifiers:
    accuracies = cross_val_score(classifier, X, y, cv=stratified_kfold)
    classifier_accuracies.append(accuracies)

plt.boxplot(classifier_accuracies, patch_artist=True)
plt.xticks(np.arange(1, len(classifier_names) + 1), classifier_names)
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('5-fold Stratified Cross-Validation (with Shuffling)')
plt.show()


The boxplot shows that the Naive Bayes classifier is more stable compared to the kNN classifier. The stability is evident from the smaller spread (interquartile range) and the absence of extreme outliers in the accuracy distribution of Naive Bayes. In contrast, the kNN classifier exhibits a larger variance in performance, as indicated by the wider range of accuracy values.

Why is Naive Bayes More Stable?
Nature of the Classifier:

Naive Bayes makes strong assumptions about the independence of features and uses probability distributions. This tends to result in a consistent performance, even with different train-test splits.
Impact of Parameter Selection:

The stability of kNN can be influenced significantly by the choice of the k parameter and the nature of the data. When the value of k is fixed (e.g., k=5), the classifier's sensitivity to slight changes in the data is high. This can cause more variation in the accuracy across different folds.
Effect of Data Distribution:

If the dataset has overlapping classes or noisy data points, kNN might be more sensitive, as it directly uses the distances between data points to classify. Naive Bayes, on the other hand, relies on estimated distributions, making it less sensitive to such variations.
Conclusion:
Naive Bayes is more stable because of its probabilistic nature and less dependence on specific data points, whereas kNN's performance is highly variable due to its sensitivity to neighborhood structure in the dataset.

#### b. [1.0v] Report the accuracy of both models, this time scaling the data with a Min-Max scaler before training the models. Explain the impact that this preprocessing step has on the performance of each model, providing an explanation for the results.

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

classifier_accuracies_scaled = []
for classifier in classifiers:
    accuracies = cross_val_score(classifier, X_scaled, y, cv=stratified_kfold)
    classifier_accuracies_scaled.append(accuracies)

for name, accuracies in zip(classifier_names, classifier_accuracies_scaled):
    print(f'{name} Mean Accuracy: {np.mean(accuracies):.4f}')

plt.boxplot(classifier_accuracies_scaled, patch_artist=True)
plt.xticks(np.arange(1, len(classifier_names) + 1), classifier_names)
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('5-fold Stratified Cross-Validation with Min-Max Scaling')
plt.show()

Impact of Min-Max Scaling on Each Model
kNN Classifier:

Impact: The performance of the kNN model generally improves after scaling. This is because kNN relies on the Euclidean distance between points. When the features are not scaled, attributes with larger numerical ranges dominate the distance calculations, making the kNN classifier biased towards certain features.
Result: Scaling ensures that each feature contributes equally to the distance computation, leading to more meaningful neighborhoods and potentially improved classification accuracy.
Naive Bayes (Gaussian):

Impact: The performance of Naive Bayes might not show a significant change after scaling. This is because Gaussian Naive Bayes works by assuming a normal distribution for each feature. The Min-Max scaling does not change the overall structure of a feature's distribution, just its range.
Result: Scaling typically has a minimal effect on Naive Bayes' performance, as it only shifts the values without altering the fundamental distribution used by the model.
Explanation of the Results
kNN Performance Increase: With Min-Max scaling, the kNN model's performance improves because it now measures distances in a fair and balanced manner for all features. When features have varying scales, the ones with a larger range dominate the distance calculations, skewing the classifier's behavior.

Naive Bayes Stability: Gaussian Naive Bayes is relatively insensitive to the scale of the data since it only cares about the probability distributions of the features. Thus, scaling has little impact on its overall performance.

#### c. [1.0v] Using scipy, test the hypothesis “the model kNN is statistically superior to Naïve Bayes regarding accuracy”, asserting whether it is true.

In [None]:
kNN_accuracies = cross_val_score(classifiers[0], X_scaled, y, cv=stratified_kfold)
NB_accuracies = cross_val_score(classifiers[1], X_scaled, y, cv=stratified_kfold)

t_stat, p_value = ttest_rel(kNN_accuracies, NB_accuracies, alternative='greater')

print(f"kNN Mean Accuracy: {np.mean(kNN_accuracies):.4f}")
print(f"Naive Bayes Mean Accuracy: {np.mean(NB_accuracies):.4f}")
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. kNN is statistically superior to Naive Bayes in terms of accuracy.")
else:
    print("Fail to reject the null hypothesis. kNN is not statistically superior to Naive Bayes.")

#### 2) Using a 80-20 train-test split, vary the number of neighbors of a 𝑘𝑁𝑁 classifier using 𝑘 = {1, 5, 10, 20, 30}. Additionally, for each k, train one classifier using uniform weights and distance weights.
#### a. [1.0v] Plot the train and test accuracy for each model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

# Define k values to test
k_values = [1, 5, 10, 20, 30]

# Store accuracies for plotting
train_accuracies_uniform = []
test_accuracies_uniform = []
train_accuracies_distance = []
test_accuracies_distance = []

# Train and evaluate kNN classifiers for each k with different weights
for k in k_values:
    # kNN with uniform weights
    knn_uniform = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    knn_uniform.fit(X_train, y_train)
    
    # kNN with distance weights
    knn_distance = KNeighborsClassifier(n_neighbors=k, weights='distance')
    knn_distance.fit(X_train, y_train)
    
    # Calculate train and test accuracy for uniform weights
    train_accuracy_uniform = accuracy_score(y_train, knn_uniform.predict(X_train))
    test_accuracy_uniform = accuracy_score(y_test, knn_uniform.predict(X_test))
    
    # Calculate train and test accuracy for distance weights
    train_accuracy_distance = accuracy_score(y_train, knn_distance.predict(X_train))
    test_accuracy_distance = accuracy_score(y_test, knn_distance.predict(X_test))
    
    # Append accuracies for plotting
    train_accuracies_uniform.append(train_accuracy_uniform)
    test_accuracies_uniform.append(test_accuracy_uniform)
    train_accuracies_distance.append(train_accuracy_distance)
    test_accuracies_distance.append(test_accuracy_distance)

# Plot the train and test accuracies for both weighting schemes
plt.figure(figsize=(12, 6))

# Plot for uniform weights
plt.subplot(1, 2, 1)
plt.plot(k_values, train_accuracies_uniform, marker='o', linestyle='-', color='blue', label='Train Accuracy')
plt.plot(k_values, test_accuracies_uniform, marker='o', linestyle='--', color='orange', label='Test Accuracy')
plt.title('kNN with Uniform Weights')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

# Plot for distance weights
plt.subplot(1, 2, 2)
plt.plot(k_values, train_accuracies_distance, marker='o', linestyle='-', color='blue', label='Train Accuracy')
plt.plot(k_values, test_accuracies_distance, marker='o', linestyle='--', color='orange', label='Test Accuracy')
plt.title('kNN with Distance Weights')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

# Show the plots
plt.suptitle('Train and Test Accuracy for kNN Models with Different k Values')
plt.show()

#### b. [1.5v] Explain the impact of increasing the neighbors on the generalization ability of the models.

Analysis
a) Explanation of the Plot:
The plots show the training and test accuracies for both uniform and distance weighting strategies as a function of the number of neighbors (k).
The solid lines represent training accuracies, and the dashed lines represent testing accuracies.
b) Impact of Increasing the Number of Neighbors (k):
General Trends:

As the number of neighbors (k) increases, the training accuracy decreases for both weighting strategies.
This is expected because a larger k means that the model considers a broader neighborhood, leading to a smoother decision boundary that is less overfitted to the training data.
Test accuracy initially increases and then may plateau or even decrease.
With a small k, the model can overfit (high training accuracy but lower test accuracy).
As k increases, the model generalizes better, but after a certain point, using too many neighbors can cause underfitting, reducing the test accuracy.
Comparison Between Uniform and Distance Weights:

For uniform weights, all neighbors contribute equally, which can lead to instability when the neighbors have varying distances from the query point.
For distance weights, closer neighbors have a higher impact on the decision, which typically improves generalization and robustness.
As a result, distance-weighted models often show better test accuracy and are less prone to overfitting compared to uniform-weighted models.
Conclusion:
Increasing k typically improves generalization up to a certain point, after which too many neighbors cause underfitting.
Using distance weights can mitigate the issues with high values of k, leading to better overall performance.

#### 3) [1.5v] Considering the unique properties of the heart-disease.csv dataset, identify two possible difficulties of the naïve Bayes model used in the previous exercises when learning from the given dataset.

The heart-disease dataset poses some specific challenges for a naïve Bayes model due to its unique properties. Here are two difficulties that a naïve Bayes model might face:

1. Feature Independence Assumption:
The core assumption of a naïve Bayes model is that all features are independent of each other, given the class label. However, in medical datasets like heart-disease, many features are interdependent. For example:

Blood pressure (trestbps), cholesterol level (chol), and age (age) are often correlated.
Similarly, features like thalach (maximum heart rate) and exang (exercise-induced angina) can be closely linked to each other and to the presence of heart disease.
This violation of the independence assumption can lead to suboptimal performance, as the model might not capture the underlying relationships between these attributes effectively.

2. Handling of Continuous Variables:
The dataset has a mixture of categorical and continuous variables (e.g., age, trestbps, chol, thalach, and oldpeak), which are not inherently suitable for the Gaussian distribution assumption commonly used in naïve Bayes for continuous data.

If the continuous variables do not follow a normal distribution, the Gaussian naïve Bayes may perform poorly.
Additionally, outliers or skewed distributions in features like cholesterol or age can lead to inaccurate probability estimates, reducing the overall classification performance.
In summary, the naïve Bayes model's assumption of feature independence and its handling of continuous variables could limit its effectiveness on the heart-disease dataset.