"""
Machine Learning Homework 4
Done by:
Mariana Santana 106992
Pedro Leal 106154
LEIC-A
"""

#### Consider the breast_cancer dataset data = datasets.load_breast_cancer()  with binary target variable y=‘malignant’. Split it 70% for training and 30% for testing.

In [9]:
"""
General imports and variables for all exercises; run this cell before any other
"""
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Ridge

data = load_breast_cancer()
X, y = data.data, data.target

#### 1) Perform logistic regression and indicate the accuracy. 

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

Accuracy: 0.9532163742690059


#### 2) Perform EM clustering on the training data set with different number k of clusters. Evaluate the quality of the clusterings using Silhouette. Is the number of clusters correlated with the quality of clustering? Which is the optimal k? 

In [11]:
k_values = range(2, 10)

silhouette_scores = []

for k in k_values:
    gmm = GaussianMixture(n_components=k)
    gmm.fit(X_train)
    labels = gmm.predict(X_train)
    score = silhouette_score(X_train, labels)
    silhouette_scores.append(score)

print("Silhouette Scores:", silhouette_scores)
optimal_k = k_values[np.argmax(silhouette_scores)]
print("Optimal number of clusters:", optimal_k)

Silhouette Scores: [0.6953546812827253, 0.6460072951798395, 0.44240066859293997, 0.40975176230240173, 0.4342973431576532, 0.4192675100180935, 0.4273557130544327, 0.4203979188373632]
Optimal number of clusters: 2


Upon performing EM clustering, we obtained the silhouette scores for the different number of clusters ([0.6953546812827253, 0.6460072951798395, 0.44240066859293997, 0.40975176230240173, 0.4342973431576532, 0.4192675100180935, 0.4273557130544327, 0.4203979188373632] in order from 2 to 9 clusters).

We also know that overall silhouette score (average of the silhouettes of the individual points of the dataset) is used to evaluate the quality of clustering results. The individual silhouettes can range between -1 and 1 where lower values mean that the point has probably been assigned to the wrong cluster while higher values mean that the point matches nearly perfectly its neighbors (points of the same cluster) and very poorly points of different cluster, resulting in a better assignment of points to clusters. The overall silhouette of the model ranges between the same values as the previously explained silhouettes. For genereal interpretation, lower values suggest bad organization of points in clusters while values closer to 1 correspond to a lower intra-cluster distance (points in the same cluster are close to eachother) and higher inter-cluster distance (clusters are far apart from eachother).

Given this, when analysing the varying results, we concluded that the number of clusters impacts the model's quality of clustering. This happens because the number of clusters has an effect in cohesion and separation (with too few cluster the model has low separation and high cohesion - points within clusters are not very similar and clusters may overlap and with too many clusters can lead to overfitting - clusters become too small and lose their generality). 

After this analysis, we concluded that there needed to be a balance in the model's cohesion and separation which are directly linked to its silhouette: the higher silhouette value belonged to k=2 (0.6953546812827253) which corresponds to the optimal number of cluster for this model and this data.

#### 3) Map the test set into probability values of the k-clusters. If you have a data point represented by a vector of dimension d, you will map it into a vector of dimension: prob=em_model.predict_proba(X)

In [12]:
em_model = GaussianMixture(n_components=optimal_k, random_state=42)
em_model.fit(X_train)

probabilities = em_model.predict_proba(X_test)
print(probabilities.shape, probabilities)

(171, 2) [[1.00000000e+000 1.02192945e-066]
 [9.99999992e-001 8.08797928e-009]
 [5.08112793e-004 9.99491887e-001]
 [1.00000000e+000 7.32600240e-034]
 [1.00000000e+000 2.85688613e-028]
 [1.00000000e+000 1.62876385e-017]
 [1.00000000e+000 5.34417998e-023]
 [1.00000000e+000 5.22037926e-020]
 [1.00000000e+000 5.18611708e-020]
 [1.00000000e+000 1.14222273e-010]
 [1.00000000e+000 5.54610072e-029]
 [1.00000000e+000 1.59916739e-067]
 [9.99993366e-001 6.63431066e-006]
 [9.99999336e-001 6.63514000e-007]
 [1.00000000e+000 1.56112641e-067]
 [9.99999983e-001 1.71323412e-008]
 [1.00000000e+000 1.79417817e-036]
 [1.00000000e+000 5.87774023e-039]
 [1.00000000e+000 2.27343543e-010]
 [1.00000000e+000 4.97578510e-054]
 [5.65397152e-016 1.00000000e+000]
 [1.00000000e+000 1.28628351e-072]
 [2.12388386e-080 1.00000000e+000]
 [1.00000000e+000 1.77690232e-044]
 [1.00000000e+000 1.04312028e-052]
 [1.00000000e+000 7.27519126e-127]
 [9.99987540e-001 1.24599426e-005]
 [2.28344948e-058 1.00000000e+000]
 [4.7217508

#### 4) Perform logistic regression on the mapped data set with the labels of the original test set. Indicate now the accuracy. Is there a relation between the number of clusters, the cluster evaluation and the accuracy of the logistic regression model?

In [13]:
X_mapped = probabilities

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_mapped, y_test)

y_pred = log_reg.predict(X_mapped)

accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.8713450292397661


As the number of clusters increases up to a certain point, clustering quality improves because the data is divided into finer partitions that can capture more complex patterns. 
However, beyond this point, adding more clusters begins to decrease clustering quality due to over-segmentation: clusters become too small and fragmented, often capturing noise rather than meaningful patterns. 
This proves the existance of a relationship between the number of clusters and the cluster evaluation.

Higher-quality clusters indicate that data points within each cluster are more similar to one another, which can help the logistic regression model generalize better, leading to higher accuracy. This suggests a relationship between cluster quality (as measured by cluster evaluation metrics) and the accuracy of the logistic regression model.

To conclude, there is a relationship between the number of clusters, the cluster evaluation and the accuracy of the logistic regression model.

#### 5) Train an RBF network using the clustering with optimal k  from 2).

In [14]:
centers = em_model.means_

X_rbf_transformed = rbf_kernel(X_test, centers)

rbf_model = Ridge(alpha=1.0)
rbf_model.fit(X_rbf_transformed, y_test)

y_rbf_pred = rbf_model.predict(X_rbf_transformed)

y_rbf_pred_binary = (y_rbf_pred >= 0.5).astype(int)

rbf_accuracy = accuracy_score(y_test, y_rbf_pred_binary)
print(rbf_accuracy)

0.6374269005847953


#### 6) Discuss your findings on a (up to) 5 page document.