"""
Machine Learning Homework 4
Done by:
Mariana Santana 106992
Pedro Leal 106154
LEIC-A
"""

#### Consider the breast_cancer dataset data = datasets.load_breast_cancer()  with binary target variable y=‘malignant’. Split it 70% for training and 30% for testing.

In [1]:
"""
General imports and variables for all exercises; run this cell before any other
"""
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Ridge

data = load_breast_cancer()
X, y = data.data, data.target

#### 1) Perform logistic regression and indicate the accuracy. 

In [2]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")

Accuracy: 0.935672514619883


#### 2) Perform EM clustering on the training data set with different number k of clusters. Evaluate the quality of the clusterings using Silhouette. Is the number of clusters correlated with the quality of clustering? Which is the optimal k? 

In [3]:
k_values = range(2, 10)

silhouette_scores = []

for k in k_values:
    gmm = GaussianMixture(n_components=k)
    gmm.fit(X_train)
    labels = gmm.predict(X_train)
    score = silhouette_score(X_train, labels)
    silhouette_scores.append(score)

print("Silhouette Scores:", silhouette_scores)
optimal_k = k_values[np.argmax(silhouette_scores)]
print("Optimal number of clusters:", optimal_k)

Silhouette Scores: [0.6156295727768908, 0.51306032059586, 0.5008310013511474, 0.43666220370687076, 0.43722891298798, 0.4386005228373668, 0.4546850794327086, 0.43917931395798343]
Optimal number of clusters: 2


Upon performing EM clustering, we obtained the silhouette scores for the different number of clusters ([0.6953546812827253, 0.6460072951798395, 0.44240066859293997, 0.40975176230240173, 0.4342973431576532, 0.4192675100180935, 0.4273557130544327, 0.4203979188373632] in order from 2 to 9 clusters).

We also know that overall silhouette score (average of the silhouettes of the individual points of the dataset) is used to evaluate the quality of clustering results. The individual silhouettes can range between -1 and 1 where lower values mean that the point has probably been assigned to the wrong cluster while higher values mean that the point matches nearly perfectly its neighbors (points of the same cluster) and very poorly points of different cluster, resulting in a better assignment of points to clusters. The overall silhouette of the model ranges between the same values as the previously explained silhouettes. For genereal interpretation, lower values suggest bad organization of points in clusters while values closer to 1 correspond to a lower intra-cluster distance (points in the same cluster are close to eachother) and higher inter-cluster distance (clusters are far apart from eachother).

Given this, when analysing the varying results, we concluded that the number of clusters impacts the model's quality of clustering. This happens because the number of clusters has an effect in cohesion and separation (with too few cluster the model has low separation and high cohesion - points within clusters are not very similar and clusters may overlap and with too many clusters can lead to overfitting - clusters become too small and lose their generality). 

After this analysis, we concluded that there needed to be a balance in the model's cohesion and separation which are directly linked to its silhouette: the higher silhouette value belonged to k=2 (0.6953546812827253) which corresponds to the optimal number of cluster for this model and this data. This conclusion makes a lot of sense for this exercise because there are 2 possible outcomes (benign and malignant) and therefore 2 clusters to classify observations.

#### 3) Map the test set into probability values of the k-clusters. If you have a data point represented by a vector of dimension d, you will map it into a vector of dimension: prob=em_model.predict_proba(X)

In [None]:
em_model = GaussianMixture(n_components=optimal_k)
em_model.fit(X_train)

probabilities = em_model.predict_proba(X_test)
print(probabilities.shape, probabilities)

(171, 2) [[1.00000000e+000 1.14215983e-014]
 [6.65112734e-046 1.00000000e+000]
 [9.99999999e-001 1.11210288e-009]
 [9.99995236e-001 4.76366919e-006]
 [0.00000000e+000 1.00000000e+000]
 [9.98319350e-001 1.68064986e-003]
 [2.79583738e-005 9.99972042e-001]
 [9.99998709e-001 1.29100970e-006]
 [3.07980895e-126 1.00000000e+000]
 [0.00000000e+000 1.00000000e+000]
 [1.00000000e+000 6.53312135e-023]
 [9.99999785e-001 2.15142495e-007]
 [1.00000000e+000 8.03337744e-013]
 [1.00000000e+000 8.75465571e-013]
 [3.16230108e-035 1.00000000e+000]
 [9.99999861e-001 1.39349573e-007]
 [1.00000000e+000 3.28438603e-019]
 [1.00000000e+000 7.25384451e-014]
 [0.00000000e+000 1.00000000e+000]
 [9.99331452e-001 6.68547975e-004]
 [1.41108937e-053 1.00000000e+000]
 [1.41884563e-045 1.00000000e+000]
 [1.00000000e+000 2.82868584e-010]
 [1.00000000e+000 9.59806240e-028]
 [1.82316973e-001 8.17683027e-001]
 [1.01751604e-088 1.00000000e+000]
 [1.00000000e+000 1.66670106e-026]
 [8.20853921e-031 1.00000000e+000]
 [8.5351420

#### 4) Perform logistic regression on the mapped data set with the labels of the original test set. Indicate now the accuracy. Is there a relation between the number of clusters, the cluster evaluation and the accuracy of the logistic regression model?

In [5]:
X_mapped = probabilities

log_reg = LogisticRegression()
log_reg.fit(X_mapped, y_test)

y_pred = log_reg.predict(X_mapped)

accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.8947368421052632


Logistic Regression applies the sigmoid activation function to the output of a linear regression to classify inputs in 1 of 2 categories. This works because the sigmoid returns a values between 0 and 1 and, with a well defined threshold is possible to predict accurately with class the observations belongs to.

Generally (for dataset with continuous outputs), as the number of clusters increases up to a certain point, clustering quality improves because the data is divided into finer partitions that capture more complex patterns. However, beyond this point, adding more clusters begins to decrease clustering quality due to over-segmentation: clusters become too small and fragmented, often capturing noise rather than meaningful patterns.
For this particular dataset (categorical), we assessed that the best number of clusters is the k that matches the number of possible outcomes (2). Given this, higher numbers of clusters would be detrimental to the model's performance (as we saw in exercise 2) and worsen the cluster evaluation.
With this, we conclude that the number of clusters has an impact on the cluster evaluation.

--

Also, higher-quality clustering indicates that data points within each cluster are more similar to eachother, which helps the logistic regression model to classify observations more accurately, leading to higher accuracy. 
This suggests a relationship between cluster quality (as measured by cluster evaluation metrics) and the accuracy of the logistic regression model.

--

Given this, it's very important to choose the optimal number of clusters (which heavilly depends on the dataset's properties) to generate the model. This allows better classification of observations and therefore better cluster evaluation proving that there is a relationship between the number of clusters, the cluster evaluation and the accuracy of the logistic regression model.

#### 5) Train an RBF network using the clustering with optimal k  from 2).

In [6]:
centers = em_model.means_

X_rbf_transformed = rbf_kernel(X_test, centers)

rbf_model = Ridge(alpha=1.0)
rbf_model.fit(X_rbf_transformed, y_test)

y_rbf_pred = rbf_model.predict(X_rbf_transformed)

y_rbf_pred_binary = (y_rbf_pred >= 0.5).astype(int)

rbf_accuracy = accuracy_score(y_test, y_rbf_pred_binary)
print(rbf_accuracy)

0.6198830409356725


#### 6) Discuss your findings on a (up to) 5 page document.