# Downstream Exploitation of Space Data
## Session 7: Unsupervised Machine Learning

### Learning Objectives

You will: 
* know the type of problems unsupervised machine learning solves and see some examples
* be able to perform dimensionality reduction on a dataset
* be able to perform unsupervised clustering on a dataset
* get familiar with a learning problem of clustering variable stars

### Dimensionality Reduction

Dimensionality reduction is a learning problem in unsupervised learning that reduces the number of input features or variables in a dataset while retaining as much important information as possible.

Importing the libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Creating a dataset with two correlated features:

In [None]:
np.random.seed(42)
n_samples = 100

In [None]:
x = np.random.rand(n_samples)
y = 0.8 * x + 0.2 * np.random.rand(n_samples)

In [None]:
data = pd.DataFrame({'Feature 1': x, 'Feature 2': y})

Standardizing the data:

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Applying the PCA:

In [None]:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

Let's visualize both our original and 'pca'ed' data: 

In [None]:
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1) # the original dataset
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], alpha=0.7)
plt.title('Original Data')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.axhline(0, color='gray', linestyle='--', lw=1)
plt.axvline(0, color='gray', linestyle='--', lw=1)

plt.subplot(1, 2, 2) # the PCA-transformed dataset
plt.scatter(pca_result[:, 0], pca_result[:, 1], alpha=0.7)
plt.title('PCA Result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.axhline(0, color='gray', linestyle='--', lw=1)
plt.axvline(0, color='gray', linestyle='--', lw=1)

plt.tight_layout()
plt.show()

Let's see how much each of the components explains variance in the data:

In [None]:
explained_variance = pca.explained_variance_ratio_
print(f'Explained variance by component 1: {explained_variance[0]:.2f}')
print(f'Explained variance by component 2: {explained_variance[1]:.2f}')

**Discuss with your neighbour:** Which component explains more variance in the data?

We can also see the contribution of each feature of each component:

In [None]:
components_df = pd.DataFrame(pca.components_, 
                             columns=['Feature 1', 'Feature 2'], 
                             index=['Component 1', 'Component 2'])

print('\nFeature contributions to each principal component:')
print(components_df)

**Discuss with your neighbour:** Why do you think the contributions in the second component are equal in magnitude but opposite in sign? (*Hint: think about how the PCA works.*)

### Clustering

#### Toy Dataset

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

Loading the iris dataset that we have seen before:

In [None]:
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

Standardizing the data:

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Let's now apply k-means clustering to our data:

In [None]:
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
labels = kmeans.fit_predict(X_scaled)

4D data is difficult to visualize so we can rescale it to 2D with a dimensionality reduction method:

In [None]:
pca = PCA(n_components=2)
X_2D = pca.fit_transform(X_scaled)

In [None]:
centers_2D = pca.transform(kmeans.cluster_centers_)

Let's plot the data:

In [None]:
plt.figure(figsize=(10, 6))

plt.scatter(X_2D[:, 0], X_2D[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7, edgecolors='k')
plt.scatter(centers_2D[:, 0], centers_2D[:, 1], c='red', s=20, marker='X', label='Centroids')

plt.title('K-means Clustering on 4D Iris Data (Projected to 2D with PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid()

plt.show()

**Discuss with your neighbour:** Do you think clusters are well-separated? Note that here separate colors represent clusters assigned by our clustering algorithm, NOT the initial labels.

Given that we have true labels (NB: which is often not the case for unsupervised learning) we can compare our clustering with actual labels:

In [None]:
true_colors = plt.cm.Set1(y)
cluster_colors = plt.cm.Paired(labels)

In [None]:
plt.figure(figsize=(10, 6))
for i in range(X_2D.shape[0]):
    plt.scatter(X_2D[i, 0], X_2D[i, 1], 
                color=cluster_colors[i], 
                edgecolors=true_colors[i],
                s=50, alpha=0.8, linewidth=1)

plt.scatter(centers_2D[:, 0], centers_2D[:, 1], 
            c='black', s=20, marker='X', label='Centroids')

plt.title('K-means Clustering on 4D Iris Data (vs Original Labels)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid()
plt.show()

**Discuss with your neighbour:** What do you think now?

We can also evaluate clustering quality:

In [None]:
silhouette = silhouette_score(X_scaled, labels)
print(f'Silhouette Score: {silhouette:.2f}')

**Discuss with your neighbour:** How would you evaluate the performance of this clustering as a whole?

#### Variable Star Dataset

In [None]:
pip install umap-learn

In [None]:
import umap.umap_ as umap
import matplotlib.patches as mpatches
from matplotlib.colors import ListedColormap

Let's load the dataset from the previous session again:

In [None]:
df = pd.read_csv('session6_tda.csv')
df.head()

We drop all non-feature columns:

In [None]:
X = df.drop(columns=['Unnamed: 0', 'TIC', 'Sector', 'Class'])

Then we standardize the data and apply clustering:

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.values.astype(np.float32))

In [None]:
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
labels = kmeans.fit_predict(X_scaled)

We have a multi-dimensional space that is impossible to visualise so we use another dimensionality reduction technique, UMAP:

In [None]:
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

Let's now plot the clusters: 

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=labels, cmap='viridis', s=20, alpha=0.8, edgecolors='k')
plt.title('K-means Clustering with UMAP Projection')
plt.xlabel('UMAP Component 1')
plt.ylabel('UMAP Component 2')

centers_umap = reducer.transform(kmeans.cluster_centers_)
plt.scatter(centers_umap[:, 0], centers_umap[:, 1], c='red', s=50, marker='X', label='Centroids')

plt.legend()
plt.grid()
plt.show()

Let's see how it corresponds to true labels:

In [None]:
y = df['Class']

In [None]:
cluster_cmap = ListedColormap(plt.cm.viridis(np.linspace(0, 1, n_clusters)))
true_cmap = ListedColormap(plt.cm.inferno(np.linspace(0, 1, len(np.unique(y)))))

In [None]:
cluster_colors = cluster_cmap(labels)
true_colors = true_cmap(pd.factorize(y)[0])

In [None]:
plt.figure(figsize=(10, 6))

for i in range(X_umap.shape[0]):
    plt.scatter(X_umap[i, 0], X_umap[i, 1], 
                color=cluster_colors[i], 
                edgecolors=true_colors[i],
                s=40, alpha=0.8, linewidth=1.2)

centers_umap = reducer.transform(kmeans.cluster_centers_)
plt.scatter(centers_umap[:, 0], centers_umap[:, 1], 
            c='red', s=50, marker='X', label='Centroids')

cluster_handles = [mpatches.Patch(color=cluster_cmap(i / (n_clusters - 1)), label=f'Cluster {i}')
                   for i in range(n_clusters)]

unique_labels, label_names = pd.factorize(y)
true_label_handles = [mpatches.Patch(edgecolor=true_cmap(i / (len(label_names) - 1)), 
                                     facecolor='white', label=label_names[i])
                      for i in range(len(label_names))]

legend1 = plt.legend(handles=cluster_handles, title='Cluster Colors', loc='upper left', bbox_to_anchor=(1, 1))
legend2 = plt.legend(handles=true_label_handles, title='True Labels', loc='lower left', bbox_to_anchor=(1, 0))
plt.gca().add_artist(legend1)

plt.title('K-means Clustering with UMAP Projection (vs True Labels)')
plt.xlabel('UMAP Component 1')
plt.ylabel('UMAP Component 2')
plt.grid()

plt.show()

**Discuss with your neighbour:** Which class get's the cleanest cluster?

**To do:** Change the number of clusters and rerun the cells above. Observe what changes and think if it makes sence given your knowledge of the dataset.

In [None]:
silhouette = silhouette_score(X_scaled, labels)
print(f'Silhouette Score: {silhouette:.2f}')

**Discuss with your neighbour:** How would you evaluate the performance of this clustering as a whole?

Let's now try to plot the result of clustering with different numbers of clusters side-by-side:

In [None]:
clusters = [2, 3, 4] # change this into how many clusters you want!

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharex=True, sharey=True)

for i, n_clusters in enumerate(clusters):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
    labels = kmeans.fit_predict(X_scaled)
    
    axes[i].scatter(X_umap[:, 0], X_umap[:, 1], c=labels, cmap='viridis', s=30, alpha=0.8, edgecolors='k')
    
    centers_umap = reducer.transform(kmeans.cluster_centers_)
    axes[i].scatter(centers_umap[:, 0], centers_umap[:, 1], c='red', s=80, marker='X', label='Centroids')

    axes[i].set_title(f'K-means with {n_clusters} clusters')
    axes[i].set_xlabel('UMAP Component 1')
    if i == 0:
        axes[i].set_ylabel('UMAP Component 2')
    axes[i].grid()

plt.tight_layout()
plt.show()

**To do (for the report):** Change the three numbers in the clusters list to produce a plot with three clusterings (please **do not** use [2,3,4] for your report!) and discuss different clustering results in your report. **This figure needs to be improved** before you put it in your report: you need to add legend, fix label sizes, etc. (demonstrate what you can do to make it presentable and make sure the point you are trying to make is clearly visible).

*NB:* you do not need to show real classes (and you would normally not be able to do that when working with clustering) but feel free to do that if you want to discuss this for your report. If you do, it might be helpful to read the description of classes in the Audenaert at al (2021) paper and look at the statistical distribution of features (as you did for session 6) but this is not compulsory. These additional plots are not required but feel free to include them if they support your point.