# COMP0189: Applied Artificial Intelligence
## Week 7 (Dimensionality reduction and clustering)

Part 1: MNIST dataset  
Part 2: OASIS dataset  

### Acknowledgements
- https://scikit-learn.org/stable/
- https://oasis-brains.org

In [None]:
%load_ext nb_black

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Part 1: MNIST dataset: Principal Component Analysis and clustering in a latent space

**Task 1: Load MNIST data and assemble it in two matrices X (images) and y (labels)**

In [None]:
MNIST = np.load(None)

**Task 2: Visualise the data for better understanding**

#### Pre-processing the data

One challenge with K-means is that it can be slow to find the nearest cluster centers for each datapoint. This is particularly for high dimensional data (or for large K).

An easy way to speed things up is to pre-process the data by reducing its dimensionality.

Here, we will run PCA as pre-processing. The next cells reshape the original MNIST data, from

* `mnist_images.shape == [60000, 28, 28]`

into a 2d array (matrix)

* `X_mnist.shape == [60000, 784]`

In [None]:
# Reshaping
None

**Task 3: Apply PCA to the MNIST data**

In [None]:
from sklearn.decomposition import PCA

None

**Task 4: Plot the MNIST data projected onto the first two principal components (using different colours for the different digits). Use the labels to colour the examples**

**Task 5: Plot the explained variance per component. Based on the plot decide on how many components should be used for clustering**

**Task 6: Apply KMeans to PCA-reduced data**

In [None]:
# Apply PCA of 200 components
pca = PCA(n_components=None)
mnist_X_pca = pca.fit_transform(None)

In [None]:
# Perform KMeans
from sklearn.cluster import KMeans

In [None]:
%%time

kmeans = KMeans(None)
kmeans.fit(None)

**Task 7: Visualise example of the clusters**

**Task 8: Evaluate**

In [None]:
from sklearn import metrics


# Define a function to calculate and print accuracy, precision and recall scores
def evaluate(labels_true, labels_pred):
    accuracy = metrics.accuracy_score(labels_true, labels_pred)
    precision = metrics.precision_score(
        labels_true, labels_pred, average="macro", zero_division=0
    )
    recall = metrics.recall_score(
        labels_true, labels_pred, average="macro", zero_division=0
    )
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}\n")

### `silhouette_score` and Evaluation Benchmark for  KMeans

In [None]:
from time import time
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


def bench_k_means(model, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    model
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    # can add your own pipeline, but we just use our model here
    estimator = make_pipeline(model).fit(data)
    fit_time = time() - t0

    reference_labels = retrieve_info(estimator[-1].labels_, labels)
    number_labels = np.random.rand(len(estimator[-1].labels_))
    for i in range(len(estimator[-1].labels_)):
        number_labels[i] = reference_labels[estimator[-1].labels_[i]]

    # inertia_: Sum of squared distances of samples to their closest cluster center,
    #           weighted by the sample weights if provided.
    results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator
    clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
        metrics.adjusted_rand_score,
        metrics.adjusted_mutual_info_score,
    ]
    results += [m(labels, number_labels) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    results += [
        metrics.silhouette_score(
            data,
            estimator[-1].labels_,
            metric="euclidean",
            sample_size=300,
        )
    ]

    # traditional metrics with true labels
    results += [
        metrics.accuracy_score(labels, number_labels),
        metrics.precision_score(
            labels, number_labels, average="macro", zero_division=0
        ),
        metrics.recall_score(labels, number_labels, average="macro", zero_division=0),
    ]

    # Show the results
    formatter_result = "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}"
    print(formatter_result.format(*results))

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

print(100 * "_")
print("model\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilh\taccu\tprec\trecall")

model = KMeans(n_clusters=n_digits, n_init=10, random_state=0)
bench_k_means(model=model, name="kmeans", data=mnist_X, labels=mnist_y)

model = MiniBatchKMeans(n_clusters=n_digits, n_init=10, random_state=0)
bench_k_means(model=model, name="minibatch", data=mnist_X, labels=mnist_y)

# change n_components of pca
pca = PCA(n_components=200)
mnist_X_pca = pca.fit_transform(mnist_X)
model = KMeans(n_clusters=n_digits, n_init=10, random_state=0)
bench_k_means(model=model, name="PCA-kmeans", data=mnist_X_pca, labels=mnist_y)

pca = PCA(n_components=200)
mnist_X_pca = pca.fit_transform(mnist_X)
model = MiniBatchKMeans(n_clusters=n_digits, n_init=10, random_state=0)
bench_k_means(model=model, name="PCA-minibatch", data=mnist_X_pca, labels=mnist_y)

print(100 * "_")

## Part 2: OASIS dataset: Cross decomposition methods and clustering in the latent space

In this part, you will learn how to apply cross decomposition methods such as CCA and PLSSVD to find the fundamental relations between two matrices (X and Y) that represent different views of the same data. You will also learn how to use KMeans clustering to group the data points based on their latent representations in a lower-dimensional space.

We will use the OASIS dataset, which contains brain MRI images (view 1) and clinical assessments (view 2) of 416 subjects aged 18 to 96. The goal is to explore how these two views are related and how they can be used for clustering.

## Import libraries and load data

First, we need to import some libraries and load the data from CSV files.

In [None]:
from sklearn.cross_decomposition import CCA, PLSSVD
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


labels = pd.read_csv("../data/OASIS_labels.csv")
brain_roi = pd.read_csv("../data/OASIS_view1_ROI.csv")
clinical = pd.read_csv("../data/OASIS_view2_clinical.csv")

## Data preprocessing

Next, we need to do some preprocessing on the data. We will drop some columns that are not relevant for our analysis, such as subject ID, gender, handness, etc. We will also normalize each view by subtracting its mean and dividing by its standard deviation.

In [None]:
# Drop irrelevant columns
brain_roi = brain_roi.drop(["Subject ID"], axis=1)
clinical = clinical.drop(["Subject ID"], axis=1)

# Fill nans with mean values
None

# Normalize each view
None

## Cross decomposition methods

Now we are ready to apply cross decomposition methods to find the relations between the two views. We will use two methods: CCA and PLSSVD.

CCA finds linear combinations of X and Y that have maximum correlation1. It can be seen as a generalization of PCA for two sets of variables.

PLSSVD finds linear combinations of X and Y that have maximum covariance. It can be seen as a generalization of SVD for two sets of variables.

For both methods, we need to specify the number of components (n_components) that we want to extract from each view. This parameter controls the dimensionality of the latent space where we will cluster the data points later.

We will use n_components=2 for both methods. You can try different values later and see how they affect the results.

For further comparison, apply PCA to the brain_roi data with 2 components in order to see if combining the modalities improves the latent space.

### CCA Fitting

In [None]:
CCA_n_components = None
# Create CCA object with n_components=2
cca = CCA(n_components=CCA_n_components)

# Fit CCA model on X (brain_roi) and Y (clinical)
cca.fit(None)

# Transform X and Y into their latent representations using CCA
X_c, Y_c = cca.transform(None)

### CCA Plotting

In [None]:
# Convert labels to numbers
label_dict = {"Demented": 0, "Nondemented": 1}
label_nums = labels["Group"].map(label_dict)

# Plot the latent dimensions for CCA
None

### PLS

In [None]:
PLSSVD_n_components = None
# Create PLSSVD object with n_components=3
plssvd = PLSSVD(n_components=PLSSVD_n_components)

# Fit PLSSVD model on X (brain_roi) and Y (clinical)
plssvd.fit(None)

# Transform X and Y into their latent representations using PLSSVD
X_p, Y_p = plssvd.transform(None)

### PLS Plotting

In [None]:
# Plot the latent dimensions for PLSSVD
None

## PCA

In [None]:
# Perform PCA on brain_roi
pca = PCA(n_components=2)
X_pca = pca.fit_transform(None)

### PCA Plotting

In [None]:
# Plot brain principal components as a scatter plot
None

In [None]:
# Perform PCA on clinical variables
pca = PCA(n_components=2)
Y_pca = pca.fit_transform(None)

In [None]:
# Plot clinical principal components as a scatter plot
None

## Clustering in the latent space

Finally, we will use KMeans clustering to group the data points based on their latent representations obtained from CCA, PLSSVD, and PCA. For CCA and PLSSVD we will average or add the X_c/Y_c and X_p/Y_p respectively while for PCA we will just use X_pca. We will use n_clusters=2 for KMeans, which corresponds to two groups: Non-Demented (ND) and Demented (D).

In [None]:
# Create KMeans object with n_clusters=3
kmeans = KMeans(n_clusters=2, n_init=10)

# Cluster the data points based on their latent representations from CCA
kmeans.fit(None)
labels_c = kmeans.labels_

# Cluster the data points based on their latent representations from PLSSVD
kmeans.fit(None)
labels_p = kmeans.labels_

# Cluster the data points based on their latent representations from PCA brain_roi
kmeans.fit(X_pca)
labels_brain = kmeans.labels_

## Quantify Performance of K-Means classifiers using the different latent spaces
NOTE You may need to reverse the sign on the kmeans labels if accuracy is below 0.5 for any model since k-means does not know the order of the original labels.


In [None]:
# Import metrics module from sklearn
from sklearn import metrics


# Define a function to calculate and print accuracy, precision and recall scores
def evaluate(labels_true, labels_pred):
    accuracy = metrics.accuracy_score(labels_true, labels_pred)
    precision = metrics.precision_score(labels_true, labels_pred)
    recall = metrics.recall_score(labels_true, labels_pred)
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}\n")


# Compare the performance of classifiers based on CCA, PLSSVD, PCA brain_roi and PCA clinical
# assuming true_labels is a variable that stores the ground truth labels
print("Performance of classifier based on CCA:")
evaluate(label_nums, labels_c)
print("Performance of classifier based on PLSSVD:")
evaluate(label_nums, labels_p)
print("Performance of classifier based on PCA brain_roi:")
evaluate(label_nums, labels_brain)

## Conclusion
This week you learned how to apply cross decomposition methods such as CCA and PLSSVD to find the relations between two views of the same data. You also learned how to use KMeans clustering to group the data points based on their latent representations in a lower-dimensional space.

You can experiment with different values of n_components and n_clusters and see how they affect the results. You can also try other cross decomposition methods such as PLSRegression or PLSCanonical or other clustering methods such as DBSCAN or SpectralClustering.