# COMP0189: Applied Artificial Intelligence
# Week 7 (Dimensionality reduction and matrix decomposition)

### 🎯 Objectives
1. To understand the differences in applying various dimensionality reduction techniques like Principal Component Analysis (PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF) to extract latent features from an image dataset
2. To apply cross decomposition methods like Canonical Correlation Analysis (CCA) and PArtial Least Squares (PLS) to find the fundamental relations between two matrices (X and Y) that represent different views of the same data

### Acknowledgements
- https://scikit-learn.org/stable/
- https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py
- https://oasis-brains.org
- https://cca-zoo.readthedocs.io/en/latest/preface.html

In [None]:
%pip install scikit-learn==1.5.2 cca-zoo matplotlib seaborn pandas scipy

## 🧑‍💻 Part 1. Face recognition through eigenfaces and SVMs

This part of the lab will focus on different dimensionality reduction (or matrix decomposition) methods. We will use the [Labelled Faces in the Wild](https://scikit-learn.org/stable/datasets/real_world.html#labeled-faces-in-the-wild-dataset) (LFW) dataset distributed with `scikit-learn` to visualise the results of dimensionality reduction methods and assess their impact on model performance and training time.

The dataset provides the faces of several famous people labelled with their name. We will use it to train a model which predicts a person's name given a picture of their face.

### 📝 Task 1.1 Import Libraries and Load the Labeled Faces in the Wild (LFW) People Dataset (Classification)

First, we load in the LFW data.

In [None]:
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split

# Load data
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
n_samples, h, w = lfw_people.images.shape
X = lfw_people.data
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
n_features = X.shape[1]

# Split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### 📝 Task 1.2 Visualise input data

To get an idea of what we're working with, we can visualise the first few faces in the dataset and their associated name. We will later reuse this function to draw the main components from dimensionality reduction.

In [None]:
import numpy as np
import numpy.typing as npt
import matplotlib.pyplot as plt

def plot_gallery(images: npt.NDArray[np.floating], titles: list[str], h: int, w: int, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits."""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=0.01, right=0.99, top=0.90, hspace=0.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap="gray")
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

sample_faces = X_train[0:13].reshape((13, h, w))
face_titles = ["%s" % target_names[t] for t in y_train]
plot_gallery(sample_faces, face_titles, h, w)

### 📝 Task 1.3 Apply dimensionality reduction methods

Now, we can apply different dimensionality reduction methods to the input data. We will compare [Principal Component Analysis](https://scikit-learn.org/stable/modules/decomposition.html#pca) (PCA), [Non-Negative Matrix Factorisation](https://scikit-learn.org/stable/modules/decomposition.html#nmf) (NMF), and [Independent Component Analysis](https://scikit-learn.org/stable/modules/decomposition.html#ica) (ICA).

Keep track of how long it took to fit each method.

In [None]:
from sklearn.decomposition import PCA, NMF, FastICA
from time import time


### 📝 Task 1.4 Visualise results

We can now visualise the main components learned by each of the dimensionlity reduction methods we just trained. Each of them will have the same dimension as a face, so feel free to reuse the `plot_gallery` function that we previously wrote.

### 📝 Task 1.5 Train SVM Classifier using the Components Extraced by the Different Method and Evaluate

Finally, we train an SVM classifier to predict a person's name given a picture of their face. How does the classifier's performance change when using no dimensionality reduction compared to the methods trained above?

In [None]:
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.svm import SVC



### 🗣 Discuss:
- What are the differences in how PCA, NMF, and ICA capture features.
- How do these differences affect the performance of the SVM classifier.

## 👩‍💻 Part 2: OASIS dataset: Cross decomposition methods

In this part, you will learn how to apply cross decomposition methods such as CCA and PLSSVD to find the fundamental relations (or latent dimensions) between two matrices (X and Y) that represent different views of the same data.

We will use the Open Access Series of Imaging Studies (OASIS) dataset, which contains brain Magnetic Resonance Images (MRI) (view 1) and clinical assessments (view 2) of 416 subjects aged 18 to 96. The brain images have been summarized into 116 Regions of Interest (ROIs) using the Automated Anatomical Labeling (AAL) atlas (https://www.gin.cnrs.fr/en/tools/aal/). The clinical data consist of tabular data including gender, age, education and two clinical questionnairs, the mini mental state examination (MMSE) and the clinical dementia rating (CDR). The goal is to explore how these two views are related.

It is important to note that there are 2 groups in the OASIS data: healthy subjects, and those with dementia. The `OASIS_labels.csv` file contains the label for each subject.

### Import libraries and load data

First, we need to import some libraries and load the data from CSV files.

In [None]:
import pandas as pd

labels = pd.read_csv("./OASIS_labels.csv")
brain_roi = pd.read_csv("./OASIS_view1_ROI.csv")
clinical = pd.read_csv("./OASIS_view2_clinical.csv")
roi_names = pd.read_csv("./AAL_ROI_names.csv", header=None).squeeze()

### Data preprocessing

Next, we need to do some preprocessing on the data. We will drop some columns that are not relevant for our analysis, such as subject ID, handness, etc. We will also normalize each view by subtracting its mean and dividing by its standard deviation.

In [None]:
from sklearn.preprocessing import StandardScaler

brain_roi = brain_roi.drop(["Subject ID"], axis=1)
clinical = clinical.drop(["Subject ID"], axis=1)

# One-hot encode the "Gender" column and drop the first column to avoid multicollinearity
clinical = pd.get_dummies(clinical, columns=["Gender"], drop_first=True)

# Fill nans with mean values
brain_roi = brain_roi.fillna(brain_roi.mean(numeric_only=True))
clinical = clinical.fillna(clinical.mean(numeric_only=True))

brain_roi.columns = roi_names

# Convert labels to numbers before the split
label_dict = {"Demented": 0, "Nondemented": 1}
labels["Group"] = labels["Group"].map(label_dict)

# Split the data and labels into training and testing sets
train_brain_roi, test_brain_roi, train_clinical, test_clinical, train_labels, test_labels = train_test_split(
    brain_roi, clinical, labels["Group"], test_size=0.3, random_state=42)

# Apply StandardScaler separately to training and testing sets to avoid data leakage
scaler_brain = StandardScaler()
train_brain_roi = scaler_brain.fit_transform(train_brain_roi)
test_brain_roi = scaler_brain.transform(test_brain_roi)

scaler_clinical = StandardScaler()
train_clinical = scaler_clinical.fit_transform(train_clinical)
test_clinical = scaler_clinical.transform(test_clinical)

In [None]:
brain_roi

In [None]:
clinical

### Cross decomposition methods

Now we are ready to apply cross decomposition methods to find the relations between the two views. We will use three methods: CCA, regularised CCA and PLSSVD.

CCA finds linear combinations of X and Y that have maximum correlation. It can be seen as a generalization of PCA for two sets of variables.

Regularised CCA adds a regularisation term to prevent overfitting on the training set.

PLSSVD finds linear combinations of X and Y that have maximum covariance. It can be seen as a generalization of SVD for two sets of variables.

For both methods, we need to specify the number of components that we want to extract from each view. This parameter controls the dimensionality of the latent space.

We will use n_components=2 for both methods. You can try different values later and see how they affect the results.

For further comparison, apply PCA to the brain_roi data with 2 components in order to see if combining the modalities changes the latent space.

### Task 2.1 - Apply CCA to the brain and clinical dataset and plot the weights for the two first CCA components (brain and clinical)

We will use [`cca-zoo`](https://cca-zoo.readthedocs.io/en/latest/#) to carry out the calculations. It's a library which offers support for many different algorithms, including a regularised version of CCA which is not present in `sklearn`.

In [None]:
from cca_zoo.linear import CCA


### Task 2.2 -  Plot the CCA latent space for the top 2 components

### Task 2.3 - Show the same plots, but obtained by applying regularised CCA to the brain and clinical dataset

In [None]:
from cca_zoo.linear import rCCA



### Task 2.4 - Show the same plots, but obtained by applying PLSSVD to the brain and clinical dataset

In [None]:
from cca_zoo.linear import PLS


### Task 2.5 - Now apply PCA to the brain and clinical data idependently and plot both:
- The first two PCA components
- The PCA latent space for the two first components

#### Plotting PCA components (brain data)

#### Plotting PCA latent components (brain data)

#### Plotting PCA components (clinical data)

#### Plotting PCA latent components (clinical data)