# PCA
In this notebook, we conduct a PCA on our concat features. We want to check, if the explained variance of the principal components fit our hypothesis that the image embeddings do not add any significant information.

We will conduct a PCA for
- the training split
- all splits combined

## 0. Imports and Constants

In [192]:
# AUTORELOAD
%load_ext autoreload
%autoreload 2

# GENERAL IMPORTS
import numpy as np
import pandas as pd

# TASK-SPECIFIC IMPORTS
from src import utils
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# CONSTANTS
users = ["patriziopalmisano", "onurdenizguler", "jockl"]
TRAIN = "train"
DEV = "dev"
TEST = "test"

####################### SELECT ###########################
user = users[2] # SELECT USER
version = "v2" # SELECT DATASET VERSION
dataset_version = version
##########################################################

if user in users[:2]:
    data_dir = f"/Users/{user}/Library/CloudStorage/GoogleDrive-check.worthiness@gmail.com/My Drive/data/CT23_1A_checkworthy_multimodal_english_{version}"
    cw_dir = f"/Users/{user}/Library/CloudStorage/GoogleDrive-check.worthiness@gmail.com/My Drive/"

else:
    data_dir = f"/home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_{dataset_version}"
    cw_dir = "/home/jockl/Insync/check.worthiness@gmail.com/Google Drive"

features_dir = f"{data_dir}/features"
labels_dir = f"{data_dir}/labels"
models_dir = f"{cw_dir}/models/vanillann"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Train Split

### 1.1 Load Features

Let's first load all the features and compare their shapes and contents with the original embeddings. We want to make sure that the first 768 feature dimensions indeed belong to the text embeddings and the last 768 to the image embeddings.

In [193]:
train_txt_emb, train_img_emb = utils.get_embeddings_from_pickle_file(f"{data_dir}/embeddings_{TRAIN}_{dataset_version}.pickle")
train_concat_features = np.load(f"{features_dir}/concat/concat_{TRAIN}_{dataset_version}.pickle", allow_pickle=True)
print(f"Train txt embeddings: {train_txt_emb.shape}")
print(f"Train img embeddings: {train_img_emb.shape}")
print(f"Train concat features: {train_concat_features.shape}")

Spot check if the first 768 feature dimensions indeed belong to the text embeddings, the latter 768 to the image embeddings:

In [194]:
print(f"Train txt embd excerpt: {train_txt_emb[0][:5]}")
print(f"Train features excerpt: {train_concat_features[0][:5]}")
print(f"Train img embd excerpt: {train_img_emb[0][-5:]}")
print(f"Train features excerpt: {train_concat_features[0][-5:]}")

## 1.2 Normalize the Features

To perform a PCA, we first need to normalize the feature values. The normalized features should have a mean of 0, and standard deviation of 1.

In [195]:
train_normalized_concat_features = StandardScaler().fit_transform(train_concat_features)
print(f"Train normalized concat features: {train_normalized_concat_features.shape}")
print(f"Mean: {np.mean(train_normalized_concat_features)}")
print(f"Standard Deviation: {np.std(train_normalized_concat_features)}")

Mean and standard deviation have the desired values, the features are now normalized.

## 1.3 PCA and Explained Variance

Now that we have normalized feature values, we can compute all principal components.

IMPORTANT NOTE: There is no direct "mapping" between the n-th PC and the n-th feature dimension. The PCs are strictly ordered according to their explained variance values - by definition, the first PC explains the highest amount of variance, while this of course does not have to be the case for the first feature.

In [196]:
train_pca = PCA()
train_principal_components = train_pca.fit_transform(train_normalized_concat_features)
train_principal_components_df = pd.DataFrame(train_principal_components)
train_principal_components_df.tail()

Sanity Check: Does the explained variance array have the right shape, do the values add up to 1?

In [197]:
train_explained_variance = train_pca.explained_variance_ratio_
sum_of_train_explained_variance_values = np.sum(train_explained_variance)
print(f"Explained variance per principal component array: {train_explained_variance.shape}")
print(f"Sum of all explained variance values: {sum_of_train_explained_variance_values}")

Now, we have the explained variance for all the 1536 principal components. Let's now sum over the first and last 768 values:

In [198]:
train_txt_features_explained_variance = np.sum(train_explained_variance[:train_txt_emb.shape[1]])
train_img_features_explained_variance = np.sum(train_explained_variance[-train_img_emb.shape[1]:])
print(f"Explained variance of the first 768 PCs within train split: {train_txt_features_explained_variance}")
print(f"Explained variance of the last 768 PCs within train split: {train_img_features_explained_variance}")

# 2. All splits

### 2.1 Load Features
Let's first load all the features and compare their shapes and contents with the original embeddings. We want to make sure that the first 768 feature dimensions indeed belong to the text embeddings and the last 768 to the image embeddings.

In [199]:
dev_concat_features = np.load(f"{features_dir}/concat/concat_{DEV}_{dataset_version}.pickle", allow_pickle=True)
test_concat_features = np.load(f"{features_dir}/concat/concat_{TEST}_{dataset_version}.pickle", allow_pickle=True)
all_concat_features = np.concatenate((train_concat_features, dev_concat_features, test_concat_features))
print(f"All concat features: {all_concat_features.shape}")


Spot check if the first 768 feature dimensions indeed belong to the text embeddings, the latter 768 to the image embeddings:

In [200]:
# Load test embeddings
test_txt_emb, test_img_emb = utils.get_embeddings_from_pickle_file(f"{data_dir}/embeddings_{TEST}_{dataset_version}.pickle")

# Spot check
print(f"Test txt embd excerpt: {test_txt_emb[-1][:5]}")
print(f"Test features excerpt: {all_concat_features[-1][:5]}")
print(f"Test img embd excerpt: {test_img_emb[-1][-5:]}")
print(f"Test features excerpt: {all_concat_features[-1][-5:]}")

## 2.2 Normalize the Features

To perform a PCA, we first need to normalize the feature values. The normalized features should have a mean of 0, and standard deviation of 1.

In [201]:
all_normalized_concat_features = StandardScaler().fit_transform(all_concat_features)
print(f"Dev normalized concat features: {all_normalized_concat_features.shape}")
print(f"Mean: {np.mean(all_normalized_concat_features)}")
print(f"Standard Deviation: {np.std(all_normalized_concat_features)}")

Mean and standard deviation have the desired values, the features are now normalized.

## 2.3 PCA and Explained Variance

Now that we have normalized feature values, we can compute all principal components.

IMPORTANT NOTE: There is no direct "mapping" between the n-th PC and the n-th feature dimension. The PCs are strictly ordered according to their explained variance values - by definition, the first PC explains the highest amount of variance, while this of course does not have to be the case for the first feature.

In [202]:
all_pca = PCA()
all_principal_components = all_pca.fit_transform(all_normalized_concat_features)
all_principal_components_df = pd.DataFrame(all_principal_components)
all_principal_components_df.tail()

Sanity Check: Has the explained variance array the right shape, do the values add up to 1?

In [203]:
all_explained_variance = all_pca.explained_variance_ratio_
all_sum_of_explained_variance_values = np.sum(all_explained_variance)
print(f"Explained variance per principal component array: {all_explained_variance.shape}")
print(f"Sum of all explained variance values: {all_sum_of_explained_variance_values}")

Now, we have the explained variance for all the 1536 principal components. Let’s now sum over the first and last 768 principal components:

In [204]:
all_txt_features_explained_variance = np.sum(all_explained_variance[:train_txt_emb.shape[1]])
all_img_features_explained_variance = np.sum(all_explained_variance[-train_img_emb.shape[1]:])
print(f"Explained variance of the first 768 PCs within all splits: {all_txt_features_explained_variance}")
print(f"Explained variance of the last 768 PCs within all splits: {all_img_features_explained_variance}")

# 3. Summary of Results and Conclusion

Results for train split and all data:

In [205]:
print(f"Explained variance of text embeddings within train split: {train_txt_features_explained_variance}")
print(f"Explained variance of img embeddings within train split: {train_img_features_explained_variance}\n")

print(f"Explained variance of text embeddings within all splits: {all_txt_features_explained_variance}")
print(f"Explained variance of img embeddings within all splits: {all_img_features_explained_variance}")

- The first 768 PCs capture around 95 % of the feature's variance.
- Even though the first PCs do not mathematically translate to the first 768 features (i.e. the text embeddings), this fits our hypothesis.
- Almost all the variance can be explained by half the number of dimensions - and half our dimensions are made up by image embedding dimensions.
- This matches our previous findings: Training an SVM/VanillaNN on text only yields hardly worse results than training on the concat features.