# PCA
In this notebook, we conduct a PCA on our mean features. We want to check if the explained variance of the principal components fit our hypothesis that the image embeddings do not add any significant information.

We will conduct a PCA for
- the training split
- all splits combined

## 0. Imports and Constants

In [39]:
# AUTORELOAD
%load_ext autoreload
%autoreload 2

# GENERAL IMPORTS
import numpy as np
import pandas as pd

# TASK-SPECIFIC IMPORTS
from src import utils
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# CONSTANTS
users = ["patriziopalmisano", "onurdenizguler", "jockl"]
TRAIN = "train"
DEV = "dev"
TEST = "test"

####################### SELECT ###########################
user = users[2] # SELECT USER
version = "v2" # SELECT DATASET VERSION
dataset_version = version
##########################################################

if user in users[:2]:
    data_dir = f"/Users/{user}/Library/CloudStorage/GoogleDrive-check.worthiness@gmail.com/My Drive/data/CT23_1A_checkworthy_multimodal_english_{version}"
    cw_dir = f"/Users/{user}/Library/CloudStorage/GoogleDrive-check.worthiness@gmail.com/My Drive/"

else:
    data_dir = f"/home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_{dataset_version}"
    cw_dir = "/home/jockl/Insync/check.worthiness@gmail.com/Google Drive"

features_dir = f"{data_dir}/features"
labels_dir = f"{data_dir}/labels"
models_dir = f"{cw_dir}/models/vanillann"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Train Split

### 1.1 Load Features

Let's first load all the features and compare their shapes and contents with the original embeddings.

In [40]:
train_txt_emb, train_img_emb = utils.get_embeddings_from_pickle_file(f"{data_dir}/embeddings_{TRAIN}_{dataset_version}.pickle")
train_mean_features = np.load(f"{features_dir}/mean/mean_{TRAIN}_{dataset_version}.pickle", allow_pickle=True)
print(f"Train txt embeddings: {train_txt_emb.shape}")
print(f"Train img embeddings: {train_img_emb.shape}")
print(f"Train mean features: {train_mean_features.shape}")

Train txt embeddings: (2356, 768)
Train img embeddings: (2356, 768)
Train mean features: (2356, 768)


Spot check:

In [41]:
print(f"Train txt embd excerpt: {train_txt_emb[0][:5]}")
print(f"Train img embd excerpt: {train_img_emb[0][:5]}")
print(f"Train features excerpt: {train_mean_features[0][:5]}")

Train txt embd excerpt: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
Train img embd excerpt: [-0.24858032  0.6837659   0.81424457  0.59864545  0.49090493]
Train features excerpt: [ 0.03200077  0.77299803  0.30925897  0.5077729  -0.16982862]


## 1.2 Normalize the Features

To perform a PCA, we first need to normalize the feature values. The normalized features should have a mean of 0, and standard deviation of 1.

In [42]:
train_normalized_mean_features = StandardScaler().fit_transform(train_mean_features)
print(f"Train normalized mean features: {train_normalized_mean_features.shape}")
print(f"Mean: {np.mean(train_normalized_mean_features)}")
print(f"Standard Deviation: {np.std(train_normalized_mean_features)}")

Train normalized mean features: (2356, 768)
Mean: -1.0541285726251015e-11
Standard Deviation: 1.0000001192092896


Mean and standard deviation have the desired values, the features are now normalized.

## 1.3 PCA and Explained Variance

Now that we have normalized feature values, we can compute all principal components.

IMPORTANT NOTE: There is no direct "mapping" between the n-th PC and the n-th feature dimension. The PCs are strictly ordered according to their explained variance values - by definition, the first PC explains the highest amount of variance, while this of course does not have to be the case for the first feature.

In [43]:
train_pca = PCA()
train_principal_components = train_pca.fit_transform(train_normalized_mean_features)
train_principal_components_df = pd.DataFrame(train_principal_components)
train_principal_components_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
2351,-5.618376,11.033623,6.053107,-0.163067,0.666468,-1.900198,3.436073,-3.184094,1.393384,-0.178092,...,0.009635,-0.015563,0.00222,-0.006839,0.015748,0.008607,-0.003422,-0.009546,0.00813,0.001067
2352,-4.682307,-12.436971,11.926312,-2.077713,-0.727863,-2.583416,-4.341856,-4.070567,-1.320721,-0.866155,...,0.006398,0.001851,-0.008635,-0.011726,0.014429,0.008191,-0.002672,0.007399,-0.002513,-0.001518
2353,-6.491424,5.32714,-1.383168,0.875179,-2.787508,0.841726,0.396089,1.701471,2.94134,-2.52194,...,0.001634,0.008995,-0.014275,-0.000268,0.022956,-0.00928,-0.005194,0.00896,0.011898,-0.000954
2354,-4.693285,-10.8221,12.317378,1.285923,-9.69206,-0.57025,-1.636979,3.693459,4.632005,-3.306872,...,0.011457,0.007084,-0.002748,0.012913,0.007749,-0.023125,0.005615,-0.00107,-0.006236,0.000914
2355,-3.795097,-10.701345,7.03228,0.25565,-2.192058,-2.775885,0.135702,-0.844129,1.474199,0.036058,...,-0.006232,-0.015062,-0.014292,-0.004327,-0.001183,-0.024918,0.01823,0.001745,-0.010218,-0.000668


Sanity Check: Does the explained variance array have the right shape, do the values add up to 1?

In [44]:
train_explained_variance = train_pca.explained_variance_ratio_
sum_of_train_explained_variance_values = np.sum(train_explained_variance)
print(f"Explained variance array shape: {train_explained_variance.shape}")
print(f"Sum of all explained variance values: {sum_of_train_explained_variance_values}")

Explained variance array shape: (768,)
Sum of all explained variance values: 0.9999998807907104


Now, we have the explained variance for all the 768 principal components. Let’s now sum over the first and last 384 principal components:

In [45]:
train_expl_var_of_first_half_of_pcs = np.sum(train_explained_variance[:384])
train_expl_var_of_second_half_of_pcs = np.sum(train_explained_variance[-384:])
print(f"Explained variance of the first 768 PCs within train split: {train_expl_var_of_first_half_of_pcs}")
print(f"Explained variance of the last 768 PCs within train split: {train_expl_var_of_second_half_of_pcs}")

Explained variance of the first 768 PCs within train split: 0.913522481918335
Explained variance of the last 768 PCs within train split: 0.0864773839712143


# 2. All Splits

### 2.1 Load Features
Let's first load all the features and compare their shapes and contents with the original embeddings. We want to make sure that the first 768 feature dimensions indeed belong to the text embeddings and the last 768 to the image embeddings.

In [46]:
dev_mean_features = np.load(f"{features_dir}/mean/mean_{DEV}_{dataset_version}.pickle", allow_pickle=True)
test_mean_features = np.load(f"{features_dir}/mean/mean_{TEST}_{dataset_version}.pickle", allow_pickle=True)
all_mean_features = np.concatenate((train_mean_features, dev_mean_features, test_mean_features))
print(f"All mean features: {all_mean_features.shape}")


All mean features: (3175, 768)


Spot check:

In [47]:
# Load test embeddings
test_txt_emb, test_img_emb = utils.get_embeddings_from_pickle_file(f"{data_dir}/embeddings_{TEST}_{dataset_version}.pickle")

# Spot check
print(f"Test txt embd excerpt: {test_txt_emb[-1][:5]}")
print(f"Test img embd excerpt: {test_img_emb[-1][:5]}")
print(f"Test features excerpt: {all_mean_features[-1][:5]}")

Test txt embd excerpt: [ 0.6375199  -0.53175956  0.20533158 -0.59946156  0.6934856 ]
Test img embd excerpt: [ 0.4965897  -0.18978584  0.3129852  -1.5566363  -0.03816223]
Test features excerpt: [ 0.5670548 -0.3607727  0.2591584 -1.078049   0.3276617]


## 2.2 Normalize the Features

To perform a PCA, we first need to normalize the feature values. The normalized features should have a mean of 0, and standard deviation of 1.

In [48]:
all_normalized_mean_features = StandardScaler().fit_transform(all_mean_features)
print(f"Dev normalized mean features: {all_normalized_mean_features.shape}")
print(f"Mean: {np.mean(all_normalized_mean_features)}")
print(f"Standard Deviation: {np.std(all_normalized_mean_features)}")

Dev normalized mean features: (3175, 768)
Mean: -1.8147346125818586e-10
Standard Deviation: 0.9999998807907104


Mean and standard deviation have the desired values, the features are now normalized.

## 2.3 PCA and Explained Variance

Now that we have normalized feature values, we can compute all principal components.

IMPORTANT NOTE: There is no direct "mapping" between the n-th PC and the n-th feature dimension. The PCs are strictly ordered according to their explained variance values - by definition, the first PC explains the highest amount of variance, while this of course does not have to be the case for the first feature.

In [49]:
all_pca = PCA()
all_principal_components = all_pca.fit_transform(all_normalized_mean_features)
all_principal_components_df = pd.DataFrame(all_principal_components)
all_principal_components_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
3170,0.238008,-4.663329,2.159748,3.730483,4.448696,-1.656778,-0.004506,-0.1656,-1.452704,-1.266812,...,-0.008645,0.022458,0.010699,0.003758,0.00794,-0.010041,0.012951,0.000659,-0.018835,-0.001551
3171,-3.178096,-12.321038,11.737628,3.931248,-1.461085,-4.500793,-4.868284,2.704535,0.289328,-0.964185,...,0.007725,0.015567,-0.000813,0.011691,0.011216,-0.008845,-0.008537,0.00235,-0.007344,-0.00048
3172,-2.032893,-2.270451,-3.973131,-0.06735,-2.996502,1.322557,2.729575,-0.248537,0.513258,-2.773205,...,0.003104,-0.004439,-0.006665,-0.004283,-0.00362,-0.012886,0.000682,-0.008256,-0.006176,0.000802
3173,-3.744528,-4.23551,-7.302965,0.158635,-2.333645,2.705621,-0.197544,-1.512575,-0.685376,-1.014182,...,0.014284,-0.00086,0.018936,-0.028383,-0.018947,-0.018295,-0.008542,-0.002742,0.020855,-0.000874
3174,-3.745104,-2.916368,-4.66056,1.345548,0.948931,0.764762,-2.684519,2.324849,1.664622,-1.585021,...,-0.002761,-0.004715,-0.001393,-0.003119,-0.00186,0.006264,0.01154,0.004378,-0.00143,0.003327


Sanity Check: Has the explained variance array the right shape, do the values add up to 1?

In [50]:
all_explained_variance = all_pca.explained_variance_ratio_
all_sum_of_explained_variance_values = np.sum(all_explained_variance)
print(f"Explained variance array shape: {all_explained_variance.shape}")
print(f"Sum of all explained variance values: {all_sum_of_explained_variance_values}")

Explained variance array shape: (768,)
Sum of all explained variance values: 1.0


Now, we have the explained variance for all the 768 principal components. Let’s now sum over the first and last 384 principal components:

In [51]:
expl_var_of_first_half_of_pcs = np.sum(all_explained_variance[:384])
expl_var_of_second_half_of_pcs = np.sum(all_explained_variance[-384:])
print(f"Explained variance of the first 768 PCs within all splits: {expl_var_of_first_half_of_pcs}")
print(f"Explained variance of the last 768 PCs within all splits: {expl_var_of_second_half_of_pcs}")

Explained variance of the first 768 PCs within all splits: 0.9037585258483887
Explained variance of the last 768 PCs within all splits: 0.0962415337562561


# 3. Summary of Results and Conclusion

Results for train split and all data:

In [52]:
print(f"Explained variance of the first 384 PCs within train split: {train_expl_var_of_first_half_of_pcs}")
print(f"Explained variance of the second 384 PCs within train split: {train_expl_var_of_second_half_of_pcs}\n")

print(f"Explained variance of the first 384 PCs within all splits: {expl_var_of_first_half_of_pcs}")
print(f"Explained variance of the second 384 PCs within all splits: {expl_var_of_second_half_of_pcs}")

Explained variance of the first 384 PCs within train split: 0.913522481918335
Explained variance of the second 384 PCs within train split: 0.0864773839712143

Explained variance of the first 384 PCs within all splits: 0.9037585258483887
Explained variance of the second 384 PCs within all splits: 0.0962415337562561


- The first half of PCs capture around 90 % of the features' variance.
- In comparison, for the concat features, the first half of PCs capture around 95 % of the feature's variance.
- This matches our previous findings: The text features are way more important than the image features.