# Feature Extraction
Given CLIP-embedded text-image-pairs, there are multiple ways of feeding them to our model. For example, one could naively concatenate the text embedding with the image embedding of every pair and use those concatenations as input. We call the actual inputs "features". This notebook showcases different methods of feature extraction, given CLIP-embedded text-image-pairs. The resulting features of each method is stored in a pickle file.

Additionally, in Section 1, we save the labels for every split in a separate pickle file.

For example, the files for the method "concat_cos" from Section 4.1 are stored as follows.

└── data/
&nbsp;&nbsp;&nbsp;&nbsp;├── CT23_1A_checkworthy_multimodal_english_v1
&nbsp;&nbsp;&nbsp;&nbsp;└── CT23_1A_checkworthy_multimodal_english_v2/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── labels/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── dev_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── test_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── train_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── features/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── concat_cos/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;├── concat_cos_dev_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;├── concat_cos_test_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── concat_cos_train_v2.pickle


## 0. Imports and Constants
- Change the path to your dataset directory
- Specify the dataset versions in the #CONSTANTS# part

In [125]:
############## AUTORELOAD MAGIC ###################
%load_ext autoreload
%autoreload 2

# FUNDAMENTAL MODULES
import numpy as np

#TASK-SPECIFIC MODULES
import feature_extraction as fe
import label_extraction as le
import utils

# CONSTANTS
dataset_version = "v2"
dataset_dir = f"/home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_{dataset_version}"
pickled_labels_dir = f"{dataset_dir}/labels"
train_embeddings_path = dataset_dir + f"/embeddings_train_{dataset_version}.pickle"
dev_embeddings_path = dataset_dir + f"/embeddings_dev_{dataset_version}.pickle"
test_embeddings_path = dataset_dir + f"/embeddings_test_{dataset_version}.pickle"

# KEYS USED IN ALL DICTS
from feature_extraction import TRAIN, DEV, TEST, TXT, IMG, SPLITS

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Extract and Pickle the Labels

In [126]:
# Initial JSON files of every split
train_json = f"{dataset_dir}/CT23_1A_checkworthy_multimodal_english_train.jsonl"
dev_json = f"{dataset_dir}/CT23_1A_checkworthy_multimodal_english_dev.jsonl"
test_json = f"{dataset_dir}/CT23_1A_checkworthy_multimodal_english_dev_test.jsonl"
all_splits_json = {"train": train_json, "dev": dev_json, "test": test_json}

# Extract and pickle labels of every split
for key, split in all_splits_json.items():
      labels_array = le.get_labels_from_dataset(split)
      print(f"Shape label array: {labels_array.shape}")
      utils.pickle_features_or_labels(labels_array, f"{pickled_labels_dir}/{key}_labels_{dataset_version}.pickle")

Shape label array: (2356,)
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/labels/train_labels_v2.pickle
Shape label array: (271,)
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/labels/dev_labels_v2.pickle
Shape label array: (548,)
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/labels/test_labels_v2.pickle


## 2. Load Pickled Embeddings

For every split of the dataset, two embedding arrays (txt, img) are loaded. We will store those embeddings in a nested dictionary that is intuitive to reference.

In [127]:
# Two  arrays (txt, img) for every split
train_txt_embeddings, train_img_embeddings = utils.get_embeddings_from_pickle_file(train_embeddings_path)
dev_txt_embeddings, dev_img_embeddings = utils.get_embeddings_from_pickle_file(dev_embeddings_path)
test_txt_embeddings, test_img_embeddings = utils.get_embeddings_from_pickle_file(test_embeddings_path)

# Hold embeddings in a nested dictionary for easy referencing
embeddings_dict = {
      TRAIN: {TXT: train_txt_embeddings, IMG: train_img_embeddings},
      DEV: {TXT: dev_txt_embeddings, IMG: dev_img_embeddings},
      TEST:  {TXT: test_txt_embeddings, IMG: test_img_embeddings}
}

# Dimensions of embedding per split
print(utils.table_embeddings_dims_per_split(embeddings_dict))

Split	txt			img
Tr		(2356, 768)	(2356, 768)
De		(271, 768)	(271, 768)
Te		(548, 768)	(548, 768)


## 3. Naive Feature Extraction

### 3.1 Concatenation

In [128]:
# Initialize to-be-pickled feature dictionary
concat_features = {split: None for split in SPLITS}

# Concatenate txt and img arrays of every split
for split, embeddings in embeddings_dict.items():
      features = np.concatenate((embeddings[TXT], embeddings[IMG]), axis=1)
      concat_features[split] = features

# Dimensions of input feature matrix per split
print(utils.table_feature_dims_per_split(concat_features))

Split	Shape
Tr		(2356, 1536)
De		(271, 1536)
Te		(548, 1536)


In [129]:
# Spot check
print(f"Txt emb: {embeddings_dict[TRAIN][TXT][0][:5]}")
print(f"Img emb: {embeddings_dict[TRAIN][IMG][0][-5:]}")
print(f"Feature: {concat_features[TRAIN][0][:5]}\n"
      f"\t\t {concat_features[TRAIN][0][-5:]}")

Txt emb: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
Img emb: [ 0.20370999  0.39563796 -0.41939157 -0.35091972  0.02099419]
Feature: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
		 [ 0.20370999  0.39563796 -0.41939157 -0.35091972  0.02099419]


In [130]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(concat_features, dataset_dir, feature_method="concat",dataset_version=dataset_version)


Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat/concat_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat/concat_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat/concat_test_v2.pickle


### 3.2 Sum

In [131]:
# Initialize to-be-pickled feature dictionary
sum_features = {split: None for split in SPLITS}

# Sum txt and img matrices of every split
for split, embeddings in embeddings_dict.items():
      features = np.sum((embeddings[TXT], embeddings[IMG]), axis=0)
      sum_features[split] = features

# Shapes of input feature matrix per split
print(utils.table_feature_dims_per_split(sum_features))

Split	Shape
Tr		(2356, 768)
De		(271, 768)
Te		(548, 768)


In [132]:
# Spot check
print(f"Txt emb: {embeddings_dict[TRAIN][TXT][0][:5]}")
print(f"Img emb: {embeddings_dict[TRAIN][IMG][0][:5]}")
print(f"Feature: {sum_features[TRAIN][0][:5]}")

Txt emb: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
Img emb: [-0.24858032  0.6837659   0.81424457  0.59864545  0.49090493]
Feature: [ 0.06400155  1.5459961   0.61851794  1.0155458  -0.33965725]


In [133]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(sum_features, dataset_dir, feature_method="sum", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum/sum_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum/sum_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum/sum_test_v2.pickle


### 3.3 Mean

In [134]:
# Initialize to-be-pickled feature dictionary
mean_features = {split: None for split in SPLITS}

# Compute the mean of txt and img matrices of every split
for split, embeddings in embeddings_dict.items():
      features = np.mean((embeddings[TXT], embeddings[IMG]), axis=0)
      mean_features[split] = features

# Shapes of input feature matrix per split
print(utils.table_feature_dims_per_split(mean_features))

Split	Shape
Tr		(2356, 768)
De		(271, 768)
Te		(548, 768)


In [135]:
# Spot check
print(f"Txt emb: {embeddings_dict[TRAIN][TXT][0][:5]}")
print(f"Img emb: {embeddings_dict[TRAIN][IMG][0][:5]}")
print(f"Feature: {mean_features[TRAIN][0][:5]}")

Txt emb: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
Img emb: [-0.24858032  0.6837659   0.81424457  0.59864545  0.49090493]
Feature: [ 0.03200077  0.77299803  0.30925897  0.5077729  -0.16982862]


In [136]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(mean_features, dataset_dir, feature_method="mean", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean/mean_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean/mean_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean/mean_test_v2.pickle


### 3.4 Hadamard Product

In [137]:
# Initialize to-be-pickled feature dictionary
hadamard_features = {split: None for split in SPLITS}

# Sum txt and img matrices of every split
for split, embeddings in embeddings_dict.items():
      features = np.multiply(embeddings[TXT], embeddings[IMG])
      hadamard_features[split] = features

# Shapes of input feature matrix per split
print(utils.table_feature_dims_per_split(hadamard_features))

Split	Shape
Tr		(2356, 768)
De		(271, 768)
Te		(548, 768)


In [138]:
# Spot check
print(f"Txt emb: {embeddings_dict[TRAIN][TXT][0][:5]}")
print(f"Img emb: {embeddings_dict[TRAIN][IMG][0][:5]}")
print(f"Feature: {hadamard_features[TRAIN][0][:5]}")

Txt emb: [ 0.31258187  0.8622302  -0.19572662  0.41690043 -0.8305622 ]
Img emb: [-0.24858032  0.6837659   0.81424457  0.59864545  0.49090493]
Feature: [-0.0777017   0.5895636  -0.15936933  0.24957554 -0.40772706]


In [139]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(hadamard_features, dataset_dir, feature_method="hadamard", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard/hadamard_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard/hadamard_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard/hadamard_test_v2.pickle


## 4. Naive Approaches Combined with Text/Image Similarity
All the feature matrices from above can be supplemented with an additional feature dimension that captures the cosine similarity of every text-image-embedding-pair. It is the same for every method. Hence, we will compute the cosine similarity dimension once.

In [140]:
# Sanity check: Cosine sim of a single embedding with itself
print(fe.cosine(embeddings_dict[TRAIN][TXT][0], embeddings_dict[TRAIN][TXT][0]))

1.0


In [141]:
# Sanity check: Compute cosine array for a split with itself
print(fe.compute_cosine_array(embeddings_dict[TRAIN][TXT], embeddings_dict[TRAIN][TXT]))

[1.         0.99999994 1.0000001  ... 1.         1.         0.99999994]


In [142]:
# Compute cosine similarity dimension for every split
train_cosine_array = fe.compute_cosine_array(embeddings_dict[TRAIN][TXT], embeddings_dict[TRAIN][IMG])
dev_cosine_array = fe.compute_cosine_array(embeddings_dict[DEV][TXT], embeddings_dict[DEV][IMG])
test_cosine_array = fe.compute_cosine_array(embeddings_dict[TEST][TXT], embeddings_dict[TEST][IMG])

# Save cosine sim dimension in a dict for easy referencing
cosine_dict = {
      TRAIN: train_cosine_array,
      DEV: dev_cosine_array,
      TEST:  test_cosine_array
}

# Check cosine sim shapes
print(utils.table_feature_dims_per_split(cosine_dict))

Split	Shape
Tr		(2356,)
De		(271,)
Te		(548,)


### 4.1 Concatenation + Cosine Similarity

In [143]:
# Add cosine dim to every split
concat_cos_features = fe.add_feature_dim_to_all_splits(concat_features, cosine_dict)
print(utils.table_feature_dims_per_split(concat_cos_features))

Split	Shape
Tr		(2356, 1537)
De		(271, 1537)
Te		(548, 1537)


In [144]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(concat_cos_features, dataset_dir, feature_method="concat_cos", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat_cos/concat_cos_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat_cos/concat_cos_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/concat_cos/concat_cos_test_v2.pickle


### 4.2 Sum + Cosine Similarity

In [145]:
# Add cosine dim to every split
sum_cos_features = fe.add_feature_dim_to_all_splits(sum_features, cosine_dict)
print(utils.table_feature_dims_per_split(sum_cos_features))

Split	Shape
Tr		(2356, 769)
De		(271, 769)
Te		(548, 769)


In [146]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(sum_cos_features, dataset_dir, feature_method="sum_cos", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum_cos/sum_cos_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum_cos/sum_cos_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/sum_cos/sum_cos_test_v2.pickle


### 4.3 Mean + Cosine Similarity

In [147]:
# Add cosine dim to every split
mean_cos_features = fe.add_feature_dim_to_all_splits(mean_features, cosine_dict)
print(utils.table_feature_dims_per_split(mean_cos_features))

Split	Shape
Tr		(2356, 769)
De		(271, 769)
Te		(548, 769)


In [148]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(mean_cos_features, dataset_dir, feature_method="mean_cos", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean_cos/mean_cos_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean_cos/mean_cos_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/mean_cos/mean_cos_test_v2.pickle


### 4.4 Hadamard Product + Cosine Similarity

In [149]:
# Add cosine dim to every split
hadamard_cos_features = fe.add_feature_dim_to_all_splits(hadamard_features, cosine_dict)
print(utils.table_feature_dims_per_split(hadamard_cos_features))

Split	Shape
Tr		(2356, 769)
De		(271, 769)
Te		(548, 769)


In [150]:
# Pickle the feature matrix of every split in its own file
utils.pickle_all_splits(hadamard_cos_features, dataset_dir, feature_method="hadamard_cos", dataset_version=dataset_version)

Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard_cos/hadamard_cos_train_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard_cos/hadamard_cos_dev_v2.pickle
Pickled: /home/jockl/Insync/check.worthiness@gmail.com/Google Drive/data/CT23_1A_checkworthy_multimodal_english_v2/features/hadamard_cos/hadamard_cos_test_v2.pickle


## 5. Storage
Every presented feature engineering method yields three files, one for every split. For example, the files for the concat_cos are stored as follows.
.
└── data/
&nbsp;&nbsp;&nbsp;&nbsp;├── CT23_1A_checkworthy_multimodal_english_v1
&nbsp;&nbsp;&nbsp;&nbsp;└── CT23_1A_checkworthy_multimodal_english_v2/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── labels/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── dev_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── test_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── train_labels_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── features/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── concat_cos/
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;├── concat_cos_dev_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;├── concat_cos_test_v2.pickle
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── concat_cos_train_v2.pickle

## 6. Load pickled features and labels
This section shows how to load a pickled feature matrix.

In [151]:
# Example: Load the mean_cos feature matrix of the train split
file_path = f"{dataset_dir}/features/mean_cos/mean_cos_train_{dataset_version}.pickle"
train_mean_cos_features = np.load(file_path, allow_pickle=True)
print(type(train_mean_cos_features))
print(train_mean_cos_features.shape)

<class 'numpy.ndarray'>
(2356, 769)


In [152]:
# Example: Load the train labels
file_path = f"{dataset_dir}/labels/train_labels_{dataset_version}.pickle"
labels = np.load(file_path, allow_pickle=True)
print(type(train_mean_cos_features))
print(train_mean_cos_features.shape)

<class 'numpy.ndarray'>
(2356, 769)


## 6. Further Feature Engineering Methods
For future approaches beyond the baseline, collect  more subtle feature engineering methods here:
- RpBERT: https://ar5iv.labs.arxiv.org/html/2102.02967