# Notebook Summary
- Get paraphrase-multilingual-MiniLM-L12-v2 from huggingface
- Use paraphrase-multilingual-MiniLM-L12-v2 to
    - extract embeddings from Tweets
    - extract embeddings from concatenations of Tweet + OCR text

# 0. Imports and Constants
- Do not forget to select dataset version in the #CONSTANTS# part

In [37]:
############## AUTORELOAD MAGIC ###################
%load_ext autoreload
%autoreload 2
###################################################

############## FUNDAMENTAL MODULES ################
import json
import os
import sys
import copy
import numpy as np
import pickle
import re
##################################################

############## TASK-SPECIFIC MODULES #############
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath("__file__"))))
from data import TweetNormalizer, utils, feature_extraction
###################################################

############## DATA SCIENCE & ML MODULES ##########
import torch
import pandas as pd
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
from scipy import stats
###################################################

############# CONSTANT DICT KEYS ###################
# Constant dict keys
TRAIN = "train"
DEV = "dev"
TEST = "test"
GOLD = "gold"
TXT = "txt"
IMG = "img"
OCR = "ocr"
TXT_OCR = "txt_ocr"
SPLITS = [TRAIN, DEV, TEST, GOLD]
####################################################

####################### SELECT ###########################
users = ["patriziopalmisano", "onurdenizguler", "jockl"]
user = users[2] # SELECT USER
version = "v2" # SELECT DATASET VERSION
dataset_version = version
##########################################################

if user in users[:2]:
    cw_dir = f"/Users/{user}/Library/CloudStorage/GoogleDrive-check.worthiness@gmail.com/My Drive"
    data_dir = f"{cw_dir}/data/CT23_1A_checkworthy_multimodal_english"
    data_dir_with_version = f"{data_dir}_{dataset_version}"
    gold_dir = f"{cw_dir}/data/CT23_1A_checkworthy_multimodal_english_test_gold"

else:
    cw_dir = f"/home/jockl/Insync/check.worthiness@gmail.com/Google Drive"
    data_dir = f"{cw_dir}/data/CT23_1A_checkworthy_multimodal_english"
    data_dir_with_version = f"{data_dir}_{dataset_version}"
    gold_dir = f"{cw_dir}/data/CT23_1A_checkworthy_multimodal_english_test_gold"


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Load all Datasets
First, we extract all the raw texts from the JSON files. Note that the concatenation of tweet and OCR text is realized with a new line in-between.

In [38]:
# Load the datasets
raw_dataset, tweet_texts, imgs, tweet_ids, ocr_texts, tweet_concat_ocr = utils.load_data_splits_with_gold_dataset(data_dir, version)

Sizes of txt, img, ocr, txt+ocr arrays in train, test, dev, gold:
2356 2356 2356 2356
271 271 2356 2356
548 548 2356 2356
736 736 2356 2356


In [39]:
# Inspect Tweet and OCR concatenation
print(f"Tweet:\n{tweet_texts[TRAIN][6]}")
print(f"\nOCR:\n{ocr_texts[TRAIN][6]}")
print(f"\nConcat:\n{tweet_concat_ocr[TRAIN][6]}")

Tweet:
Despite calls for calm, some local people are panicking over the deadly coronavirus outbreak.

"I really am. It's really something that I'm really frightened about right now."

"It might be actually worse than what they're telling us."

https://t.co/2aeFCPdg4T https://t.co/Stjz0XlcwA

OCR:
CORONAVIRUS
MGN


Concat:
Despite calls for calm, some local people are panicking over the deadly coronavirus outbreak.

"I really am. It's really something that I'm really frightened about right now."

"It might be actually worse than what they're telling us."

https://t.co/2aeFCPdg4T https://t.co/Stjz0XlcwA
CORONAVIRUS
MGN



# 2. Normalize Texts

In [40]:
# Normalize all tweets using TweetNormalizer()
normalized_tweets = {split: [TweetNormalizer.normalizeTweet(tweet) for tweet in tweet_texts[split]] for split in SPLITS}
normalized_tweet_concat_ocr = {split: [TweetNormalizer.normalizeTweet(concat) for concat in tweet_concat_ocr[split]] for split in SPLITS}
print(len(normalized_tweets[TRAIN]))
print(len(normalized_tweet_concat_ocr[TRAIN]))

2356
2356


In [41]:
# Inspect normalization of tweets
print(f"Tweet:\n{tweet_texts[TRAIN][6]}")
print(f"\nNormalized Tweet:\n{normalized_tweets[TRAIN][6]}")

Tweet:
Despite calls for calm, some local people are panicking over the deadly coronavirus outbreak.

"I really am. It's really something that I'm really frightened about right now."

"It might be actually worse than what they're telling us."

https://t.co/2aeFCPdg4T https://t.co/Stjz0XlcwA

Normalized Tweet:
Despite calls for calm , some local people are panicking over the deadly coronavirus outbreak . " I really am . It 's really something that I 'm really frightened about right now . " " It might be actually worse than what they 're telling us . " HTTPURL HTTPURL


In [42]:
# Inspect normalization of tweet_ocr_concat
print(f"Concat:\n{tweet_texts[TRAIN][6]}")
print(f"Normalized Concat:\n{normalized_tweet_concat_ocr[TRAIN][6]}")

Concat:
Despite calls for calm, some local people are panicking over the deadly coronavirus outbreak.

"I really am. It's really something that I'm really frightened about right now."

"It might be actually worse than what they're telling us."

https://t.co/2aeFCPdg4T https://t.co/Stjz0XlcwA
Normalized Concat:
Despite calls for calm , some local people are panicking over the deadly coronavirus outbreak . " I really am . It 's really something that I 'm really frightened about right now . " " It might be actually worse than what they 're telling us . " HTTPURL HTTPURL CORONAVIRUS MGN


# 3. Set Up The Model and Embed A Minimal Example

In [43]:
# Set up device
device = "cuda" if torch.cuda.is_available() else \
         ("mps" if torch.backends.mps.is_available() else "cpu")

In [44]:
# Get the model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2').to(device)

In [45]:
# Embed one example text
input_text = normalized_tweets[TRAIN][0]
embedding = model.encode(input_text)
print(embedding.shape)

(384,)


# 4. Embed the Samples

First, we set the batch size with which the model should embed our data splits.

In [46]:
# Set batch size and model name here
batch_size = 8
model_name = "paraphrase"

## 4.1 Embed the Tweets Only

In [47]:
# Embed every split
for split in SPLITS:
    utils.embed_and_pickle_split(model, model_name, data_dir_with_version, split, normalized_tweets[split], with_ocr=False, batch_size=batch_size)


Split: train
Num samples: 2356
Num batches: 294
train batch 0/294
train batch 1/294
train batch 2/294
train batch 3/294
train batch 4/294
train batch 5/294
train batch 6/294
train batch 7/294
train batch 8/294
train batch 9/294
train batch 10/294
train batch 11/294
train batch 12/294
train batch 13/294
train batch 14/294
train batch 15/294
train batch 16/294
train batch 17/294
train batch 18/294
train batch 19/294
train batch 20/294
train batch 21/294
train batch 22/294
train batch 23/294
train batch 24/294
train batch 25/294
train batch 26/294
train batch 27/294
train batch 28/294
train batch 29/294
train batch 30/294
train batch 31/294
train batch 32/294
train batch 33/294
train batch 34/294
train batch 35/294
train batch 36/294
train batch 37/294
train batch 38/294
train batch 39/294
train batch 40/294
train batch 41/294
train batch 42/294
train batch 43/294
train batch 44/294
train batch 45/294
train batch 46/294
train batch 47/294
train batch 48/294
train batch 49/294
train batch

In [48]:
# Embed every split
for split in SPLITS:
    utils.embed_and_pickle_split(model, model_name, data_dir_with_version, split, normalized_tweet_concat_ocr[split], with_ocr=True, batch_size=batch_size)


Split: train
Num samples: 2356
Num batches: 294
train batch 0/294
train batch 1/294
train batch 2/294
train batch 3/294
train batch 4/294
train batch 5/294
train batch 6/294
train batch 7/294
train batch 8/294
train batch 9/294
train batch 10/294
train batch 11/294
train batch 12/294
train batch 13/294
train batch 14/294
train batch 15/294
train batch 16/294
train batch 17/294
train batch 18/294
train batch 19/294
train batch 20/294
train batch 21/294
train batch 22/294
train batch 23/294
train batch 24/294
train batch 25/294
train batch 26/294
train batch 27/294
train batch 28/294
train batch 29/294
train batch 30/294
train batch 31/294
train batch 32/294
train batch 33/294
train batch 34/294
train batch 35/294
train batch 36/294
train batch 37/294
train batch 38/294
train batch 39/294
train batch 40/294
train batch 41/294
train batch 42/294
train batch 43/294
train batch 44/294
train batch 45/294
train batch 46/294
train batch 47/294
train batch 48/294
train batch 49/294
train batch

# 6. Load the Pickled Embeddings
Example Usage:

In [50]:
# Load embedding tensor of one data split from pickle file
pickle_file = f"{data_dir_with_version}/{model_name}_{TRAIN}_{dataset_version}.pickle"
with open(pickle_file, 'rb') as handle:
    embeddings_tensor = pickle.load(handle)
    print(embeddings_tensor.shape)

(2356, 384)
