# Stage 3: Triplet Embeddings

For generating embeddings of the triplets we use Sentence-BERT, which is specifically designed and trained to derive dense vector representations for sentences and paragraphs. Unlike BERT, which outputs one vector for each token in the input, SBERT outputs a single fixed-size vector for an entire text input, making it suitable for semantic similarity tasks and other scenarios where sentence-level representations are required. We used the following template to  obtain a joint representation of hero, villain and victim:

```text
In this article [HERO] is portrayed as a hero, [VILLAIN] is portrayed as a villain, and [VICTIM] is portrayed as a victim.
```

## Fetch Articles

In [1]:
from utils.preprocessing import *
from utils.accelerators import *
from utils.multithreading import *
from utils.database import *
from utils.files import *
from datasets import Dataset
import random

  from .autonotebook import tqdm as notebook_tqdm


### Connect to Database

Credentials are sourced from the `.env` file.

In [3]:
_, db = getConnection(use_dotenv=True)

### Query Database

Fetches a limited number of articles from the database that haven't been processed yet, 
returning specified fields like url, title, and parsing result text.

In [4]:
collection = "v2_sampled_articles"
fields = {"url": 1, "title": 1, "parsing_result.text": 1, "processing_result": 1}
query = {
    "denoising_result": {"$exists": True},
    "parsing_result.text_length": {"$lt": 10000},
    #"sentence_embedding_result": {"$exists": False},
    #"initial_subsample": False
}
articles = fetchArticleTexts(db, 50, 0, fields, query, collection)

Example article:

In [6]:
example_article = random.choice(articles)
title = example_article.get("title")
text = example_article.get("parsing_result").get("text")
print(f"Title: {title}\nText: {text}")
print(f"Processing Result: {example_article.get('processing_result')}")


Title: UK’s ‘Sunday Times’ Pulls Transphobic Editorial by British TV Host Jeremy Clarkson After Outrage
Text: Towleroad Gay News Gay Blog Towleroad: More than gay news | gay men Michael Fitzgerald January 25, 2016 Leave a Comment   Former Top Gear presenter Jeremy Clarkson has come under fire for his offensive remarks aimed at transgender people. In a newspaper column titled “Transgender issues are driving me nuts” which has since been taken offline, Clarkson wrote that trans people are “only really to be found on the internet or in the seedier bits of Bangkok” and are “nothing more than the punchline in a stag night anecdote.” Writing in the Sunday Times, he added that parents of transgender children should not “indulge this whim” and that after applying “a bit of make-up,” incarcerated trans women “spend the rest of their lives being a lesbian” living “every man's dream.” jeremy clarkson is so disgusting. he's a cis man who knows nothing about trans struggles. pic.twitter.com/VCDkbs8


Processes the 'parsing_result' of each article to clean the text, and filters out articles 
that lack a 'title' or 'parsing_result'.


In [7]:
# Basic text cleaning, e.g. removing newlines, tabs, etc.
articles = cleanArticles(articles)

Cleaning articles: 100%|██████████| 50/50 [00:00<00:00, 1623.22it/s]


In [8]:
# Filter out articles with no title or no parsing result 
articles = [article for article in articles if article.get(
    "title", "") and article.get("parsing_result", "")]

print("Number of articles:", len(articles))

Number of articles: 50


### Export as JSON

Saves the given data to a JSON file for optional visual inspection.

In [9]:
exportAsJSON("../data/input/articles.json",  articles)

### Convert to HF Dataset

Convert article IDs to strings and transform a list of articles into a dataset with fields: id, title, url, and text extracted from parsing results. The HuggingFace `datasets` library provides several key advantages over plain JSON files:

- **Efficiency**: The datasets are memory-mapped, allowing you to work with data that's larger than your available RAM without loading the entire dataset into memory. 
- **Speed**: Datasets in the HuggingFace format (which is Arrow-based) can be loaded faster than large JSON files, facilitating quicker data operations.
- **Columnar Storage**: By using Apache Arrow for storage, HuggingFace datasets benefit from a columnar format that ensures more efficient serialization and deserialization compared to row-based storage, such as JSON.


In [10]:
column_names = ["_id", "title", "url", "parsing_result.text", "processing_result.hero", "processing_result.villain", "processing_result.victim"]
articles = convertListToDataset(articles, column_names)
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


In [11]:
print("Example Article Text:", articles[42]["text"][:100])
print("Example Article Hero:", articles[42]["hero"])
print("Example Article Villain:", articles[42]["villain"])
print("Example Article Victim:", articles[42]["victim"])

Example Article Text:  January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gate
Example Article Hero: ['Robert Gates']
Example Article Villain: []
Example Article Victim: ['Barack Obama']


Save dataset to disk:

In [12]:
articles.save_to_disk('../data/input/articles')

Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 3634.45 examples/s]


***

## Prepare Dataset 

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from datasets import Dataset, load_from_disk
from transformers import AutoTokenizer
from multiprocessing import Pool
from utils.preprocessing import *
from utils.database import *
from utils.files import *
import transformers

The code `os.environ["TOKENIZERS_PARALLELISM"] = "false"` disables parallel tokenization in HuggingFace's libraries. It's a way to suppress warnings and prevent potential issues tied to multi-core tokenization.
See: https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning

In [15]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
transformers.utils.logging.set_verbosity_error()

### Import Raw Dataset

In [16]:
articles = load_from_disk('../data/input/articles')
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


### Prepare Dataset

#### Define Prompt Template:

In [17]:
PROMPT_TEMPLATE = "In this article {hero} is portrayed as a hero, {villain} is portrayed as a villain, and {victim} is portrayed as a victim."

# Test the template with a dummy text
print(PROMPT_TEMPLATE.format(hero='Batman', villain='Joker', victim='Robin'))

In this article Batman is portrayed as a hero, Joker is portrayed as a villain, and Robin is portrayed as a victim.


#### Expand Dataset

For each article, distinct prompts identify 'hero', 'villain', and 'victim'. If an article exceeds the model's input size, it's divided into chunks, generating additional prompts. It seems that one article results in about 10 to 12 prompts.

In [20]:
articles[42]

{'_id': '64d8eb39516b2658722931a5',
 'title': 'Gates Says Obama Has Too Many &#8216;Yes Men&#8217;',
 'url': 'https://politicalwire.com/2016/01/19/gates-says-obama-has-too-many-yes-men/',
 'text': " January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gates suggested on MSNBC that President Obama has “centralized power and operational activities of the government in the White House to a degree that I think is unparalleled.” He added: “I don’t see the kind of strong people around the President who will push back on him. I will give him credit. I pushed back on him a lot and he never shut me down. He never told me to be quiet or refused to see me or anything like that. But I don’t see people around him like that now. “ Filed Under: Governing, White House Tagged With: Barack Obama, Robert Gates Donald Trump said on Truth Social that his lawyers would prefer that he didn’t have a Monday new conference to unveil his “irrefutable” evidence of fraud i

In [23]:
def expandRow(row, template):
    """
    Generate prompts based on various roles and text chunks from the input row.
    """

    # Extract values from the row
    heros = row.get("hero", [])
    villains = row.get("villain", [])
    victims = row.get("victim", [])

    hero = heros[0] if len(heros) > 0 else "no one"
    villain = villains[0] if len(villains) > 0 else "no one"
    victim = victims[0] if len(victims) > 0 else "no one"

    # Construct the prompt
    prompt = template.format(hero=hero, villain=villain, victim=victim)

    new_row = {
        **row,
        'prompt': prompt,
    }

    return new_row

In [26]:
example_row = articles[42]
example_row_exp = expandRow(example_row, PROMPT_TEMPLATE)

print("Example Prompt:", example_row_exp.get("prompt"))
print("Expanded row:", example_row_exp)

Example Prompt: In this article Robert Gates is portrayed as a hero, no one is portrayed as a villain, and Barack Obama is portrayed as a victim.
Expanded row: {'_id': '64d8eb39516b2658722931a5', 'title': 'Gates Says Obama Has Too Many &#8216;Yes Men&#8217;', 'url': 'https://politicalwire.com/2016/01/19/gates-says-obama-has-too-many-yes-men/', 'text': " January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gates suggested on MSNBC that President Obama has “centralized power and operational activities of the government in the White House to a degree that I think is unparalleled.” He added: “I don’t see the kind of strong people around the President who will push back on him. I will give him credit. I pushed back on him a lot and he never shut me down. He never told me to be quiet or refused to see me or anything like that. But I don’t see people around him like that now. “ Filed Under: Governing, White House Tagged With: Barack Obama, Robert Gate

Process datataset using multiple proesses:

In [27]:
PROMPT_TEMPLATE

'In this article {hero} is portrayed as a hero, {villain} is portrayed as a villain, and {victim} is portrayed as a victim.'

In [35]:
# Initialize a progress bar with the total number of tasks
pbar = tqdm(total=len(articles), desc="Constructing Prompts",
            position=0, leave=True)

results = []
for row in articles:
        result = expandRow(row, PROMPT_TEMPLATE)
        results.append(result)
        pbar.update()

pbar.close()

Constructing Prompts: 100%|██████████| 50/50 [00:00<00:00, 2954.44it/s]


In [37]:
dataset_hvv = results

print("Number of prompts:", len(dataset_hvv))

# Convert the list of dictionaries into a Dataset
dataset_hvv = Dataset.from_dict(
    {key: [dic[key] for dic in dataset_hvv] for key in dataset_hvv[0]})

Number of prompts: 50


In [39]:
dataset_hvv.save_to_disk('../data/input/articles_chunkified')
dataset = dataset_hvv

Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 3534.13 examples/s]


***

## Make Predictions

In [49]:
from torch.utils.data import DataLoader, TensorDataset, SequentialSampler
from datasets import Dataset, load_from_disk, concatenate_datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import threading
import torch
import pickle
import time
import copy

In [50]:
dataset = load_from_disk('../data/input/articles_chunkified')
print("Dataset length:", len(dataset))

Dataset length: 50


### Split Dataset

List infos about the available GPUs:

In [51]:
gpu_info_list = listAvailableGPUs()

GPU 0:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0

GPU 1:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0



Determine the number of available GPUs:

In [52]:
num_gpus = torch.cuda.device_count()
print(f'Number of available GPUs: {num_gpus}')


Number of available GPUs: 2


In [53]:
# Split the dataset into chunks (one for each GPU)
chunks = splitDataset(dataset, num_chunks=num_gpus)

# Print the length of each chunk
print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} length:", len(chunk))


Number of chunks: 2
Chunk 0 length: 25
Chunk 1 length: 25


### Process Articles

Check GPU utilization:

In [54]:
!nvidia-smi

Sun Oct 22 10:46:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   46C    P0    34W / 250W |    316MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   47C    P0    36W / 250W |    910MiB / 16384MiB |      0%      Default |
|       

#### Parameters for Text Generation

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.


In [55]:
def generatePredictions(process_id, dataset, device):
    """Generates predictions for a given dataset."""

    # Print some information about the process
    print(f"--------- Process {process_id:02} ---------")
    print(f"Dataset length: {len(dataset)}")
    print(f"Device: {device}")
    print(f"------------------------------")

    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
    model.eval()
    model.to(device)

    print("Device:", torch.cuda.get_device_name())

    dataset_full = copy.copy(dataset)

    # Make predictions
    predictions = []

    with torch.no_grad():
        for row in tqdm(dataset_full, desc="Articles"):
            embedding = model.encode(row.get("prompt"))
            predictions.append(embedding)

    # results.extend(predictions)

    print("GPU finished:", len(predictions))

    #Ensure the new column has the same number of items as the dataset
    assert len(dataset_full) == len(
        predictions), "The length of new_column_values must match the dataset's length"

    #print("Dataset length:", len(dataset_full))

    # Add new column
    predictions_list = [tensor.tolist() for tensor in predictions]
    dataset_full = dataset_full.add_column('answer', predictions_list)
    dataset_full.save_to_disk('data/output/articles_processed_' + str(id))

    return dataset_full
    

Start one thread per GPU before collecting and merging the results:

In [56]:
# Assuming datasets and devices are lists containing the datasets and device names
datasets = chunks  # and so on...
devices = ['cuda:0', 'cuda:1']  # and so on...

# Calls the function to start the threads
returned_datasets = startThreads(len(datasets), datasets, devices, generatePredictions)
print("Number of returned datasets:", len(returned_datasets))

# Concatenate the returned datasets
merged_dataset = concatenate_datasets(returned_datasets)
merged_dataset.save_to_disk('../data/output/articles_processed')

# Print the length of the merged dataset
print("Processing on both GPUs completed!")
print("Results:", len(merged_dataset))

--------- Process 00 ---------
Dataset length: 25
Device: cuda:0
------------------------------
--------- Process 01 ---------
Dataset length: 25
Device: cuda:1
------------------------------
Device: Tesla P100-PCIE-16GB


Articles:   4%|▍         | 1/25 [00:00<00:08,  2.91it/s]

Device: Tesla P100-PCIE-16GB


Articles: 100%|██████████| 25/25 [00:01<00:00, 23.64it/s]


GPU finished: 25


Saving the dataset (1/1 shards): 100%|██████████| 25/25 [00:00<00:00, 2511.26 examples/s]
Articles: 100%|██████████| 25/25 [00:00<00:00, 28.73it/s]


GPU finished: 25


Saving the dataset (1/1 shards): 100%|██████████| 25/25 [00:00<00:00, 2880.55 examples/s]


Number of returned datasets: 2


Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 2406.04 examples/s]

Processing on both GPUs completed!
Results: 50





In [57]:
dataset = merged_dataset
dataset

Dataset({
    features: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim', 'prompt', 'answer'],
    num_rows: 50
})

***

## Upload Results

In [109]:
from utils.preprocessing import *
from utils.database import *
from datasets import load_from_disk
from tqdm import tqdm

In [110]:
del dataset
dataset = load_from_disk('../data/output/articles_processed')
describeDataset(dataset)

Number of rows: 112
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim', 'prompt', 'role', 'entity', 'input_ids', 'attention_mask', 'entity_tokenized', 'answer']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'prompt': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None), 'entity': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'entity_tokenized': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'answer': Sequenc

In [58]:
# Exmample amswer 
answer = dataset[0].get("answer")
print("Answer:", answer[:10])   

Answer: [-0.006319770589470863, 0.09470830112695694, 0.024422844871878624, 0.028843551874160767, -0.05869416147470474, 0.05381316319108009, 0.050233401358127594, -0.05282343551516533, -0.04206103831529617, -0.030477477237582207]


### Connect to Database

In [59]:
_, db = getConnection(use_dotenv=True)

### Update Documents in Database

In [60]:
for row in tqdm(dataset, desc="Uploading results"):
    object_id = row.get("id")
    result = row.get("Answers")
    #updateProcessingResults(
    #    db, object_id, {"sentence_embedding_result": result})

Uploading results: 100%|██████████| 50/50 [00:00<00:00, 728.56it/s]
