# Stage 3: Entity Embeddings

We obtain embeddings for the disambiguated entities using the encoder-based \texttt{RoBERTa-BASE} model. We used the same prompt template as in the disambiguation phase to condition the embeddings on the context of the article text where the entities originally appeared, but without the special tokens marking the entity. The embeddings of the tokens corresponding to the entity were then averaged to produce the final embedding. We use the following prompt template:

```text
Discussing ([HERO], [VILLAIN] or [VICTIM]): [TEXT]. 
```

## Fetch Articles

In [32]:
from utils.preprocessing import *
from utils.accelerators import *
from utils.multithreading import *
from utils.database import *
from utils.files import *
from datasets import Dataset
import random

### Connect to Database

Credentials are sourced from the `.env` file.

In [33]:
_, db = getConnection(use_dotenv=True)

### Query Database

Fetches a limited number of articles from the database that haven't been processed yet, 
returning specified fields like url, title, and parsing result text.

In [34]:
collection = "v2_sampled_articles"
fields = {"url": 1, "title": 1, "parsing_result.text": 1, "denoising_result": 1}
query = {
    "denoising_result": {"$exists": True},
    "parsing_result.text_length": {"$lt": 10000},
    #"embedding_result": {"$exists": False},
    #"initial_subsample": False
}
articles = fetchArticleTexts(db, 50, 0, fields, query, collection)

Example article:

In [35]:
example_article = random.choice(articles)
title = example_article.get("title")
text = example_article.get("parsing_result").get("text")
print(f"Title: {title}\nText: {text}")
print(f"Denoising Result: {example_article.get('denoising_result')}")


Title: Sarah Palin: Obama Is “Bending Over” For Iran
Text: January 14, 2016
Crackpots, Teabagistan “Here we go again, coddling the enemy and implementing rules of engagement that prohibit our military from doing its job. Iran captures our warriors before forcing the U.S. to apologize for the ENEMY’S bullying and endangerment, all to send a (false) message that America is now so weak we’ve fallen insurmountably far under the enemy. As Sen. McCain just said on Fox News, Obama spokesman Josh Ernest is an ‘idiot’ and knows not of what he speaks as actual APPRECIATION is expressed by the Leftist White House for Iran’s combatant actions. “To praise the enemy for humiliating our military should make all reasonable, pro-‘Peace through Strength’ Americans agree with what Dean Cain said on Fox this morning: ‘This makes me sick to my stomach.’ Then went on to suggest reasons many of America’s finest are leaving the military. No question, they’re leaving due to the Left’s pussy-footing around per 


Processes the 'parsing_result' of each article to clean the text, and filters out articles 
that lack a 'title' or 'parsing_result'.


In [36]:
# Basic text cleaning, e.g. removing newlines, tabs, etc.
articles = cleanArticles(articles)

Cleaning articles: 100%|██████████| 50/50 [00:00<00:00, 1701.12it/s]


In [37]:
# Filter out articles with no title or no parsing result 
articles = [article for article in articles if article.get(
    "title", "") and article.get("parsing_result", "")]

print("Number of articles:", len(articles))

Number of articles: 50


### Export as JSON

Saves the given data to a JSON file for optional visual inspection.

In [38]:
exportAsJSON("../data/input/articles.json",  articles)

### Convert to HF Dataset

Convert article IDs to strings and transform a list of articles into a dataset with fields: id, title, url, and text extracted from parsing results. The HuggingFace `datasets` library provides several key advantages over plain JSON files:

- **Efficiency**: The datasets are memory-mapped, allowing you to work with data that's larger than your available RAM without loading the entire dataset into memory. 
- **Speed**: Datasets in the HuggingFace format (which is Arrow-based) can be loaded faster than large JSON files, facilitating quicker data operations.
- **Columnar Storage**: By using Apache Arrow for storage, HuggingFace datasets benefit from a columnar format that ensures more efficient serialization and deserialization compared to row-based storage, such as JSON.


In [39]:
column_names = ["_id", "title", "url", "parsing_result.text", "denoising_result.hero", "denoising_result.villain", "denoising_result.victim"]
articles = convertListToDataset(articles, column_names)
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


In [40]:
print("Example Article Text:", articles[42]["text"][:100])
print("Example Article Hero:", articles[42]["hero"])
print("Example Article Villain:", articles[42]["villain"])
print("Example Article Victim:", articles[42]["victim"])

Example Article Text:  January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gate
Example Article Hero: ['Robert Gates']
Example Article Villain: []
Example Article Victim: ['Barack Obama']


Save dataset to disk:

In [41]:
articles.save_to_disk('../data/input/articles')

Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 3649.57 examples/s]


***

## Prepare Dataset 

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from datasets import Dataset, load_from_disk
from transformers import AutoTokenizer
from multiprocessing import Pool
from utils.preprocessing import *
from utils.database import *
from utils.files import *
import transformers

The code `os.environ["TOKENIZERS_PARALLELISM"] = "false"` disables parallel tokenization in HuggingFace's libraries. It's a way to suppress warnings and prevent potential issues tied to multi-core tokenization.
See: https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning

In [43]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
transformers.utils.logging.set_verbosity_error()

### Import Raw Dataset

In [44]:
articles = load_from_disk('../data/input/articles')
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


### Prepare Dataset

#### Define Prompt Template:

In [61]:
PROMPT_TEMPLATE = "Discussing {entity} . {article}"

# Test the template with a dummy text
print(PROMPT_TEMPLATE.format(entity='Donald Trump',
      article='Lorem ipsum dolor sit amet, consectetur adipiscing elit.'))


Discussing Donald Trump . Lorem ipsum dolor sit amet, consectetur adipiscing elit.


#### Expand Dataset

Functions to segment articles into chunks fitting within the input window:

In [62]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
print("Input window length:", tokenizer.model_max_length)

Input window length: 512


In [63]:
template_length = calcInputLength(tokenizer, PROMPT_TEMPLATE.format(entity='Donald Trump', article=' '))
print("Max length of empty prompt template:", template_length)

Max length of empty prompt template: 9


For each article, distinct prompts identify 'hero', 'villain', and 'victim'. If an article exceeds the model's input size, it's divided into chunks, generating additional prompts. It seems that one article results in about 10 to 12 prompts.

In [64]:
articles[42]

{'_id': '64d8eb39516b2658722931a5',
 'title': 'Gates Says Obama Has Too Many &#8216;Yes Men&#8217;',
 'url': 'https://politicalwire.com/2016/01/19/gates-says-obama-has-too-many-yes-men/',
 'text': " January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gates suggested on MSNBC that President Obama has “centralized power and operational activities of the government in the White House to a degree that I think is unparalleled.” He added: “I don’t see the kind of strong people around the President who will push back on him. I will give him credit. I pushed back on him a lot and he never shut me down. He never told me to be quiet or refused to see me or anything like that. But I don’t see people around him like that now. “ Filed Under: Governing, White House Tagged With: Barack Obama, Robert Gates Donald Trump said on Truth Social that his lawyers would prefer that he didn’t have a Monday new conference to unveil his “irrefutable” evidence of fraud i

In [65]:
def expandRow(row, template, col_name="text", roles=['hero', 'villain', 'victim']):
    """
    Generate prompts based on various roles and text chunks from the input row.
    """
    prompts = []

    # Generate prompts for each role and text chunk
    for role in roles:
        for entity in row.get(role, []):
            text = row.get(col_name, "")
            prompt = template.format(entity=entity, article=text)
            new_row = {
                **row,
                'prompt': prompt,
                'role': role,
                'entity': entity,
            }
            prompts.append(new_row)
            
    return prompts

In [66]:
roles=['hero', 'villain', 'victim']
col_name="text"

example_row = articles[42]
example_row_exp = expandRow(example_row, PROMPT_TEMPLATE, col_name, roles)

print("Example Prompt:", example_row_exp[0].get("prompt"))
print("Expanded row:", example_row_exp)
print("Expanded row length:", len(example_row_exp))

Example Prompt: Discussing Robert Gates .  January 19, 2016 at 10:46 am EST By Taegan Goddard 25 Comments Former Defense Secretary Robert Gates suggested on MSNBC that President Obama has “centralized power and operational activities of the government in the White House to a degree that I think is unparalleled.” He added: “I don’t see the kind of strong people around the President who will push back on him. I will give him credit. I pushed back on him a lot and he never shut me down. He never told me to be quiet or refused to see me or anything like that. But I don’t see people around him like that now. “ Filed Under: Governing, White House Tagged With: Barack Obama, Robert Gates Donald Trump said on Truth Social that his lawyers would prefer that he didn’t have a Monday new conference to unveil his “irrefutable” evidence of fraud in the 2020 Georgia… “The current thinking among Donald Trump’s campaign advisers and those close to the former president is that he is not planning on parti

Process datataset using multiple proesses:

In [67]:
PROMPT_TEMPLATE

'Discussing {entity} . {article}'

In [68]:
num_processes = 12
params = (PROMPT_TEMPLATE, col_name, roles,)
dataset_hvv = processDataset(articles, num_processes, expandRow, params)


In [69]:
dataset_hvv.save_to_disk('../data/input/articles_chunkified')

Saving the dataset (1/1 shards): 100%|██████████| 112/112 [00:00<00:00, 6025.45 examples/s]


***

## Tokenize Dataset

Tokenization refers to the process of converting input text into smaller units, such as words or subwords, which are then represented as tokens. These tokens are mapped to indices in a vocabulary that the model can understand. Hugging Face provides a variety of tokenizers, each suited for different types of models. For instance, the BertTokenizer is designed for BERT-like models and tokenizes text into wordpieces. Similarly, the GPT2Tokenizer is tailored for GPT-2-like models and tokenizes text into subwords using the Byte-Pair Encoding (BPE) algorithm. 

#### Parameters

Below are descriptions of key parameters helpful for using these tokenizers:

**`add_special_tokens`**:
* Whether to add special tokens such as `[CLS]` and `[SEP]` (default is True).
* Special tokens are necessary for some models to function properly.

**`max_length`**:
* The maximum number of tokens for the output (default varies, often 512).
* Texts longer than this will be truncated.

**`padding`**:
* Whether to pad the output to `max_length`, and the padding strategy (default is False).
* Options include `'max_length'`, `'longest'`, or `True` to pad to the length of the longest sequence.

**`truncation`**:
* Whether to truncate sequences to `max_length` (default is False).

**`return_tensors`**:
* The framework to use for the returned tensors, either `'pt'` for PyTorch or `'tf'` for TensorFlow (default is None, which returns plain lists).

**`return_token_type_ids`**:
* Whether to return token type IDs (default is True).
* Necessary for some models to understand the different segments of input (e.g., question vs answer).

**`return_attention_mask`**:
* Whether to return the attention mask (default is True).
* Attention masks tell the model which tokens to pay attention to and which to ignore.

**`verbose`**:
* Whether to log information during tokenization (default is True).

**`is_split_into_words`**:
* Whether the input is pre-tokenized into words (default is False).

These parameters allow for fine-grained control over the tokenization process, ensuring the text is prepared in a way that's suitable for your model and task.

For more information, consider checking the [`encode` and `encode_plus` methods documentation](https://huggingface.co/transformers/main_classes/tokenizer.html).


In [70]:
# Paramater passed to the tokenizer
tokenizer_params = {"truncation": True, "is_split_into_words": False,
                    "add_special_tokens": True, "padding": "max_length"}

# Parameters passed to the tokenization function
params = {"tokenizer": tokenizer, "col_name": "prompt", "params": tokenizer_params}

# Tokenize the dataset
tokenized_dataset = dataset_hvv.map(tokenizeInputs, fn_kwargs=params)

Map:   0%|          | 0/112 [00:00<?, ? examples/s]

Map: 100%|██████████| 112/112 [00:00<00:00, 299.78 examples/s]


In [71]:
def tokenizeEntity(example):
    """Tokenize the inputs"""

    tokenized_entity = tokenizer(
        " " + example["entity"], add_special_tokens=False, padding="max_length")

    # Combine original data with the tokenized inputs
    example.update({'entity_tokenized': tokenized_entity["input_ids"]})
    # example['prompt_length'] = len(tokenized_inputs['input_ids'])
    return example

In [72]:
tokenized_dataset = tokenized_dataset.map(tokenizeEntity, desc="Tokenizing Entity", num_proc=num_processes)

Tokenizing Entity (num_proc=12): 100%|██████████| 112/112 [00:00<00:00, 264.22 examples/s]


In [73]:
tokenized_dataset.save_to_disk('../data/input/articles_tokenized')

Saving the dataset (1/1 shards): 100%|██████████| 112/112 [00:00<00:00, 3677.57 examples/s]


In [76]:
dataset = tokenized_dataset

***

## Make Predictions

In [83]:
from torch.utils.data import DataLoader, TensorDataset, SequentialSampler
from datasets import Dataset, load_from_disk, concatenate_datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import RobertaModel, RobertaTokenizer
from tqdm import tqdm
import threading
import torch
import pickle
import time
import copy

In [78]:
dataset = load_from_disk('../data/input/articles_tokenized')
print("Dataset length:", len(dataset))

Dataset length: 112


### Split Dataset

List infos about the available GPUs:

In [79]:
gpu_info_list = listAvailableGPUs()

GPU 0:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0

GPU 1:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0



Determine the number of available GPUs:

In [80]:
num_gpus = torch.cuda.device_count()
print(f'Number of available GPUs: {num_gpus}')


Number of available GPUs: 2


In [81]:
# Split the dataset into chunks (one for each GPU)
chunks = splitDataset(dataset, num_chunks=num_gpus)

# Print the length of each chunk
print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} length:", len(chunk))


Number of chunks: 2
Chunk 0 length: 56
Chunk 1 length: 56


### Process Articles

Check GPU utilization:

In [82]:
!nvidia-smi

Sun Oct 22 10:20:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   46C    P0    34W / 250W |  10758MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   47C    P0    36W / 250W |  10596MiB / 16384MiB |      0%      Default |
|       

#### Parameters for Text Generation

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.


In [89]:
def generatePredictions(process_id, dataset, device):
    """Generates predictions for a given dataset."""

    # Print some information about the process
    print(f"--------- Process {process_id:02} ---------")
    print(f"Dataset length: {len(dataset)}")
    print(f"Device: {device}")
    print(f"------------------------------")

    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
    model = RobertaModel.from_pretrained("roberta-base")
    model.eval()
    model.to(device)

    print("Device:", torch.cuda.get_device_name())

    dataset_full = copy.copy(dataset)
    dataset_full.set_format(type='torch', columns=[
        'input_ids', 'attention_mask', "entity_tokenized"])

    # Create dataloader without explicit sampler for sequential loading
    BATCH_SIZE = 256
    dataloader = DataLoader(
        dataset_full, batch_size=BATCH_SIZE, shuffle=False)

    # Make predictions
    predictions = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Batches"):
            batch = {k: v.to(device) for k, v in batch.items()}

            # Get embeddings
            outputs = model(
                input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
            embeddings = outputs.last_hidden_state

            # Optional: If you want average embeddings for each sequence
            # average_embeddings = embeddings.mean(dim=1)

            # # Get the token IDs of the person's name to extract their embeddings
            # name_ids = tokenizer.encode(name, add_special_tokens=False)

            batch_name_embeddings = []
            for sequence_embeddings, input_ids_sequence, input_ids_entity in zip(embeddings, batch['input_ids'], batch['entity_tokenized']):

                # Remove padding
                true_name_ids = input_ids_entity[input_ids_entity != 1]

                # Find the start index of the name tokens in the current sequence
                start_idx = 3

                # Extract the embeddings corresponding to the person's name for the current sequence
                current_name_embeddings = sequence_embeddings[start_idx:start_idx+len(
                    true_name_ids), :]

                # Optional: If you want average embeddings for the name
                average_name_embedding = current_name_embeddings.mean(dim=0)
                batch_name_embeddings.append(average_name_embedding)

            # Move embeddings back to CPU to save GPU memory
            # Note: The following line is incorrect because batch_name_embeddings is a list. We should stack the tensors first and then move them to CPU.
            # batch_name_embeddings = batch_name_embeddings.cpu()

            # Corrected:
            tensor_name_embeddings = [embedding.cpu()
                                      for embedding in batch_name_embeddings]
            predictions.extend(tensor_name_embeddings)

            torch.cuda.empty_cache()

    # results.extend(predictions)

    print("GPU finished:", len(predictions))

    #Ensure the new column has the same number of items as the dataset
    assert len(dataset_full) == len(
        predictions), "The length of new_column_values must match the dataset's length"

    #print("Dataset length:", len(dataset_full))

    # Add new column
    predictions_list = [tensor.tolist() for tensor in predictions]
    dataset_full = dataset_full.add_column('answer', predictions_list)
    dataset_full.save_to_disk('data/output/articles_processed_' + str(id))

    return dataset_full
    

Start one thread per GPU before collecting and merging the results:

In [90]:
# Assuming datasets and devices are lists containing the datasets and device names
datasets = chunks  # and so on...
devices = ['cuda:0', 'cuda:1']  # and so on...

# Calls the function to start the threads
returned_datasets = startThreads(len(datasets), datasets, devices, generatePredictions)
print("Number of returned datasets:", len(returned_datasets))

# Concatenate the returned datasets
merged_dataset = concatenate_datasets(returned_datasets)
merged_dataset.save_to_disk('../data/output/articles_processed')

# Print the length of the merged dataset
print("Processing on both GPUs completed!")
print("Results:", len(merged_dataset))

--------- Process 00 ---------
Dataset length: 56
Device: cuda:0
------------------------------
--------- Process 01 ---------
Dataset length: 56
Device: cuda:1
------------------------------
Device: Tesla P100-PCIE-16GB


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Device: Tesla P100-PCIE-16GB


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]


GPU finished: 56


Saving the dataset (1/1 shards): 100%|██████████| 56/56 [00:00<00:00, 6428.58 examples/s]
Batches: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]


GPU finished: 56


Saving the dataset (1/1 shards): 100%|██████████| 56/56 [00:00<00:00, 4727.21 examples/s]


Number of returned datasets: 2


Saving the dataset (1/1 shards): 100%|██████████| 112/112 [00:00<00:00, 3239.43 examples/s]

Processing on both GPUs completed!
Results: 112





In [91]:
dataset = merged_dataset
dataset

Dataset({
    features: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim', 'prompt', 'role', 'entity', 'input_ids', 'attention_mask', 'entity_tokenized', 'answer'],
    num_rows: 112
})

***

## Upload Results

In [109]:
from utils.preprocessing import *
from utils.database import *
from datasets import load_from_disk
from tqdm import tqdm

In [110]:
del dataset
dataset = load_from_disk('../data/output/articles_processed')
describeDataset(dataset)

Number of rows: 112
Column names: ['_id', 'title', 'url', 'text', 'hero', 'villain', 'victim', 'prompt', 'role', 'entity', 'input_ids', 'attention_mask', 'entity_tokenized', 'answer']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'hero': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'villain': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'victim': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'prompt': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None), 'entity': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'entity_tokenized': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'answer': Sequenc

In [103]:
# Exmample amswer 
answer = dataset[0].get("answer")
role = dataset[0].get("role")
print("Role:", role)
print("Answer:", answer[:10])   

Role: None
Answer: tensor([-0.1185, -0.0948, -0.0394, -0.1386,  0.6860,  0.4518, -0.1260,  0.1470,
        -0.0445,  0.0749])


### Connect to Database

In [96]:
_, db = getConnection(use_dotenv=True)

### Update Documents in Database

In [99]:
def processResults(dataset):
    # Initial processing results
    processing_result = {"hero": [], "villain": [], "victim": []}
    object_id_prev = None

    for item in dataset:
        object_id = item['_id']
        role = item['role']
        answer = item['answer']

        # If the object_id changes, reset the processing_result
        if object_id_prev is not None and object_id_prev != object_id:
            yield object_id_prev, processing_result
            processing_result = {"hero": [], "villain": [], "victim": []}

        processing_result[role].append(answer)
        object_id_prev = object_id

    # Yield the final processing_result if any
    if processing_result["hero"] or processing_result["villain"] or processing_result["victim"]:
        yield object_id_prev, processing_result

In [100]:
# Assuming `ds` is your dataset object
unique_ids = set(dataset["_id"])

# Count of unique ids
count_unique_ids = len(unique_ids)
# print(count_unique_ids)‚

for object_id, result in tqdm(processResults(dataset), total=count_unique_ids, desc="Uploading results"):
    pass
    #updateProcessingResults(db, object_id, {"embedding_result": result})

Uploading results:   0%|          | 0/50 [00:00<?, ?it/s]


KeyError: '_id'