# Role Labeling

One way to identify narratives in newspaper text is through considering the character archetypes relied on to compose the framing of an article. The main figures in an article may be represented as the heroes, villains, or victims in the text to guide the reader towards reading the article in context with existing qualities implicit in these character archetypes. Gomez-Zara et al present a dictionary-based method for computationally determining the hero, villain, and victim in a newspaper text, which Stammbach et al adapt by using an LLM for the same task. 

## Fetch Articles

In [1]:
from utils.preprocessing import *
from utils.accelerators import *
from utils.multithreading import *
from utils.database import *
from utils.files import *
from datasets import Dataset
import random

  from .autonotebook import tqdm as notebook_tqdm


### Connect to Database

Credentials are sourced from the `.env` file.

In [2]:
_, db = getConnection(use_dotenv=True)

### Query Database

Fetches a limited number of articles from the database that haven't been processed yet, 
returning specified fields like url, title, and parsing result text.

In [3]:
collection = "articles"
fields = {"url": 1, "title": 1, "parsing_result.text": 1}
query = {"processing_result": {"$exists": False}, 
         "parsing_result.text_length": {"$lt": 10000}}
articles = fetchArticleTexts(db, 50, 0, fields, query, collection)

Example article:

In [4]:
example_article = random.choice(articles)
title = example_article.get("title")
text = example_article.get("parsing_result").get("text")
print(f"Title: {title}\nText: {text}")


Title: Court chaos continues as right-wing True the Vote leaders are thrown in jail
    Masthead  |
    Privacy Policy  |
    Manage Preferences | Debug Logs
    For corrections contact
    corrections@rawstory.com, for support contact
    support@rawstory.com.



Processes the 'parsing_result' of each article to clean the text, and filters out articles 
that lack a 'title' or 'parsing_result'.


In [5]:
# Basic text cleaning, e.g. removing newlines, tabs, etc.
articles = cleanArticles(articles)

Cleaning articles: 100%|██████████| 50/50 [00:00<00:00, 4230.18it/s]


In [6]:
# Filter out articles with no title or no parsing result 
articles = [article for article in articles if article.get(
    "title", "") and article.get("parsing_result", "")]

print("Number of articles:", len(articles))

Number of articles: 50


### Export as JSON

Saves the given data to a JSON file for optional visual inspection.

In [7]:
exportAsJSON("../data/input/articles.json",  articles)

### Convert to HF Dataset

Convert article IDs to strings and transform a list of articles into a dataset with fields: id, title, url, and text extracted from parsing results. The HuggingFace `datasets` library provides several key advantages over plain JSON files:

- **Efficiency**: The datasets are memory-mapped, allowing you to work with data that's larger than your available RAM without loading the entire dataset into memory. 
- **Speed**: Datasets in the HuggingFace format (which is Arrow-based) can be loaded faster than large JSON files, facilitating quicker data operations.
- **Columnar Storage**: By using Apache Arrow for storage, HuggingFace datasets benefit from a columnar format that ensures more efficient serialization and deserialization compared to row-based storage, such as JSON.


In [8]:
column_names = ["_id", "title", "url", "parsing_result.text"]
articles = convertListToDataset(articles, column_names)
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}


In [9]:
print("Example Article Text:", articles[42]["text"][:100])

Example Article Text: Kari Lake, the Trump-backed Republican nominee for governor of Arizona, is unfit for office. We've k


Save dataset to disk:

In [10]:
articles.save_to_disk('../data/input/articles')

Saving the dataset (1/1 shards): 100%|██████████| 50/50 [00:00<00:00, 8364.52 examples/s]


***

## Prepare Dataset 

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from datasets import Dataset, load_from_disk
from transformers import AutoTokenizer
from multiprocessing import Pool
from utils.preprocessing import *
from utils.database import *
from utils.files import *
import transformers

The code `os.environ["TOKENIZERS_PARALLELISM"] = "false"` disables parallel tokenization in HuggingFace's libraries. It's a way to suppress warnings and prevent potential issues tied to multi-core tokenization.
See: https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning

In [12]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
transformers.utils.logging.set_verbosity_error()

### Import Raw Dataset

In [13]:
articles = load_from_disk('../data/input/articles')
describeDataset(articles)

Number of rows: 50
Column names: ['_id', 'title', 'url', 'text']
Features (schema): {'_id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}


### Prepare Dataset

#### Define Prompt Template:

In [14]:
PROMPT_TEMPLATE = "Who is the {elt} in the following text?\nText: {article_text}"

# Test the template with a dummy text
print(PROMPT_TEMPLATE.format(elt='hero',
      article_text='Lorem ipsum dolor sit amet, consectetur adipiscing elit.'))

Who is the hero in the following text?
Text: Lorem ipsum dolor sit amet, consectetur adipiscing elit.


#### Expand Dataset

Functions to segment articles into chunks fitting within the input window:

In [15]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
print("Input window length:", tokenizer.model_max_length)

Input window length: 512


In [16]:
template_length = calcInputLength(tokenizer,
    PROMPT_TEMPLATE.format(elt='villain', article_text=' '))
print("Max length of empty prompt template:", template_length)

Max length of empty prompt template: 13


For each article, distinct prompts identify 'hero', 'villain', and 'victim'. If an article exceeds the model's input size, it's divided into chunks, generating additional prompts. It seems that one article results in about 10 to 12 prompts.

In [17]:
articles[42]

{'_id': '64d8e926516b265872e62a24',
 'title': "Kari Lake's Joke About Pelosi Attack Draws Laughter From GOP Crowd",
 'url': 'https://crooksandliars.com/2022/10/kari-lakes-joke-about-pelosi-attack-0',
 'text': 'Kari Lake, the Trump-backed Republican nominee for governor of Arizona, is unfit for office. We\'ve known this for a while, but she just proved that she lacks the empathy gene which should be required to hold office. On Monday, Lake made a joke about the violent assault against Speaker Nancy Pelosi\'s husband on Friday that landed him in the hospital, where he is still recovering from brain surgery. Do you remember in 2019 when a Trump supporter said the quiet part out loud, saying, "He\'s (Trump) not hurting the people he needs to be hurting"? That is who Republicans have become. Conservatives want Democrats to hurt and suffer. The GOP has become the party of victimization, so they think they\'re owed something. "It is not impossible to protect our kids at school. They act like 

In [18]:
roles=['hero', 'villain', 'victim']
col_name="text"
n_tokens=450
overlap=10

example_row = articles[42]
example_row_exp = expandRow(example_row, tokenizer, PROMPT_TEMPLATE, col_name, n_tokens, overlap,  roles)

print("Expanded row:", example_row_exp)
print("Expanded row length:", len(example_row_exp))

Expanded row: [{'_id': '64d8e926516b265872e62a24', 'title': "Kari Lake's Joke About Pelosi Attack Draws Laughter From GOP Crowd", 'url': 'https://crooksandliars.com/2022/10/kari-lakes-joke-about-pelosi-attack-0', 'text': 'Kari Lake, the Trump-backed Republican nominee for governor of Arizona, is unfit for office. We\'ve known this for a while, but she just proved that she lacks the empathy gene which should be required to hold office. On Monday, Lake made a joke about the violent assault against Speaker Nancy Pelosi\'s husband on Friday that landed him in the hospital, where he is still recovering from brain surgery. Do you remember in 2019 when a Trump supporter said the quiet part out loud, saying, "He\'s (Trump) not hurting the people he needs to be hurting"? That is who Republicans have become. Conservatives want Democrats to hurt and suffer. The GOP has become the party of victimization, so they think they\'re owed something. "It is not impossible to protect our kids at school. Th

Process datataset using multiple proesses:

In [19]:
num_processes = 12
params = (tokenizer, PROMPT_TEMPLATE, col_name, n_tokens, overlap, roles,)
dataset_hvv = processDataset(articles, num_processes, expandRow, params)


In [20]:
dataset_hvv.save_to_disk('../data/input/articles_chunkified')

Saving the dataset (1/1 shards): 100%|██████████| 387/387 [00:00<00:00, 24104.12 examples/s]


***

## Tokenize Dataset

Tokenization refers to the process of converting input text into smaller units, such as words or subwords, which are then represented as tokens. These tokens are mapped to indices in a vocabulary that the model can understand. Hugging Face provides a variety of tokenizers, each suited for different types of models. For instance, the BertTokenizer is designed for BERT-like models and tokenizes text into wordpieces. Similarly, the GPT2Tokenizer is tailored for GPT-2-like models and tokenizes text into subwords using the Byte-Pair Encoding (BPE) algorithm. 

#### Parameters

Below are descriptions of key parameters helpful for using these tokenizers:

**`add_special_tokens`**:
* Whether to add special tokens such as `[CLS]` and `[SEP]` (default is True).
* Special tokens are necessary for some models to function properly.

**`max_length`**:
* The maximum number of tokens for the output (default varies, often 512).
* Texts longer than this will be truncated.

**`padding`**:
* Whether to pad the output to `max_length`, and the padding strategy (default is False).
* Options include `'max_length'`, `'longest'`, or `True` to pad to the length of the longest sequence.

**`truncation`**:
* Whether to truncate sequences to `max_length` (default is False).

**`return_tensors`**:
* The framework to use for the returned tensors, either `'pt'` for PyTorch or `'tf'` for TensorFlow (default is None, which returns plain lists).

**`return_token_type_ids`**:
* Whether to return token type IDs (default is True).
* Necessary for some models to understand the different segments of input (e.g., question vs answer).

**`return_attention_mask`**:
* Whether to return the attention mask (default is True).
* Attention masks tell the model which tokens to pay attention to and which to ignore.

**`verbose`**:
* Whether to log information during tokenization (default is True).

**`is_split_into_words`**:
* Whether the input is pre-tokenized into words (default is False).

These parameters allow for fine-grained control over the tokenization process, ensuring the text is prepared in a way that's suitable for your model and task.

For more information, consider checking the [`encode` and `encode_plus` methods documentation](https://huggingface.co/transformers/main_classes/tokenizer.html).


In [21]:
# Paramater passed to the tokenizer
tokenizer_params = {"truncation": True, "is_split_into_words": False,
                    "add_special_tokens": True, "padding": "max_length"}

# Parameters passed to the tokenization function
params = {"tokenizer": tokenizer, "col_name": "prompt", "params": tokenizer_params}

# Tokenize the dataset
tokenized_dataset = dataset_hvv.map(tokenizeInputs, fn_kwargs=params)

Map: 100%|██████████| 387/387 [00:00<00:00, 872.59 examples/s] 


In [22]:
tokenized_dataset.save_to_disk('../data/input/articles_tokenized')

Saving the dataset (1/1 shards): 100%|██████████| 387/387 [00:00<00:00, 36354.58 examples/s]


***

## Make Predictions

In [23]:
from torch.utils.data import DataLoader, TensorDataset, SequentialSampler
from datasets import Dataset, load_from_disk, concatenate_datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm
import threading
import torch
import pickle
import time
import copy

In [24]:
dataset = load_from_disk('../data/input/articles_tokenized')
print("Dataset length:", len(dataset))

Dataset length: 387


### Split Dataset

List infos about the available GPUs:

In [25]:
gpu_info_list = listAvailableGPUs()

GPU 0:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0

GPU 1:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0



Determine the number of available GPUs:

In [26]:
num_gpus = torch.cuda.device_count()
print(f'Number of available GPUs: {num_gpus}')


Number of available GPUs: 2


In [27]:
# Split the dataset into chunks (one for each GPU)
chunks = splitDataset(dataset, num_chunks=num_gpus)

# Print the length of each chunk
print("Number of chunks:", len(chunks))
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} length:", len(chunk))


Number of chunks: 2
Chunk 0 length: 194
Chunk 1 length: 193


### Process Articles

Check GPU utilization:

In [28]:
!nvidia-smi

Sat Oct 21 15:52:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   46C    P0    28W / 250W |      2MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   48C    P0    30W / 250W |      2MiB / 16384MiB |      0%      Default |
|       

#### Parameters for Text Generation

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.


In [29]:
def generatePredictions(process_id, dataset, device):
    """Generates predictions for a given dataset."""

    # Print some information about the process
    print(f"--------- Process {process_id:02} ---------")
    print(f"Dataset length: {len(dataset)}")
    print(f"Device: {device}")
    print(f"------------------------------")

    # Load tokenizer and model for generation
    tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
    model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
    model.eval()
    model.to(device)

    # Copy the dataset to avoid modifying the original
    dataset_full = copy.copy(dataset)
    dataset.set_format(type='torch', columns=[
        'input_ids', 'attention_mask'])

    # Create dataloader without explicit sampler for sequential loading
    BATCH_SIZE = 192
    dataloader = DataLoader(
        dataset, batch_size=BATCH_SIZE, shuffle=False)

    # Set generation parameters
    params = {'do_sample': True,
              'early_stopping': False,
              # 'max_length': 100,
              # 'min_length': 1,
              # 'num_beam_groups': 2,
              # 'num_beams': 2,
              # 'max_tokens': 32,
              # 'min_tokens': 1,
              # 'output_scores': False,
              'repetition_penalty': 1.0,
              # 'return_dict_in_generate': False,
              'temperature': 1.0,
              'top_k': 50,
              'top_p': 1.0, }

    # Make predictions
    predictions = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Batches"):
            batch = {k: v.to(device) for k, v in batch.items()}

            # Generate outputs
            batch_outputs = model.generate(
                input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], **params, max_new_tokens=100)

            # Decode and store predictions
            decoded_outputs = [tokenizer.decode(
                output_id, skip_special_tokens=True) for output_id in batch_outputs]
            predictions.extend(decoded_outputs)

    # Ensure the new column has the same number of items as the dataset
    assert len(dataset_full) == len(
        predictions), "The length of new_column_values must match the dataset's length"

    # Add new column to the dataset and save it
    dataset_full = dataset_full.add_column('answer', predictions)
    dataset_full.save_to_disk('../data/output/articles_processed_' + str(id))

    return dataset_full
    

Start one thread per GPU before collecting and merging the results:

In [30]:
# Assuming datasets and devices are lists containing the datasets and device names
datasets = chunks  # and so on...
devices = ['cuda:0', 'cuda:1']  # and so on...

# Calls the function to start the threads
returned_datasets = startThreads(len(datasets), datasets, devices, generatePredictions)
print("Number of returned datasets:", len(returned_datasets))

# Concatenate the returned datasets
merged_dataset = concatenate_datasets(returned_datasets)
merged_dataset.save_to_disk('../data/output/articles_processed')

# Print the length of the merged dataset
print("Processing on both GPUs completed!")
print("Results:", len(merged_dataset))

--------- Process 00 ------------------ Process 01 ---------
Dataset length: 193
Device: cuda:1
------------------------------

Dataset length: 194
Device: cuda:0
------------------------------


Batches: 100%|██████████| 2/2 [00:09<00:00,  4.56s/it]
Saving the dataset (1/1 shards): 100%|██████████| 193/193 [00:00<00:00, 20484.87 examples/s]
Batches: 100%|██████████| 2/2 [00:08<00:00,  4.17s/it]
Saving the dataset (1/1 shards): 100%|██████████| 194/194 [00:00<00:00, 18737.08 examples/s]


Number of returned datasets: 2


Saving the dataset (1/1 shards): 100%|██████████| 387/387 [00:00<00:00, 9718.80 examples/s] 

Processing on both GPUs completed!
Results: 387





***

## Upload Results

In [31]:
from utils.preprocessing import *
from utils.database import *
from datasets import load_from_disk
from tqdm import tqdm

In [32]:
dataset = load_from_disk('../data/output/articles_processed')

In [33]:
# Exmample amswer 
answer = dataset[0].get("answer")
role = dataset[0].get("role")
print("Role:", role)
print("Answer:", answer)   

Role: hero
Answer: "Investigators in the Georgia criminal probe have long suspected the breach was not an organic effort [...]


In [34]:
dataset = dataset.map(lambda row: {'sort_key':
                                   str(row['_id']) + str(row['role']) +
                                   str(row['chunk'])
                                   })
sorted_dataset = dataset.sort('sort_key')
sorted_dataset = sorted_dataset.remove_columns('sort_key')


Map:   0%|          | 0/387 [00:00<?, ? examples/s]

Map: 100%|██████████| 387/387 [00:00<00:00, 2910.98 examples/s]


### Connect to Database

In [35]:
_, db = getConnection(use_dotenv=True)

### Update Documents in Database

In [36]:
def processResults(dataset):
    # Initial processing results
    processing_result = {"hero": [], "villain": [], "victim": []}
    object_id_prev = None

    for item in dataset:
        object_id = item['_id']
        role = item['role']
        answer = item['answer']

        # If the object_id changes, reset the processing_result
        if object_id_prev is not None and object_id_prev != object_id:
            yield object_id_prev, processing_result
            processing_result = {"hero": [], "villain": [], "victim": []}

        processing_result[role].append(answer)
        object_id_prev = object_id

    # Yield the final processing_result if any
    if processing_result["hero"] or processing_result["villain"] or processing_result["victim"]:
        yield object_id_prev, processing_result

In [37]:
# Assuming `ds` is your dataset object
unique_ids = set(dataset["_id"])

# Count of unique ids
count_unique_ids = len(unique_ids)
# print(count_unique_ids)‚

for object_id, result in tqdm(processResults(dataset), total=count_unique_ids, desc="Uploading results"):
    pass
    #updateProcessingResults(db, object_id, {"processing_result": result})

Uploading results: 51it [00:00, 164.61it/s]                        
