# Stage 1: Role Labeling

One way to identify narratives in newspaper text is through considering the character archetypes relied on to compose the framing of an article. The main figures in an article may be represented as the heroes, villains, or victims in the text to guide the reader towards reading the article in context with existing qualities implicit in these character archetypes. Gomez-Zara et al present a dictionary-based method for computationally determining the hero, villain, and victim in a newspaper text, which Stammbach et al adapt by using an LLM for the same task. 

## Fetch Articles (for Testing)

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from huggingface_hub import InferenceClient
from transformers import BertTokenizer
from utils.preprocessing import *
from utils.accelerators import *
from utils.multithreading import *
from utils.database import *
from utils.model import *
from utils.files import *
from datasets import Dataset
from rouge import Rouge
from tqdm import tqdm
import statistics
import hashlib
import random
import openai
import time
import math
import re

  from .autonotebook import tqdm as notebook_tqdm


### Connect to Database

Credentials are sourced from the `.env` file.

In [2]:
_, db = getConnection(use_dotenv=True)

### Query Database

Fetches a limited number of articles from the database that haven't been processed yet, 
returning specified fields like url, title, and parsing result text.

In [3]:
collection = "articles"
fields = {"url": 1, "title": 1, "parsing_result.text": 1}
query = {"processing_result": {"$exists": False}, 
         "parsing_result.text_length": {"$lt": 10000}}
articles = fetchArticleTexts(db, 50, 0, fields, query, collection)

Example article:

In [4]:
example_article = random.choice(articles)
title = example_article.get("title")
text = example_article.get("parsing_result").get("text")
print(f"Title: {title}\nText: {text}")


Title: 'Buckle up': Princeton professor warns that political violence is 'going to get worse'
Text: Political violence in the United States is continuing to get worse and with easy access to weapons of mass death, there is a fear that it could generate larger body counts. Princeton Professor Eddie Glaude warned this is just the beginning of MSNBC's The Last Word on Monday evening. "It reveals, the corruption of the heart. They've turned some of Americans into more monsters. It reveals we have political nihilists who are part of a major political party, who value nothing and no, except for holding power" said Glaude. "And you combine those political nihilists with cynical people and people who are motivated by hate, and you get violence." Glaude went on to say the problem isn't just rhetoric. "I'm thinking very quickly of two historical moments," he recalled. "Moments where there seems to be an existential threat that defined how Americans came to blows with each other. The mid-20th cen


Processes the 'parsing_result' of each article to clean the text, and filters out articles 
that lack a 'title' or 'parsing_result'.


In [5]:
# Basic text cleaning, e.g. removing newlines, tabs, etc.
articles = cleanArticles(articles)

Cleaning articles: 100%|██████████| 50/50 [00:00<00:00, 4354.91it/s]


In [6]:
# Filter out articles with no title or no parsing result 
articles = [article for article in articles if article.get(
    "title", "") and article.get("parsing_result", "")]

print("Number of articles:", len(articles))

Number of articles: 50


### Export as JSON

Saves the given data to a JSON file for optional visual inspection.

In [7]:
exportAsJSON("../data/input/articles.json",  articles)

***

## Load Model

Vicuna-13B is an open-source chatbot developed by refining LLaMA through user-contributed conversations gathered from ShareGPT. Initial assessments employing GPT-4 as a referee indicate that Vicuna-13B attains over 90%* quality of OpenAI ChatGPT and Google Bard, surpassing other models such as LLaMA and Stanford Alpaca in over 90%* of instances. 

See:
* https://github.com/lm-sys/FastChat
* https://huggingface.co/lmsys/vicuna-13b-v1.5-16k

```bash
# Start the controller service
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 &

# Start the model_worker service
nohup python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-13b-v1.5-16k --num-gpus 2 &

# Start the gradio_web_server service
nohup python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 7860 &

# Launch the RESTful API server
nohup python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 &
```

Check GPU utilization:

In [8]:
num_gpus = torch.cuda.device_count()
print(f'Number of available GPUs: {num_gpus}')

Number of available GPUs: 2


List infos about the available GPUs:

In [9]:
gpu_info_list = listAvailableGPUs()

GPU 0:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0

GPU 1:
  Name: Tesla P100-PCIE-16GB
  Memory: 16276.00 MiB
  Compute Capability: 6.0



In [10]:
!nvidia-smi

Sat Oct 21 20:04:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   46C    P0    34W / 250W |  13224MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:65:00.0 Off |                    0 |
| N/A   48C    P0    36W / 250W |  13224MiB / 16384MiB |      0%      Default |
|       

Test Model:

In [11]:
model = RemoteModel(model_name="vicuna-13b-v1.5-16k",
                    api_base="http://merkur72.inf.uni-konstanz.de:8080/v1",
                    api_key="EMPTY")

print(model.generateAnswer("Once upon a time"))

, there was a guy named Warren. He was a pretty regular guy


***

## Define Prompt Template:

In [12]:
# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. A hero is an individual, organisation, or entity admired for their courage, noble qualities, and outstanding achievements. A villain is a character, organisation, or entity known for their wickedness or malicious actions, often serving as an antagonist in a story or narrative. A victim is an individual, organisation, or entity who suffers harm or adversity, often due to an external force or action. Every entity can only be one of those roles. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. Every entity can only be one of those roles. If not existing return None as name. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. Each entity can only assume one role. If none apply, use 'None'. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

PROMPT_TEMPLATE = "Given the news article below, identify entities categorized as a hero, villain, or victim. Each entity can only assume one role. If none apply, use 'None'. The solution must be provided in this format: {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. \n Headline: '{headline}' \n Text: '{article_text}' \n Solution: "

# Test the template with a dummy text
prompt_test = PROMPT_TEMPLATE.format(headline = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', article_text='Lorem ipsum dolor sit amet, consectetur adipiscing elit.')
print(prompt_test)


Given the news article below, identify entities categorized as a hero, villain, or victim. Each entity can only assume one role. If none apply, use 'None'. The solution must be provided in this format: {hero: "Name", villain: "Name", victim: "Name"}. 
 Headline: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.' 
 Text: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.' 
 Solution: 


## Define Parameter for Text Generation

Each parameter influences the text generation in a specific way. Below are the parameters along with a brief explanation:

**`max_length`**:
* Sets the maximum number of tokens in the generated text (default is 50).
* Generation stops if the maximum length is reached before the model produces an EOS token.
* A higher `max_length` allows for longer generated texts but may increase the time and computational resources required.

**`min_length`**:
* Sets the minimum number of tokens in the generated text (default is 10).
* Generation continues until this minimum length is reached even if an EOS token is produced.

**`num_beams`**:
* In beam search, sets the number of "beams" or hypotheses to keep at each step (default is 4).
* A higher number of beams increases the chances of finding a good output but also increases the computational cost.

**`num_return_sequences`**:
* Specifies the number of independently computed sequences to return (default is 3).
* When using sampling, multiple different sequences are generated independently from each other.

**`early_stopping`**:
* Stops generation if the model produces the EOS (End Of Sentence) token, even if the predefined maximum length is not reached (default is True).
* Useful when an EOS token signifies the logical end of a text (often represented as `</s>`).

**`do_sample`**:
* Tokens are selected probabilistically based on their likelihood scores (default is True).
* Introduces randomness into the generation process for diverse outputs.
* The level of randomness is controlled by the 'temperature' parameter.

**`temperature`**:
* Adjusts the probability distribution used for sampling the next token (default is 0.7).
* Higher values make the generation more random, while lower values make it more deterministic.

**`top_k`**:
* Limits the number of tokens considered for sampling at each step to the top K most likely tokens (default is 50).
* Can make the generation process faster and more focused.

**`top_p`**:
* Also known as nucleus sampling, sets a cumulative probability threshold (default is 0.95).
* Tokens are sampled only from the smallest set whose cumulative probability exceeds this threshold.

**`repetition_penalty`**:
* Discourages the model from repeating the same token by modifying the token's score (default is 1.5).
* Values greater than 1.0 penalize repetitions, and values less than 1.0 encourage repetitions.


In [13]:
params = {'do_sample': True,
        'early_stopping': True,
        # 'max_length': 100,
        # 'min_length': 1,
        'logprobs': 1,
        'n': 1,
        'best_of': 1,
        
        'num_beam_groups': 2,
        'num_beams': 5,
        'num_return_sequences': 5,
        'max_tokens': 50,
        'min_tokens': 0,
        'output_scores': True,
        'repetition_penalty': 1.0,
        'temperature': 0.6,
        'top_k': 50,
        'top_p': 1.0 
        }

## Define Helper Functions

In [14]:
def extractTriplet(answer):
    """ Extracts the triplet from the answer string. """
    
    # Extract keys and values using regex
    keys = re.findall(r'(\w+):\s*\"', answer)
    values = re.findall(r'\"(.*?)\"', answer)
    result = dict(zip(keys, values))

    if result == {}:    
        keys = re.findall(r'(\w+):\s*([^,]+)', answer)
        result = dict((k, v.strip('"')) for k, v in keys)
    
    return result

In [15]:
def getAnswersTriplets(article, model, template, params):
    """ Generates answers for the given article using the model and template. """

    # Extract the article headline and text
    article_headline=article.get("title", "")
    article_text = article.get("parsing_result").get("text")

    # Generate the answer
    prompt = template.format(headline = article_headline, article_text = article_text)
    answer = model.generateAnswer(prompt, params)

    return answer

In [16]:
def splitText(text, n_tokens, tokenizer, overlap=10):
    """Splits the input text into chunks with n_tokens tokens using HuggingFace tokenizer, 
    with an overlap of overlap tokens from the previous and the next chunks."""
    
    tokens = tokenizer.tokenize(text)
    chunks = []
    i = 0

    # No previous chunk at the beginning, so no need for overlap
    chunks.append(tokenizer.convert_tokens_to_string(tokens[i:i+n_tokens]))
    i += n_tokens

    while i < len(tokens):
        # Now, we include overlap from the previous chunk
        start_index = i - overlap
        end_index = start_index + n_tokens
        chunk = tokens[start_index:end_index]
        chunks.append(tokenizer.convert_tokens_to_string(chunk))
        i += n_tokens - overlap  # Moving the index to account for the next overlap

    return chunks

In [17]:
def processBatch(articles, model, template, params, chunk_size=1024, overlap=256, show_progress=False, verbose=False):
    """Processes a batch of articles and extracts the triplets."""
    runtimes = []  # List to store the runtime for each article

    # Iterate over the articles
    for article in tqdm(articles, desc="Generating answers", disable=not show_progress):
        start_time = time.time()  # Start the timer

        # Extract the article headline and text
        article_headline = article.get("title", "")
        article_text = article.get("parsing_result").get("text")

        # Split the article text into chunks
        chunks = splitText(article_text, chunk_size,
                            model.tokenizer, overlap=overlap)

        # print("Chunks:", len(chunks))

        chunk_results = []
        for chunk_id, chunk in enumerate(chunks):

            if verbose:
                print("Chunk:", chunk_id)
                print("Chunk Length:", calcInputLength(model.tokenizer, chunk))
            # print("Headline Length:", calcInputLength(
            #     tokenizer, article_headline))

            prompt = template.format(
                headline=article_headline, article_text=chunk)
            answer = model.generateAnswer(prompt, params)
            triplet = extractTriplet(answer)

            # print(chunk_id, "Answer:", triplet, "Type:", type(triplet))
            results = {
                "chunk_id": chunk_id,
                "chunk": chunk,
                "answer": answer,
                "triplet": triplet
            }
            chunk_results.append(results)

        article["triplets"] = chunk_results

        end_time = time.time()  # End the timer
        runtime = end_time - start_time  # Calculate the runtime
        runtimes.append(runtime)  # Store the runtime

    return articles, runtimes

In [18]:
def updateArticle(db, id: str, values: dict = {}, collection="articles"):
    "Updates scraping task in database"
    pass # TODO: Uncomment to update the database
    # filter = {"_id": ObjectId(id)}
    # values = {"$set": {**values}}
    # r = db[collection].update_one(filter, values)
    # return r

In [19]:
def updateArticles(db, articles):
    """Updates the articles in the database."""

    for article in tqdm(articles, desc="Uploading results"):
        id = article.get("_id")
        values = {"triplets": article.get("triplets", [])}
        #updateArticle(db, id, values) # TODO: Uncomment to update the database

## Test Examples

In [20]:
article = articles[40]
print("Article Title:", article.get("title"))
print("Article Text:", article.get("parsing_result").get("text")[:200])

Article Title: GOP threats to Medicare and Social Security take center stage in last days before election
Article Text: According to a report from CNN, President Joe Biden is headed to Florida on Tuesday where he will lead the charge against Republicans who want to gut Medicare and dismantle Social Security in a state 


In [21]:
splitText(article.get("title"), 5, model.tokenizer, overlap=1)

['GOP threats to',
 'to Medicare and Social',
 'Social Security take center stage',
 'stage in last days before',
 'before election']

In [22]:
title = article.get("title")
text = article.get("parsing_result").get("text")
prompt = PROMPT_TEMPLATE.format(headline =title, article_text = text)
input_length = calcInputLength(model.tokenizer, prompt)

print("Prompt: >>>", prompt, "<<<")
print("Prompt Input Length:", input_length)

Prompt: >>> Given the news article below, identify entities categorized as a hero, villain, or victim. Each entity can only assume one role. If none apply, use 'None'. The solution must be provided in this format: {hero: "Name", villain: "Name", victim: "Name"}. 
 Headline: 'GOP threats to Medicare and Social Security take center stage in last days before election' 
 Text: 'According to a report from CNN, President Joe Biden is headed to Florida on Tuesday where he will lead the charge against Republicans who want to gut Medicare and dismantle Social Security in a state that is up to its ears in retirees. With one week before the midterm election, Biden will warn voters that the GOP -- led by Sens. Rick Scott (R-FL) and Ron Johnson (R-WI) -- will destroy the social safety net if given the chance while painting them as "extreme MAGA Republicans.” "Among his chief foils is Scott, the head of Republicans’ campaign arm who had laid out a policy agenda that would put Medicare, Social Securi

In [23]:
answer = getAnswersTriplets(article, model, PROMPT_TEMPLATE, params)
print("Answer:", answer)

Answer: 
{hero: "Joe Biden", villain: "Rick Scott, Ron Johnson", victim: "Republican Party"}


In [24]:
articles, runtimes = processBatch(articles[:5], model, PROMPT_TEMPLATE, params, chunk_size = 1024, overlap= 64, show_progress=True)

Generating answers: 100%|██████████| 5/5 [00:38<00:00,  7.67s/it]


In [25]:
articles[2]

{'_id': ObjectId('64d8e926516b265872e629f2'),
 'title': 'Chicago police say up to fourteen people may have been shot nine critically injured on West Side',
 'url': 'https://biztoc.com/p/fgcuuxii?ref=rss&rd=1',
 'parsing_result': {'text': "Key events: 18': Luis Diaz opens the scoring for Liverpool, converting a great early cross from Salah. 29': Mohamed Salah appears to have... Chelsea are taking on Liverpool in their Premier League opener at Stamford Bridge. The score is currently level 1-1 in the second half. Tottenham began life without Harry Kane with an entertaining 2-2 draw at Brentford in the Premier League on Sunday. The Chiefs and Saints are facing off in the 2023 NFL preseason. We break down how to watch the matchup. A quick look at the Jaguars' 28-23 victory over the Dallas Cowboys in a 2023 Preseason Week 1 game at AT&T Stadium in Arlington, Texas, Saturday. Eagles linebacker and special teams ace Shaun Bradley was carted off the field in the third quarter of the team's pres

In [26]:
if runtimes:
    avg_runtime = sum(runtimes) / len(runtimes)
    print(f"Average runtime: {avg_runtime:.4f} seconds")
else:
    avg_runtime = 0

if len(runtimes) > 1:
    std_runtime = statistics.stdev(runtimes)
    print(f"Standard Deviation of runtime: {std_runtime:.4f} seconds")
else:
    std_runtime = 0

Average runtime: 7.6656 seconds
Standard Deviation of runtime: 6.6863 seconds


## Make Predictions

In [27]:
LIMIT = 10 # Number of articles to process in each batch
CHUNK_SIZE = 1024 # Number of tokens in each chunk
OVERLAP = 64 # Number of overlapping tokens between chunks

In [28]:
batch_id = 0

while True:
    print(f"------ Batch {batch_id} ------")

    # Fetch the next batch of articles
    articles = fetchArticleTexts(db, LIMIT, 0, fields, query)
    
    # Stop if no more articles are available
    if not articles:
        break
    
    # Process the batch of articles
    articles, runtimes = processBatch(articles, model, PROMPT_TEMPLATE, params, chunk_size=CHUNK_SIZE, overlap=OVERLAP, show_progress=True)

    # Update the articles in the database
    #updateArticles(db, articles)
    print(f"Updated {len(articles)} articles", end="\n\n")

    batch_id += 1
    break

------ Batch 0 ------


Generating answers:   0%|          | 0/10 [00:00<?, ?it/s]

Generating answers: 100%|██████████| 10/10 [01:07<00:00,  6.73s/it]

Updated 10 articles






In [29]:
raise SystemExit("Stopped before updating the database!")

SystemExit: Stopped before updating the database!

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Convert to Results to old Format

In [None]:
# Fetch articles from database
fields = {"triplets": 1}
query = {"triplets": {"$exists": True}}
articles = fetchArticleTexts(db,  limit=0, skip=0, fields=fields, query=query, collection="articles")


In [None]:
articles[0]

In [None]:
invalid_answers = [
    "None",
    "'None'",
    "'none'",
    "None}",
    "{None",
    "Not applicable",
    "Not available",
    "Not specified",
    "No data",
    "No value",
    "Invalid",
    "Unspecified",
    "Empty",
    "Missing",
    "Null",
    "Undefined",
    "N/A",
    "NA",
    "Not provided",
    "No information",
    "Not set",
    "No entry",
    "No response",
    "Not applicable",
    "Not determined",
    "No result",
    "No answer",
    "No record",
    "No match",
    "No selection",
    "Not found",
    "Not valid",
    "Not given",
    "Not filled",
    "Not assigned",
    "No choice",
    "Not used",
    "No sample",
    "Not measured",
    "No response",
    "Not reported",
    "Not registered",
    "Not logged",
    "No feedback",
    "No score",
    "No grade",
    "No rating",
    "No rating available",
    "No rating provided",
    "No rating assigned",
    "No rating given",
    "No rating received",
    "No rating found",
    "No rating available",
    "No rating recorded",
    "No rating obtained",
    "No rating submitted",
    "No rating included",
]

In [None]:
def isNone(input_string, alternative_names):
    """Checks if the input string contains one of the alternative names."""
    
    for name in alternative_names:
        if name == input_string.strip():
            return True
    return False

In [None]:
#result = db.articles.update_many({}, {"$unset": {"embedding_result_tokens": ""}})
#result = db.articles.sampled.triplets.update_many({}, {"$unset": {"processing_result": ""}})
#result = db.articles.sampled.triplets.update_many({}, {"$unset": {"denoising_result": ""}})
#result = db.articles.sampled.triplets.update_many({}, {"$unset": {"embedding_result": ""}})

In [None]:
# Iterate through all documents in the collection
for article in tqdm(articles, desc="Uploading results"):
    chunks = article.get("triplets", [])  # Get the "triplets" property

    #print(chunks)

    heros, villains, victims = [], [], []

    # Extract data from the "triplets" property
    for chunk in chunks:


        triplet = chunk.get("triplet", {})
        hero = triplet.get("hero", "None")
        villain = triplet.get("villain", "None")
        victim = triplet.get("victim", "None")

        #print(hero, villain, victim)

        if not isNone(hero, invalid_answers):
            heros.append(hero)
        if not isNone(villain, invalid_answers):
            villains.append(villain)
        if not isNone(victim, invalid_answers):
            victims.append(victim)

    # Create the processing_result structure
    processing_result = {
        "hero": heros,
        "villain": villains,
        "victim": victims
    }
   
    #print(processing_result)
    
    # Update the document in the database   
    id = article.get("_id")
    values = {"processing_result": processing_result}
    updateArticle(db, id, values)