# Stage 1: Role Labeling

One way to identify narratives in newspaper text is through considering the character archetypes relied on to compose the framing of an article. The main figures in an article may be represented as the heroes, villains, or victims in the text to guide the reader towards reading the article in context with existing qualities implicit in these character archetypes. Gomez-Zara et al present a dictionary-based method for computationally determining the hero, villain, and victim in a newspaper text, which Stammbach et al adapt by using an LLM for the same task. 

## Fetch Articles (for Testing)

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from huggingface_hub import InferenceClient
from transformers import BertTokenizer
from utils.preprocessing import *
from utils.accelerators import *
from utils.multithreading import *
from utils.database import *
from utils.model import *
from utils.files import *
from datasets import Dataset
from rouge import Rouge
from tqdm import tqdm
import statistics
import hashlib
import random
import openai
import time
import math
import re

  from .autonotebook import tqdm as notebook_tqdm


### Connect to Database

Credentials are sourced from the `.env` file.

In [2]:
_, db = getConnection(use_dotenv=True)

## Load Model

Vicuna-13B is an open-source chatbot developed by refining LLaMA through user-contributed conversations gathered from ShareGPT. Initial assessments employing GPT-4 as a referee indicate that Vicuna-13B attains over 90%* quality of OpenAI ChatGPT and Google Bard, surpassing other models such as LLaMA and Stanford Alpaca in over 90%* of instances. 

See:
* https://github.com/lm-sys/FastChat
* https://huggingface.co/lmsys/vicuna-13b-v1.5-16k

```bash
# Start the controller service
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 &

# Start the model_worker service
nohup python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-13b-v1.5-16k --num-gpus 2 &

# Start the gradio_web_server service
nohup python3 -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 7860 &

# Launch the RESTful API server
nohup python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8080 &
```

Test Model:

In [3]:
model = RemoteModel(model_name="vicuna-13b-v1.5-16k",
                    api_base="http://merkur72.inf.uni-konstanz.de:8080/v1",
                    api_key="EMPTY")

print(model.generateAnswer("Once upon a time"))

, there was a young girl named Sophie. Sophie lived in a small


***

## Define Prompt Template:

In [4]:
# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. A hero is an individual, organisation, or entity admired for their courage, noble qualities, and outstanding achievements. A villain is a character, organisation, or entity known for their wickedness or malicious actions, often serving as an antagonist in a story or narrative. A victim is an individual, organisation, or entity who suffers harm or adversity, often due to an external force or action. Every entity can only be one of those roles. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. Every entity can only be one of those roles. If not existing return None as name. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

# PROMPT_TEMPLATE = "Please identify entities which are portrayed as hero, villain and victim in the following news article. Each entity can only assume one role. If none apply, use 'None'. The solution must be returned in this format {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. Article Headline: ''{headline}''. Article Text: ''{article_text}''  Solution: "

PROMPT_TEMPLATE = "Given the news article below, identify entities categorized as a hero, villain, or victim. Each entity can only assume one role. If none apply, use 'None'. The solution must be provided in this format: {{hero: \"Name\", villain: \"Name\", victim: \"Name\"}}. \n Headline: '{headline}' \n Text: '{article_text}' \n Solution: "

# Test the template with a dummy text
prompt_test = PROMPT_TEMPLATE.format(headline = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', article_text='Lorem ipsum dolor sit amet, consectetur adipiscing elit.')
print(prompt_test)


Given the news article below, identify entities categorized as a hero, villain, or victim. Each entity can only assume one role. If none apply, use 'None'. The solution must be provided in this format: {hero: "Name", villain: "Name", victim: "Name"}. 
 Headline: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.' 
 Text: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.' 
 Solution: 


## Define Parameter for Text Generation

In [5]:
params = {'do_sample': True,
        'early_stopping': True,
        # 'max_length': 100,
        # 'min_length': 1,
        'logprobs': 1,
        'n': 1,
        'best_of': 1,
        
        'num_beam_groups': 2,
        'num_beams': 5,
        'num_return_sequences': 5,
        'max_tokens': 50,
        'min_tokens': 0,
        'output_scores': True,
        'repetition_penalty': 1.0,
        'temperature': 0.6,
        'top_k': 50,
        'top_p': 1.0 
        }

## Define Helper Functions

In [6]:
def extractTriplet(answer):
    """ Extracts the triplet from the answer string. """
    
    # Extract keys and values using regex
    keys = re.findall(r'(\w+):\s*\"', answer)
    values = re.findall(r'\"(.*?)\"', answer)
    result = dict(zip(keys, values))

    if result == {}:    
        keys = re.findall(r'(\w+):\s*([^,]+)', answer)
        result = dict((k, v.strip('"')) for k, v in keys)
    
    return result

In [7]:
def getAnswersTriplets(article, model, template, params):
    """ Generates answers for the given article using the model and template. """

    # Extract the article headline and text
    article_headline=article.get("title", "")
    article_text = article.get("parsing_result").get("text")

    # Generate the answer
    prompt = template.format(headline = article_headline, article_text = article_text)
    answer = model.generateAnswer(prompt, params)

    return answer

In [8]:
def splitText(text, n_tokens, tokenizer, overlap=10):
    """Splits the input text into chunks with n_tokens tokens using HuggingFace tokenizer, 
    with an overlap of overlap tokens from the previous and the next chunks."""
    
    tokens = tokenizer.tokenize(text)
    chunks = []
    i = 0

    # No previous chunk at the beginning, so no need for overlap
    chunks.append(tokenizer.convert_tokens_to_string(tokens[i:i+n_tokens]))
    i += n_tokens

    while i < len(tokens):
        # Now, we include overlap from the previous chunk
        start_index = i - overlap
        end_index = start_index + n_tokens
        chunk = tokens[start_index:end_index]
        chunks.append(tokenizer.convert_tokens_to_string(chunk))
        i += n_tokens - overlap  # Moving the index to account for the next overlap

    return chunks

In [9]:
def processBatch(articles, model, template, params, chunk_size=1024, overlap=256, show_progress=False, verbose=False):
    """Processes a batch of articles and extracts the triplets."""
    runtimes = []  # List to store the runtime for each article

    # Iterate over the articles
    for article in tqdm(articles, desc="Generating answers", disable=not show_progress):
        start_time = time.time()  # Start the timer

        # Extract the article headline and text
        article_headline = article.get("title", "")
        article_text = article.get("parsing_result").get("text")

        # Split the article text into chunks
        chunks = splitText(article_text, chunk_size,
                            model.tokenizer, overlap=overlap)

        # print("Chunks:", len(chunks))

        chunk_results = []
        for chunk_id, chunk in enumerate(chunks):

            if verbose:
                print("Chunk:", chunk_id)
                print("Chunk Length:", calcInputLength(model.tokenizer, chunk))
            # print("Headline Length:", calcInputLength(
            #     tokenizer, article_headline))

            prompt = template.format(
                headline=article_headline, article_text=chunk)
            answer = model.generateAnswer(prompt, params)
            triplet = extractTriplet(answer)

            # print(chunk_id, "Answer:", triplet, "Type:", type(triplet))
            results = {
                "chunk_id": chunk_id,
                "chunk": chunk,
                "answer": answer,
                "triplet": triplet
            }
            chunk_results.append(results)

        article["triplets"] = chunk_results

        end_time = time.time()  # End the timer
        runtime = end_time - start_time  # Calculate the runtime
        runtimes.append(runtime)  # Store the runtime

    return articles, runtimes

In [10]:
def updateArticle(db, id: str, values: dict = {}, collection="articles"):
    "Updates scraping task in database"
    pass # TODO: Uncomment to update the database
    # filter = {"_id": ObjectId(id)}
    # values = {"$set": {**values}}
    # r = db[collection].update_one(filter, values)
    # return r

In [11]:
def updateArticles(db, articles):
    """Updates the articles in the database."""

    for article in tqdm(articles, desc="Uploading results"):
        id = article.get("_id")
        values = {"triplets": article.get("triplets", [])}
        #updateArticle(db, id, values) # TODO: Uncomment to update the database

## Make Predictions

In [12]:
collection = "articles"
fields = {"url": 1, "title": 1, "parsing_result.text": 1}
query = {"processing_result": {"$exists": False}, 
         "parsing_result.text_length": {"$lt": 10000}}

In [13]:
LIMIT = 10 # Number of articles to process in each batch
CHUNK_SIZE = 1024 # Number of tokens in each chunk
OVERLAP = 64 # Number of overlapping tokens between chunks

In [14]:
batch_id = 0

while True:
    print(f"------ Batch {batch_id} ------")

    # Fetch the next batch of articles
    articles = fetchArticleTexts(db, LIMIT, 0, fields, query)
    
    # Stop if no more articles are available
    if not articles:
        break
    
    # Process the batch of articles
    articles, runtimes = processBatch(articles, model, PROMPT_TEMPLATE, params, chunk_size=CHUNK_SIZE, overlap=OVERLAP, show_progress=True, verbose=False)

    # Update the articles in the database
    #updateArticles(db, articles)
    print(f"Updated {len(articles)} articles", end="\n\n")

    batch_id += 1
    break # TODO: Remove to process all batches

------ Batch 0 ------


Generating answers:   0%|          | 0/10 [00:00<?, ?it/s]

Generating answers: 100%|██████████| 10/10 [01:08<00:00,  6.83s/it]

Updated 10 articles




