## Llama 13b for validation task

* input: dataframe with text pairs and additional info LLama 2 should use to make informed decision
* output: dataframe with additional columns: topic, topic match evalutation, topic match classification, news event, news event match evaluation, news event match classification, final classification on topic-level and news event level matching


In [1]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch import cuda, bfloat16
import transformers
import pandas as pd
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SequentialChain
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import os

In [3]:
torch.clear_autocast_cache

<function torch.clear_autocast_cache>

In [11]:
hf_auth_file = 'analysis/hf_auth.txt'

In [4]:
# Read the API token from the file
with open(hf_auth_file, "r") as file:
    hf_auth = file.read().strip()  # Remove leading/trailing whitespaces

In [5]:
model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model loaded on cuda:0


In [7]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

In [8]:
device = torch.device('cuda')
print("GPU Name:", torch.cuda.get_device_name(device))
print("Memory Usage:", torch.cuda.memory_allocated(device) / 1024 ** 3, "GB")
print("Max Memory Usage:", torch.cuda.max_memory_allocated(device) / 1024 ** 3, "GB")

GPU Name: NVIDIA A10
Memory Usage: 3.559241771697998 GB
Max Memory Usage: 3.604752540588379 GB


In [9]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max we do not want any randomness here as we want the model to stick to the prompt as closely as possible
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

llm = HuggingFacePipeline(pipeline=generate_text)

### Read in file

This is a file for prompt engineering that we devide into smaller samples for tuning our prompts. There are 100 rows in this data and we split them up into 20, 5 row dfs for quick testing.


In [13]:
# Function to navigate up 'n' levels
def navigate_up(current_directory, levels):
    for _ in range(levels):
        current_directory = os.path.dirname(current_directory)
    return current_directory

# Get the current working directory
current_directory = os.getcwd()

# Specify the number of levels to navigate up (4 levels in this case)
levels_to_navigate = 4

# Navigate up 'levels_to_navigate' folders
parent_directory = navigate_up(current_directory, levels_to_navigate)

# Define the path to the data file
file_path = os.path.join(parent_directory, 'newspaper_data', 'sample_1percent.csv')

# Now you can open and read the CSV file using pandas
import pandas as pd

df = pd.read_csv(file_path)

In [15]:
# Split the DataFrame into 20 smaller DataFrames for the sake of fast tuning of prompting, each containing 5 rows
chunk_size = 5
chunks = [df.iloc[i:i+chunk_size] for i in range(0, len(df), chunk_size)]

# Create variables for each smaller DataFrame
for i, chunk in enumerate(chunks):
    globals()[f'df{i + 1}'] = chunk

# Now you have variables df1, df2, df3, ... containing the smaller DataFrames
# You can access and work with them as needed
df20
#df2
#....


Unnamed: 0,Similarity_Score,Text1,Text2,Group,Date1,Date2,Publisher1,Publisher2,ID1,ID2,proper_nouns1,proper_nouns2,keywords1,keywords2
95,0.777635,"Een sterkere EU graag, maar nu is dat even bij...",En toch is Rutte erbij in verkiezingsdebat van...,high,2021-03-01 00:00:00,2021-03-01 00:00:00,Trouw,De Volkskrant,3285325,3285230,"PVV, D66, VVD","WNL, CDA'er Pieter Omtzigt, Tamara van Ark, Je...","['bikkelharde', 'mondkapjes', 'europese', 'opr...","['oneliners', 'ludieke', 'omtzigt', 'pitbull',..."
96,0.772102,"Makers van vaccins redden de mens, niet de bel...",Een vraag van een slachtoffer van de toeslagen...,high,2021-02-28 00:00:00,2021-02-28 22:15:58,Het Financieele Dagblad,NOS liveblog,3290697,3287475,"Makers van vaccins redden de mens, Martin Shkr...","Kristie, Jullie, Kristie, Jullie","['farmaceut', 'dwanglicenties', 'pharmaceutica...","['toeslagenaffaire', 'schandvlek', 'kristie', ..."
97,0.866637,Opstand begin van corona-uitbraak; Opstand beg...,Europese Volt rekent op een zetel in Den Haag...,high,2021-03-01 00:00:00,2021-03-01 00:00:00,De Telegraaf,NRC Handelsblad,3286378,3285633,"Opstand, Ter Apel, FNV, Sander Dekker, PI, Ter...","Europese Volt, Volt, Jan Boekestijn, Sander Sc...","['indigd', 'coronavirus', 'sander', 'bajes', '...","['grensoverschrijdende', 'schimmelpenninck', '..."
98,0.833621,Economen: hogere belastingen jagen bedrijven w...,"Meeste partijen willen hoger minimumloon, maar...",high,2021-03-01 00:00:00,2021-03-01 00:00:00,Het Financieele Dagblad,Het Financieele Dagblad,3290606,3290622,"Economen, Jakob de Haan, VVD, Centraal Planbur...","VVD, CDA, D66, ChristenUnie, CDA, SP, VVD, Gro...","['verslechtert', 'vestigingsklimaat', 'verkiez...","['bijstandsuitkeringen', '50pluss', 'maandloon..."


#  Prompting 

Prompting takes shapes in many sequetial instructions. We divide the prompts themselves into system prompt, example prompt, and main prompt to geenrate a template for each subtask. We begin with the broadest level, topic-level matching task. This task is also divided into three sepearate subtasks: (1) create topic labels for eacg text, (2) compare the topic labels and texts to decide to what extent they match, and (3) based on the explanation create a single classification topic match or no topic match. 


## Step 1: Extract topics. 
The prompt template is based on Grootendorst, BERTopic LLama2 implementation with example from our full dataset.
* Important to note that for each step we pass in a system prompt, give and example, and provide a main prompt that signifies the variables and content to be considered.
* Then we create a chain from the prompt for further sequential chaining with LangChain
* Very important here to intially extract main topic and subtopic in order to obtain clear topics. If subtopics are not requested then the model might not understand that a topic that mentions politicians and conspiracy theories belongs to the broader topic of politics and instead it may label it as conspiracy theories alone. 


In [17]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will b
system_prompt_1 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics. A "topic" is a fundamental subject or theme that encompasses all aspects related to a particular area of interest or discussion. 
A topic serves as the overarching framework for exploring and discussing various facets within that subject. A topic comprises of a main topic and a subtopic. A main topic is an overarching theme, a subtopic is a more specific thematic or content-based divisions within a broader main topic. All main topics are labeled Politics if the documents' keywords and proper nouns relate to politics. For instance if the text discusses the economy but a politician, party, or government is mentioned either in the text or in the keywords then it should be categorized as Politics and not Economy. \n
Main topic: Politics; Subtopic: Elections and campaigns
Main topic: Economy; Subtopic: Interest rates
Main topic: Health; Subtopic: Mental health
Main topic: Entertainment; Subtopic: Film and Television \n

If a text mentions politics, politicians names functions, partirs, policy, or any other politics-related term, the main topic should always be Politics.\n


You must always return a main topic and a subtopic and nothing else in the following format: Main topic1 : Subtopic;, Main topic2: Subtopic \n
Do not return any notes. Only return the label and nothing more for each text.

<</SYS>>
"""

In [23]:
# Example prompt demonstrating the output we are looking for
example_prompt_1 = """
I have a document pair of the following texts:\n
- Contact met de kiezer; Deze keer zie je geen lijsttrekkers in windjacks rondlopen op markten. Kandidaat-Kamerleden moeten noodgedwongen online contact zoeken met de kiezers. Zoals de lijsttrekker van de ChristenUnie, Gert-Jan Segers, te zien is op de bovenstaande afbeelding terwijl hij vragen beantwoordt die kiezers hem stellen op het online platform Instagram. In plaats van direct met de burgers te praten, spreken politici nu voor de camera's. Livestreams op Facebook zijn ook populair. Zo had Mark Rutte dit weekend een gesprek met horeca-ondernemers en zond de VVD dat uit op Facebook. Naast de online campagne werd er dit weekend ook op de ouderwetse manier geflyerd. Maar de meeste campagnevoerende partijleden gingen niet aanbellen uit angst voor verdere verspreiding van het coronavirus. Forum voor Democratie was de enige partij die de straat op ging om campagne te voeren. Met een vrijheidskaravaan bezocht de partij Nijmegen en Venlo voor een manifestatie. Toen er meer dan tweehonderd mensen kwamen opdagen, moest burgemeester Hubert Bruls de bijeenkomst voortijdig beëindigen, hoewel deze wel was aangekondigd en aangevraagd. Een bezoeker in Venlo twitterde dat het centrale plein voor het eerst sinds carnaval vorig jaar weer vol stond met de komst van Baudet.
- Forum voor Democratie op zoek naar extra stemmen; Terwijl andere partijen zich nauwelijks op straat vertonen, reist Forum voor Democratie stad en land af. Deze optredens trekken niet alleen de aandacht van kiezers; het Openbaar Ministerie onderzoekt nu ook of het campagneteam van Baudet de coronaregels op grote schaal heeft overtreden. Volgens getuigen werden bij een bezoek aan Urk honderden handen geschud. En dan was er nog de volmachtrel. In een live-uitzending riep Baudet zijn kijkers op om zoveel mogelijk volmachtstemmen te regelen, aangezien kiezers dit jaar niet twee, maar drie volmachtstemmen mogen uitbrengen om de kans op besmetting te verkleinen. "Een persoon kan eigenlijk vier keer stemmen, als je maar die volmachten kunt regelen," zei Baudet, en dat was een enorme kans. Maar het ministerie van Binnenlandse Zaken zei: "Ho, dat is niet de bedoeling en is niet toegestaan." Het campagneteam van Forum leek daar toen al achter te komen, want de suggestie om stemmen te regelen werd snel uit de video van Baudet geknipt. Baudet houdt echter wel openlijk vast aan zijn standpunt dat er een grote kans is op verkiezingsfraude, maar dan door anderen, uiteraard.

The topic of each text is described by the following keywords: 'livesessies', 'vrijheidskaravaan', 'flyerende', 'facebook', 'windjacks'; besmettingskansen', 'volmachtsstemmen', 'schielijk', 'volmachten', 'baudets'
The following proper nouns appear in each text: Gert-Jan Segers, Mark Rutte, Forum voor Democratie, Hubert Bruls, Baudet; Forum voor Democratie, Forum voor Democratie, Ministerie kijkt, Urk, Baudet, Baudet, Baudet, Baudet

Based on the information about the topic above, please create a short label of the topic for each text. Only return the label and nothing more for each text in the following format:

[/INST] Main topic 1: Politics; Subtopic: Elections and campaigns; Main topic 2: Politics; Subtopic: Elections, campaigns and fraud 

"""

In [24]:
main_prompt_1 = """
[INST]
I have a document pair of the following texts:
{text1} and {text2}

The topic of each text is described by the following keywords: {keywords1} and {keywords2}
The following proper nouns appear in each text: {proper_nouns1}, {proper_nouns2}

Based on the information about the topic above, please create a short label of this topic for each text. Only return the label and nothing more for each text in the following format: Main topic 1 : Subtopic ; Main topic 2: Subtopic
[/INST]
"""

In [25]:
prompt_1 = system_prompt_1 + example_prompt_1 + main_prompt_1

In [26]:
# Create a PromptTemplate instance
prompt_template = PromptTemplate(
    input_variables=["text1", "text2", 'proper_nouns1', 'proper_nouns2', 'keywords1', 'keywords2'],
    template=prompt_1
)

# Create the LLMChain instance
chain_1 = LLMChain(llm = llm, prompt = prompt_template, output_key="topics")

### Step 1.1: Extract main topic from topics for matching

¶This is a must otherwise the chain considers subtopics as the level of match and disregards those that match on a broad level

In [28]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will b
system_prompt_1_1 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for extracting the main topic from a topic. 
A topic comprises of a main topic and a subtopic. A main topic is an overarching theme, a subtopic is a more specific thematic or content-based divisions within a broader main topic.
Main topic: Economy; Subtopic: Interest rates
Main topic: Health; Subtopic: Mental health
Main topic: Entertainment; Subtopic: Film and Television 

A main topic is everything before the word 'Subtopic'
Given a topic, you must always return the main topic nothing else in the following format: Main topic1, Main topic2: 
Only return the main topic label and nothing more for each text.

<</SYS>>
"""

In [29]:
# Example prompt demonstrating the output we are looking for
example_prompt_1_1 = """
I have a pair of topics:
Main topic 1: Politics; Subtopic: Elections and campaigns; \n
Main topic 2: Politics; Subtopic: Elections, campaigns and fraud \n

Based on the information about the topic above, please extract the main topic from each topic. A main topic is everything before the word 'Subtopic'. In this case this word is Politics. Only return the label of the main topic and nothing more in the following format:

[/INST] Main topic 1: Politics; Main topic 2: Politics

"""

In [30]:
main_prompt_1_1 = """
[INST]
I have a pair of topics:
{topics}

Based on the information about the topic above, please extract the main topic from each topic. Only return the label of the main topic and nothing more in the following format:
Main topic 1: ; Main topic 2: 
[/INST]
"""

In [31]:
prompt_1_1 = system_prompt_1_1 + example_prompt_1_1 + main_prompt_1_1

In [32]:
# Create a PromptTemplate instance
prompt_template_1 = PromptTemplate(
    input_variables=["topics"],
    template=prompt_1_1
)

# Create the LLMChain instance
chain_1_1 = LLMChain(llm = llm, prompt = prompt_template_1, output_key="main_topic")

## Step 2: Evaluate topic level match
Compare the topics of the text pairs and made eveluation about the match level  

* We create a second chain for this task that uses the texts as well as the extrcated topics as input

In [33]:
# Change the system prompt from the default one to a specific one in order to focus the model on a single task. 
# This system prompt will be
system_prompt_2 = """
<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant for comparing the main topics of two texts. In this comparison, a match is solely based on the main topic and nothing else.

"""

In [34]:
# Example prompt demonstrating the output we are looking for
example_prompt_2 = """

The main topic of each text is described by the following labels:

Main topic 1: Politics;  
Main topic 2: Politics; 


Based on the information about the main topics above, please write a short evaluation about whether the two texts match on a main topic level. Make sure to only return the evaluation and nothing more in the following format:

[/INST] Evaluation: Yes, the two texts match on a main topic level because both texts touch upon the broader context of Politics. 
"""

In [35]:
#main prompt describing the task once more and adding the input variables to be considered
main_prompt_2 = """
[INST]

The main topic of each text is the following: 
{main_topic}

Based on the information about the topics above, please write a short evaluation about whether the two texts match on a main topic level. Make sure to only return the evaluation and nothing more in the following format:
Evaluation:
[/INST] 
"""

In [36]:
prompt_2 = system_prompt_2 + example_prompt_2 + main_prompt_2

In [37]:
prompt_template_2 = PromptTemplate(input_variables=["main_topic"], template=prompt_2, batch_size=32, max_iterations = 1)
chain_2 = LLMChain(llm = llm, prompt = prompt_template_2, output_key="topic_evaluation")


## Step 3: Create classification label based on evaluation
Provide single label for the match level

* We create a third chain for this task that uses the texts as well as the extracted topics and evaluation as input.
* Labels: topic match, not topic match

In [38]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will be
system_prompt_3 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for classifying whether two texts match on a main topic level based on an evaluation provided. 

At each request excplicitly assign one of the two labels below. 
0 - no match
1 - topic match

Make sure you to only return the label and nothing else.
<</SYS>>
"""

In [39]:
# Example prompt demonstrating the output we are looking for
example_prompt_3 = """

The following evaluation describes the topic match level:
Yes, the two texts match on a main topic level. Both texts touch upon the broader context of Politics. 
Based on this information, please assign either '0 - no match' or '1 - topic match'. Make sure to only return the label and nothing more in the following format:

[/INST]: 1 
"""

In [40]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt_3 = """
[INST]

The following evaluation describes the topic match level:
{topic_evaluation}

Based on this information, please assign either '0 - no match' or '1 - topic match'. Make sure to only return the label and nothing more in the following format:
[/INST] 
"""

In [41]:
prompt_3 = system_prompt_3 + example_prompt_3 + main_prompt_3

In [42]:
prompt_template_3 = PromptTemplate(input_variables=[ "topic_evaluation"], template=prompt_3, batch_size=32, max_iterations = 1)
chain_3 = LLMChain(llm = llm, prompt = prompt_template_3, output_key="match_topic")


## Step 4: Identify news events
* we ask the model the identify the news event described in each text
* input data remains the same
* this is in preparation of assessing news event level matching similar to topic level matching

In [83]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will be
system_prompt_4 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for idenitifying the news event described in a pair of documents. 
News events are specific events that lead to news coverage, such as a specific debate on a specific day in a specific parliament, a specific accident, or a specific football match. They can be covered by one or more articles in one or more outlets, but relate to one specific and identifiable event and are thus much more fine-grained than news topics, issues, or news categories.
News events can span over multiple days but not more than 10 days. Therefore articles that cover the same news event are published within the same few hours and in the course of a few days. 

Provide your answer as an explanation in maximum 100 tokens. Make sure you to only return the news event identified and nothing else. Be very specific, name actors and places where possible. 
<</SYS>>
"""

In [84]:
# Example prompt demonstrating the output we are looking for
example_prompt_4 = """
I have a document pair of the following texts:
- Contact met de kiezer; Deze keer zie je geen lijsttrekkers in windjacks rondlopen op markten. Kandidaat-Kamerleden moeten noodgedwongen online contact zoeken met de kiezers. Zoals de lijsttrekker van de ChristenUnie, Gert-Jan Segers, te zien is op de bovenstaande afbeelding terwijl hij vragen beantwoordt die kiezers hem stellen op het online platform Instagram. In plaats van direct met de burgers te praten, spreken politici nu voor de camera's. Livestreams op Facebook zijn ook populair. Zo had Mark Rutte dit weekend een gesprek met horeca-ondernemers en zond de VVD dat uit op Facebook. Naast de online campagne werd er dit weekend ook op de ouderwetse manier geflyerd. Maar de meeste campagnevoerende partijleden gingen niet aanbellen uit angst voor verdere verspreiding van het coronavirus. Forum voor Democratie was de enige partij die de straat op ging om campagne te voeren. Met een vrijheidskaravaan bezocht de partij Nijmegen en Venlo voor een manifestatie. Toen er meer dan tweehonderd mensen kwamen opdagen, moest burgemeester Hubert Bruls de bijeenkomst voortijdig beëindigen, hoewel deze wel was aangekondigd en aangevraagd. Een bezoeker in Venlo twitterde dat het centrale plein voor het eerst sinds carnaval vorig jaar weer vol stond met de komst van Baudet.
- Forum voor Democratie op zoek naar extra stemmen; Terwijl andere partijen zich nauwelijks op straat vertonen, reist Forum voor Democratie stad en land af. Deze optredens trekken niet alleen de aandacht van kiezers; het Openbaar Ministerie onderzoekt nu ook of het campagneteam van Baudet de coronaregels op grote schaal heeft overtreden. Volgens getuigen werden bij een bezoek aan Urk honderden handen geschud. En dan was er nog de volmachtrel. In een live-uitzending riep Baudet zijn kijkers op om zoveel mogelijk volmachtstemmen te regelen, aangezien kiezers dit jaar niet twee, maar drie volmachtstemmen mogen uitbrengen om de kans op besmetting te verkleinen. "Een persoon kan eigenlijk vier keer stemmen, als je maar die volmachten kunt regelen," zei Baudet, en dat was een enorme kans. Maar het ministerie van Binnenlandse Zaken zei: "Ho, dat is niet de bedoeling en is niet toegestaan." Het campagneteam van Forum leek daar toen al achter te komen, want de suggestie om stemmen te regelen werd snel uit de video van Baudet geknipt. Baudet houdt echter wel openlijk vast aan zijn standpunt dat er een grote kans is op verkiezingsfraude, maar dan door anderen, uiteraard.\n

The following keywords appear in each text: 'livesessies', 'vrijheidskaravaan', 'flyerende', 'facebook', 'windjacks'; besmettingskansen', 'volmachtsstemmen', 'schielijk', 'volmachten', 'baudets'
The following proper nouns appear in each text: Gert-Jan Segers, Mark Rutte, Forum voor Democratie, Hubert Bruls, Baudet; Forum voor Democratie, Forum voor Democratie, Ministerie kijkt, Urk, Baudet, Baudet, Baudet, Baudet

The topic of each text is the following:
Main topic 1: Politics; Subtopic: Elections and campaigns; \n
Main topic 2: Politics; Subtopic: Elections, campaigns and fraud \n

Based on the information above, please identify the news events that describe the texts. Be as specific as possible. Make sure to only return the events and nothing more for each text in the following format:

[/INST] Event 1: Most political parties are shifting their campaign activity stategies due to the COVID-19 pandemic; Event 2: Forum voor Democratie party's campaign activities violate COVID-19 regulations while other parties have more pandemic-proof startegies. 

"""

In [85]:
# Example prompt demonstrating the output we are looking for
main_prompt_4 = """

I have a document pair of the following texts:
{text1} and {text2}

The following keywords appear in each text: {keywords1} and {keywords2}
The following proper nouns appear in each text: {proper_nouns1}, {proper_nouns2}

The topic of each text is the following:
{topics}

Based on the information above, please identify the news events that describe the texts. Make sure to only return the news events and nothing more in the following format:
[/INST] 

"""

In [86]:
prompt_4 = system_prompt_4 + example_prompt_4 + main_prompt_4

In [87]:
prompt_template_4 = PromptTemplate(input_variables=["text1", "text2", "proper_nouns1", "proper_nouns2", 'keywords1', 'keywords2', "topics" ], template=prompt_4, batch_size=32, max_iterations = 1)
chain_4 = LLMChain(llm = llm, prompt = prompt_template_4, output_key="news_events")

## Step 5: Evaluate news event level match

* we ask the model to compare the news events identified on whether they match 
* input data remains the same plus the dates

In [107]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will be
system_prompt_5 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for evaluating whether two texts pertain to the same news event.
News events are comprised of specific events that lead to news coverage around a news story, such as a specific debate on a specific day in a specific parliament, a specific accident, or a specific football match. \n
They can be covered by one or more articles in one or more outlets, but relate to one specific and identifiable event and are thus much more fine-grained than news topics, issues, or news categories.\n
News events can span over multiple days but not more than 10 days. Therefore articles that cover the same news event are published very close in time, a matter of hours or maximum a few days. 
Different news events can also be published on the same date or on a very close date. \n
The most important criteria for determining whether the two texts pertain to the same news event are the events mentioned in the text. The date overlap is a secondary objective. \n

An event within a news event must refer to a specific event or related developments around the event. A news event can include various articles, reports, and updates from news outlets, all contributing to the coverage of that specific event or issue and its sorrounding aspects. \n
For example, the news event might revolve around the presidential election of a specific year, detailing campaign events, candidate profiles, polling data, and key issues.
Provide your answer as an explanation in maximum 100 tokens. Make sure to only return the evaluation and nothing else.
<</SYS>>
"""

In [122]:
# Example prompt demonstrating the output we are looking for
example_prompt_5 = """

The news events of each text is the following:
 Event 1: Most political parties are shifting their campaign activity stategies due to the COVID-19 pandemic; \n
 Event 2: Forum voor Democratie party's campaign activities violate COVID-19 regulations while other parties have more pandemic-proof strategies. \n

The pubishing dates of the texts is the following:\n
date1: 01/03/2021; date2: 01/03/2021 \n 

Based on the information above, please write a short evaluation about whether the two texts match on a news event level. Make sure to only return the evaluation and nothing more in the following format:

[/INST] Evaluation: Both texts focus on one particular news event, the election campaign and party campaign activities amid the COVID-19 pandemic which is distintive event. Both texts discuss aspects of the same election campaign, political parties and campaign strategies during the pandemic indicating that they pertain to the same news event.
The texts were also published at a similar time and date which further indicates that they belong to the same news event. 
"""

In [117]:
# Example prompt demonstrating the output we are looking for
main_prompt_5 = """

The news events of each text is the following:
{news_events}

The pubishing dates of the texts is the following:
{date1} and {date2}

Based on the information above, please write a short evaluation about whether the two texts match on a news event level. Make sure to only return the evaluation and nothing more in the following format:
[/INST] 

"""

In [118]:
prompt_5 = system_prompt_5 + example_prompt_5 + main_prompt_5

In [119]:
prompt_template_5 = PromptTemplate(input_variables=["news_events", "date1", "date2"], template=prompt_5, batch_size=32, max_iterations = 1)
chain_5 = LLMChain(llm = llm, prompt = prompt_template_5, output_key="event_evaluation")


## Step 6: Provide single label for news event level match

* we use the evalution and the events to make a classication whether there is a match or no match on the news event level

In [None]:
# Change the system prompt. It describes information given to all conversations
# This system prompt will be
system_prompt_6 = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for classifying whether two texts match on news event level based on an evaluation provided. 

At each request excplicitly assign one of the two labels below. 
0 - no match
1 - event match

Make sure you to only return the label and nothing else.
<</SYS>>
"""

In [None]:
# Example prompt demonstrating the output we are looking for
example_prompt_6 = """

The following evaluation describes the news event match level:
Both texts focus on one particular news event, the election campaign and party campaign activities amid the COVID-19 pandemic which is distintive event. Both texts discuss aspects of the same election campaign, political parties and campaign strategies during the pandemic indicating that they pertain to the same news event.
The texts were also published at a similar time and date which further indicates that they belong to the same news event. 
Based on this information, please assign either '0 - no match' or '1 - topic match'. Make sure to only return the label and nothing more in the following format:

[/INST]: 1 
"""

## Create overall chain to combine previous chains into one big sequential chain

In [120]:
#create overall chain to combine previous chains into one big sequential chain
overall_chain = SequentialChain(
                  chains=[chain_1, chain_1_1, chain_2, chain_3, chain_4, chain_5], input_variables = ["text1", "text2", "proper_nouns1", "proper_nouns2", 'keywords1', 'keywords2', 'date1', 'date2'],output_variables=["topics", "main_topic", "topic_evaluation", "match_topic","news_events","event_evaluation" ],
                  verbose=True )

In [121]:
#this purely for tests


for index, row in df2.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2'],
        "date1":row['Date1'],
        "date2":row['Date2']

    }

    # Generate text using the chain
    
    results = overall_chain(input_variables)
    print(results)



[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
{'text1': 'Alle pijlen zijn gericht op Rutte in RTL-debat; Reportage verkiezingscampagne Alles op Rutte, dat is de stilzwijgende afspraak waaraan al diens opponenten zich in deze campagne tot dusver houden. Na het lijsttrekkersdebat van Radio 1 en het running mate-debat is ook het premiersdebat van RTL het eerste grote tv-debat van deze verkiezingen een meertrapsaanval op de minister-president, die graag aan zijn vierde termijn zou beginnen. De tactiek van de VVD tekent zich scherper af bouw zo min mogelijk profiel op, pareer aanvallen met verzoenende reacties en blijf dichtbij de kernboodschap gelieve de premier te koesteren die het land door de coronacrisis leidt. Een ruimte die hij pakt als het aan het eind van het debat over vaccinatiebewijzen gaat We moeten verstandig blijven. De politiek moet niet op de stoel van artsen en verpleegkundigen gaan zitten, zeggen Sigrid Kaag D66 en Jesse Klaver GroenLinks op

### Run the overall chain and save the results into the df column

this is to be modified based on all the new output variables

In [None]:
# Create empty lists to collect the results
topics = []
topic_eval = []
match_topic = []
news_events = []
event_eval = []
match_event = []

# Iterating over the DataFrame
for index, row in df2.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2'],
        "date1": row['Date1'],
        "date2": row['Date2']
    }

    # Process the input_variables and get the results
    results = overall_chain(input_variables)  # Assuming 'overall_chain' is your processing function

    # Append results to respective lists
    topics.append(results['topics'])
    match_topic.append(results['match_topic'])
    topic_eval.append(results['topic_evaluation'])
    news_events.append(results['news_events'])
    event_eval.append(results['event_evaluation'])
    match_event.append(results['match_event'])

# Add new columns to the DataFrame
df2['Topic'] = topics
df2['Topic_eval'] = topic_eval
df2['Topic_match'] = match_topic
df2['News_events'] = news_events
df2['Event_eval'] = event_eval
df2['Event_match'] = match_event