## Zero shot LLama 13b prompting for task

* input dataframe with text pairs and additional info LLama should use to make informed decision


In [1]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch import cuda, bfloat16
import transformers
import pandas as pd

In [141]:
torch.clear_autocast_cache

<function torch.clear_autocast_cache>

In [3]:
model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model loaded on cuda:0


In [142]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



In [143]:
device = torch.device('cuda')
print("GPU Name:", torch.cuda.get_device_name(device))
print("Memory Usage:", torch.cuda.memory_allocated(device) / 1024 ** 3, "GB")
print("Max Memory Usage:", torch.cuda.max_memory_allocated(device) / 1024 ** 3, "GB")

GPU Name: NVIDIA A10
Memory Usage: 5.196333885192871 GB
Max Memory Usage: 7.891656398773193 GB


In [117]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max we do not want any randomness here as we want the model to stick to the prompt as closely as possible
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

llm = HuggingFacePipeline(pipeline=generate_text)

### Tokenize the text or at least check for average token length


In [118]:
# Define the token_len function
def token_len(text):
    tokens = tokenizer.encode(
        text
    )
    return len(tokens)

In [8]:
#read in test data
df = pd.read_csv('newspaper_data/sample_5.csv')

In [119]:
# Apply the token_len function to the DataFrame
df['Token Length_1'] = df['Text1'].apply(token_len)
df['Token Length_2'] = df['Text2'].apply(token_len)

In [145]:
df

Unnamed: 0.1,Unnamed: 0,Similarity_Score,Text1,Text2,Group,Date1,Date2,Publisher1,Publisher2,ID1,ID2,proper_nouns1,proper_nouns2,keywords1,keywords2,Token Length_1,Token Length_2
0,15440466,0.802175,’Geef Haagse muzikanten Walk of Fame’; Geef Ha...,Eindhoven kiest niet voor positieve discrimina...,high,2021-02-19T00:00:00,2021-03-17T17:00:10,De Telegraaf,NU.nl,2766277,6314875,"Geef Haagse, D66, Jos van Leeuwen, Tanja Verka...","GroenLinks, PvdA, anders op zoek, Jannie Vissc...","['muziekhelden', 'ruimtereizigers', 'haagse', ...","['voorgespiegeld', 'cursussen', 'migratieachte...",375,401
1,6529835,0.710634,Kamermeerderheid voor beperken vluchten uit Lo...,De CDA-campagne op sociale media draait vooral...,high,2021-01-13T22:07:29,2021-03-13T00:00:00,NOS liveblog,NRC Handelsblad,2778611,3297694,"GroenLinks, D66, GroenLinks, D66, SP, PvdA, SG...","Wopke Hoekstra, Wopke Hoekstra, CDA, adverteer...","['50plus', 'sgp', 'groenlinks', 'ontraden', 'k...","['koffiebarretje', 'adverteert', 'schaatsfilmp...",179,382
2,9289150,0.705326,"Ruim 2,6 miljoen Nederlandse kijkers voor inau...",De Nederlandse mondzorg vertoont gaten; Ziekte...,high,2021-01-21T07:49:42,2021-03-18T00:00:00,NU.nl,Trouw,3033444,5534774,"Joe Biden, NOS, Joe Biden, NOS, Lady Gaga, Jen...","Dokters, Tweede Kamer","['inauguratie', 'washington', 'plechtigheid', ...","['schuldhulpverleners', 'mondzorgconsulten', '...",363,404
3,7370652,0.719213,Demissionair kabinet trekt &#39;enig denkbare ...,kruiswoordtest 5900; Horizontaal 1Britse prins...,high,2021-01-15T14:39:22,2021-03-03T00:00:00,NU.nl,Trouw,3034205,3285929,"Hugo de Jonge Volksgezondheid, Sigrid Kaag voo...","Bernhard, Adriana 4, Sahara, Bernhard, James Watt","['ontwikkelingssamenwerking', 'toeslagenaffair...","['dressuurhengst', 'manchetknopen', 'complotde...",371,421
4,14563301,0.707523,Wekdienst 13/2: Schaatsdrukte verwacht • Carna...,‘Soms ben ik geschokt als mensen niet denken z...,high,2021-02-13T07:46:42,2021-02-23T00:00:00,NOS nieuws,NRC Handelsblad,2769942,2863915,"Carnaval anders dan anders, PVV","Rotterdamse jongeren, NRC, NRC met jongeren, J...","['schaatsdrukte', 'winterdag', 'winterdienstre...","['rotterdamse', 'rotterdam', 'erasmus', 'marke...",372,347


from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=2048,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

Topic: The topic is the overarching subject or theme that encompasses all aspects related to elections. In this case, the topic is "elections," which is a broad and recurring subject in the news. It includes various elections happening at different levels of government (e.g., presidential, gubernatorial, local), electoral systems, voting procedures, and political analysis. The topic sets the stage for coverage and discussions surrounding elections.

Story: A story within the context of elections is a specific, often ongoing narrative that focuses on a particular election or related developments. A story can include various articles, reports, and updates from news outlets, all contributing to the coverage of that specific election or its surrounding events. For example, the story might revolve around the presidential election of a specific year, detailing campaign events, candidate profiles, polling data, and key issues.

Event: An event is a singular occurrence or happening within the broader context of an election story. Events are typically noteworthy and can be reported on by multiple news outlets. In the context of elections, an event might be something like a presidential debate, election day itself, the release of election results, or a major campaign rally. Events are the specific milestones or moments that shape the narrative of an election story.

To summarize, "elections" is the overarching topic, "the presidential election of a specific year" is the story that encompasses all coverage related to that election, and "presidential debates," "election day," and "release of election results" are individual events within that story. These distinctions help to clarify how news articles organize their coverage of elections, ensuring that readers can follow and understand the unfolding developments and narratives.

###  Default system prompt

In [131]:
# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for deciding whether a connection exists across document pairs.
<</SYS>>
"""

In [132]:
# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

In [133]:
# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

In [134]:
prompt_1 = system_prompt + example_prompt + main_prompt

In [135]:
prompt_1

"\n<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant for deciding whether a connection exists across document pairs.\n<</SYS>>\n\nI have a topic that contains the following documents:\n- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.\n- Meat, but especially beef, is the word food in terms of emissions.\n- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.\n\nThe topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.\n\nBased on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.\n\n[/INST] Environmental impacts of eating meat\n\n[INST]\nI have a topic that contains the following documents:\n[DOCUMENTS]\n\nThe topic is described by

In [136]:
prompt_template

PromptTemplate(input_variables=['text1', 'text2', 'proper_nouns1', 'proper_nouns2', 'keywords1', 'keywords2'], output_parser=None, partial_variables={}, template='Given two text snippets from Dutch newspapers: text 1: {text1} and text 2: {text2}, the proper nouns that appear in the articles: {proper_nouns1} and {proper_nouns2}, and the keywords of the articles {keywords1} and {keywords2} extract the main topic and subtopic of each article. \nProvide your answer without explanation in the following format: Text 1 Topic: - Subtopic: , Text2 Topic:- Subtopic:\n\nA topic must refer to one of the following 5 main topics, do not invent new main topics, ever: \n\n(1) Main topic Politics: internal affairs, international politics, military and defense \n(2) Main topic Business: economy, education, welfare and social services\n(3) Main topic Health: \n(4) Main topic Entertainment: sports, culture, music, fashion, and human interest\n(5) Main topic Other: science and technology, environment, comm

## Using langchain chains to sequentially get evaluation and classification based on input data

* Chain_1 = ask the llm to use variables to make an informed decision about wether text pairs are similar on different levels
* Chain_2 = ask the llm to classify them into distinct categories based on the evaluation

In [137]:
# Importing the necessary functions or libraries
from langchain import PromptTemplate
from langchain.chains import LLMChain

#topic catergories are based on Vermeer et al, 2018
#slight adjustment: Health becomes also main topic given the covid-19 pandemic 

# topic extraction template

template = """<s>[INST] <<SYS>> You are a helpful, respectful and honest assistant for deciding whether a connection exists across document pairs.
<</SYS>>

[INST] Given two text snippets from Dutch newspapers: text 1: {text1} and text 2: {text2}, the proper nouns that appear in the articles: {proper_nouns1} and {proper_nouns2}, and the keywords of the articles {keywords1} and {keywords2} extract the main topic and subtopic of each article. 
Provide your answer without explanation in the following format: Text 1 Topic: - Subtopic: , Text2 Topic:- Subtopic:

A topic must refer to one of the following 5 main topics, do not invent new main topics, ever: 

(1) Main topic Politics: internal affairs, international politics, military and defense 
(2) Main topic Business: economy, education, welfare and social services
(3) Main topic Health: 
(4) Main topic Entertainment: sports, culture, music, fashion, and human interest
(5) Main topic Other: science and technology, environment, communication, weather, and religion and beliefs

Always include a main topic (for example: Politics) and a subtopic (for example: internal affairs, elections) at each request.

Text 1: {text1}
Text 2: {text2}
proper nouns 1: {proper_nouns1}
proper nouns 2: {proper_nouns2}
keywords 1: {keywords1}
keywords 2: {keywords2}

[/INST] Answer in English: """


# Create a PromptTemplate instance
prompt_template = PromptTemplate(
    input_variables=["text1", "text2", 'proper_nouns1', 'proper_nouns2', 'keywords1', 'keywords2'],
    template=template
)

# Create the LLMChain instance
chain_1 = LLMChain(llm = llm, prompt = prompt_template, output_key="topics")

In [139]:
prompt_template

PromptTemplate(input_variables=['text1', 'text2', 'proper_nouns1', 'proper_nouns2', 'keywords1', 'keywords2'], output_parser=None, partial_variables={}, template='<s>[INST] <<SYS>> You are a helpful, respectful and honest assistant for deciding whether a connection exists across document pairs.\n<</SYS>>\n\n[INST] Given two text snippets from Dutch newspapers: text 1: {text1} and text 2: {text2}, the proper nouns that appear in the articles: {proper_nouns1} and {proper_nouns2}, and the keywords of the articles {keywords1} and {keywords2} extract the main topic and subtopic of each article. \nProvide your answer without explanation in the following format: Text 1 Topic: - Subtopic: , Text2 Topic:- Subtopic:\n\nA topic must refer to one of the following 5 main topics, do not invent new main topics, ever: \n\n(1) Main topic Politics: internal affairs, international politics, military and defense \n(2) Main topic Business: economy, education, welfare and social services\n(3) Main topic Hea

In [138]:
# Test if it works

for index, row in df.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2'],
    }

    # Generate text using the chain
    generated_text = chain_1.run(input_variables)
    
    print(generated_text)



Text 1 Topic: Politics - Subtopic: Internal Affairs

Text 2 Topic: Politics - Subtopic: Inclusiveness and Diversity


Text 1 Topic: Politics - Subtopic: Elections

Text 2 Topic: Business - Subtopic: Advertising


Text 1 Topic: Politics - Subtopic: Internal Affairs

Text 2 Topic: Health - Subtopic: Dental Care


Text 1 Topic: Politics - Subtopic: Internal Affairs

Text 2 Topic: Other - Subtopic: Science and Technology


Text 1 Topic: Politics - Subtopic: Internal Affairs

Text 2 Topic: Health - Subtopic: Mental Health


In [128]:
# topic match
template_2 = """You are an expert reasoner. Given two text snippets from Dutch newspapers explain carefully to what extent the two text snippets are similar on a topic level using {topics}. 
Provide your answer as an explanation in maximum 100 tokens. 

Answer: """


# Create the LLMChain instance for chain 2 

prompt_template_2 = PromptTemplate(input_variables=["topics"], template=template_2, batch_size=32)
chain_2 = LLMChain(llm = llm, prompt = prompt_template_2, output_key="topic_evaluation")


In [58]:
# story extraction

template_3 = """Given two text snippets from Dutch newspapers: text 1: {text1} and text 2: {text2}, their publishing dates {date1} and date {date2}, the proper nouns that appear in the article: {proper_nouns1} and {proper_nouns2}, and the keywords of the articles {keywords1} and {keywords2} extract the story of each article
Provide your answer as a single short label without explanation in the following format: Text 1 Story: , Text 2 Story:

Context: News stories are about something concrete: a place, a person, or an event. News stories occur close together in proximity, a matter or a sliding window of 3 days. Make sure that date {date1} and date {date2} are within three days of distance. 

Text 1: {text1}
Text 2: {text2}
proper nouns 1: {proper_nouns1}
proper nouns 2: {proper_nouns2}
keywords 1: {keywords1}
keywords 2: {keywords2}
Date 1: {date1}
Date 2: {date2}

Answer: """


# Create the LLMChain instance for chain 2 

prompt_template_3 = PromptTemplate(input_variables=["text1", "text2", "proper_nouns1", "proper_nouns2", 'keywords1', 'keywords2', 'date1', 'date2'], template=template_3, batch_size=32)
chain_3 = LLMChain(llm = llm, prompt = prompt_template_3, output_key="story")

In [129]:
#create overall chain to combine previous chains into one big sequential chain

from langchain.chains import SequentialChain

overall_chain = SequentialChain(
                  chains=[chain_1, chain_2], input_variables = ["text1", "text2", "proper_nouns1", "proper_nouns2", 'keywords1', 'keywords2'],output_variables=["topics", "topic_evaluation"],
                  verbose=True)

In [130]:
#this purely for tests


for index, row in df.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2']

    }

    # Generate text using the chain
    
    results = overall_chain(input_variables)
    print(results)



[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
{'text1': '’Geef Haagse muzikanten Walk of Fame’; Geef Haagse muzikanten Walk of Fame D66 wil een Walk of Fame voor Haagse artiesten. foto Jos van Leeuwen D66 pad bij het Centraal Station door Tanja Verkaik DEN HAAG Haagse muziekhelden moeten een plek krijgen op een Haagse Walk of Fame. Raadsleden Birg l zmen en Dani l Scheper D66 scharen zich achter de lobby van muziekjournalist Martin Reitsma en prominente Hagenaars die pleiten voor een waardig eerbetoon voor Haagse muzikanten in de vorm van straatnamen. Een straatnaam is leuk, maar een Walk of Fame heeft allure, zegt raadslid Scheper. We moeten trots zijn op ons muzikale erfgoed, vindt zmen. Een geschikte plek voor de Walk of Fame is de loop vanaf Den Haag Centraal Station richting het nieuwe cultuurpaleis Amare. Straatnamen voor Haagse bands en artiesten is complexer, denkt Scheper. E n van de eisen die wordt gesteld aan een straatnaam is dat personen al o

In [85]:
# Importing the necessary functions or libraries
from langchain import PromptTemplate
from langchain.chains import LLMChain


# Define the template
template = """Given two text snippets from Dutch newspapers: text 1: {text1} and text 2: {text2}, their publishing dates {date1} and date {date2}, and the proper nouns that appear in the article: {proper_nouns1} and {proper_nouns2}, and the keywords of the articles {keywords1} and {keywords2} explain to what extent the two text snippets are similar on a topic level, a story level, or event level using all the variables provided./
Always interpret the temporal distance between {date1} and {date2} in your answer each time. Always add 'Final evaluation:' explicitly to the end of your evaluation each time you get a request, no exceptions from this format ever. /
Provide your answer in maximum 100 words each time, no exceptions. Be explicit about the reasons these do not match on a certain level and emphasize why they match. 

Context: A news event refers to a specific occurrence or happening that leads to news coverage. Different articles that cover the same news event will be publised very close together in time (a matter of hours perhaps a day). A news story is a more general term that encompasses all the related news reports or articles covering an event in a relatively close date range but longer than the date range of news events. In other words, a news event is the actual happening or occurrence, whereas a news story is the collection of news reports or articles that cover that event./
On the other hand, a topic is a broader area of focus that may encompass multiple news stories or events. For example, "airplane accidents" could be a topic, with each specific accident being a news event that might be reported on individually or collectively./


Text 1: {text1}
Text 2: {text2}
proper nouns 1: {proper_nouns1}
proper nouns 2: {proper_nouns2}
keywords 1: {keywords1}
keywords 2: {keywords2}
Date 1: {date1}
Date 2: {date2}


Answer: """

# Create a PromptTemplate instance
prompt_template = PromptTemplate(
    input_variables=["text1", "text2", "date1", "date2", 'proper_nouns1', 'proper_nouns2', 'keywords1', 'keywords2'],
    template=template
)

# Create the LLMChain instance
chain_1 = LLMChain(llm = llm, prompt = prompt_template, output_key="evaluation")

for index, row in df.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2'],
        "date1": row['Date1'],
        "date2": row['Date2']
    }

    # Generate text using the chain
    generated_text = chain_1.run(input_variables)
    
    print(generated_text)
    




Final evaluation: The two text snippets are similar on the topic level but dissimilar on the story and event levels.

Similarities on the topic level: Both texts discuss the idea of giving recognition to individuals or groups who have made significant contributions to society. In Text 1, the recognition is in the form of a Walk of Fame for Haagse musicians, while in Text 2, it is in the form of inclusive hiring practices for municipal employees. Both texts also mention the importance of pride and heritage in relation to the recognition being given.

Dissimilarities on the story level: The two texts tell different stories. Text 1 is about the proposal for a Walk of Fame for Haagse musicians, while Text 2 is about the decision not to implement positive discrimination in the hiring process for municipal employees.

Dissimilarities on the event level: The two texts report on different events. Text 1 reports on a proposal for a Walk of Fame, while Text 2 reports on a decision not to imple

In [86]:
device = torch.device('cuda')
print("GPU Name:", torch.cuda.get_device_name(device))
print("Memory Usage:", torch.cuda.memory_allocated(device) / 1024 ** 3, "GB")
print("Max Memory Usage:", torch.cuda.max_memory_allocated(device) / 1024 ** 3, "GB")

GPU Name: NVIDIA A10
Memory Usage: 5.196333885192871 GB
Max Memory Usage: 7.891656398773193 GB


In [15]:
torch.clear_autocast_cache

<function torch.clear_autocast_cache>

 #If texts match on multiple levels make sure to choose the single right label from below:

News Event: This refers to the precise and identical occurrence covered in news. Various articles discussing the same news event are published almost immediately, typically within a few hours or a day.

News Story: This term encompasses all related news reports or articles about an event within a relatively close timeframe. A news event is the specific incident, while a news story comprises various news pieces covering that event.

Topic: A broader subject that may encompass multiple news stories or events. For instance, "airplane accidents" could be a topic, with each individual accident being reported as a news event.

In [87]:
# Chain2 - suggest age-appropriate gift
template_2 = """Task: You are an expert classifier. Your objective is to determine the degree of similarity between two text snippets on different levels: topic, story, or event. Make your decision based on the complete evaluation provided only and nothing else./



At each request excplicitly assign one of the labels and corresponding scores below on the closest matching classification without exceptions.  These scores are not depecting intensity they are simply placeholders for categories.

0 - No match
1 - Topic-level match (only when referring to the same topic)
2 - Story-level match (only when referring to the same story)
3 - Event-level match (only when referring to the same event)
4 - Topic and story-level match (only when referring to the same story and to the same event)
5 - Topic, story, and event-level match (only when referring to the same event, story, and event)
6 - Story and event-level match (only when referring to the same story and event)
7 - Topic and event-level match (only when referring to the same topic and event)

Please consider only the following evaluation when making your decision every time without exceptions:
{evaluation}
Classification:"""

prompt_template_2 = PromptTemplate(input_variables=["evaluation"], template=template_2, batch_size=32)
chain_2 = LLMChain(llm=llm, prompt=prompt_template_2, output_key="classification") 

In [89]:
#create overall chain to combine previous chains into one big sequential chain

from langchain.chains import SequentialChain

overall_chain = SequentialChain(
                  chains=[chain_1, chain_2], input_variables = ["text1", "text2", "date1", "date2","proper_nouns1", "proper_nouns2", 'keywords1', 'keywords2' ],output_variables=["evaluation", "classification"],
                  verbose=True)

In [90]:
#this purely for tests


for index, row in df.iterrows():
    full_text1 = row['Text1']  # Get the full text of Text1
    full_text2 = row['Text2']  # Get the full text of Text2

    input_variables = {
        "text1": full_text1,
        "text2": full_text2,
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "date1": row['Date1'],
        "date2": row['Date2'],
        "keywords1": row['keywords1'],
        "keywords2": row['keywords2']
        
    }

    # Generate text using the chain
    
    results = overall_chain(input_variables)
    print(results)



[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
{'text1': '’Geef Haagse muzikanten Walk of Fame’; Geef Haagse muzikanten Walk of Fame D66 wil een Walk of Fame voor Haagse artiesten. foto Jos van Leeuwen D66 pad bij het Centraal Station door Tanja Verkaik DEN HAAG Haagse muziekhelden moeten een plek krijgen op een Haagse Walk of Fame. Raadsleden Birg l zmen en Dani l Scheper D66 scharen zich achter de lobby van muziekjournalist Martin Reitsma en prominente Hagenaars die pleiten voor een waardig eerbetoon voor Haagse muzikanten in de vorm van straatnamen. Een straatnaam is leuk, maar een Walk of Fame heeft allure, zegt raadslid Scheper. We moeten trots zijn op ons muzikale erfgoed, vindt zmen. Een geschikte plek voor de Walk of Fame is de loop vanaf Den Haag Centraal Station richting het nieuwe cultuurpaleis Amare. Straatnamen voor Haagse bands en artiesten is complexer, denkt Scheper. E n van de eisen die wordt gesteld aan een straatnaam is dat personen al o

### Save the results into the df column

In [102]:
#this will be in final code

# Create empty lists to collect the results
evaluations = []
classifications = []

# Iterating over the DataFrame

for index, row in df.iterrows():
    paragraphs_text1 = row['Text1'].split('\n\n')  # Split the text into paragraphs
    paragraphs_text2 = row['Text2'].split('\n\n')  # Split the text into paragraphs
    
    #Extract the first paragraph, or the first two paragraphs if length < 5
    if len(paragraphs_text1[0]) < 5 and len(paragraphs_text1) > 1:
        first_paragraph_text1 = '\n\n'.join(paragraphs_text1[:2])
    else:
        first_paragraph_text1 = paragraphs_text1[0]

    if len(paragraphs_text2[0]) < 5 and len(paragraphs_text2) > 1:
        first_paragraph_text2 = '\n\n'.join(paragraphs_text2[:2])
    else:
        first_paragraph_text2 = paragraphs_text2[0]
    
    input_variables = {
        "text1": first_paragraph_text1,
        "text2": first_paragraph_text2,
        #"similarity_score": row['Similarity_Score'],
        "proper_nouns1": row['proper_nouns1'],
        "proper_nouns2": row['proper_nouns2'],
        "date1": row['Date1'],
        "date2": row['Date2']
    }

    # Append results to respective lists
    evaluations.append(results['evaluation'])
    classifications.append(results['classification'])

# Add new columns to the DataFrame
df['Evaluation'] = evaluations
df['Classification'] = classifications

# Print the updated DataFrame
df

Unnamed: 0.1,Unnamed: 0,Similarity_Score,Text1,Text2,Group,Date1,Date2,Publisher1,Publisher2,ID1,ID2,Named_Entities1,Named_Entities2,Token Length_1,Token Length_2,chunk1,chunk2,Evaluation,Classification
0,7524236,0.684123,Kijk met een economische bril naar migratie en...,De toekomst van de landbouw ; Op veel punten s...,medium,2021-03-05T00:00:00,2021-03-09T00:00:00,De Volkskrant,Algemeen Dagblad,3288032,3295057,"['Nederlandse', 'Forum van Democratie', 'anti-...","['SGP', 'Partij voor de Dieren', 'PvdD', 'één'...",797,1836,Kijk met een economische bril naar migratie en...,"het ietwat chargerend, maar met een serieuze o...",Final evaluation: The two text snippets are s...,1 (topic-level match).
1,944469,0.611839,Nieuwe dreun voor Schiphol en KLM ; Nieuwe dre...,'Rutte zegt: in de kern zijn we een diep socia...,medium,2021-01-08T00:00:00,2021-02-04T00:00:00,De Telegraaf,De Volkskrant,2676726,2756469,"['KLM\n\n', 'KLM', 'Yteke de Jong\n\nAmsterdam...","['#', 'Tweede', '2021', 'SP', 'Lilian Marijnis...",598,4077,Nieuwe dreun voor Schiphol en KLM ; Nieuwe dre...,Ze is alweer zo'n bekend gezicht op het Binnen...,Final evaluation: The two text snippets are s...,1 (topic-level match).
2,7286249,0.655534,Verkiezingen: Bijna alle partijen gaan nu acht...,"Wanneer een politicus de clown is, hoef je van...",medium,2021-03-03T00:00:00,2021-03-15T00:00:00,Algemeen Dagblad,De Volkskrant,3286360,3298131,"['Mark Rutte', 'Nederland', 'RTL', 'Mark', 'Ar...","['#', 'Mark Rutte', 'Wilders', 'extreemrechts'...",213,1375,Verkiezingen: Bijna alle partijen gaan nu acht...,Buiten een kleine kring van journalisten en po...,Final evaluation: The two text snippets are s...,1 (topic-level match).
