# LOAD DATA:

In [18]:
pairs = []
# Read data from sample_test.v1.json
with open('val/mushroom.en-val.v1.jsonl', 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        # Collect id, model_output_text, model_input, and soft_labels as pairs
        pairs.append((data['id'], data['model_output_text'], data['model_input'], data['soft_labels']))

In [22]:
for pair in pairs:
    print(pair[2])

What did Petra van Staveren win a gold medal for?
How many genera does the Erysiphales order contain?
Do all arthropods have antennae?
When did Chance the Rapper debut?
What is the UN Sustainable Development Goal 11’s definition of a sustainable city?
What are the four styles of Zhejiang cuisine?
If today is 14th October, and it is not a leap year, how many days remain until the end of the year?
Which network released the TV series of the The Punisher?
What is the population of the Spanish region of Galicia?
How many stages of labour are there in childbirth?
When did the merger of Takara and Tomy take place?
In which country is the Salzburg Red Bull Arena?
In which country is Cilleruelo de San Mamés located?
When did the Bleeding Kansas civil confrontations take place?
Who first described the white-winged chough?
In which city were the 26th biathlon world championships held?
What was the previous name of the Gillette Stadium?
Is a graphics address redressing table a type of input-outpu

# PART 1: DPR

In [1]:
!pip install datasets
!pip install cohere

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [26]:
from datasets import load_dataset
import numpy as np
import cohere
import time

In [3]:
lang = "simple" #simple dataset for English
top_k = 5

In [62]:
with open('apikey.txt', 'r') as f:
    key = f.read().strip('\n')
    f.close()

co = cohere.Client(key) # security risk evaded :D


The next step can take a while based on max_docs :(

In [64]:
max_docs = 100000
docs_stream = load_dataset(f"Cohere/wikipedia-2023-11-embed-multilingual-v3", lang, split="train", streaming=True)

docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break

doc_embeddings = np.asarray(doc_embeddings)

But this step is hella fast :)

In [81]:
queries = [(pair[0], pair[2]) for pair in pairs]

retrieved_passages = {}

for query in queries:
    response = co.embed(texts=[query[1]], model='embed-multilingual-v3.0', input_type="search_query")
    query_embedding = response.embeddings
    query_embedding = np.asarray(query_embedding)

    # Compute dot score between query embedding and document embeddings
    dot_scores = np.matmul(query_embedding, doc_embeddings.transpose())[0]
    top_k_hits = np.argpartition(dot_scores, -top_k)[-top_k:].tolist()

    # Sort top_k_hits by dot score
    top_k_hits.sort(key=lambda x: dot_scores[x], reverse=True)

    results = []
    for doc_id in top_k_hits:
        passage = f"{docs[doc_id]['title']}. {docs[doc_id]['text']}"
        distance = dot_scores[doc_id]
        results.append([passage, distance])

    retrieved_passages[query[0]] = {'query': query[1],
                              'results': results}

    # Avoid time-out from API usage limits (40 calls per minute):
    time.sleep(1.5)
    print(f"{query[0]}: [{query[1][:20]}...] --- Done.")

val-en-1: [What did Petra van S...] --- Done.
val-en-2: [How many genera does...] --- Done.
val-en-3: [Do all arthropods ha...] --- Done.
val-en-4: [When did Chance the ...] --- Done.
val-en-5: [What is the UN Susta...] --- Done.
val-en-6: [What are the four st...] --- Done.
val-en-7: [If today is 14th Oct...] --- Done.
val-en-8: [Which network releas...] --- Done.
val-en-9: [What is the populati...] --- Done.
val-en-10: [How many stages of l...] --- Done.
val-en-11: [When did the merger ...] --- Done.
val-en-12: [In which country is ...] --- Done.
val-en-13: [In which country is ...] --- Done.
val-en-14: [When did the Bleedin...] --- Done.
val-en-15: [Who first described ...] --- Done.
val-en-16: [In which city were t...] --- Done.
val-en-17: [What was the previou...] --- Done.
val-en-18: [Is a graphics addres...] --- Done.
val-en-19: [When was Captain Mor...] --- Done.
val-en-20: [In which oblast is N...] --- Done.
val-en-21: [Are kremlins similar...] --- Done.
val-en-22: [Where do H

In [82]:
print(retrieved_passages)

{'val-en-1': {'query': 'What did Petra van Staveren win a gold medal for?', 'results': [['Olympic Games. Individual athletes have also used the Olympic stage to promote their own political agenda. At the 1968 Summer Olympics, in Mexico City, two American track and field athletes, Tommie Smith and John Carlos, who finished first and third in the 200meter sprint race, performed the Black Power salute on the podium. The runner up Peter Norman wore an Olympic Project for Human Rights badge in support of Smith and Carlos. IOC President Avery Brundage then told the United States, to either send the two athletes home or withdraw the track and field team. The United States chose to send the pair home.', 0.48203595114005965], ["Belgium. Kim Clijsters and Justine Henin both were Player of the Year in the Women's Tennis Association. The Spa-Francorchamps motor-racing circuit hosts the Formula One World Championship Belgian Grand Prix. The Belgian driver, Jacky Ickx, won eight Grands Prix and six 

In [83]:
import json

with open('retrieved_passages_cohere.json', 'w') as f:
    json.dump(retrieved_passages, f, indent=4)

PART 2: PROMPTING

In [11]:
!pip install torch
!pip install transformers



In [12]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import json
from sklearn.metrics import precision_score, recall_score, f1_score

In [13]:
# Load the T5 model and tokenizer
model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [33]:
pairs = []
# Read data from sample_test.v1.json
with open('val/mushroom.en-val.v1.jsonl', 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        # Collect id, model_output_text, model_input, and soft_labels as pairs
        pairs.append((data['id'], data['model_output_text'], data['model_input'], data['soft_labels']))

In [35]:
for pair in pairs[:10]:
    print(pair)

('val-en-1', 'Petra van Stoveren won a silver medal in the 2008 Summer Olympics in Beijing, China.', 'What did Petra van Staveren win a gold medal for?', [{'start': 10, 'prob': 0.2, 'end': 12}, {'start': 12, 'prob': 0.3, 'end': 13}, {'start': 13, 'prob': 0.2, 'end': 18}, {'start': 25, 'prob': 0.9, 'end': 31}, {'start': 31, 'prob': 0.1, 'end': 37}, {'start': 45, 'prob': 1.0, 'end': 49}, {'start': 49, 'prob': 0.3, 'end': 65}, {'start': 65, 'prob': 0.2, 'end': 69}, {'start': 69, 'prob': 0.9, 'end': 83}])
('val-en-2', 'The Elysiphale order contains 5 genera.', 'How many genera does the Erysiphales order contain?', [{'start': 4, 'prob': 0.2, 'end': 14}, {'start': 30, 'prob': 1.0, 'end': 31}, {'start': 31, 'prob': 0.2, 'end': 38}])
('val-en-3', 'Yes, all arachnids have antennas. However, not all of them are visible to the naked eye.', 'Do all arthropods have antennae?', [{'start': 0, 'prob': 0.6, 'end': 3}, {'start': 3, 'prob': 0.4, 'end': 8}, {'start': 8, 'prob': 0.2, 'end': 9}, {'start': 9

Prompt definition:

In [86]:
# Define the modified prompt template
prompt_template = "Context:\n\n{passage1}\n\n{passage2}\n\n{passage3}\n\nPremise: {text1}\nHypothesis: {text2}\nUsing the provided context, identify the part of the hypothesis that contradicts the premise."

In [38]:
# List to hold all predicted spans
predicted_spans = []

# Helper functions for evaluation
def compute_exact_match(pred_start, pred_end, true_start, true_end):
    """Check if predicted span exactly matches the ground truth span."""
    return int(pred_start == true_start and pred_end == true_end)

def compute_f1(pred_start, pred_end, true_start, true_end):
    """Compute the F1 score for a predicted span vs ground truth span."""
    pred_span = set(range(pred_start, pred_end + 1))
    true_span = set(range(true_start, true_end + 1))
    overlap = pred_span.intersection(true_span)

    if len(overlap) == 0:
        return 0.0
    precision = len(overlap) / len(pred_span)
    recall = len(overlap) / len(true_span)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

# Evaluation metrics storage
exact_matches = 0
total_examples = 0
f1_scores = []

Testing the prompt:

In [91]:
id, model_output_text, model_input, soft_labels = pairs[0]

# return relevant passages:
passages = retrieved_passages[id]['results']
passage1 = passages[0][0]
passage2 = passages[1][0]
passage3 = passages[2][0]

prompt = prompt_template.format(passage1=passage1, passage2=passage2, passage3=passage3, text1=model_input, text2=model_output_text)

print(prompt)


Context:

Olympic Games. Individual athletes have also used the Olympic stage to promote their own political agenda. At the 1968 Summer Olympics, in Mexico City, two American track and field athletes, Tommie Smith and John Carlos, who finished first and third in the 200meter sprint race, performed the Black Power salute on the podium. The runner up Peter Norman wore an Olympic Project for Human Rights badge in support of Smith and Carlos. IOC President Avery Brundage then told the United States, to either send the two athletes home or withdraw the track and field team. The United States chose to send the pair home.

Belgium. Kim Clijsters and Justine Henin both were Player of the Year in the Women's Tennis Association. The Spa-Francorchamps motor-racing circuit hosts the Formula One World Championship Belgian Grand Prix. The Belgian driver, Jacky Ickx, won eight Grands Prix and six 24 Hours of Le Mans. Belgium also has a strong reputation in motocross. Sporting events held each year in

In [92]:
# Process each pair
for pair in pairs:
    id, model_output_text, model_input, soft_labels = pair

    # return relevant passages:
    passages = retrieved_passages[id]['results']
    passage1 = passages[0][0]
    passage2 = passages[1][0]
    passage3 = passages[2][0]

    prompt = prompt_template.format(passage1=passage1, passage2=passage2, passage3=passage3, text1=model_input, text2=model_output_text)

    # Tokenize and generate prediction
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids
    outputs = model.generate(input_ids)
    predicted_span_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Find character indices of the predicted span in the original text
    start_char_idx = model_output_text.find(predicted_span_text)
    end_char_idx = start_char_idx + len(predicted_span_text)

    # Debugging output
    print(f"Input Text: {model_output_text}")
    print(f"Predicted Span Text: '{predicted_span_text}'")
    print(f"Start Char Index: {start_char_idx}, End Char Index: {end_char_idx}")

    if start_char_idx != -1 and end_char_idx != -1 and start_char_idx < end_char_idx:
        predicted_spans.append({
            'id': id,
            'model_output_text': model_output_text,
            'target_text': model_input,
            'predicted_span': predicted_span_text,
            'hard_labels': [{'start': start_char_idx, 'end': end_char_idx}],
            'soft_labels': soft_labels
        })
    else:
        # Handle case where no valid span was found
        predicted_spans.append({
            'id': id,
            'model_output_text': model_output_text,
            'target_text': model_input,
            'predicted_span': None,
            'hard_labels': [],
            'soft_labels': soft_labels
        })



Input Text: Petra van Stoveren won a silver medal in the 2008 Summer Olympics in Beijing, China.
Predicted Span Text: 'False'
Start Char Index: -1, End Char Index: 4
Input Text: The Elysiphale order contains 5 genera.
Predicted Span Text: 'False'
Start Char Index: -1, End Char Index: 4
Input Text: Yes, all arachnids have antennas. However, not all of them are visible to the naked eye.
Predicted Span Text: 'True'
Start Char Index: -1, End Char Index: 3
Input Text: Chance the rapper debuted in 2011.
Predicted Span Text: 'False'
Start Char Index: -1, End Char Index: 4
Input Text: The UN's Sustainable City initiative defines a city as one that is:
- Equipped with infrastructure and services to ensure sustainable and equitable access to a range of basic services, such as water, sanitation, and electricity;
-.
Predicted Span Text: 'True'
Start Char Index: -1, End Char Index: 3
Input Text: Zhejing cuisine is known for its unique flavors and cooking techniques. The four main styles are: 1) Jia

In [93]:
# Print or save the predicted spans with indices
for prediction in predicted_spans:
    print(json.dumps(prediction, indent=2))

# Save results to a file called predictions.jsonl
with open('predictions.jsonl', 'w') as outfile:
    for prediction in predicted_spans:
        json.dump(prediction, outfile, indent=4)
        outfile.write("\n")

{
  "id": "val-en-1",
  "model_output_text": "Petra van Stoveren won a silver medal in the 2008 Summer Olympics in Beijing, China.",
  "target_text": "What did Petra van Staveren win a gold medal for?",
  "predicted_span": null,
  "hard_labels": [],
  "soft_labels": [
    {
      "start": 10,
      "prob": 0.2,
      "end": 12
    },
    {
      "start": 12,
      "prob": 0.3,
      "end": 13
    },
    {
      "start": 13,
      "prob": 0.2,
      "end": 18
    },
    {
      "start": 25,
      "prob": 0.9,
      "end": 31
    },
    {
      "start": 31,
      "prob": 0.1,
      "end": 37
    },
    {
      "start": 45,
      "prob": 1.0,
      "end": 49
    },
    {
      "start": 49,
      "prob": 0.3,
      "end": 65
    },
    {
      "start": 65,
      "prob": 0.2,
      "end": 69
    },
    {
      "start": 69,
      "prob": 0.9,
      "end": 83
    }
  ]
}
{
  "id": "val-en-2",
  "model_output_text": "The Elysiphale order contains 5 genera.",
  "target_text": "How many genera d

# ALTERNATE VERSION:

Just some expiriments I did with the wikipedia retriever, does not work now but can be modified. Should work great but uses scraped wikipedia pages using Beautifulsoup instead of a dataset, which means all the retrieved documents are inconsistent (for example only the first few paragraphs of a Wikipedia page). Just ignore this for now :).

In [None]:
!pip install -U wikipedia
!pip install -U langchain-community
!pip install wiki-passage-retriever

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=c70f22d7ce995d1ccfd1a5b46ed400dc4c06967211ae743b77df47c1f8c99f7e
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain-community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.wh

In [None]:
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

from wiki_passage_retriever.retrieve import get_most_relevant_passages

import spacy
from spacy.tokenizer import Tokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
queries = [
    "What is the capital of France?",
    "Who wrote the play 'Hamlet'?",
    "What did Petra van Staveren win a gold medal for?",
    "How did Steve Jobs die?",
]

In [None]:
nlp = spacy.load('en_core_web_sm')
tokenizer = nlp.tokenizer

retriever = WikipediaRetriever()

passages = {}

for query in queries:
    query = tokenizer(query)
    doc = nlp(query)
    for ent in doc.ents:
        passages[query] = {}
        docs = retriever.invoke(str(ent))
        for i in range(len(docs)):
            title = docs[i].metadata['title']
            text = docs[i].page_content
            passages[query][title] = text.split("/n/n")
print(passages)





  lis = BeautifulSoup(html).find_all('li')


In [None]:
n = 0
i = 0
keys = list(passages.keys())
entity_keys = list(passages[keys[n]].keys())

for passage in (passages[keys[n]][entity_keys[i]]):
    print(passage)

France, officially the French Republic, is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the northeast, Switzerland to the east, Italy and Monaco to the southeast, Andorra and Spain to the south, and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions (five of which are overseas) span a combined area of 643,801 km2 (248,573 sq mi) and have a total population of 68.4 million as of January 2024. France is a semi-presidential republic with its capita

In [None]:
import time
from nltk.tokenize import sent_tokenize

In [None]:
doc_embeddings = []
passages_test = []

for query in passages:
    for ent in passages[query]:
        for text in passages[query][ent]:
            paragraphs = sent_tokenize(text)
            for passage in paragraphs:
                passages_test.append(passage.strip())
                # create doc_embbeddings using cohere:
                embs = co.embed(texts=[passage], model='embed-multilingual-v3.0', input_type="search_document")
                doc_embeddings.append(embs.embeddings)
                time.sleep(.5)

doc_embeddings = np.asarray(doc_embeddings)

doc_embeddings = doc_embeddings.reshape(doc_embeddings.shape[0], doc_embeddings.shape[2])

print(doc_embeddings)
print(passages_test[0])

[[ 0.03649902  0.04437256 -0.01615906 ... -0.00693512  0.05297852
   0.0231781 ]
 [ 0.01878357  0.01704407 -0.03839111 ... -0.00423813  0.04248047
   0.01931763]
 [ 0.0164032   0.0310669  -0.03289795 ...  0.0335083   0.00732422
   0.01004791]
 ...
 [ 0.01049042  0.0134964  -0.01540375 ...  0.03747559  0.01432037
   0.02098083]
 [ 0.00516891  0.05148315 -0.03573608 ...  0.01565552  0.01498413
  -0.00985718]
 [ 0.00354576  0.03131104 -0.00556564 ...  0.03527832  0.00279236
  -0.0091095 ]]
France, officially the French Republic, is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the northeast, Switzerland to the east, Italy an