# Context generation for 'prompts.json'

Even if our pipeline allows to obtain context which is relevant for a question posed to the QA model in real time, we process here the file 'prompt.json' to add context to each of the question. This preprocessing step is done to speed up the inference, and to avoid re-generating context each time the inference process is run.

We start by installing the packages which are required to run the notebook.

In [1]:
!pip install transformers
!pip install sentencepiece
!pip install wikipedia-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

Then, we set all the seeds to ensure the reproducibility of the code.

In [2]:
import torch
import numpy as np
import random

# Set the seed value
seed_value = 0

random.seed(seed_value) # Python
np.random.seed(seed_value) # numpy
torch.manual_seed(seed_value) # PyTorch

# If a GPU is used, set the seed for it as well
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

The Keyword Generator Model is now loaded (pretrained separately, see the corresponding notebook), together with its tokenizer.

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import wikipediaapi
import regex as re
import json

# Load the fine-tuned model and its tokenizer
try:
    model = T5ForConditionalGeneration.from_pretrained("lucazed/keyword-generator-complete")
    tokenizer = T5Tokenizer.from_pretrained("lucazed/keyword-generator-complete")
except Exception as e:
    print("Failed to load model or tokenizer:", e)
    model, tokenizer = None, None

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Then, we define a function which allows to obtain a list of keywords (obtained thanks to beam search) and the language of the question associated to the keyword given a question. By default, this function considers ten beams to perform beam search, and returns ten keywords.

In [4]:
def generate_keywords_and_languages(question, num_return_sequences=10, num_beams=10):
    try:
        # Encode the question and return a tensor in Pytorch
        input_ids = tokenizer.encode('Keyword and Language of: ' + question, return_tensors="pt")

        # Generate a sequence of ids
        output_ids = model.generate(
            input_ids,
            max_length=10,
            num_return_sequences=num_return_sequences,
            no_repeat_ngram_size=3,
            num_beams=num_beams,
            early_stopping=True
        )

        # Decode the sequences
        keyword_and_language_pairs = [tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]

        # Split the keyword and language
        keywords_and_languages = [pair.split("|") for pair in keyword_and_language_pairs]

    except Exception as e:
        print("Failed to generate keywords and languages:", e)
        keywords_and_languages = []

    return keywords_and_languages

To make the code more robust, we also introduce a function which allows to remove everything between parentheses in a keyword. This is useful since some of the keywords returned by the model tend to include abbreviations between parenthesis, that sometimes make the Wikipedia page retrieval harder.

In [5]:
def remove_parentheses(text):
    # Use regular expression to remove everything between parentheses
    pattern = r"\([^()]*\)"
    result = re.sub(pattern, "", text)
    return result

If set to 'True' the debug variable allows to obtain extended information on the various aspects of the context retrieval procedure.

In [6]:
debug = True

We then introduce a function which allows to load JSON files.

In [7]:
def load_json(file):
    with open(file, 'r') as f:
        data = json.load(f)
    return data

At this point, the main pipeline is executed to read the dataset 'prompts.json', get the question, pass it to the keyword retriever model to obtain a list of keywords, and look up these keywords on Wikipedia using Wikipedia's APIs. To make the code more robust, if the model is not returning the language associated to a particular keyword, we consider both languages to query Wikipedia's APIs. If none of the keywords allow to obtain valid Wikipedia pages (the disambiguation pages are skipped), the context field of the dataset is kept empty. If at least one of the keywords is matching a valid Wikipedia page, the Summary of the page is used as context.

In [8]:
# The dataset is loaded from a JSON file
prompts = load_json("/content/prompts.json")
datapoints = []

# For each prompt
for count, prompt in enumerate(prompts):
    print(f"Processing prompt {count + 1} of {len(prompts)}")

    guid = prompt["guid"]
    question = prompt["question"]
    answer = prompt["answer"]

    # Generate the keywords and languages (for the Wikipedia search)
    keywords_list = generate_keywords_and_languages(question)
    print(keywords_list)

    context = ""
    finished = False

    # For each keyword and language in the list
    for keyword_and_language in keywords_list:
        # If the keyword and language are both present, use them
        if len(keyword_and_language) == 2:
            keyword, language = keyword_and_language
        # If only the keyword is present, use it and keep the language empty (to use both English and French Wikipedia)
        elif len(keyword_and_language) == 1:
            keyword = keyword_and_language[0]
            language = ""
        else:
            keyword = ""
            language = ""
        try:
            if language == "EN":
                # Use the English Wikipedia
                wiki_wiki = wikipediaapi.Wikipedia("en")
            elif language == "FR":
                # Use the French Wikipedia
                wiki_wiki = wikipediaapi.Wikipedia("fr")
            else:
                # Use both the English and French Wikipedia
                wiki_wiki_1 = wikipediaapi.Wikipedia("en")
                wiki_wiki_2 = wikipediaapi.Wikipedia("fr")
            if not finished:
                if language == "EN" or language == "FR" and keyword != "":
                    # Get the Wikipedia page for the keyword
                    page = wiki_wiki.page(keyword)
                    # If the page exists
                    if page.exists():
                        # If the page is a disambiguation page, skip it
                        if "may refer to" in page.text or "plusieurs concepts" in page.text or "dans les articles suivants" in page.text or "Suivant le contexte, le terme" in page.text:
                            if debug:
                                print(f"Skipping disambiguation page for '{keyword}'")
                        else:
                            # Get the summary of the page and use it as the context
                            context = page.summary
                            if debug:
                                print(f"Main definition for '{keyword}':")
                                print(page.summary)
                            finished = True
                    else:
                        # If the page doesn't exist, try to remove the parentheses from the keyword
                        page = wiki_wiki.page(remove_parentheses(keyword))
                        if page.exists():
                            # If the page is a disambiguation page, skip it
                            if "may refer to" in page.text or "plusieurs concepts" in page.text or "dans les articles suivants" in page.text or "Suivant le contexte, le terme" in page.text:
                                if debug:
                                    print(f"Skipping disambiguation page for '{remove_parentheses(keyword)}'")
                            else:
                                # Get the summary of the page and use it as the context
                                context = page.summary
                                if debug:
                                    print(f"Main definition for '{remove_parentheses(keyword)}':")
                                    print(page.summary)
                                finished = True
                        else:
                            if debug:
                                print(f"No webpage found for '{keyword}'")
                elif keyword != "":
                    page_en = wiki_wiki_1.page(keyword)
                    page_fr = wiki_wiki_2.page(keyword)
                    # If the page exists in English
                    if page_en.exists():
                        # If the page is a disambiguation page, skip it
                        if "may refer to" in page_en.text or "plusieurs concepts" in page_en.text or "dans les articles suivants" in page_en.text or "Suivant le contexte, le terme" in page_en.text:
                            if debug:
                                print(f"Skipping disambiguation page for '{keyword}'")
                        else:
                            # Get the summary of the page and use it as the context
                            context = page_en.summary
                            if debug:
                                print(f"Main definition for '{keyword}':")
                                print(page_en.summary)
                            finished = True
                    # If the page exists in French
                    elif page_fr.exists():
                        # If the page is a disambiguation page, skip it
                        if "may refer to" in page_fr.text or "plusieurs concepts" in page_fr.text or "dans les articles suivants" in page_fr.text or "Suivant le contexte, le terme" in page_fr.text:
                            if debug:
                                print(f"Skipping disambiguation page for '{keyword}'")
                        else:
                            # Get the summary of the page and use it as the context
                            context = page_fr.summary
                            if debug:
                                print(f"Main definition for '{keyword}':")
                                print(page_fr.summary)
                            finished = True
                    else:
                        # If the page doesn't exist, try to remove the parentheses from the keyword
                        page_en = wiki_wiki_1.page(remove_parentheses(keyword))
                        page_fr = wiki_wiki_2.page(remove_parentheses(keyword))
                        # If the page exists in English
                        if page_en.exists():
                            # If the page is a disambiguation page, skip it
                            if "may refer to" in page_en.text or "plusieurs concepts" in page_en.text or "dans les articles suivants" in page_en.text or "Suivant le contexte, le terme" in page_en.text:
                                if debug:
                                    print(f"Skipping disambiguation page for '{remove_parentheses(keyword)}'")
                            else:
                                # Get the summary of the page and use it as the context
                                context = page_en.summary
                                if debug:
                                    print(f"Main definition for '{remove_parentheses(keyword)}':")
                                    print(page_en.summary)
                                finished = True
                        # If the page exists in French
                        elif page_fr.exists():
                            # If the page is a disambiguation page, skip it
                            if "may refer to" in page_fr.text or "plusieurs concepts" in page_fr.text or "dans les articles suivants" in page_fr.text or "Suivant le contexte, le terme" in page_fr.text:
                                if debug:
                                    print(f"Skipping disambiguation page for '{remove_parentheses(keyword)}'")
                            else:
                                # Get the summary of the page and use it as the context
                                context = page.summary
                                if debug:
                                    print(f"Main definition for '{remove_parentheses(keyword)}':")
                                    print(page_fr.summary)
                                finished = True
                        else:
                            if debug:
                                print(f"No webpage found for '{keyword}'")
        except Exception as e:
            print(f"Failed to retrieve keyword '{keyword}' in language '{language}':", e)

    # Create the datapoint
    datapoint = {
        "guid": guid,
        "question": question,
        "answer": answer,
        "context": context
    }

    # Add the datapoint to the list of datapoints
    datapoints.append(datapoint)

# Save the datapoints to a JSON file
with open("datapoints_context.json", "w") as f:
    json.dump(datapoints, f)

Processing prompt 1 of 100
[["Coefficient d'indice impair", ''], ["Coefficient d'indices impair"], ["Coefficient d'inférence imp"], ["Coefficient d'indice impairs"], ["Coefficient d'indicateurs imp"], ["Coefficient d'intégration"], ["Coefficients d'indices imp"], ["Coefficient d'indice impair ("], ["Coefficient d'index impairs"], ["Coefficient d'interruption"]]
No webpage found for 'Coefficient d'indice impair'
No webpage found for 'Coefficient d'indices impair'
No webpage found for 'Coefficient d'inférence imp'
No webpage found for 'Coefficient d'indice impairs'
No webpage found for 'Coefficient d'indicateurs imp'
No webpage found for 'Coefficient d'intégration'
No webpage found for 'Coefficients d'indices imp'
No webpage found for 'Coefficient d'indice impair ('
No webpage found for 'Coefficient d'index impairs'


Token indices sequence length is longer than the specified maximum sequence length for this model (564 > 512). Running this sequence through the model will result in indexing errors


No webpage found for 'Coefficient d'interruption'
Processing prompt 2 of 100
[['MapTr', 'EN'], ['Tail-recursion', 'EN'], ['Tail-recursive programming', 'EN'], ['Tail recursive programming', 'EN'], ['MapTr (programming language)', ''], ['Tail recursion', 'EN'], ['MapTrNil', 'EN'], ['MapTr (computing)', ''], ['MapTr and mapTr', 'EN'], ['MapTr (computing language)']]
No webpage found for 'MapTr'
Main definition for 'Tail-recursion':
In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion (or tail-end recursion) is particularly useful, and is often easy to optimize in implementations. 
Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appro