## Building a Question Answering Model with Transformers

In this notebook, we demonstrate how to create a question answering model using the Transformers library from Hugging Face. We download and preprocess data from Wikipedia using natural language processing techniques and transformer-based models, fine-tune a pre-trained transformer model on the data, and generate answers to a set of predefined questions using the trained model.

# Install dependencies
We install pytorch with built-in CUDA support. If you don't have CUDA, you can install pytorch without CUDA support. You can find more information [here](https://pytorch.org/get-started/locally/).
Also we install transformers, pandas, mwparserfromhell, nltk, accelerate and nvidia-ml-py3.
We use mwparserfromhell to parse the raw text of the Wikipedia articles, nltk for tokenization, accelerate for multi-GPU training and nvidia-ml-py3 for GPU monitoring.

In [3]:
!pip install torch torchvision torchaudio --index-url https: // download.pytorch.org/whl/cu117
!pip install transformers pandas mwparserfromhell nltk accelerate nvidia-ml-py3 datasets

ERROR: Directory '//' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.






# Download data from Wikipedia
We download each article about some specific category from Wikipedia. We use the category "Science fiction films" as an example. You can change the category to any other category you want. We also remove the references and external links sections from the articles and wiki markup and save the result to a CSV file.

In [4]:
import os
from typing import List, Dict

import mwparserfromhell
import pandas as pd
import requests
from accelerate import Accelerator
from tqdm.auto import tqdm
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Set the category you want to download
csv_filename = "../data/articles.csv"
articles_category = 'Science_fiction_films'

# Get the list of subcategories and articles
def get_category_members(category: str, member_type: str) -> List[str]:
    base_url = 'https://en.wikipedia.org/w/api.php'
    params = {
        'action': 'query',
        'list': 'categorymembers',
        'cmtitle': category,
        'cmtype': member_type,
        'format': 'json',
        'cmlimit': 500
    }
    response = requests.get(base_url, params=params)
    data = response.json()
    for item in data['query']['categorymembers']:
        yield item['title']

# Get the raw text of the articles
def get_article_texts(articles: List[str]) -> Dict[str, str]:
    base_url = 'https://en.wikipedia.org/w/api.php'
    params = {
        'action': 'query',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
        'titles': '|'.join(articles)
    }
    response = requests.get(base_url, params=params)
    data = response.json()
    for page in data['query']['pages'].values():
        yield page['revisions'][0]['*']


def parse_and_remove_references_and_external_links(text: str) -> mwparserfromhell.wikicode.Wikicode:
    parsed_text = mwparserfromhell.parse(text)
    for section in parsed_text.get_sections(levels=[2]):
        if section.filter_headings()[0].title.strip().lower() in ["references", "external links", "see also", "awards and nominations", "filmography"]:
            parsed_text.remove(section)
    return parsed_text

# Download the articles by chunks of 50 articles
def download_articles(category: str):
    # Get the list of subcategories and articles
    subcategories = list(get_category_members(f'Category:{category}', 'subcat'))
    all_articles = []

    # Download the articles from the subcategories
    for subcategory in tqdm(subcategories, desc="Downloading subcategories"):
        articles = list(get_category_members(subcategory, 'page'))
        all_articles.extend(articles)

    # Download the articles by chunks of 50 articles
    for i in tqdm(range(0, len(all_articles), 50), desc="Downloading articles"):
        batch = all_articles[i:i + 50]
        texts = dict(zip(batch, get_article_texts(batch)))
        for title, wiki_text in texts.items():
            wiki_text_clean = parse_and_remove_references_and_external_links(wiki_text)
            text = wiki_text_clean.strip_code().strip()
            # Remove the references and external links sections
            yield {'title': title, 'wiki_text': wiki_text, 'wiki_text_clean': wiki_text_clean, 'text': text}

# Download the articles and save them to a CSV file
if not os.path.exists(csv_filename):
    articles_df = pd.DataFrame(download_articles(articles_category))
    articles_df = articles_df.dropna(subset=['text'])
    articles_df.to_csv(csv_filename, index=False)
else:
    # If the CSV file already exists, we just load it
    articles_df = pd.read_csv(csv_filename)
    articles_df.to_csv(csv_filename, index=False)

articles_df.tail()

Unnamed: 0,title,wiki_text,wiki_text_clean,text
1361,The Tunnel (1933 German-language film),{{Infobox film\n | name = Totò nella luna\n | ...,{{Infobox film\n | name = Totò nella luna\n | ...,Totò nella luna (internationally released as T...
1362,Ureme (film series),{{short description|1936 film by Del Lord}}\n{...,{{short description|1936 film by Del Lord}}\n{...,Trapped by Television is a 1936 American comed...
1363,The Voice from the Sky,{{More citations needed|date=June 2019}}\n''''...,{{More citations needed|date=June 2019}}\n''''...,"Ureme (also spelled ulemae, uroemae or wuroema..."
1364,Welcome to Willits,{{Short description|2016 film by Trevor Ryan}}...,{{Short description|2016 film by Trevor Ryan}}...,"Welcome to Willits, also known as Alien Hunter..."
1365,Yuli (2018 film),{{Infobox film\n| name = Yuli\n| ima...,{{Infobox film\n| name = Yuli\n| ima...,Yuli is a 2018 Peruvian science fiction action...


# Generate question-answer pairs
We use T5 to generate question-answer pairs from the Wikipedia articles. We use the T5-small model and the T5 tokenizer.


In [19]:
# Load the model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

# Prepare the model for distributed training
accelerator = Accelerator()
model, tokenizer = accelerator.prepare(model, tokenizer)
device = accelerator.device

# Load the dataset
articles_df = pd.read_csv(csv_filename)
csv_questions_filename = "../data/questions.csv"


example_context = "Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence, made significant contributions to the field of machine learning and AI. He coined the term \"machine learning\" in 1959 and also used the synonym \"self-teaching computers\" during that time period. His work laid the foundation for the development and growth of machine learning as a crucial aspect of artificial intelligence."
example_section = "The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence. The synonym self-teaching computers was also used in this time period."
example_question = "What was the contribution of Arthur Samuel in the field of machine learning and AI?"
example_answer = """
    Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence, made significant contributions to the field of machine learning and AI. He coined the term "machine learning" in 1959 and also used the synonym "self-teaching computers" during that time period. His work laid the foundation for the development and growth of machine learning as a crucial aspect of artificial intelligence.
"""

example_question_prompt = f"""
    Context: {example_context}
    Section: {example_section}
    Question: {example_question}
"""

example_answer_prompt = f"""
    Context: {example_context}
    Section: {example_section}
    Question: {example_question}
    Answer: {example_answer.strip()}
"""

def make_query_to_model(query) -> str:
    input_ids = tokenizer(query, return_tensors="pt", max_length=512, truncation=True).input_ids.to(device)
    outputs = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


def is_valid_question(question, existing_questions):
    return question.endswith("?") and existing_questions[existing_questions["question"] == question].empty


def is_valid_answer(answer, existing_answers):
    first_word = answer.split(" ")[0]
    return first_word not in ["Yes", "No"] and len(answer.split(" ")) > 1 and existing_answers[existing_answers["answer"] == answer].empty

def generate_and_evaluate_questions_and_answers(row):
    text_df = pd.DataFrame(columns=["question", "answer"])
    sections = mwparserfromhell.parse(row["wiki_text_clean"]).get_sections(levels=[2])
    article_summary = make_query_to_model(f"Summary: {row['text']}")
    for section in sections:
        section_text = section.strip_code().strip()
        question_prompt = f"""
            {example_question_prompt}

            Context: {article_summary}
            Section: {section_text}
            Question:
        """
        question = make_query_to_model(question_prompt)

        if not is_valid_question(question, text_df):
            continue

        answer_prompt = f"""
            {example_answer_prompt}

            Context: {article_summary}
            Section: {section_text}
            Question: {question}
            Answer:
        """
        answer = make_query_to_model(answer_prompt)

        if not is_valid_answer(answer, text_df):
            continue

        text_df = pd.concat([text_df, pd.DataFrame({"question": [question], "answer": [answer]})])

    return text_df


def load_or_generate_question_answers(csv_questions_filename, articles_df):
    if os.path.exists(csv_questions_filename):
        return pd.read_csv(csv_questions_filename)
    else:
        question_answers_df = generate_all_questions_and_answers(articles_df)
        question_answers_df = question_answers_df.drop_duplicates(keep=False, subset='question')
        question_answers_df = question_answers_df.drop_duplicates(keep=False, subset='answer')
        question_answers_df.to_csv(csv_questions_filename, index=False)
        return question_answers_df


def generate_all_questions_and_answers(articles_df):
    question_answers_df = pd.DataFrame(columns=["question", "answer"])

    with tqdm(articles_df.iterrows(), desc="Generating questions and answers", total=len(articles_df)) as pbar:
        for index, row in pbar:
            created_question_answers = generate_and_evaluate_questions_and_answers(row)
            question_answers_df = pd.concat([question_answers_df, created_question_answers])

            if not question_answers_df.empty:
                pbar.set_postfix_str("Last generated question-answer pair: " + question_answers_df.iloc[-1]["question"] + " - " + question_answers_df.iloc[-1]["answer"])

    return question_answers_df

question_answers_df = load_or_generate_question_answers(csv_questions_filename, articles_df)
question_answers_df.tail()

Unnamed: 0,question,answer
1819,What is the full name of the actor who plays B...,Bill Sage
1820,What was the first film made by the Ryans?,Welcome to Willits: After Sundown
1821,What is the full name of the director of Welco...,Trevor Ryan
1822,What is the name of the character that Yuli de...,Jorge ‘Coco’ Gutiérrez.
1823,What was the name of the film that took place ...,Marisela Puicón
