## Building a Question Answering Model with Transformers

In this notebook, we demonstrate how to create a question answering model using the Transformers library from Hugging Face. We download and preprocess data from Wikipedia using natural language processing techniques and transformer-based models, fine-tune a pre-trained transformer model on the data, and generate answers to a set of predefined questions using the trained model.

The notebook is divided into the following sections:

1. Introduction ([link](#Section-1:-Introduction))
2. Loading and Preprocessing Data ([link](#Section-2:-Loading-and-Preprocessing-Data))
3. Creating the QADataset Class ([link](#Section-3:-Creating-the-QADataset-Class))
4. Fine-Tuning the Transformer Model ([link](#Section-4:-Fine-Tuning-the-Transformer-Model))
5. Generating Answers to Questions ([link](#Section-5:-Generating-Answers-to-Questions))
6. Conclusions ([link](#Section-6:-Conclusions))
7. References ([link](#Section-7:-References))

The `QADataset` class is defined in Section 3, which is the core of this notebook. It loads the preprocessed data and creates PyTorch datasets for training and validation. The data loading and preprocessing functions are defined in Section 2. The transformer model is fine-tuned on the dataset in Section 4 using the PyTorch Lightning framework, and the trained model is used to generate answers to predefined questions in Section 5. Finally, Section 6 provides a summary of the results and discusses the potential improvements for the question answering model, while Section 7 provides links to resources related to the use of the Transformers library and transformer-based models for natural language processing.


# Section 1: Introduction

In this notebook, we will demonstrate the process of creating a question answering model using the Transformers library from Hugging Face. We will use natural language processing techniques and transformer-based models to achieve the best results.

In [58]:
from pathlib import Path
import json
import os
from typing import List, Dict

import pandas as pd
import requests
from tqdm.auto import tqdm

In [59]:
data_path = "../data"
articles_file_path = f"{data_path}/articles.csv"

dataset_path = "../datasets"
train_csv_path = f"{dataset_path}/train_dataset.csv"
val_csv_path = f"{dataset_path}/val_dataset.csv"

articles_category = 'Science_fiction_films'

models_path = "../models"
finetuned_model_path = f"{models_path}/model.pt"


if not os.path.exists(data_path):
    os.makedirs(data_path, exist_ok=True)
if not os.path.exists(dataset_path):
    os.makedirs(dataset_path, exist_ok=True)

# Section 2: Loading and Preprocessing Data

This section is dedicated to downloading and preprocessing data for the question answering model. We will use the Wikipedia API to download articles on a given topic, split them into sentences, and create a dataset of question-answer pairs.

In [60]:
!pip install mwparserfromhell



In [61]:
import mwparserfromhell

if not os.path.exists(articles_file_path):
    def get_category_members(category: str, member_type: str) -> List[str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'list': 'categorymembers',
            'cmtitle': category,
            'cmtype': member_type,
            'format': 'json',
            'cmlimit': 500
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        members = [item['title'] for item in data['query']['categorymembers']]
        return members


    def get_article_texts(articles: List[str]) -> Dict[str, str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'prop': 'revisions',
            'rvprop': 'content',
            'format': 'json',
            'titles': '|'.join(articles)
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        texts = {}
        for page in data['query']['pages'].values():
            raw_text = page['revisions'][0]['*']
            parsed_text = mwparserfromhell.parse(raw_text)
            cleaned_text = parsed_text.strip_code()
            texts[page['title']] = cleaned_text
        return texts


    def download_articles(category: str) -> pd.DataFrame:
        subcategories = get_category_members(f'Category:{category}', 'subcat')
        all_articles = []

        for subcategory in tqdm(subcategories, desc="Downloading subcategories"):
            articles = get_category_members(subcategory, 'page')
            all_articles.extend(articles)

        article_data = []
        for i in tqdm(range(0, len(all_articles), 50), desc="Downloading articles"):
            batch = all_articles[i:i + 50]
            texts = get_article_texts(batch)
            for title, text in texts.items():
                article_data.append({'title': title, 'text': text})

        return pd.DataFrame(article_data)

    articles_df = download_articles(articles_category)
    articles_df.to_csv(articles_file_path, index=False)
else:
    articles_df = pd.read_csv(articles_file_path)

articles_df.head()


Downloading subcategories:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading articles:   0%|          | 0/28 [00:00<?, ?it/s]

Unnamed: 0,title,text
0,Alfonso Cuarón,"Alfonso Cuarón Orozco ( , ; born 28 November 1..."
1,Brad Bird,"Phillip Bradley Bird (born September 24, 1957)..."
2,Brian Patrick Butler,"Brian Patrick Butler is an American actor, fil..."
3,Charles Band,"Charles Robert Band (born December 27, 1951) i..."
4,Chris Carter (screenwriter),"Christopher Carl Carter (born October 13, 1956..."


# Section 3: Creating the QADataset Class

In this section, we define the QADataset class, which is used to load the preprocessed data and create PyTorch datasets for training and validation. We also define helper functions for keyword extraction and sentence replacement, which are used to create question-answer pairs from sentences.

In [62]:
!pip install spacy transformers
!python -m spacy download en_core_web_sm
!python -m nltk.downloader punkt

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 21.8 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [63]:
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize
from torch.utils.data import Dataset
import os
import spacy
import re

base_model_name = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
nlp = spacy.load("en_core_web_sm")


def get_question_word(entity_label):
    question_word_map = {
        "PERSON": "Who",
        "GPE": "Where",
        "ORG": "Which organization",
        "DATE": "When",
        "TIME": "At what time",
        "NOUN": "What",
        "PROPN": "Which",
    }
    return question_word_map.get(entity_label, "What")


def replace_keywords(sentence, keywords, question_word):
    words = sentence.split()
    replaced = False
    for i, word in enumerate(words):
        if word.strip(".,?;!") in keywords and not replaced:
            words[i] = question_word
            replaced = True
    return " ".join(words)


def extract_keywords(doc):
    # For NER:
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    # For POS tagging (nouns and proper nouns):
    nouns = [(token.text, token.pos_) for token in doc if token.pos_ in ["NOUN", "PROPN"]]
    keywords = entities + nouns
    if not keywords:
        return None
    # Select the first keyword and its label
    keyword, label = keywords[0]
    question_word = get_question_word(label)
    return keyword, question_word


def create_qa_pairs(df):
    qa_pairs = []
    for index in tqdm(df.index, total=df.shape[0], desc="Creating QA pairs"):
        text = df.loc[index, "text"]
        # Remove the section headings
        cleaned_text = re.sub(r"==.*?==+", "", text)
        sentences = sent_tokenize(cleaned_text)
        docs = list(nlp.pipe(sentences))
        for sentence, doc in zip(sentences, docs):
            keyword_info = extract_keywords(doc)
            if keyword_info is not None:
                keyword, question_word = keyword_info
                question = replace_keywords(sentence, keyword, question_word)
                qa_pairs.append((index, question, keyword))
    qa_df = pd.DataFrame(qa_pairs, columns=["index", "question", "answer"])
    return qa_df


def split_data(qa_pairs, test_size=0.2, random_state=42):
    train_data, val_data = train_test_split(qa_pairs, test_size=test_size, random_state=random_state)
    return train_data, val_data

def save_qa_pairs_to_csv(train_qa_pairs, val_qa_pairs, train_csv_path, val_csv_path):
    train_df = pd.DataFrame(train_qa_pairs, columns=["index", "question", "answer"])
    val_df = pd.DataFrame(val_qa_pairs, columns=["index", "question", "answer"])
    train_df.to_csv(train_csv_path, index=False)
    val_df.to_csv(val_csv_path, index=False)

def load_qa_pairs_from_csv(train_csv_path, val_csv_path):
    train_df = pd.read_csv(train_csv_path)
    val_df = pd.read_csv(val_csv_path)
    return train_df, val_df


class QADataset(Dataset):
    def __init__(self, df, qa_df, tokenizer, max_input_length=512, max_output_length=512):
        self.df = df
        self.qa_df = qa_df
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_output_length = max_output_length

    def __len__(self):
        return len(self.qa_df)

    def __getitem__(self, idx):
        index, question, answer = self.qa_df.loc[idx, ["index", "question", "answer"]]
        tokenized_input = self.tokenizer(question, max_length=self.max_input_length, padding="max_length",
                                         truncation=True, return_tensors="pt")
        tokenized_output = self.tokenizer(answer, max_length=self.max_output_length, padding="max_length",
                                          truncation=True, return_tensors="pt")
        return {"input_ids": tokenized_input["input_ids"].squeeze(),
                "attention_mask": tokenized_input["attention_mask"].squeeze(),
                "labels": tokenized_output["input_ids"].squeeze()}

# load train and validation datasets if they exist or create them and save them
if os.path.exists(train_csv_path) and os.path.exists(val_csv_path):
    train_qa_df, val_qa_df = load_qa_pairs_from_csv(train_csv_path, val_csv_path)
else:
    qa_df = create_qa_pairs(articles_df)
    train_qa_df, val_qa_df = split_data(qa_df)
    save_qa_pairs_to_csv(train_qa_df, val_qa_df, train_csv_path, val_csv_path)

train_dataset = QADataset(articles_df, train_qa_df, tokenizer)
val_dataset = QADataset(articles_df, val_qa_df, tokenizer)

Creating QA pairs:   0%|          | 0/1357 [00:00<?, ?it/s]

In [64]:
print(f"Number of training samples: {len(train_dataset)}")
train_qa_df.sample(5)

Number of training samples: 97236


Unnamed: 0,index,question,answer
105798,1151,The new name Which organization derived from T...,Atragon
101380,1105,Who later directed a feature-length stop-motio...,Burton
62058,706,"In Who turmoil, the last Mother Box is left un...",Mother Box
24457,295,Which organization Cinema criticized the drast...,Japan Cinema
101670,1108,It grossed What million at the U.S. box office...,$19.4 million


In [65]:
print(f"Number of validation samples: {len(val_dataset)}")
val_qa_df.sample(5)

Number of validation samples: 24310


Unnamed: 0,index,question,answer
94683,1029,Co-director Who reflected on the film in 2016 ...,Morton
72616,806,While both aspects are present in Which organi...,MacReady
79773,866,Which organization 1998 Fantasia's Toronto edi...,Fant-Asia
1864,24,"""Tobias, Scott (20 May 2007).",Tobias
46641,567,"What spoke to this in June 2018, stating that ...",Feige


# Section 4: Fine-Tuning the Transformer Model

This section is dedicated to fine-tuning the transformer-based language model using the PyTorch Lightning framework. We use the Trainer class from the transformers library to train the model on the QADataset, and evaluate its performance on a validation set. We also save the model and tokenizer for later use.

In [66]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from torch.utils.data import DataLoader

batch_size = 3

model = AutoModelForCausalLM.from_pretrained(base_model_name)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_accumulation_steps=1,
    prediction_loss_only=True,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()
trainer.evaluate()
model.save_pretrained(finetuned_model_path)
tokenizer.save_pretrained(finetuned_model_path)

OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 0; 6.00 GiB total capacity; 5.26 GiB already allocated; 0 bytes free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

# Section 5: Generating Answers to Questions

In this section, we load the saved model and tokenizer, and use them to generate answers to a set of predefined questions. We use the pipeline function from the transformers library to generate text from input strings, and print the generated answers to the console.

In [None]:
from transformers import pipeline

model = AutoModelForCausalLM.from_pretrained(finetuned_model_path)
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)
qa_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
questions = [
    "What is the main theme of the movie Blade Runner?",
    "Who is the author of the novel Dune?",
    "What is the name of the spaceship in the Alien movie?",
    "Who directed the movie The Matrix?",
    "What is the setting of the Star Wars series?",
    "What is the similarity between the movie Matrix and Star Wars?",
]

for question in questions:
    answer = qa_pipeline(question, max_length=50)[0]["generated_text"]
    print(f"Q: {question}\nA: {answer}\n")


# Section 6: Conclusions

This section provides a summary of the results and discusses the potential improvements for the question answering model. We also discuss the applications of the Transformers library for natural language processing tasks in general.

# Section 7: References
This section provides links to resources related to the use of the Transformers library and transformer-based models for natural language processing. Links to the official Hugging Face website, examples of transformer-based models, and research papers on transformer models are provided.

Hugging Face official website: https://huggingface.co/
Transformers: https://huggingface.co/transformers/
Transformers examples: https://huggingface.co/examples
"Attention Is All You Need" paper on Transformers: https://arxiv.org/abs/1706.03762
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" paper: https://arxiv.org/abs/1810.04805