## Building a Question Answering Model with Transformers

In this notebook, we demonstrate how to create a question answering model using the Transformers library from Hugging Face. We download and preprocess data from Wikipedia using natural language processing techniques and transformer-based models, fine-tune a pre-trained transformer model on the data, and generate answers to a set of predefined questions using the trained model.

The notebook is divided into the following sections:

1. Introduction ([link](#Section-1:-Introduction))
2. Loading and Preprocessing Data ([link](#Section-2:-Loading-and-Preprocessing-Data))
3. Creating the QADataset Class ([link](#Section-3:-Creating-the-QADataset-Class))
4. Fine-Tuning the Transformer Model ([link](#Section-4:-Fine-Tuning-the-Transformer-Model))
5. Generating Answers to Questions ([link](#Section-5:-Generating-Answers-to-Questions))
6. Conclusions ([link](#Section-6:-Conclusions))
7. References ([link](#Section-7:-References))

The `QADataset` class is defined in Section 3, which is the core of this notebook. It loads the preprocessed data and creates PyTorch datasets for training and validation. The data loading and preprocessing functions are defined in Section 2. The transformer model is fine-tuned on the dataset in Section 4 using the PyTorch Lightning framework, and the trained model is used to generate answers to predefined questions in Section 5. Finally, Section 6 provides a summary of the results and discusses the potential improvements for the question answering model, while Section 7 provides links to resources related to the use of the Transformers library and transformer-based models for natural language processing.


# Section 1: Introduction

In this notebook, we will demonstrate the process of creating a question answering model using the Transformers library from Hugging Face. We will use natural language processing techniques and transformer-based models to achieve the best results.

In [1]:
articles_file_path = "../data/articles.csv"
train_dataset_path = "../datasets/train_dataset.pkl"
val_dataset_path = "../datasets/val_dataset.pkl"
tokenizer_for_sentence_splitting = 'punkt'
base_model_name = "databricks/dolly-v2-3b"
model_path = "./finetuned_model"
articles_category = 'Science_fiction_films'

batch_size = 1
import json
import os
from typing import List, Dict

import pandas as pd
import requests
from tqdm.auto import tqdm

# Section 2: Loading and Preprocessing Data

This section is dedicated to downloading and preprocessing data for the question answering model. We will use the Wikipedia API to download articles on a given topic, split them into sentences, and create a dataset of question-answer pairs.

In [2]:
if not os.path.exists(articles_file_path):
    def get_category_members(category: str, member_type: str) -> List[str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'list': 'categorymembers',
            'cmtitle': category,
            'cmtype': member_type,
            'format': 'json',
            'cmlimit': 500
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        members = [item['title'] for item in data['query']['categorymembers']]
        return members


    def get_article_texts(articles: List[str]) -> Dict[str, str]:
        base_url = 'https://en.wikipedia.org/w/api.php'
        params = {
            'action': 'query',
            'prop': 'revisions',
            'rvprop': 'content',
            'format': 'json',
            'titles': '|'.join(articles)
        }
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)
        texts = {}
        for page in data['query']['pages'].values():
            texts[page['title']] = page['revisions'][0]['*']
        return texts


    def download_articles(category: str) -> pd.DataFrame:
        subcategories = get_category_members(f'Category:{category}', 'subcat')
        all_articles = []

        for subcategory in tqdm(subcategories, desc="Downloading subcategories"):
            articles = get_category_members(subcategory, 'page')
            all_articles.extend(articles)

        article_data = []
        for i in tqdm(range(0, len(all_articles), 50), desc="Downloading articles"):
            batch = all_articles[i:i + 50]
            texts = get_article_texts(batch)
            for title, text in texts.items():
                article_data.append({'title': title, 'text': text})

        return pd.DataFrame(article_data)


    articles_df = download_articles(articles_category)
    articles_df.to_csv(articles_file_path, index=False)
else:
    articles_df = pd.read_csv(articles_file_path)

articles_df.head()


Unnamed: 0,title,text
0,Alfonso Cuarón,{{short description|Mexican filmmaker}}\n{{Red...
1,Brad Bird,"{{Short description|American film director, sc..."
2,Brian Patrick Butler,{{Short description|American actor and filmmak...
3,Charles Band,{{short description|American film director}}\n...
4,Chris Carter (screenwriter),{{Short description|American television and fi...


# Section 3: Creating the QADataset Class

In this section, we define the QADataset class, which is used to load the preprocessed data and create PyTorch datasets for training and validation. We also define helper functions for keyword extraction and sentence replacement, which are used to create question-answer pairs from sentences.

In [None]:
!pip install transformers

In [None]:
!python -m nltk.downloader stopwords

In [None]:
from transformers import AutoTokenizer
import nltk

nltk.download(tokenizer_for_sentence_splitting)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize
from nltk.corpus import stopwords
from torch.utils.data import Dataset
import os
import pickle


class QADataset(Dataset):
    def __init__(self, df, qa_indices, tokenizer, max_input_length=512, max_output_length=512):
        self.df = df
        self.qa_indices = qa_indices
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_output_length = max_output_length

    def __len__(self):
        return len(self.qa_indices)

    def __getitem__(self, idx):
        index, question, answer = self.qa_indices[idx]
        tokenized_input = self.tokenizer(question, max_length=self.max_input_length, padding="max_length",
                                         truncation=True, return_tensors="pt")
        tokenized_output = self.tokenizer(answer, max_length=self.max_output_length, padding="max_length",
                                          truncation=True, return_tensors="pt")
        return {"input_ids": tokenized_input["input_ids"].squeeze(),
                "attention_mask": tokenized_input["attention_mask"].squeeze(),
                "labels": tokenized_output["input_ids"].squeeze()}


def replace_keywords(sentence, keywords):
    words = sentence.split()
    replaced = False
    for i, word in enumerate(words):
        if word.strip(".,?;!") in keywords and not replaced:
            words[i] = "____"
            replaced = True
    return " ".join(words)


def extract_keywords(sentence, num_keywords=1):
    words = sentence.split()
    words = [word for word in words if word not in stopwords.words("english")]
    words = sorted(words, key=lambda x: len(x), reverse=True)
    return words[:num_keywords]


def create_qa_pairs(df):
    qa_pairs = []
    for index in tqdm(df.index, total=df.shape[0], desc="Creating QA pairs"):
        text = df.loc[index, "text"]
        sentences = sent_tokenize(text)
        for sentence in sentences:
            keywords = extract_keywords(sentence)
            question = replace_keywords(sentence, keywords)
            if "____" in question:
                qa_pairs.append((index, question, keywords[0]))
    return qa_pairs

def split_data(qa_pairs, test_size=0.2, random_state=42):
    train_data, val_data = train_test_split(qa_pairs, test_size=test_size, random_state=random_state)
    return train_data, val_data

# load train and validation datasets if they exist or create them and save them
if os.path.exists(train_dataset_path) and os.path.exists(val_dataset_path):
    with open(train_dataset_path, "rb") as train_file:
        train_dataset = pickle.load(train_file)

    with open(val_dataset_path, "rb") as val_file:
        val_dataset = pickle.load(val_file)
else:
    qa_pairs = create_qa_pairs(articles_df)
    train_qa_pairs, val_qa_pairs = split_data(qa_pairs)

    # create train and validation datasets
    train_dataset = QADataset(articles_df, train_qa_pairs, tokenizer)
    val_dataset = QADataset(articles_df, val_qa_pairs, tokenizer)

    with open(train_dataset_path, "wb") as train_file:
        pickle.dump(train_dataset, train_file)

    with open(val_dataset_path, "wb") as val_file:
        pickle.dump(val_dataset, val_file)

# Section 4: Fine-Tuning the Transformer Model

This section is dedicated to fine-tuning the transformer-based language model using the PyTorch Lightning framework. We use the Trainer class from the transformers library to train the model on the QADataset, and evaluate its performance on a validation set. We also save the model and tokenizer for later use.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(base_model_name)
from transformers import TrainingArguments, Trainer
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_accumulation_steps=1,
    prediction_loss_only=True,
    learning_rate=5e-5,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()
trainer.evaluate()
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

# Section 5: Generating Answers to Questions

In this section, we load the saved model and tokenizer, and use them to generate answers to a set of predefined questions. We use the pipeline function from the transformers library to generate text from input strings, and print the generated answers to the console.

In [None]:
from transformers import pipeline

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
qa_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
questions = [
    "What is the main theme of the movie Blade Runner?",
    "Who is the author of the novel Dune?",
    "What is the name of the spaceship in the Alien movie?",
    "Who directed the movie The Matrix?",
    "What is the setting of the Star Wars series?",
    "What is the similarity between the movie Matrix and Star Wars?",
]

for question in questions:
    answer = qa_pipeline(question, max_length=50)[0]["generated_text"]
    print(f"Q: {question}\nA: {answer}\n")


# Section 6: Conclusions

This section provides a summary of the results and discusses the potential improvements for the question answering model. We also discuss the applications of the Transformers library for natural language processing tasks in general.

# Section 7: References
This section provides links to resources related to the use of the Transformers library and transformer-based models for natural language processing. Links to the official Hugging Face website, examples of transformer-based models, and research papers on transformer models are provided.

Hugging Face official website: https://huggingface.co/
Transformers: https://huggingface.co/transformers/
Transformers examples: https://huggingface.co/examples
"Attention Is All You Need" paper on Transformers: https://arxiv.org/abs/1706.03762
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" paper: https://arxiv.org/abs/1810.04805