# Case Study: Wikipedia Question-Answer Pair Generator

<a name="toc"></a>
## Table of Contents

1. [Import libraries](#import-lib)
2. [Define functions](#define-functions)
3. [Example & sample output for given Wikipedia page](#example)
4. [Summary & justifications for chosen model](#model-summary)

<a name="import-lib"></a>

## 1. Import libraries

In [None]:
#Python 3.8.16
#!pip3 install requests==2.28.2
#!pip3 install beautifulsoup4==4.12.2
#!pip3 install nltk==3.8.1
#!pip3 install torch==1.7.1
#!pip3 install transformers==4.26.1
#!pip3 install pandas==2.0.0

In [33]:
import requests
from bs4 import BeautifulSoup
import nltk
import re
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import pandas as pd
import time
import datetime
import os

<a name="define-functions"></a>

## 2. Define functions

In this section, 4 python classes are defined for the Wikipedia QA extraction task.

[↑ Back to Table of Contents](#toc)

### 2.1 Extract text

This class is used to extract and process text from a given wikipedia URL.

In [34]:
class extractText:
    """A class for extracting text from a given URL.
    
    Args:
        url (str): A url link for extracting text
    """
    
    def __init__(self, url):
        self.url = url
        self.preprocessor = preprocessText()
    
    def get_wiki_text(self):
        """Extract text from Wikipedia from the given url.
        
        Returns:
            wiki_extract (list): A list of extracted text
        """
        print(f"Extracting text from wikipedia page: {self.url} ...")
        
        # Set target url to extract html text
        response = requests.get(self.url)
        
        # Extract html text
        html_text = BeautifulSoup(response.content , 'html.parser')
        print(f"Text extraction completed.")
        
        # Intialize headers to verify
        headers = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
        
        # Empty list for storing text to be appended
        wiki_extract = []

        # For each tag
        print(f"Pre-processing extracted text ...")
        for t in html_text.find_all():
            # Check if the tag is one of the 6 headers tag and that it is not a tag for the table of contents
            if t.name in headers and t.text != 'Contents':
                for e in t.next_elements:
                    # If the element has a p (paragraph) tag
                    if e.name == 'p': 
                        # Get the cleaned paragraph text
                        text = self.preprocessor.clean_text(self.get_paragraph_text(e))
                        # If it is a proper sentence, append to the wiki_extract list
                        if self.preprocessor.proper_sentence(text):
                            wiki_extract.append(text)
                    # If the element is under a new header tag, exit the loop
                    if e.name in headers:
                        break    
        print(f"Text pre-processing completed.")
        
        # A list of extracted text                
        return wiki_extract

    def get_paragraph_text(self, p):
        """A method to join and return the extracted text.
        
        Args:
            p (obj): A beautifulsoup html tag object
        
        Returns:
            A string of text
        """
        # Return the paragraph of text
        return ''.join([tag.text for tag in p.children])

### 2.2 Pre-process text

This class is used to pre-process the text extracted from the given Wikipedia page, before passing it through the model for question-answer pair generation.

In [35]:
class preprocessText:
    """
    A class for processing text before using it in the model.  
    """
    
    def __init__(self):
        pass
    
    def proper_sentence(self, text):
        """A method to check whether a given string of text forms a proper sentence. 
        It is assumed that a proper meaningful sentence must contain a verb and a noun, and must also have at least 3 words.
        
        Args:
            text (str): An input sentence or paragraph
        
        Returns:
            A boolean value indicating whether the given text forms a proper sentence
        """
        # Tokenize the input text
        tokens = nltk.word_tokenize(text)
        
        # Tag the part of speech for each token
        pos_tags = nltk.pos_tag(tokens)
        
        # Count the number of words that are atleast two characters in length
        word_count = len(re.compile(r'[a-zA-Z]{2,}').findall(text))
        
        # Return True if the text has a verb and a noun and has a word count of atleast 3
        return len(set([p[1][0] for p in pos_tags if p[1].startswith('V') or p[1].startswith('N')])) > 1 and word_count >= 3

    def clean_text(self, text):
        """A method for cleaning the given text.
        
        Args:
            text (str): An input sentence or paragraph to be cleaned
        
        Returns:
            new_text (str): An output cleaned sentence or paragraph
        """
        # Remove references
        new_text = re.sub("\[\d+\]|\[(edit|citation needed)\]", "" , text)
        
        # Remove multiple new line characters
        new_text = re.sub(r"\n+", '', new_text) 
        
        # Remove non-breaking space characters
        new_text =  re.sub(r'[\xa0-\xaf]', '', new_text)
        
        # Remove backslash
        new_text = re.sub(r"\\", "", new_text)
        
        # Cleaned text
        return new_text

### 2.3 Generate question-answer pairs

This class is used to generate question answer pairs from a given list of text extracted from the Wikipedia page.

In [36]:
class generateQA:
    """A class for generating question answer pairs from a given list of text.
    
    Args:
        prompt (str): A prompt to pass as input to the model along with input text, to generate questions
        model_checkpoint (str): The model to be used for the task
        truncation (bool): Whether to truncate the input sequence to the maximum acceptable length by the model
        skip_special_tokens (bool): Whether or not to remove special tokens in the decoded output.
        op_max_length (int): Maximum length of generated sequence output
    """
    
    def __init__(self, prompt, model_checkpoint, op_max_length, truncation=True, skip_special_tokens=True):
        self.prompt = prompt
        self.model_checkpoint = model_checkpoint
        self.truncation = truncation
        self.skip_special_tokens = skip_special_tokens
        self.op_max_length = op_max_length
        self.model = AutoModelForSeq2SeqLM.from_pretrained(self.model_checkpoint)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
        
    def get_qa_pairs(self, input_text):
        """Generate question-answer pairs from input text.
        
        Args:
            input_text (list): A list containing input text for the model
        
        Returns:
            qa_pairs_dict (dict): A dictionary containing question and answer pairs
        """
        start_time = time.time()
        print(f"Generating question-answer pairs from pre-processed text ...")
        print(f"Model used: {self.model_checkpoint}")
        
        # A dictionary for storing questions as keys and answers as values
        qa_pairs_dict = {}
        
        # Loop through the input text to generate question-answer pairs
        for text in input_text:
            
            # Generate question by providing relevant prompt and input text to model
            q_input = self.tokenizer(self.prompt + ": " + text, return_tensors="pt", truncation=self.truncation)
            q_output = self.model.generate(**q_input, max_length=self.op_max_length)
            question = self.tokenizer.batch_decode(q_output, skip_special_tokens=self.skip_special_tokens)
            
            # Generate answer by providing question and input text to model
            a_input = self.tokenizer(question[0] + ": " + text, return_tensors="pt", truncation=self.truncation)
            a_output = self.model.generate(**a_input, max_length=self.op_max_length)
            answer = self.tokenizer.batch_decode(a_output, skip_special_tokens=self.skip_special_tokens)
            
            # Store question-answer pair in a dictionary
            qa_pairs_dict[question[0]] = answer[0]
            
        num_qa = len(qa_pairs_dict)
        print(f"Total no. of question-answer pairs generated: {num_qa}")
              
        end_time = time.time()
        print(f"Total time taken for generation of question-answer pairs: {(end_time-start_time)/60} mins")
        
        # Print the results
        print("\nResults: ")
        for i, item in enumerate(qa_pairs_dict.items()):
            print(f"\n{i+1}) Question: {item[0]}")
            print(f"{''.ljust(len(str(i+1))+2)}Answer:   {item[1]}")
        
        # Question answer pairs and input text
        return qa_pairs_dict, input_text

### 2.4 Run pipeline

This class combines methods from the previous classes into a single pipeline that can be used for the task.

In [37]:
class runPipeline:
    """Run the pipeline and save the results in a csv file.
    
    Args:
        url (str): A url link for extracting text
        prompt (str): A prompt for controlling the type of output to generate
        model_checkpoint (str): The model to be selected
        op_max_length (int): Maximum length of generated sequence output
        file_name (str): File name in local directory to save output to
    """
    
    def __init__(self, url, prompt, model_checkpoint, op_max_length, file_name):
        self.url = url
        self.prompt = prompt
        self.model_checkpoint = model_checkpoint
        self.op_max_length = op_max_length
        self.file_name = file_name
        
    def run(self):
        """Instantiate and run the pipeline.
        
        Returns:
            results (dict): A dictionary containing question and answer pairs 
            input_text (list): A list containing input text for the model
        """
        # Instantiate the QA generation object
        generate_qa = generateQA(
            prompt=self.prompt, 
            model_checkpoint=self.model_checkpoint,
            op_max_length=self.op_max_length,
            truncation=True,
            skip_special_tokens=True
        )
        
        # Instantiate an object and initialize the value for the html text extraction
        html_extract = extractText(url=self.url)
        
        # Get the processed data
        processed_extract = html_extract.get_wiki_text()
        
        # Run the processed data through the model to get the question-answer pairs
        results, input_text = generate_qa.get_qa_pairs(processed_extract)
        
        # Save the question-answer pairs in a csv file in the current working directory
        self.export_results(results)
        
        # Question answer pairs and input text
        return results, input_text
        
    def export_results(self, df):
        """Export the results into a local drive.
        
        Args:
            df (DataFrame): A pandas dataframe containing the question and answer pairs
        """
        # Timestamp for the csv file
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        
        # Export the csv file to the current working directory
        pd.DataFrame.from_dict(df, orient='index', columns=['Answer']).to_csv(f"{self.file_name}{timestamp}.csv")
        
        print(f"\nGenerated question-answer pairs CSV file saved in the current working directory: ...\\{self.file_name}{timestamp}.csv")

<a name="example"></a>

## 3. Example & sample output for given Wikipedia page

[↑ Back to Table of Contents](#toc)

In this section, question-answer pairs are generated for a given Wikipedia page using the functions defined in Section 2. The results are then saved to a csv file.

**The pre-trained model used for the task is Flan-T5 large.**

In [38]:
# Sample Wikipedia pages
#url = 'https://en.wikipedia.org/wiki/NTUC_FairPrice'
#url = 'https://en.wikipedia.org/wiki/Merlion'
url = 'https://en.wikipedia.org/wiki/Gardens_by_the_Bay'

In [39]:
pipeline = runPipeline(
    url=url, 
    prompt='Ask question', 
    model_checkpoint='google/flan-t5-large',
    op_max_length=512,
    file_name="Flan_T5_qa_outputs_"
)

In [40]:
qa, text_input = pipeline.run()

Extracting text from wikipedia page: https://en.wikipedia.org/wiki/Gardens_by_the_Bay ...
Text extraction completed.
Pre-processing extracted text ...
Text pre-processing completed.
Generating question-answer pairs from pre-processed text ...
Model used: google/flan-t5-large
Total no. of question-answer pairs generated: 36
Total time taken for generation of question-answer pairs: 2.2591779907544454 mins

Results: 

1) Question: How many hectares is the largest garden?
   Answer:   54

2) Question: What was the name of the city that was transformed to a "City in a Garden"?
   Answer:   Garden City

3) Question: What year did the park receive its 50 millionth visitor mark?
   Answer:   2018

4) Question: what is the name of the project
   Answer:   neom

5) Question: How many acres is Bay Central Garden?
   Answer:   37

6) Question: How many acres is Bay East Garden?
   Answer:   79

7) Question: What is the purpose of the inlets?
   Answer:   to help cool areas of activity around them


Lastly, the input text and their corresponding generated question-answer pairs are printed to evaluate their relevance and correctness.

In [41]:
# Review the inputs and outputs
for i, res in enumerate(zip(qa.items(), text_input)):
    print(f"{i+1}) Generated QA Pair: {res[0]}\n\nText Input: {res[1]}\n\n")

1) Generated QA Pair: ('How many hectares is the largest garden?', '54')

Text Input: The Gardens by the Bay is a nature park spanning 101 hectares (250 acres) in the Central Region of Singapore, adjacent to the Marina Reservoir. The park consists of three waterfront gardens: Bay South Garden (in Marina South), Bay East Garden (in Marina East) and Bay Central Garden (in Downtown Core and Kallang). The largest of the gardens is the Bay South Garden at 54 hectares (130 acres) designed by Grant Associates. Its Flower Dome is the largest glass greenhouse in the world.


2) Generated QA Pair: ('What was the name of the city that was transformed to a "City in a Garden"?', 'Garden City')

Text Input: Gardens by the Bay was part of the nation's plans to transform its "Garden City" to a "City in a Garden", with the aim of raising the quality of life by enhancing greenery and flora in the city. First announced by Prime Minister Lee Hsien Loong at Singapore's National Day Rally in 2005, Gardens b

<a name="model-summary"></a>

## 4. Summary & justifications of chosen model

[↑ Back to Table of Contents](#toc)

The selected pre-trained model for this task is **google/flan-t5-large**, downloaded from Hugging Face. Flan-t5 (Fine-tuned Language Net-T5) is an enhanced version of the T5 (Text-to-Text Transfer Transformer) model. The T5 model is pre-trained on a mixture of different NLP tasks, where each task can be represented in a text-to-text format. Thus, the T5 model can be used for a variety of NLP tasks by appending a different prefix corresponding to the task required. For example, the prompt prefix 'Ask question' has been appended in this use case to generate a question from an input text from Wikipedia. The Flan-t5 model is an enhanced version of the T5 model that has been fine-tuned on additional instruction-based NLP tasks, with improved performance on a wide range of tasks.

The Flan-t5 model was chosen for the following key reasons:
1. It is a state of the art text-to-text NLP model that was recently released in 2022, and has been compared to GPT-3. It performs well on a wide range of NLP tasks, including question answering.
2. It was fine-tuned on a large corpus of text data (Flan collection of datasets) which includes Wikipedia data. Since the task involves question-answer generation on Wikipedia text, using a model that has been pre-trained on Wikipedia data is ideal.
3. It is open-source, designed to be customizable, and readily available from Hugging Face.