<a href="https://colab.research.google.com/github/nondescryptid/bert-qna/blob/main/Tomoe_Question_Answering_with_various_flavours_of_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question-Answering with BERT 
#### by Tomoe, for the AI Projects and Case Studies course
---
This project is based on: 
https://mccormickml.com/2021/05/27/question-answering-system-tf-idf/ (Referenced as "McCormick" for attribution of functions)
https://blog.fastforwardlabs.com/2020/06/22/how-to-explain-huggingface-bert-for-question-answering-nlp-models-with-tf-2.0.html (referenced as "FastForwardLabs")



## Overview of building a question-answering system 

A Question-Answering (QA) system comprises of 3 major parts:   
1) Dataset/External knowledge source - to be queried against  

2) Retriever - gets the most relevant data from our knowledge source  

3) Generator - This retrieves extract text from the dataset. In other QA systems, novel text may be generated using the question and data (e.g. GPT3)

Let's start by installing the packages we need!

In [1]:
# Install packages 
!pip install transformers 
!pip install datasets 
!pip install torch

# import 
from transformers import BertTokenizer, BertForQuestionAnswering, \
  AutoTokenizer,AutoModel, AutoModelForQuestionAnswering
from datasets import load_dataset
import pandas as pd
import torch

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 24.7 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.

#1. Dataset
---
You can check out the text data itself by opening sexualoffences.txt.


## Loading the dataset

In [None]:
# Reading the sexualoffences.txt file itself
file = open("sexualoffences.txt", "r")
sexualoffences = file.read()

# Loading data into a format that the transformers package can use
so = load_dataset('text', data_files='sexualoffences.txt', split='train['
                                                                     '0:]')

Using custom data configuration default-7f6d053da4cd744f


Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-7f6d053da4cd744f/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-7f6d053da4cd744f/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.


Note that the variable **so** (short for sexual offences :P) is an object. To demonstrate this, let's print it out:

In [None]:
print(so)

Dataset({
    features: ['text'],
    num_rows: 72
})


Note that **so** is an object. In order to access the text data itself, we need to obtain contents with the key "text": 

In [None]:
docs = so['text']

#2. TF-IDF Retriever 
BERT has a limit of 512 tokens that it accepts as an input. 

Tokenization is the process of breaking down text into smaller units that a model will accept as inputs. In BERT's case, it uses subword based tokenization. 

However, our document is too long to fit within 512 tokens. Hence, segmenting the document into different parts will be needed. 

We will use a function, segment_documents that takes in 2 inputs: docs, and max_doc_length (which limits the maximum length before the text is segmented).

---
For more on tokenization: https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17



In [None]:
# From McCormick
def segment_documents(docs, max_doc_length=450):
  # List containing full and segmented docs
  segmented_docs = []

  for doc in docs:
    # Split document by spaces to obtain a word count that roughly approximates the token count
    split_to_words = doc.split(" ")

    # If the document is longer than our maximum length, split it up into smaller segments and add them to the list 
    if len(split_to_words) > max_doc_length:
      for doc_segment in range(0, len(split_to_words), max_doc_length):
        segmented_docs.append( " ".join(split_to_words[doc_segment:doc_segment + max_doc_length]))

    # If the document is shorter than our maximum length, add it to the list
    else:
      segmented_docs.append(doc)

  return segmented_docs

Apply this function to our existing data: 

In [None]:
segmented_docs = segment_documents(docs, 450)

Print it out to see what segmented_docs looks like. 

Note how segmented_docs is a list of multiple strings. Words that were part of the same sentence have now been broken up. Each string is now considered a document (doc). 

In [None]:
print(segmented_docs)



## Vectorization: Representing words mathematically so that we can find the most relevant segments 

This converts our documents and question into vectors. Document vectors with the highest cosine similarity to our query vector will be the best candidates for our answer to the user's question. 

Vectors can be visualised as matrices e.g.  [0,0,0], which can also be thought of as points in 3D space. 

Using 3D space as an example, 2 documents can be "further" in 3D space, but still be considered similar. 

Here, similarity is measured by finding the cosine of the angle between two vectors, where cosine similarity is a value that ranges from 0 to 1. Cosine similarity is unaffected by the distance between vectors -- for instance, vectors of two similar documents can be far apart because of one word appearing 50 times, and the same word appearing 10 times. However, they can still be similar as they have a small angle between them. The smaller the angle, the higher the similarity. 

-- -
References:    
https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a
https://www.machinelearningplus.com/nlp/cosine-similarity/

## TD-IDF 
The technique we use here is TF-IDF, which stands for term frequency--inverse document frequency. 
This measures how relevant a word is to a document by evaluating how often a word appears, but also takes into account common words (aka stop words) that appear frequently such as "a","the", "is". 



In [None]:
# From McCormick 
def get_top_k_articles(query, docs, k):
    # Initialize a vectorizer that removes English stop words
    vectorizer = TfidfVectorizer(analyzer="word", stop_words='english')

    # Create a corpus of query and documents and convert to TFIDF vectors
    query_and_docs = [query] + docs
    matrix = vectorizer.fit_transform(query_and_docs)

    # Holds our cosine similarity scores
    scores = []

    # The first vector is our query text, so compute the similarity of our query against all document vectors
    for i in range(1, len(query_and_docs)):
        scores.append(cosine_similarity(matrix[0], matrix[i])[0][0])

    # Sort list of scores and return the top k highest scoring documents
    sorted_list = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    top_doc_indices = [x[0] for x in sorted_list[:k]]
    top_docs = [docs[x] for x in top_doc_indices]

    return top_docs


## BERT Retriever
After the TD-IDF vectorizer has found the documents(segments) that are most relevant to the query, the shortlisted documents are passed to BERT to identify the span of words within the segment that are most likely to contain the answer to a given question. 

Here, we will use a BERT model that has been fine-tuned on SQuAD 1.0 (which is a question-answering dataset).

In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Next we need a function that takes a **question** and **reference text**, and then returns the span of words in the reference that is most likely to be an answer to the input question. 

In [None]:
# From McCormick 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def answer_question(question, answer_text):
    input_ids = tokenizer.encode(question, answer_text, max_length=512,
                                 truncation=True)

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token itself.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0] * num_seg_a + [1] * num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    outputs = model(torch.tensor([input_ids]),
                    # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]),
                    # The segment IDs to differentiate question from answer_text
                    return_dict=True)
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):

        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]

        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    return answer

# Now let's write something that we can actually run and get answers from


### Ask it anything. How about "What are my legal options?"

In [None]:
print("Hello! This program provides information about sexual offences in "
          "Singapore.")

# This prompts the user for a question until the user exits.
user_active = True
while user_active:
    # Accepts user's input 
    query = input("Enter your question or press N to exit: ")

    # Script ends if the user chooses to exit
    if query == "N" or query == "n":
        print("Exiting programme...")
        break

    # Segment documents
    segmented_docs = segment_documents(docs, 450)

    # Retrieve K most relevant documents to the query. Here, k = 3. 
    candidate_docs = get_top_k_articles(query, segmented_docs, 3)

    # Return the likeliest answers from each of our top k most relevant documents in descending order
    for doc in candidate_docs:
        # Prints the span of text in each candidate 
        print("Answer: ", answer_question(query, doc))
        print("Reference: ", doc, "\n")


Hello! This program provides information about sexual offences in Singapore.
Enter your question or press N to exit: What are my legal options
Answer:  if you have suffered sexual assault or harassment
Reference:  If you have suffered sexual assault or harassment, there are four legal actions 

Answer:  report the matter to law enforcement , file a magistrate ' s
Reference:  you can take: Report the matter to law enforcement, File a Magistrate's 

Answer:  complaint ( private prosecution ) , apply for a protection order / personal
Reference:  Complaint (Private Prosecution), Apply for a protection order/personal 

Enter your question or press N to exit: n
Exiting programme...


### The answers don't seem that good...
If you asked it "What are my legal options?", you might notice something strange. 

The answer for "What are my legal options" ought to contain:     
"Report the matter to law enforcement, File a Magistrate's Complaint (Private Prosecution), Apply for a protection order/personal protection order, Commence civil proceedings for compensation."

Instead, it appears as such:




```
Hello! This program provides information about sexual offences in Singapore.
Enter your question or press N to exit: what are my legal options
Answer:  if you have suffered sexual assault or harassment
Reference:  If you have suffered sexual assault or harassment, there are four legal actions 

Answer:  report the matter to law enforcement , file a magistrate ' s
Reference:  you can take: Report the matter to law enforcement, File a Magistrate's 

Answer:  complaint ( private prosecution ) , apply for a protection order / personal
Reference:  Complaint (Private Prosecution), Apply for a protection order/personal 

```



The answers individually seem to be cut off, and the reference for each answer appears to contain parts of a longer sentence that is the whole answer. 

I think this problem occurs for two reasons:   
1) The "reference" is the segment that the answer was found in. As mentioned earlier, segmentation does not always follow natural breaks between sentences.   
2) Answers are short as SQuAD (which BERT was fine-tuned on) comprises of short answers to factual questions based on Wikipedia pages. As a result, shorter spans of text within the already short segments are chosen as an answer. 

### Tweaking the programme: Let's try retrieving the entire sentence that the shortlisted reference texts are from instead!

This new function, **reference_context** will locate the reference text within the original document.  

Then, it will find the start of the sentence by finding the highest index where "." (full stop) appears, before the start of the reference. This is essentially the same as searching leftwards, until you reach the first full stop. 

After finding the full stop, we add 1 to the index of the start of the sentence to avoid printing the full stop. 

Then, we look to find the lowest index where the full stop appears, starting from the reference's beginning. This means we go rightwards from the reference until we find the first full stop. 


Finally, we print the full sentence as our answer. If the index of the reference's start is 0 (meaning that it is the beginning of the original text itself), then characters will be printed out from the start to the index that marks the end of the sentence. 

Otherwise, we print the characters from the start of the sentence to the end of the sentence. 

In [None]:
# From me :D 
def reference_context(doc, original_docs, query):
    answer = None
    # Check if SQuAD thinks that the answer to the question is not in the text. 
    # This is relevant for models finetuned on SQuAD 2.0, as opposed to 1.0
    if answer_question(query, doc) == "[CLS]":
        answer = "I don't know. You might want to consult another source."
    else:
        # Find the index of the beginning of first reference in the original
        # text
        reference_start = original_docs.find(doc)
        # Search for start of sentence, to the left of reference_start
        # USE .rfind() - It returns the highest index of the substring.
        sentence_start = original_docs.rfind(".", 0, reference_start) + 1
        # Find end of sentence by looking for the first full-stop to the
        # right of the reference's first character,
        sentence_end = original_docs.find(".", reference_start)
        if reference_start == 0:
            answer = original_docs[:sentence_end]

        else:
            answer = original_docs[sentence_start:sentence_end]

    return answer


We locate indexes of references within the original document itself as BERT introduces extra characters/tokens in the process of tokenization such as "[CLS]" and "[SEP]", so we cannot use the text after it has been processed. 

In [None]:
print("Hello! This program provides information about sexual offences in "
          "Singapore.")

user_active = True
while user_active:
    query = input("Enter your question or press N to exit: ")
    if query == "N" or query == "n":
        # Close sexualoffences.txt
        print("Exiting programme...")
        break
    # Segment docs
    segmented_docs = segment_documents(docs, 450)

    # Retrieve K most relevant paragraphs to the query
    candidate_docs = get_top_k_articles(query, segmented_docs, 3)

    # Return the likeliest answers from each of our top k most relevant documents in descending order
    print("Here are our top 3 answers")
    for doc in candidate_docs:
        print(reference_context(doc, sexualoffences, query), "\n")

Hello! This program provides information about sexual offences in Singapore.
Enter your question or press N to exit: N
Exiting programme...


Hopefully, the answers make more sense now. However, there's still one problem left: Sentences are ended in the middle of weblinks as they contain full-stops. 


This problem can be seen when we ask "How can I file a police report online?", where the link to the police's web portal is cut off.




```
Enter your question or press N to exit: how can i file a report online
Here are our top 3 answers


An online police report can be filed, if there is no
need for immediate help by the police, at this link: https://eservices 

If you have suffered sexual assault or harassment, there are four legal actions
you can take: Report the matter to law enforcement, File a Magistrate's
Complaint (Private Prosecution), Apply for a protection order/personal
protection order, Commence civil proceedings for compensation 



Police reports can be filed by visiting the nearest police centre or police post
to file a report 
```



I'll probably have to fix this in a future version with regex (regular expressions) to get my **reference_context** function to ignore full stops in weblinks when searching for the end of a sentence. This is kind of beyond me at the moment. Might work on it during winter break or something. 




## Trying other BERT models
Let's compare how these three models work.   
1) BERT fine-tuned on SQuAD v1 (```bertsquad1```)  
^ This one's the model that we've been using so far.    
2) LegalBERT  (```legalbert```)  
3) BERT fine-tuned on SQuAD v2 (```bertsquad2```)  


In [None]:
def get_pretrained_squad_model(model_name):
    model, tokenizer = None, None
    # BERT fine-tuned on SQuAD 1.0
    if model_name == "bertsquad1":
        model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
        tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

    # LegalBERT
    elif model_name == "legalbert":
        model = BertForQuestionAnswering.from_pretrained("nlpaueb/legal-bert-base-uncased")
        tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
    # BERT fine-tuned on SQuAD 2.0
    elif model_name == "bertsquad2":
        model = AutoModelForQuestionAnswering.from_pretrained(
            "mrm8488/bert-medium-finetuned-squadv2")
        tokenizer = AutoTokenizer.from_pretrained(
            "mrm8488/bert-medium-finetuned-squadv2")
        
    # You can add on more models here 
    return model, tokenizer



""" Variable Declaration """
# If you want to test the models with different questions, add them to the question list.
questions = [
    "What are my legal options?",
    "What happens after filing a police report?",
    "What happens in a medical examination?",
    "How can I file a police report?",
    "How can I file a police report online?",
    "Can I file a police report online?"
    # Add your questions here 
]
model_names = ["bertsquad1", "legalbert", "bertsquad2"] # Update if you add more models
result_holder = []
context_result_holder = []

""" Set up sexualoffences.txt for segmentation """
so = load_dataset('text', data_files='sexualoffences.txt', split='train['
                                                                     '0:]')
docs =so['text']
segmented_docs = segment_documents(docs, 450)

# Open original text file
file = open("sexualoffences.txt", "r")
sexualoffences = file.read()

""" Scripting """
# There's got to be a more efficient way to do this but I'll leave it as this for now :P 
bertsquad1_ans = []
bertsquad1_context = []
bertsquad2_ans = []
bertsquad2_context = []
legalbert_ans = []
legalbert_context = [] # Add more empty lists to hold results for other models 


for model_name in model_names:
  # This changes the model and tokenizer configuration when model_names is iterated over
  model, tokenizer = get_pretrained_squad_model(model_name)
  for question in questions:    
    # Find top document
    # Retrieve top document (index 0).
    top_doc = get_top_k_articles(question, segmented_docs, 1)[0]
    answer = answer_question(question, top_doc)
    answer_context = reference_context(top_doc, sexualoffences, question)
  
    if model_name == "bertsquad1":
      bertsquad1_ans.append(answer)
      bertsquad1_context.append(answer_context)
    elif model_name == "bertsquad2": 
      bertsquad2_ans.append(answer)
      bertsquad2_context.append(answer_context)
    elif model_name == "legalbert":
      legalbert_ans.append(answer)
      legalbert_context.append(answer_context)
    # Add conditional logic for other models 
  

# Create dataframe
result_df = pd.DataFrame({
    "question": questions,
    "bertsquad1_ans": bertsquad1_ans,
    "bertsquad1_context": bertsquad1_context,
    "bertsquad2_ans": bertsquad2_ans,
    "bertsquad2_context": bertsquad2_context,
    "legalbert_ans": bertsquad2_ans,
    "legalbert_context": bertsquad2_context, # Add on if there are more models
})

# Print all results
pd.set_option('display.max_rows', None, 'display.max_columns', None)

# Close sexualoffences.txt at end
file.close()


Using custom data configuration default-7f6d053da4cd744f
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-7f6d053da4cd744f/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
Some weights of the model checkpoint at nlpaueb/legal-bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoi

In [None]:
# This allows pandas dataframes to be rendered nicely in Colab 
from google.colab import data_table

# Displays the table below with responses for each question by each BERT model. 
result_df

Unnamed: 0,question,bertsquad1_ans,bertsquad1_context,bertsquad2_ans,bertsquad2_context,legalbert_ans,legalbert_context
0,What are my legal options?,if you have suffered sexual assault or harassment,If you have suffered sexual assault or harassm...,sexual assault or harassment,If you have suffered sexual assault or harassm...,sexual assault or harassment,If you have suffered sexual assault or harassm...
1,What happens after filing a police report?,the police ' s investigation process will begin,"\n\nAfter filing a report, the police's invest...",the police ' s investigation process will begin,"\n\nAfter filing a report, the police's invest...",the police ' s investigation process will begin,"\n\nAfter filing a report, the police's invest..."
2,What happens in a medical examination?,a full physical examination,"\n\nIf forensic medical examination is needed,...",physical examination,"\n\nIf forensic medical examination is needed,...",physical examination,"\n\nIf forensic medical examination is needed,..."
3,How can I file a police report?,police,\n\nPolice reports can be filed by visiting th...,[CLS],I don't know. You might want to consult anothe...,[CLS],I don't know. You might want to consult anothe...
4,How can I file a police report online?,if,"\n\nAn online police report can be filed, if t...",if there is no,"\n\nAn online police report can be filed, if t...",if there is no,"\n\nAn online police report can be filed, if t..."
5,Can I file a police report online?,if there is no,"\n\nAn online police report can be filed, if t...",if there is no,"\n\nAn online police report can be filed, if t...",if there is no,"\n\nAn online police report can be filed, if t..."
