In [1]:
from similarity import spacy_model, rank_paragraphs, find_answer_paragraph
from pickle_docs import *
from tqdm import tqdm
from timeit import default_timer as timer
from squad_dataset_test import *

In this notebook, I am going to see how my simple similarity functions are holding up to return the file containing information relevant to the question.

Then, I am going to see how good it is at returning not only the correct paragraph, but the sentence that contains the answer to the question.

First, I am using the SQUAD dataset to simplify the process (don't have to create my own dataset).
In order to use this dataset, I extract the context paragraphs and the question-answer pair, then process the paragraph and question through the spaCy model "en_vector_web_lg" in order to be able to use spaCy's similarity pipeline that uses vector embedding using GloVe.

The list of contexts (spaCy doc object) is exported as a pickle file.
The list of tuples (index of target context paragraph, question as a spaCy doc, answer as a string) is exported as a pickle file.

In [2]:
start = timer()
preprocess_pickle(spacy_model(), "docs/squad contexts.pickle", "docs/squad qas.pickle")
time_passed = timer() - start
print("preprocess and pickle the dataset: %s seconds" % time_passed)

pickling docs and qas
86821
19035
preprocess and pickle the dataset: 116.9895018 seconds


Once the context and (target,question,answer) are pickled, we will no longer need to process them through spacy.
We will simply load the pickle files.

In [3]:
start = timer()
contexts, qas = preprocess_unpickle("docs/squad contexts.pickle", "docs/squad qas.pickle")
time_passed = timer() - start
print("%s seconds to load %s context paragraphs and %s question-answers" % (time_passed, len(contexts), len(qas)))

loading the pickled files
89.0525707 seconds to load 19035 context paragraphs and 86821 question-answers


Once we load the lists, we can start testing.
I will be looking at just the first 100 context paragraphs and the 1165 question-answer pairs that correspond to those 100 context paragraphs.

In [4]:
qas = sorted(qas, key=lambda tup:tup[0])
sub_qas = [qa for qa in qas if qa[0] < 100]
sub_contexts = contexts[:100]

The similarity functions all work by checking the similarity between the question sentence and each of the sentences within a paragraph.

In similarity.find_answer_sentence, the function compares the similarity value (1 being the most similar) created by comparing the question sentence to each question in the paragraph.

In find_answer_paragraph, it compares the most similar sentence from one paragraph with the most similar sentence in another paragraph. Then, the paragraph that contains the sentence (with the highest similarity value of all the sentences in all the paragraphs) is returned as the predicted target paragraph.

The code snippet below will compare the predicted target paragraph with the actual target paragraph for the 1165 question-answer pairs within the context of 100 paragraphs.

In [5]:
correct = 0
for qa in tqdm(sub_qas):
    predicted_target_paragraph,highest_similarity_value,predicted_target_sentence = find_answer_paragraph(sub_contexts, qa[1])
    if predicted_target_paragraph == qa[0]:
        correct += 1
print("number of question-answer pairs: %s, number of context paragraphs: %s" %(len(sub_qas), len(sub_contexts)))
print("number of correct paragraph index prediction: %s" %correct)
print("percent correct: %s" %(correct/len(sub_qas)))


100%|██████████████████████████████████████████████████████████████████████████████| 1165/1165 [04:36<00:00,  4.09it/s]


number of question-answer pairs: 1165, number of context paragraphs: 100
number of correct paragraph index prediction: 358
percent correct: 0.3072961373390558


The above took 4:08 minutes to process 1165 sentences with 30.73% accuracy.

The dataset contains paragraphs from articles, meaning that many of the paragraphs are from one article.
This makes the similarity value of paragraphs that do not contain the answer pretty high. (Maybe the paragraph does contain the answer, but the question-answer pair wasn't meant to be answered by the paragraph -- I won't be addressing these types of issues)
There are many reasons, but to see if finding the similarity between the question and paragraph is useful at all, I decided to write a function that will store the top n number of paragraphs with the highest similarity value.

We can also see where the correct target value ranked (if it was ranked in the top n)

In [6]:
rank_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
wrong = 0

# qa = (target paragraph index, question, answer)
for qa in tqdm(sub_qas):
    # top 10
    # most_similar = list of tuples (predicted paragraph index, max similarity score, predicted sentence index)
    most_similar = rank_paragraphs(sub_contexts, qa[1], 10)
    paragraph_indices = [paragraph[0] for paragraph in most_similar]
    try:
        # find the index of the target paragraph in the paragraph ranking
        idx = paragraph_indices.index(qa[0])
        rank_list[idx] += 1
    except ValueError:
        # if the target paragraph does not exist
        wrong += 1
print("distribution of ranking of target paragraph: %s" %rank_list)
print("number of times target paragraph was included in ranking: %s out of %s" %((len(sub_qas)-wrong),len(sub_qas)))   
    

100%|██████████████████████████████████████████████████████████████████████████████| 1165/1165 [04:23<00:00,  4.08it/s]


distribution of ranking of target paragraph: [358, 143, 88, 53, 39, 29, 35, 30, 27, 23]
number of times target paragraph was included in ranking: 825 out of 1165


Only 358 of the predictions correctly ranked the target paragraph as the best choice.
However, the target paragraph was included in the top 10 for 825/1165 questions (70.82%), with most of them placed in the top three (589/825 = 71.39%).

This took 4:17 minutes to complete 1165 questions.

This does not actually require any machine learning, which means that it does not require any training. However, this means that there is not a model that contains some network that is searching for the answer (a span of characters).
Instead, this code compares the question sentence with sentences in a paragraph, under the assumption that the answer to the question is most likely to be in the same sentence as the question.
This means that it does not capture any examples such as :
    "What is the purpose of models such as A?"
    "There are models such as A. They have B purpose."
    prediction: "There are models such as A."
The GloVe pretrained vector space is created by a count-based model. This is available directly through spaCy's vector model.


We can also see how accurately the similarity function calculated the target sentence of the correctly predicted target paragraph.

the answer_text2idx returns a list of tuples (target_paragraph_index, question, answer, target_sentence_index)

In [7]:
idx_qas = answer_text2idx(contexts, qas)
pickle_all(idx_qas, "docs/squad idx_qas.pickle")

In [8]:
idx_qas = sorted(idx_qas, key=lambda tup:tup[0])
trimmed_qas = {qa for qa in idx_qas if qa[0] < 100}
trimmed_contexts = contexts[:100]

In [9]:
rank_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
wrong = 0
par_sent_correct = 0
for qa in tqdm(trimmed_qas):
    # (paragraph_index, max_similarity_score, sentence_of_max)
    most_similar = rank_paragraphs(trimmed_contexts, qa[1], 10)
    xz = [ixx[0] for ixx in most_similar]
    try:
        va1 = xz.index(qa[0])
        rank_list[va1] += 1
        sen_idx = most_similar[va1][2]
        tar_idx = qa[3]
        if sen_idx == tar_idx:
            par_sent_correct += 1
    except ValueError:
        wrong += 1
print(rank_list)
print((len(trimmed_qas) - wrong)/len(trimmed_qas))
print(wrong, len(trimmed_qas))
print("correct sentence: ", par_sent_correct)

100%|██████████████████████████████████████████████████████████████████████████████| 1165/1165 [04:39<00:00,  4.52it/s]


[358, 143, 88, 53, 39, 29, 35, 30, 27, 23]
0.7081545064377682
340 1165
correct sentence:  556


Out of the 825 correctly identified target paragraphs, the target sentence of 556 were also correctly identified.