In [1]:
%load_ext autoreload
%autoreload 2

---
# Project 2 - Part 1
For the first part, use [the pretrained Bert model from hugging face](https://huggingface.co/transformers/task_summary.html#question-answering) and feed it with the five 300 words long sections from the book of your choice.
These sections should be selected so they are: introducing the protagonist(s), the antagonist, the crime and crime scene, any significant evidence, and the resolution of the crime/a narrative that presents the case against the perpetrator.
The questions you should ask are about the identity and characteristics of the protagonist, antagonist/perpetrator, the nature and the setting of the crime or crime scene, the evidence, and the case against the perpetrator.
Document the questions, ask the questions and document the specificity and accuracy of the results.
Then, feed the whole novel as a context, and repeat the questions. Document the results

In [59]:
# import spacy
# from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
# import tensorflow as tf
import os
from pprint import pprint
from transformers import pipeline
from transformers import DistilBertTokenizerFast, TFDistilBertForQuestionAnswering
from datasets import load_dataset
from custom_squad import *  # creating and loading in custom squad questions and answers
from question_answering import *  # automating the question answering process
import tensorflow as tf


---
# Load in Normal Distilbert

In [12]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
generator = pipeline(task="question-answering", model=model, tokenizer=tokenizer)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_39', 'qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---
# Load in Custom Questions

In [47]:
samples = create_samples()

In [63]:
# peek at a samples
print(f'Num Samples: {len(samples)}')
print(f'Sample Keys: {samples[0].keys()}')


Num Samples: 25
Sample Keys: dict_keys(['id', 'title', 'context', 'question', 'answers'])


In [49]:
print(f'sample[0]:')
pprint(samples[0])


sample[0]:
{'answers': {'answer_start': [608], 'text': ['practitioner']},
 'context': 'Mr. Sherlock Holmes, who was usually very late in the mornings, '
            'save upon those not infrequent occasions when he was up all '
            'night, was seated at the breakfast table. I stood upon the '
            'hearth-rug and picked up the stick which our visitor had left '
            'behind him the night before. It was a fine, thick piece of wood, '
            'bulbous-headed, of the sort which is known as a \\"Penang '
            'lawyer.\\" Just under the head was a broad silver band nearly an '
            'inch across. \\"To James Mortimer, M.R.C.S., from his friends of '
            'the C.C.H.,\\" was engraved upon it, with the date \\"1884.\\" It '
            'was just such a stick as the old-fashioned family practitioner '
            'used to carry—dignified, solid, and reassuring. \\"Well, Watson, '
            'what do you make of it?\\" Holmes was sitting with his b

---
# Automate QA

In [56]:
# Test pipeline generator
_ = qa(generator, samples[0], verbose=True)


QUESTION: "What is the guests occupation?"
GROUND TRUTH: "practitioner"
PREDICTED: "esteemed since those who know him give him this mark"
SCORE: "4.6428795030806214e-05"


In [57]:
# Run all of the samples through the generator
predicted = list(map(lambda s: qa(generator, s, verbose=False), samples))

# Package the sample with the predicted
questions_answers = list(zip(samples, predicted))


---
# Non-Finetuned Question vs Answers With 500 Word Contexts

In [62]:
n = len(questions_answers)

# print predictions to console for quick inspection
for i, answer in enumerate(questions_answers):
    print_answer(i, questions_answers[i], n=n)

# write predictions to a file for later
with open(os.path.join('results', 'non-fine-tuned-500-word-context.text'), 'w') as f:
    for i, answer in enumerate(questions_answers):
        print_answer(i, questions_answers[i], n=n, file=f)

****************************************************************************************************

QUESTION #0:
"What is the guests occupation?"

GROUND TRUTH ANSWER:
"practitioner"

PREDICTED ANSWER (score=4.6428795030806214e-05):
"esteemed since those who know him give him this mark"

****************************************************************************************************

QUESTION #1:
"How did Sherlock Holmes see Watson?"

GROUND TRUTH ANSWER:
"well-polished, silver-plated coffee-pot"

PREDICTED ANSWER (score=4.631249976227991e-05):
"coffee-pot in front of me,\" said he"

****************************************************************************************************

QUESTION #2:
"What did Holmes and Watson inspect?"

GROUND TRUTH ANSWER:
"visitor's stick"

PREDICTED ANSWER (score=4.593794801621698e-05):
"coffee-pot in front of me,\" said he"

****************************************************************************************************

QUESTION #3:
"Where

---
# Non-Finetuned Question vs Answers With Whole Book As Context

---
# OLD STUFF

In [4]:
def Q_A (questions, text):
    for question in questions:
        inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
        input_ids = inputs["input_ids"].numpy()[0]

        outputs = model(inputs)
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
        answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
        answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1

        answer = tokenizer.convert_tokens_to_string(
            tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
        )

        print(f"Question: {question}")
        print(f"Answer: {answer}")

In [5]:
def split_text(text):
    filename = os.path.join('text',text)
    file = open(filename,encoding="utf8")
    word_list = file.readlines()
    word_list = str(word_list).split()
    for i in range(0, len(word_list), 300):
        yield word_list[i:i + 300]

In [11]:
questions = [
    "Who is the protagonist?",
    "Who is the perpetrator?",
    "Where was the first victim murdered?",
]
word_list = list(split_text('the-hound-of-the-baskervilles.txt'))

for i in range(len(word_list)):
    text = ' '.join([str(item) for item in word_list[i]])

    Q_A (questions, text)

Question: Who is the protagonist?
Answer: 
Question: Who is the perpetrator?
Answer: 
Question: Where was the first victim murdered?
Answer: 
Question: Who is the protagonist?
Answer: [CLS]
Question: Who is the perpetrator?
Answer: [CLS]
Question: Where was the first victim murdered?
Answer: [SEP]
Question: Who is the protagonist?
Answer: [CLS] who is the protagonist? [SEP]
Question: Who is the perpetrator?
Answer: [SEP]
Question: Where was the first victim murdered?
Answer: [SEP]
Question: Who is the protagonist?
Answer: [CLS] who is the protagonist? [SEP]
Question: Who is the perpetrator?
Answer: [CLS]
Question: Where was the first victim murdered?
Answer: [CLS]
Question: Who is the protagonist?
Answer: hugo baskerville
Question: Who is the perpetrator?
Answer: [SEP]
Question: Where was the first victim murdered?
Answer: [CLS]
Question: Who is the protagonist?
Answer: [SEP] thing great black beast shape like hound large hound mortal eye rest. look thing tear throat hugo baskerville
Q