**Purpose:**

Since the max sequence length for a BERT model is 512 tokens, this script uses the *generate_context()* function from *infersent_glove_context_generation.py*. 

*generate_context()* takes the URL and the question, and finds the N most similar sentences on that web page. It concatenates them and returns a context of valid sequence length.

This context and the question are then fed to the BERT QnA model to extract the answer.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install transformers

In [None]:
#!wget --directory-prefix='/content/drive/My Drive/colab_files/word_embeddings/' http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
#!wget --directory-prefix='/content/drive/My Drive/colab_files/InferSent/encoder/' https://dl.fbaipublicfiles.com/infersent/infersent1.pkl

In [None]:
#import zipfile
#with zipfile.ZipFile('/content/drive/My Drive/colab_files/word_embeddings/glove.6B.zip', 'r') as zip_ref:
#    zip_ref.extractall('/content/drive/My Drive/colab_files/word_embeddings/glove/')

In [None]:
import sys
sys.path.append('/content/drive/My Drive/colab_files/modules')

import infersent_glove_context_generation as ig

import time
import os
import contextlib
import torch
import nltk
nltk.download('punkt')

In [None]:
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [None]:
def extract_answer_phrase(question, context):
    '''
    Takes a `question` string and an `context` string (which contains the
    answer), and identifies the words within the `context` that are the
    answer. Prints them out.
    '''
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, context)

    # Report how long the input sequence is.
    print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example question through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    #print('Answer: "' + answer + '"')
    return answer

In [None]:
def find_answer(url, question):
    context = ig.generate_context(url, question)
    answer = extract_answer_phrase(question, context)
    return answer

**Inference:**

In [None]:
url = 'https://en.wikipedia.org/wiki/India'
question = 'Which sports does India play?'

start_time = time.time()
with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
    answer = find_answer(url, question)
end_time = time.time()
print('Answer: ', answer)
print('\n\nTotal Execution Time: {} seconds'.format(end_time - start_time))

Answer:  cricket is the most popular sport in india . in india , several traditional indigenous sports remain fairly popular , such as kabaddi , kho kho , pehlwani and gilli - danda . india has traditionally been the dominant country at the south asian games . corruption in india is perceived to have decreased . other sports in which indians have succeeded internationally include badminton ( saina nehwal and p v sindhu are two of the top - ranked female badminton players in the world ) , boxing , and wrestling


Total Execution Time: 27.443676710128784 seconds


In [None]:
url = 'https://en.wikipedia.org/wiki/Cryptocurrency'
question = 'Who launched the first Bitcoin ATM?'

start_time = time.time()
with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
    answer = find_answer(url, question)
end_time = time.time()
print('\n\nAnswer: ', answer)
print('\n\nTotal Execution Time: {} seconds'.format(end_time - start_time))

  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))




Answer:  jordan kelley


Total Execution Time: 20.821406841278076 seconds
