## Assignment 2
### Zhengjie Deng a1865926

#### Dependency version:
- Python: 3.8.16
- pandas: 1.4.2
- sklearn: 1.0.2
- nltk: 3.7
- tqdm: 4.64.1
- matplotlib: 3.7.0
- spacy: 3.5.0
- numpy: 1.23.5
- gensim: 3.8.3

In [None]:
import warnings
import gensim.downloader as api
from collections import Counter
import spacy
import pandas as pd
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords


import numpy as np
import json
import re


# Install faiss
!pip install faiss
!pip install pyserini

# Download stopwords
nltk.download('stopwords')
nltk.download('wordnet2022')

# Download en_core_web_lg model
spacy.cli.download("en_core_web_sm")


# Download the pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-100')

# ignore the warning
warnings.filterwarnings("ignore")

# Some Kaggle-wordnet patch
! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet 

# Install java for pyserini

!wget "https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz"
!tar -xvf openjdk-11.0.2_linux-x64_bin.tar.gz

!export JAVA_HOME='/kaggle/working/jdk-11.0.2/'
!export PATH='/kaggle/working/jdk-11.0.2/bin':$PATH
!mkdir -p /kaggle/working/jdk-11.0.2/jre/lib/amd64/server/

!ln -s /kaggle/working/jdk-11.0.2/lib/server/libjvm.so /kaggle/working/jdk-11.0.2/jre/lib/amd64/server/libjvm.so
os.environ["JAVA_HOME"] = "/kaggle/working/jdk-11.0.2/"

### 1. Reading dataset and pre-processing

#### 1.1 Read dataset

In [None]:
# load the dataset
f_metadata_path = "../input/CORD-19-research-challenge/metadata.csv"

# construct the data frame of metadata
df = pd.read_csv(f_metadata_path)
df.head(5)

#### 1.2 Dropping useless columns

In [None]:
# drop useless columns for this assignment
df_trimed = df.drop(["sha", "source_x", "doi", "license", "authors", "journal", "pmc_json_files",
                    "pmcid", "pubmed_id", "mag_id", "who_covidence_id", "arxiv_id", "url", "s2_id"], axis=1)
df_trimed.head(5)


#### 1.3 Sampling

Given that COVID-19 was first reported in December 2019, we can limit our data selection to articles published after this date. Therefore, we will sample data from December 1st, 2019, onwards.

In [None]:
# only keep the documents published after 2019-11
df_trimed = df_trimed[df_trimed["publish_time"] > '2019-12']
df_trimed.shape[0]


Check the missing values in the dataset.

In [None]:
# check the number of missing value in each column
missing_values_count = df_trimed.isnull().sum()
missing_values_count


Many rows of data are missing titles, abstracts, and PDF files, as demonstrated above. We will drop these rows from our analysis.

In [None]:
df_trimed = df_trimed.dropna(subset=['pdf_json_files', 'title', 'abstract'])
df_trimed.shape[0]


Next, we will verify the validity of the PDF file URLs. Any rows with invalid URLs will be dropped from the dataset.

In [None]:
# remove all the rows that do not have the real file
import os
# tqdm progress apply
tqdm.pandas(desc="removing all the rows that do not have the real file...")
print(df_trimed['pdf_json_files'])


df_trimed = df_trimed[df_trimed['pdf_json_files'].progress_apply(
    lambda x: os.path.isfile("../input/CORD-19-research-challenge/"+x))]
df_trimed.shape[0]
s

Based on the keywords of the test queries, we can sample the data further. We will only keep the rows whose title or abstract contain the keywords in the test queries. 

In [None]:
# function: select the rows whose title or abstract contain the given strings, ignoring the case
def select_rows_contain_string(df, strings):
    tqdm.pandas(
        desc="selecting the rows whose title or abstract contain the given strings...")
    return df[df.progress_apply(lambda row: any(string in row['title'].lower() or string in row['abstract'].lower() for string in strings), axis=1)]

In [None]:
# title or abstract contain "COVID-19", "SARS-CoV-2", "coronavirus", "2019-nCoV", "covid", "covid-19", "Covid-19"
strings = ["COVID-19", "SARS-CoV-2", "coronavirus",
           "2019-nCoV", "covid", "covid-19", "Covid-19"]
df_trimed_covid = select_rows_contain_string(df_trimed, strings)

keyword_sampled_df_list = []

# title or abstract contain "origin", "Wuhan" from the df_trimed_covid
strings = ["origin", "Wuhan"]
df_trimed_origin = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_origin)

# title or abstract contain "rapid testing"
strings = ["rapid testing"]
df_trimed_testing = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_testing)

# title or abstract contain "social", "distancing", "lockdown", "quarantine"
strings = ["social distancing", "lockdown", "quarantine"]
df_trimed_social = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_social)

# title or abstract contain "transmission route"
strings = ["transmission route"]
df_trimed_transmission = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_transmission)

# title or abstract contain "best masks", "preventing infection", "prevent infection", "preventing transmission", "prevent transmission", "preventing spread", "prevent spread"
strings = ["best masks", "preventing infection", "prevent infection",
           "preventing transmission", "prevent transmission", "preventing spread", "prevent spread"]
df_trimed_masks = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_masks)

# title or abstract contain "hand sanitizer"
strings = ["hand sanitizer"]
df_trimed_sanitizer = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_sanitizer)

# title or abstract contain "vaccine", "vaccination", "vaccines", "vaccinations"
strings = ["vaccine", "vaccination", "vaccines", "vaccinations"]
df_trimed_vaccine = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_vaccine)

# title or abstract contain "Vitamin"
strings = ["Vitamin"]
df_trimed_vitamin = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_vitamin)

# title or abstract contain "live outside the body"
strings = ["live outside the body"]
df_trimed_outside = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_outside)

# title or abstract contain "initial symptoms"
strings = ["initial symptoms"]
df_trimed_symptoms = select_rows_contain_string(df_trimed_covid, strings)
keyword_sampled_df_list.append(df_trimed_symptoms)


In [None]:
# join all the dataframes above into one dataframe and remove the duplicates
df_trimed_covid = pd.concat(keyword_sampled_df_list).drop_duplicates()
df_trimed_covid.shape[0]

Finally, we randomly sample 10000 rows from the dataset.

In [None]:
# randomly pick 10000 documents from the dataset

df_sampled = df_trimed_covid.sample(n=10000, random_state=42)
df_sampled

#### 1.4 Pre-processing the text data 

After sampling the data, our next step is to pre-process the text data. This involves extracting the text from the PDF JSON files as our first task.

In [None]:
# function: get the pdf json text based on the URL
def extract_pdf(row, pdf_json):
    # paragraphs other than abstract, introduction, conclusion
    row["other_paragraph"] = []
    for body_paragraph in pdf_json['body_text']:
        if body_paragraph["section"] == "Introduction":
            row["introduction"] = body_paragraph["text"]
        elif body_paragraph["section"] == "Conclusion":
            row["conclusion"] = body_paragraph["text"]
        else:
            row["other_paragraph"].append(body_paragraph["text"])
    return row

# extract the pdf json text from the sampled dataset
tqdm.pandas(desc="extracting the pdf json text from the sampled dataset...")
df_sampled = df_sampled.progress_apply(lambda row: extract_pdf(
    row, json.load(open("../input/CORD-19-research-challenge/"+row['pdf_json_files']))), axis=1)
df_sampled.head()


Next, we can preprocess the text we got from the PDF JSON files. Preprocessing the text involves several tasks: first, we will convert all text to lowercase and remove stopwords. Next, we will perform lemmatization and store the preprocessed text in a separate column for further analysis.

In [None]:
stop_words = set(stopwords.words('english'))

# function that preprocesses the input string
def preprocess_text(string):
    string = string.lower()
    # remove stopwords
    string = " ".join([word for word in string.split()
                      if word not in stop_words])
    # lemmatization
    lemmatizer = nltk.stem.WordNetLemmatizer()
    string = " ".join([lemmatizer.lemmatize(word) for word in string.split()])
    return string

# function that preprocesses the text of a row in the sampled dataset
def preprocess_text_in_df(row):
    title = preprocess_text(row['title'])
    abstract = preprocess_text(row['abstract'])
    introduction = ""
    if pd.isnull(row['introduction']) == False:
        introduction = preprocess_text(row['introduction'])
    conclusion = ""
    if pd.isnull(row['conclusion']) == False:
        conclusion = preprocess_text(row['conclusion'])
    other_paragraph = ""
    for paragraph in row['other_paragraph']:
        other_paragraph += preprocess_text(paragraph)
    row["preprocessed_text"] = title + " " + abstract + " " + \
        introduction + " " + conclusion + " " + other_paragraph
    return row


tqdm.pandas(desc="preprocessing the text in the sampled dataset...")
df_sampled = df_sampled.progress_apply(
    lambda row: preprocess_text_in_df(row), axis=1)


In [None]:
df_sampled["preprocessed_text"].head()

#### Generating the paragraph list

In [None]:
# generate the list of paragraphs
paragraphs = []
for index, row in df_sampled.iterrows():
    paragraph_obj = {}
    paragraph_obj['text'] = row['abstract']
    paragraph_obj['p_id'] = row['cord_uid'] + "_0"
    paragraphs.append(paragraph_obj)
    # extract the paragraphs from the json file
    with open("./archive/"+row['pdf_json_files']) as f:
        json_data = json.load(f)
        p_index = 1
        for body in json_data['body_text']:
            paragraph_obj = {}
            paragraph_obj['text'] = body['text']
            paragraph_obj['p_id'] = row['cord_uid'] + "_" + str(p_index)
            paragraphs.append(paragraph_obj)
            p_index += 1

# turn the list of paragraphs into a dataframe
df_paragraphs = pd.DataFrame(paragraphs)
# set the index of the dataframe to be the p_id
df_paragraphs = df_paragraphs.set_index('p_id')
df_paragraphs

### 2. Named Entity Recognition and Knowledge Base

#### 2.1 Entity extraction

To save time, we will only extract entities from the title, abstract, introduction, and conclusion sections of the text. These sections of the article are most likely to contain entities that are relevant to the topic of the article.

In [None]:
nlp = spacy.load("en_core_web_sm")

# function: get the name entities from the title, abstract, introduction, and conclusion of each data in the dataset
def get_name_entity(row):
    name_entity = []
    doc = nlp(row["title"])
    for ent in doc.ents:
        # if ent is not number
        if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
            name_entity.append(ent.text)
    doc = nlp(row["abstract"])
    for ent in doc.ents:
        if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
            name_entity.append(ent.text)
    if pd.isna(row["introduction"]) == False:
        doc = nlp(row["introduction"])
        for ent in doc.ents:
            if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
                name_entity.append(ent.text)
    if pd.isna(row["conclusion"]) == False:
        doc = nlp(row["conclusion"])
        for ent in doc.ents:
            if ent.label_ != "CARDINAL" and ent.label_ != "PERCENT" and ent.label_ != "MONEY":
                name_entity.append(ent.text)
    return name_entity


In [None]:
# get the name entity of the title, abstract, introduction, and conclusion of each data in the dataset
tqdm.pandas(desc="Getting name entity...")
df_sampled['name_entity'] = df_sampled.progress_apply(
    lambda row: get_name_entity(row), axis=1)


Presenting the frequency of the extracted entities.

In [None]:
# merge the name entity into one list
name_entity_list = []
for name_entity in df_sampled['name_entity']:
    name_entity_list.extend(name_entity)

# count the frequency of each name entity and sort it
name_entity_count = Counter(name_entity_list)
name_entity_count = sorted(name_entity_count.items(),
                           key=lambda x: x[1], reverse=True)

# show the top 100 name entity
name_entity_count[:100]


#### 2.2 Knowledge base

We will manually build knowledge bases that display synonyms of the entities and their associated keywords. This will be based on the results of the Named Entity Recognition and the test query set.

In [None]:
# manually create the knowledge base

# create the knowledge base
knowledge_base_synonym = {
    "COVID-19": ["SARS-CoV-2", "coronavirus disease 2019", "coronavirus disease 19", "coronavirus 2019", "coronavirus 19", "2019 novel coronavirus", "2019-nCoV", "2019-novel coronavirus", "2019 novel coronavirus pneumonia", "2019-nCoV pneumonia"],
    "rapid testing": ["RAT", "rapid test", "rapid antigen test", "rapid antigen tests", "rapid antigen testing", "rapid antigen tests", "rapid antigen test kit"],
    "origin": ["origins", "source", "sources"],
    "initial symtoms": ["early signs"]
}

knowledge_base_association = {
    "mask": ["n95", "cloth mask"],
    "vaccine": ["mrna"],
    "origin": ["wuhan", "fish market"],
    "symptoms": ["fever", "chill", "cough", "tired", "headache", "loss taste or small", "sore throat", "diarrhea"],
    "sanitizer": ["alcohol"],
    "social distancing": ["quarantine", "lockdown"],
    "transmission route": ["airborne", "droplet", "contact", "fomite"],
    "testing": ["PCR"]
}


### 3. Indexing method

To efficiently retrieve documents containing words from the query, we will utilize the inverted index method. This method creates a dictionary that maps words to the documents that contain them.

In [None]:
# the unique word set of the preprocessed_text of the dataset
unique_word_set = set([])

def get_unique_words(row):
    for word in row['preprocessed_text'].split():
        # if not a single puctuation
        if re.match(r'^[^\w\s]$', word) == None:
            unique_word_set.add(word)


tqdm.pandas(desc="Getting unique words...")
result = df_sampled.progress_apply(lambda row: get_unique_words(row), axis=1)


In [None]:
# the inverted index of the preprocessed_text of the dataset
inverted_index = inv_indx = {i: [] for i in unique_word_set}

# function: get the inverted index of the preprocessed_text of the dataset
def get_inverted_index(row, inverted_index):
    for word in row['preprocessed_text'].split():
        if word in inverted_index:
            inverted_index[word].append(row['cord_uid'])


tqdm.pandas(desc="Getting inverted index...")
result = df_sampled.progress_apply(
    lambda row: get_inverted_index(row, inverted_index), axis=1)


In [None]:
# get the documents containing the word "mask"
inverted_index["mask"]


### 4. Text matching utility

To obtain the answer snippets to a query, we will first extend the query by adding related entities from the knowledge base. Next, we will use the inverted index method to locate documents containing the words from the extended query. We will then calculate cosine similarity between the query and the documents to rank them and select the top 5 as the target documents. Finally, we will again use cosine similarity to rank the paragraphs in the target documents and extract the top 3 paragraphs as answer snippets.

First, we define the function to preprocess the query.

In [None]:
# function: extend the query with the knowledge base
def extend_query(query, knowledge_base_synonym, knowledge_base_association):
    # convert the query to lower case
    query = query.lower()
    # for each key in the knowledge base, if the key is in the query, then add the value to the query string
    for key in knowledge_base_synonym:
        # if the query string contains the key
        if key.lower() in query:
            # concatenate the query string with each value of the key
            for value in knowledge_base_synonym[key]:
                query += " " + value
    for key in knowledge_base_association:
        if key.lower() in query:
            for value in knowledge_base_association[key]:
                query += " " + value
    return query

# query preprocessing: extend the query with the knowledge base, remove stopwords, lemmatization, and remove question mark
def preprocess_query(query):
    # extend the query with the knowledge base
    query = extend_query(query, knowledge_base_synonym,
                         knowledge_base_association)
    query = query.lower()
    # remove stopwords
    query = " ".join([word for word in query.split()
                     if word not in stop_words])
    # lemmatization
    lemmatizer = nltk.stem.WordNetLemmatizer()
    query = " ".join([lemmatizer.lemmatize(word) for word in query.split()])
    # remove question mark
    query = query.replace("?", "")
    return query


Next, we need to define the function to embed the query and the documents.

In [None]:
# function: get the vector representation of the query by averaging the vector representation of each word in the query
def embedding_string(string):
    vectors = []
    for word in string.split():
        if word in glove_model:
            vectors.append(glove_model[word])
    if vectors == []:
        return np.zeros(100)
    return np.mean(vectors, axis=0)

# get the vector representation of each document in df_sampled
df_sampled['embedding'] = df_sampled.progress_apply(
    lambda row: embedding_string(row['preprocessed_text']), axis=1)

df_sampled['embedding']

#### 4.1 Cosine similarity

In [None]:
# function: calculate the cosine similarity between two vectors
def cosine_similarity(vector1, vector2):
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

#### 4.2 Get the top n documents or paragraphs

In [None]:
# function: get the top n documents that are most similar to the query from the indexed document set
def get_top_n_documents(query_vector, indexed_document_set, n):
    # get the vector representation of each document in the indexed document set
    df_indexed = df_sampled[df_sampled['cord_uid'].isin(indexed_document_set)]
    # calculate the cosine similarity between the query and each document in the indexed document set
    df_indexed['similarity'] = df_indexed.progress_apply(
        lambda row: cosine_similarity(query_vector, row['embedding']), axis=1)
    # sort the documents by the similarity score
    df_indexed = df_indexed.sort_values(by=['similarity'], ascending=False)
    # get the top n documents
    df_top_n = df_indexed[:n]
    return df_top_n


In [None]:
# function: get the top n paragraphs that are most similar to the query for each document in the top n documents
def get_top_n_paragraphs(query_vector, df_top_n_doc, n):
    top_n_paragraphs_dict = {}
    for index, row in df_top_n_doc.iterrows():
        # get the paragraphs of the article
        paragraph_list = {"paragraph": [], "similarity": []}
        paragraph_list["paragraph"].append(row["abstract"])

        # if the conclusion is not null, then add it to the paragraph list
        if pd.isnull(row["conclusion"]) == False:
            paragraph_list["paragraph"].append(row["conclusion"])

        if pd.isnull(row["introduction"]) == False:
            paragraph_list["paragraph"].append(row["introduction"])

        paragraph_list["paragraph"].extend(row["other_paragraph"])

        # calculate the cosine similarity between the query and each paragraph
        for paragraph in paragraph_list["paragraph"]:
            paragraph_list["similarity"].append(cosine_similarity(
                query_vector, embedding_string(paragraph)))
        
        df_paragraph_list = pd.DataFrame(paragraph_list)
        # sort the paragraphs by the similarity score
        df_paragraph_list = df_paragraph_list.sort_values(
            by=['similarity'], ascending=False)
        top_n_paragraphs_dict[row["cord_uid"]] = df_paragraph_list[:n]
    return top_n_paragraphs_dict


#### 4.3 Get the final answer of a query

In [None]:
# function: integrate the above functions to get the answer of the query
def get_answer(query, n):
    # preprocess the query
    query = preprocess_query(query)
    # get the indexed document set
    indexed_document_set = set([])
    for word in query.split():
        if word in inverted_index:
            for document in inverted_index[word]:
                indexed_document_set.add(document)
    # get the vector representation of the query
    query_vector = embedding_string(query)
    # get the top n documents that are most similar to the query
    df_top_n = get_top_n_documents(query_vector, indexed_document_set, n)
    # get the top n paragraphs that are most similar to the query for each document in the top n documents
    top_n_paragraphs_dict = get_top_n_paragraphs(query_vector, df_top_n, 3)
    return top_n_paragraphs_dict


### 5. Test utility and test results

In [None]:
# the test set with 10 queries
test_set = [
    "what is the origin of COVID-19?",
    "what types of rapid testing for Covid-19 have been developed?",
    "has social distancing had an impact on slowing the spread of COVID-19?",
    "what are the transmission routes of coronavirus?",
    "what are the best masks for preventing infection by Covid-19?",
    "what type of hand sanitizer is needed to destroy Covid-19?",
    "What vaccine candidates are being tested for Covid-19?",
    "does Vitamin D impact COVID-19 prevention and treatment?",
    "how long can the coronavirus live outside the body?",
    "what are the initial symptoms of Covid-19?",
]


In [None]:
# present the top_n_paragraphs_dict in a more readable way
def present_answer(top_n_paragraphs_dict):
    for key in top_n_paragraphs_dict:
        # print the document title
        print("Document: " +
              df_sampled[df_sampled["cord_uid"] == key]["title"].values[0])
        for index, row in top_n_paragraphs_dict[key].iterrows():
            print("Paragraph: " + row["paragraph"])
            print("Confidence: " + str(row["similarity"]))
            print("")
        print("---------------------------------------------------------------------------------------------")


In [None]:
for i in range(len(test_set)):
    print("Question " + str(i + 1) + ": " + test_set[i])
    present_answer(get_answer(test_set[i], 5))
    print("")
    print("=================================================================================================")
    print("=================================================================================================")
    print("")
    print("")


#### 5.1 Calculate the MRR of the result

After obtaining the test results, we will manually identify the correctness of each return snippets and calculate the reciprocal rank for each query.

In [None]:
reciprocal_rank = np.zeros(len(test_set))
reciprocal_rank[0] = 1/2 # the second document is a relevant document
reciprocal_rank[1] = 1/5 # the fifth document is a relevant document
reciprocal_rank[2] = 0 # there is no relevant document in the top 5 documents
reciprocal_rank[3] = 1 # the first document is a relevant document
reciprocal_rank[4] = 0
reciprocal_rank[5] = 0
reciprocal_rank[6] = 1/4
reciprocal_rank[7] = 0
reciprocal_rank[8] = 0
reciprocal_rank[9] = 1

print("The mean reciprocal rank of the test set is: " +
      str(np.mean(reciprocal_rank)))


#### 5.2 Result analysis

The result shows that only half of the queries are accurately answered, which are: 1. "what is the origin of COVID-19?"; 2. "what types of rapid testing for Covid-19 have been developed?"; 3. "what are the transmission routes of coronavirus?"; 4. "What vaccine candidates are being tested for Covid-19?"; 5. "what are the initial symptoms of Covid-19?". The success of these queries can be attributed to the presence of specific keywords in the knowledge base. Conversely, queries that were answered incorrectly lacked keywords that were not present in the knowledge base or dataset. Therefore, it can be concluded that a well-constructed knowledge base is critical for precise answer retrieval, which can be enhanced by incorporating additional keywords and synonyms or by utilizing relation extraction to establish the connections between entities. Despite these limitations, the outcomes highlight the potential of utilizing named entity recognition and knowledge bases for question-answering systems.

### 6. Simple user interface

In [None]:
convert_json_list = []
def convert_to_json(row):
    num = 0
    row['other_paragraph'].append(row['introduction'])
     row['other_paragraph'].another_paragraph']:
        convert_json_list.append(
        {
            "id": row["cord_uid"]+"-"+ str(num),
            "contents": i
        })
        num = num+1;
        

In [None]:
x = df_sampled.progress_apply(
    lambda row: convert_to_json(row), axis=1)

json_str = json.dumps(convert_json_list)
with open("collection.json", "w") as outfile:
    outfile.write(json_str)
print(json_str[0:500])


In [None]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input ./ \
  --index index \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

In [None]:
from pyserini.search.lucene import LuceneSearcher
import json
searcher = LuceneSearcher('./index')
f = open("collection.json", "r")
text = json.loads(f.read())


hits = searcher.search('where did coronavirus origin')

for i in range(len(hits)):
   
    print(f'{i+1:2} {hits[i].docid} {hits[i].contents} {hits[i].score:.5f}')
    for j in text:
        
        if(j['id'] == hits[i].docid):
            print(j['contents'])

In [None]:
# accept the query from the user
query = input("Please enter your question: ")
print(query)
# get the answer of the query
top_n_paragraphs_dict = get_answer(query, 5)
# present the answer
present_answer(top_n_paragraphs_dict)


### Bert QA

In [9]:
# connect with the summarization section
test_summerized_reference = "We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement)."
test_query = "What does the 'B' in BERT stand for?"

In [10]:
!pip install transformers
import torch

[0m

In [15]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [19]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [20]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # ======== Tokenize ========
    # Apply the tokenizer to the input text, treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Report how long the input sequence is.
#     print('Query has {:,} tokens.\n'.format(len(input_ids)))

    # ======== Set Segment IDs ========
    # Search the input_ids for the first instance of the `[SEP]` token.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A tokens includes the [SEP] token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B.
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s.
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # ======== Evaluate ========
    # Run our example through the model.
    outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                    token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                    return_dict=True) 

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
    # Find the tokens with the highest `start` and `end` scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    print('Answer: "' + answer + '"')

In [21]:
answer_question(test_query, test_summerized_reference)

Answer: "bidirectional encoder representations from transformers"


### 7. References