<a href="https://www.kaggle.com/code/leomauro/nlp-document-retrieval-for-question-answering?scriptVersionId=96151497" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a id='top'></a>
# NLP - Document Retrieval for Question Answering


Question answering (QA) is a task that answers user's questions using a large collection of documents; it consists of two steps: (1) sort possible documents that contain the answer of a given question; and (2) extract the content from these documents and elaborate an answer to the user. In this notebook, we are going to explore techniques for identify the most similar document for a question, problem called _Document Retrieval_.

**Statement**: Given a question, the document retriever have to return the most likely $k$ documents that contain the answer to the question.

We are going to experiment Document Retrieval in [Stanford Question Answering Dataset](https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset). In the end, I hope you are going to be able to apply some algorithms to handle with this problem.

> **Summary** - Document Retrieval for Question Answering.   
> Content for intermediate level in Machine Learning and Data Science!   

## Table of Contents
- [Data Exploration](#data)
- [Document Retrieval](#document)
    - TF-IDF
    - Word2Vec
- [Conclusion](#discussion)

![image](https://qa.fastforwardlabs.com/images/copied_from_nb/my_icons/QAworkflow.png)
_Image from [Intro to Automated Question Answering](https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html)._

In [1]:
%%capture
!pip install wikipedia==1.4.0
!pip install scikit-learn==1.0.2
!pip install gensim==4.0.1

## What do we want to do?

We want to create a Document Retrieval, like a search tool, such as Wikipedia. Let's explore the `wikipedia` library, to see the retriever in action.

In [2]:
import wikipedia as wiki

k = 5
question = "What are the tourist hotspots in Portugal?"

results = wiki.search(question, results=k)
print('Question:', question)
print('Pages:  ', results)

Question: What are the tourist hotspots in Portugal?
Pages:   ['Tourist attraction', 'Portugal', 'Goa', 'Tourism', 'Algarve']


### Discussion

For this question, Wikipedia's Document Retrieval returned the 5 most likely pages that contain the answer to the question.

<a id="data"></a>

---
# Data Exploration

In this section, we are going to load a `json` file into a `pandas.DataFrame`. At last, elaborate our list of documents.

[Back to Top](#top)

In [3]:
import json
import numpy as np
import pandas as pd

In [4]:
import os

# list the available data
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/stanford-question-answering-dataset/train-v1.1.json
/kaggle/input/stanford-question-answering-dataset/dev-v1.1.json


In [5]:
# based on: https://www.kaggle.com/code/sanjay11100/squad-stanford-q-a-json-to-pandas-dataframe
def squad_json_to_dataframe(file_path, record_path=['data','paragraphs','qas','answers']):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    """
    file = json.loads(open(file_path).read())
    # parsing different level's in the json file
    js = pd.json_normalize(file, record_path)
    m = pd.json_normalize(file, record_path[:-1])
    r = pd.json_normalize(file,record_path[:-2])
    # combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    m['context'] = idx
    data = m[['id','question','context','answers']].set_index('id').reset_index()
    data['c_id'] = data['context'].factorize()[0]
    return data

In [6]:
# loading the data
file_path = '/kaggle/input/stanford-question-answering-dataset/train-v1.1.json'
data = squad_json_to_dataframe(file_path)
data

Unnamed: 0,id,question,context,answers,c_id
0,5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 515, 'text': 'Saint Bernadet...",0
1,5733be284776f4190066117f,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha...","[{'answer_start': 188, 'text': 'a copper statu...",0
2,5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 279, 'text': 'the Main Build...",0
3,5733be284776f41900661181,What is the Grotto at Notre Dame?,"Architecturally, the school has a Catholic cha...","[{'answer_start': 381, 'text': 'a Marian place...",0
4,5733be284776f4190066117e,What sits on top of the Main Building at Notre...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 92, 'text': 'a golden statue...",0
...,...,...,...,...,...
87594,5735d259012e2f140011a09d,In what US state did Kathmandu first establish...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 229, 'text': 'Oregon'}]",18890
87595,5735d259012e2f140011a09e,What was Yangon previously known as?,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 414, 'text': 'Rangoon'}]",18890
87596,5735d259012e2f140011a09f,With what Belorussian city does Kathmandu have...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 476, 'text': 'Minsk'}]",18890
87597,5735d259012e2f140011a0a0,In what year did Kathmandu create its initial ...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 199, 'text': '1975'}]",18890


In [7]:
# how many documents do we have?
data['c_id'].unique().size

18891

## Get the Unique Documents

Let's select the unique documents in our `data`. This will be the list of documents to search for the answers.

In [8]:
documents = data[['context', 'c_id']].drop_duplicates().reset_index(drop=True)
documents

Unnamed: 0,context,c_id
0,"Architecturally, the school has a Catholic cha...",0
1,"As at most other universities, Notre Dame's st...",1
2,The university is the major seat of the Congre...,2
3,The College of Engineering was established in ...,3
4,All of Notre Dame's undergraduate students are...,4
...,...,...
18886,"Institute of Medicine, the central college of ...",18886
18887,Football and Cricket are the most popular spor...,18887
18888,The total length of roads in Nepal is recorded...,18888
18889,The main international airport serving Kathman...,18889


<a href='#top'><span class="label label-info" style="font-size: 125%">Back to Top</span></a>

<a id="document"></a>

---
# Document Retrieval

In this section, we are going to explore the techniques to retrieve documents. First, we are going to create our document `vectorizer`. We use this `vectorizer` to encode the documents and the questions into vectors. After, we can search for a question comparing with the document vectors. In the end, the algorithm will return the $k$ most similar document vectors to a question vector.

[Back to Top](#top)

## TF-IDF

"In information retrieval, TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling." [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

# defining the TF-IDF
tfidf_configs = {
    'lowercase': True,
    'analyzer': 'word',
    'stop_words': 'english',
    'binary': True,
    'max_df': 0.9,
    'max_features': 10_000
}
# defining the number of documents to retrieve
retriever_configs = {
    'n_neighbors': 1,
    'metric': 'cosine'
}

# defining our pipeline
pipeline = Pipeline([
    ('embedding', TfidfVectorizer(**tfidf_configs)),
    ('retriever', KNeighborsClassifier(**retriever_configs)),
])

In [10]:
# let's train the model to retrieve the document id 'c_id'
pipeline.fit(documents['context'], documents['c_id'])

Pipeline(steps=[('embedding',
                 TfidfVectorizer(binary=True, max_df=0.9, max_features=10000,
                                 stop_words='english')),
                ('retriever',
                 KNeighborsClassifier(metric='cosine', n_neighbors=1))])

Let's test the vectorizer, what information our model is using to extract the vector?

In [11]:
def transform_text(vectorizer, text):
    '''
    Print the text and the vector[TF-IDF]
    vectorizer: sklearn.vectorizer
    text: str
    '''
    print('Text:', text)
    vector = vectorizer.transform([text])
    vector = vectorizer.inverse_transform(vector)
    print('Vect:', vector)

In [12]:
# vectorize the question
transform_text(pipeline['embedding'], question)

Text: What are the tourist hotspots in Portugal?
Vect: [array(['tourist', 'portugal'], dtype='<U18')]


What is the most similar document to this question?

In [13]:
# predict the most similar document
c_id = pipeline.predict([question])[0]
selected = documents.iloc[c_id]['context']

# vectorize the document
transform_text(pipeline['embedding'], selected)

Text: The two largest metropolitan areas have subway systems: Lisbon Metro and Metro Sul do Tejo in the Lisbon Metropolitan Area and Porto Metro in the Porto Metropolitan Area, each with more than 35 km (22 mi) of lines. In Portugal, Lisbon tram services have been supplied by the Companhia de Carris de Ferro de Lisboa (Carris), for over a century. In Porto, a tram network, of which only a tourist line on the shores of the Douro remain, began construction on 12 September 1895 (a first for the Iberian Peninsula). All major cities and towns have their own local urban transport network, as well as taxi services.
Vect: [array(['urban', 'transport', 'towns', 'tourist', 'systems', 'supplied',
       'subway', 'shores', 'services', 'september', 'remain', 'portugal',
       'porto', 'peninsula', 'network', 'mi', 'metropolitan', 'metro',
       'major', 'local', 'lisbon', 'lines', 'line', 'largest', 'km',
       'iberian', 'construction', 'cities', 'century', 'began', 'areas',
       'area', '35

### Evaluation

In [14]:
%%time
# predict one document for each question
y_test = data['c_id']
y_pred = pipeline.predict(data['question'])

CPU times: user 18.1 s, sys: 16.1 s, total: 34.2 s
Wall time: 34.2 s


In [15]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
print('Accuracy:', f'{acc:.4f}')
print('Quantity:', int(acc*len(y_pred)), 'from', len(y_pred))

Accuracy: 0.4322
Quantity: 37856 from 87599


### Discussion

1. This is a difficult problem, because we have multiples documents (in this notebook, ~19k documents) and the answer can be in one or more documents. Thus, it is not complete/fair return only one document. Usually, researchers used to return $k$ documents.
2. We reach a low accuray (43.22%) becase we have a lot of documents, and some are pretty similar. Actually, this is a very good accuracy for this problem.
3. TF-IDF has some problems: (1) this algorithm is only able to compute similarity between questions and documents that present the same words, so it can not capture synonyms; and (2) cannot understand the question context or the meaning of the words.

## Word2Vec / Embedding

"Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence." [Wikipedia](https://en.wikipedia.org/wiki/Word2vec)

In [16]:
from gensim.parsing.preprocessing import preprocess_string

# create a corpus of tokens
corpus = documents['context'].tolist()
corpus = [preprocess_string(t) for t in corpus]

In [17]:
from gensim.models import Word2Vec
import gensim.downloader

# you can download a pretrained Word2Vec
# - or you can train your own model

# download a model
# 'glove-wiki-gigaword-300' (376.1 MB)
# 'word2vec-ruscorpora-300' (198.8 MB)
# 'word2vec-google-news-300' (1.6 GB)
# vectorizer = gensim.downloader.load('word2vec-ruscorpora-300')

# train your own model
vectorizer = Word2Vec(sentences=corpus, vector_size=300, window=5, min_count=1, workers=4).wv

In [18]:
# similar words to 'tourist'
vectorizer.most_similar('tourist', topn=5)

[('destin', 0.9756662845611572),
 ('visitor', 0.9436802268028259),
 ('melbourn', 0.9377897381782532),
 ('miami', 0.9367578625679016),
 ('detroit', 0.9308216571807861)]

In [19]:
def transform_text2(vectorizer, text, verbose=False):
    '''
    Transform the text in a vector[Word2Vec]
    vectorizer: sklearn.vectorizer
    text: str
    '''
    tokens = preprocess_string(text)
    words = [vectorizer[w] for w in tokens if w in vectorizer]
    if verbose:
        print('Text:', text)
        print('Vector:', [w for w in tokens if w in vectorizer])
    elif len(words):
        return np.mean(words, axis=0)
    else:
        return np.zeros((300), dtype=np.float32)

In [20]:
# just testing our Word2Vec
transform_text2(vectorizer, question, verbose=True)

Text: What are the tourist hotspots in Portugal?
Vector: ['tourist', 'hotspot', 'portug']


In [21]:
# let's train the model to retrieve the document id 'c_id'
retriever = KNeighborsClassifier(**retriever_configs)

# vectorizer the documents, fit the retriever
document_vectors = documents['context'].apply(lambda x: transform_text2(vectorizer, x)).tolist()
retriever.fit(document_vectors, documents['c_id'])

KNeighborsClassifier(metric='cosine', n_neighbors=1)

### Evaluation

In [22]:
%%time
# vectorizer the questions
data_vectors = data['question'].apply(lambda x: transform_text2(vectorizer, x)).tolist()

# predict one document for each question
y_test = data['c_id']
y_pred = retriever.predict(data_vectors)

CPU times: user 53.4 s, sys: 10.4 s, total: 1min 3s
Wall time: 34.7 s


In [23]:
acc = accuracy_score(y_test, y_pred)
print('Accuracy:', f'{acc:.4f}')
print('Quantity:', int(acc*len(y_pred)), 'from', len(y_pred))

Accuracy: 0.0313
Quantity: 2741 from 87599


### Discussion

1. We reach a really low accuray (3.12%). I do also tested the pretrained model `'word2vec-google-news-300'` and reached 20.98% accuracy. Thus, the TF-IDF was, at least, double better than Word2Vec.
2. Problem: I simply compute the average of the words to compose the document/question embedding; we do have other pooling strategies to work with sentences. Or, we can try more robust embedding techniques, such as BERT, MT5, DPR, etc.

<a href='#top'><span class="label label-info" style="font-size: 125%">Back to Top</span></a>

<a id="discussion"></a>

---
# Conclusion

1. As mentioned, this problem is really complex, due to the number of documents.
2. TF-IDF reached a great accuracy (43.22%) for this dataset, and it can increases returning more documents.
3. We also have other algorithms to work with Document Retriveal, such as [BM25](https://pypi.org/project/rank-bm25/) and [DPR](https://aclanthology.org/2020.emnlp-main.550/).

[Back to Top](#top)

## Reference

1. [Intro to Automated Question Answering](https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html)
2. [Building a QA System with BERT on Wikipedia](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html)
3. [Evaluating QA: Metrics, Predictions, and the Null Response](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html)
4. [Dense Passage Retrieval for Open-Domain Question Answering](https://aclanthology.org/2020.emnlp-main.550/)