<a id='top'></a>
# Automated Question Answering System

We are going to work on Document Retrieval in [Stanford Question Answering Dataset](https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset). Let's see how algorithms handled the problems.

Check Testing Scores [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/)

![image](https://qa.fastforwardlabs.com/images/copied_from_nb/my_icons/QAworkflow.png)
_Image from [Intro to Automated Question Answering](https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html)._

In [None]:
%%capture
!pip install wikipedia==1.4.0
!pip install scikit-learn==1.0.2
!pip install gensim==4.0.1

## What do we want to do?

We want to create a Document Retrieval, like a search tool, such as Wikipedia. Let's explore the `wikipedia` library, to see the retriever in action.

In [None]:
import wikipedia as wiki

k = 5
question = "What are the tourist hotspots in Portugal?"

results = wiki.search(question, results=k)
print('Question:', question)
print('Pages:  ', results)

### Discussion

For this question, Wikipedia's Document Retrieval returned the 5 most likely pages that contain the answer to the question.

<a id="data"></a>

---
# Data Exploration

In this section, we are going to load a `json` file into a `pandas.DataFrame`. At last, elaborate our list of documents.

[Back to Top](#top)

In [None]:
import json
import numpy as np
import pandas as pd

In [None]:
import os

# list the available data
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# based on: https://www.kaggle.com/code/sanjay11100/squad-stanford-q-a-json-to-pandas-dataframe
def squad_json_to_dataframe(file_path, record_path=['data','paragraphs','qas','answers']):
    """
    input_file_path: path to the squad json file.
    record_path: path to deepest level in json file default value is
    ['data','paragraphs','qas','answers']
    """
    file = json.loads(open(file_path).read())
    # parsing different level's in the json file
    js = pd.json_normalize(file, record_path)
    m = pd.json_normalize(file, record_path[:-1])
    r = pd.json_normalize(file,record_path[:-2])
    # combining it into single dataframe
    idx = np.repeat(r['context'].values, r.qas.str.len())
    m['context'] = idx
    data = m[['id','question','context','answers']].set_index('id').reset_index()
    data['c_id'] = data['context'].factorize()[0]
    return data

In [None]:
# loading the data
file_path = '/kaggle/input/stanford-question-answering-dataset/train-v1.1.json'
data = squad_json_to_dataframe(file_path)
data

In [None]:
# how many documents do we have?
data['c_id'].unique().size

## Get the Unique Documents

Let's select the unique documents in our `data`. This will be the list of documents to search for the answers.

In [None]:
documents = data[['context', 'c_id']].drop_duplicates().reset_index(drop=True)
documents

<a id="document"></a>

---
# Document Retrieval

In this section, we are going to explore the techniques to retrieve documents. First, we are going to create our document `vectorizer`. We use this `vectorizer` to encode the documents and the questions into vectors. After, we can search for a question comparing with the document vectors. In the end, the algorithm will return the $k$ most similar document vectors to a question vector.


## TF-IDF

"In information retrieval, TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling." 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# defining the TF-IDF
tfidf_configs = {
    'lowercase': True,
    'analyzer': 'word',
    'stop_words': 'english',
    'binary': True,
    'max_df': 0.9,
    'max_features': 10_000
}
# defining the number of documents to retrieve
retriever_configs = {
    'n_neighbors': 10,
    'metric': 'cosine'
}

# defining our pipeline
embedding = TfidfVectorizer(**tfidf_configs)
retriever = NearestNeighbors(**retriever_configs)

In [None]:
# let's train the model to retrieve the document id 'c_id'
X = embedding.fit_transform(documents['context'])
retriever.fit(X, documents['c_id'])

Let's test the vectorizer, what information our model is using to extract the vector?

In [None]:
def transform_text(vectorizer, text):
    '''
    Print the text and the vector[TF-IDF]
    vectorizer: sklearn.vectorizer
    text: str
    '''
    print('Text:', text)
    vector = vectorizer.transform([text])
    vector = vectorizer.inverse_transform(vector)
    print('Vect:', vector)

In [None]:
# vectorize the question
transform_text(embedding, question)

What is the most similar document to this question?

In [None]:
# predict the most similar document
X = embedding.transform([question])
c_id = retriever.kneighbors(X, return_distance=False)[0][0]
selected = documents.iloc[c_id]['context']

# vectorize the document
transform_text(embedding, selected)

### Evaluation

In [None]:
%%time
# predict one document for each question
X = embedding.transform(data['question'])
y_test = data['c_id']
y_pred = retriever.kneighbors(X, return_distance=False)

In [None]:
# top documents predicted for each question
y_pred

In [None]:
def top_accuracy(y_true, y_pred) -> float:
    right, count = 0, 0
    for i, y_t in enumerate(y_true):
        count += 1
        if y_t in y_pred[i]:
            right += 1
    return right / count if count > 0 else 0

In [None]:
acc = top_accuracy(y_test, y_pred)
print('Accuracy:', f'{acc:.4f}')
print('Quantity:', int(acc*len(y_pred)), 'from', len(y_pred))

### Discussion

1. This is a difficult problem, because we have multiples documents (in this notebook, ~19k documents) and the answer can be in one or more documents. Thus, the retriever usually returns $k$ documents, because it is not complete/fair return only one document.
2. We reach a high accuracy with top-10 (71.48%); in top-1 a low accuray (43.22%) becase we have a lot of documents, and some are pretty similar. Actually, this top-1 and top-10 are very good accuracy for this problem.
3. TF-IDF has some problems: (1) this algorithm is only able to compute similarity between questions and documents that present the same words, so it can not capture synonyms; and (2) cannot understand the question context or the meaning of the words.

## Word2Vec / Embedding

"Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence."

In [None]:
from gensim.parsing.preprocessing import preprocess_string

# create a corpus of tokens
corpus = documents['context'].tolist()
corpus = [preprocess_string(t) for t in corpus]

In [None]:
from gensim.models import Word2Vec
import gensim.downloader

# you can download a pretrained Word2Vec
# - or you can train your own model

# download a model
# 'glove-wiki-gigaword-300' (376.1 MB)
# 'word2vec-ruscorpora-300' (198.8 MB)
# 'word2vec-google-news-300' (1.6 GB)
# vectorizer = gensim.downloader.load('word2vec-ruscorpora-300')

# train your own model
vectorizer = Word2Vec(sentences=corpus, vector_size=300, window=5, min_count=1, workers=4).wv

In [None]:
# similar words to 'tourist'
vectorizer.most_similar('tourist', topn=5)

In [None]:
def transform_text2(vectorizer, text, verbose=False):
    '''
    Transform the text in a vector[Word2Vec]
    vectorizer: sklearn.vectorizer
    text: str
    '''
    tokens = preprocess_string(text)
    words = [vectorizer[w] for w in tokens if w in vectorizer]
    if verbose:
        print('Text:', text)
        print('Vector:', [w for w in tokens if w in vectorizer])
    elif len(words):
        return np.mean(words, axis=0)
    else:
        return np.zeros((300), dtype=np.float32)

In [None]:
# just testing our Word2Vec
transform_text2(vectorizer, question, verbose=True)

In [None]:
# let's train the model to retrieve the document id 'c_id'
retriever2 = NearestNeighbors(**retriever_configs)

# vectorizer the documents, fit the retriever
X = documents['context'].apply(lambda x: transform_text2(vectorizer, x)).tolist()
retriever2.fit(X, documents['c_id'])

### Evaluation

In [None]:
%%time
# vectorizer the questions
X = data['question'].apply(lambda x: transform_text2(vectorizer, x)).tolist()

# predict one document for each question
y_test = data['c_id']
y_pred = retriever2.kneighbors(X, return_distance=False)

In [None]:
# top documents predicted for each question
y_pred

In [None]:
acc = top_accuracy(y_test, y_pred)
print('Accuracy:', f'{acc:.4f}')
print('Quantity:', int(acc*len(y_pred)), 'from', len(y_pred))

### Discussion

1. We did not reach a good accuracy (12.15%) in top-10; and a really low accuray (3.07%) in top-1. Thus, the TF-IDF was better.
2. Maybe, the `vectorizer` did not receive enough data to be trained. Thus, I suggest use pretrained models, like `'word2vec-google-news-300'`.
3. Another problem: I simply compute the average of the words to compose the document/question embedding; we do have other pooling strategies to work with sentences. Or, we can try more robust embedding techniques, such as BERT, MT5, DPR, etc.

<a id="discussion"></a>

---
# Conclusion

1. As mentioned, this problem is really complex, due to the number of documents.
2. TF-IDF reached a great top-10 accuracy (71.48%) for this dataset, and it can increases returning more documents.
3. We also have other algorithms to work with Document Retriveal, such as [BM25](https://pypi.org/project/rank-bm25/) and [DPR](https://aclanthology.org/2020.emnlp-main.550/).

## Reference

1. [Intro to Automated Question Answering](https://qa.fastforwardlabs.com/methods/background/2020/04/28/Intro-to-QA.html)
2. [Building a QA System with BERT on Wikipedia](https://qa.fastforwardlabs.com/pytorch/hugging%20face/wikipedia/bert/transformers/2020/05/19/Getting_Started_with_QA.html)
3. [Evaluating QA: Metrics, Predictions, and the Null Response](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html)
4. [Dense Passage Retrieval for Open-Domain Question Answering](https://aclanthology.org/2020.emnlp-main.550/)