## "FAQ-Style QA": Utilizing existing FAQs for Question Answering

While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.

Pros:
- Very fast at inference time
- Utilize existing FAQ data
- Quite good control over answers

Cons:
- Generalizability: We can only answer questions that are similar to existing ones in FAQ

In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option.

*Use this [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_Tutorial4_FAQ_style_QA.ipynb) to open the notebook in Google Colab.*


In [1]:
! pip install -q kaggle

In [None]:
from google.colab import files

contents = files.upload()

In [1]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [2]:
! kaggle datasets download 'Cornell-University/arxiv'
! unzip -qq arxiv.zip

Downloading arxiv.zip to /content/blog_nbs/scratch_nbs/google_scholar
 99% 873M/885M [00:09<00:00, 112MB/s]
100% 885M/885M [00:09<00:00, 99.6MB/s]


In [2]:
import json
data  = []
with open("arxiv-metadata-oai-snapshot.json", 'r') as f:
    for line in f: 
        data.append(json.loads(line))

In [3]:
import pandas as pd
data = pd.DataFrame(data[:100])

In [6]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack and install the version of torch that works with the colab GPUs
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

kages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 38.7MB/s 
[?25hCollecting pydantic<2.0.0,>=1.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/aa/5f/855412ad12817ae87f1c77d3af2fc384eaed3adfb8f3994816d75483fa20/pydantic-1.6.1-cp36-cp36m-manylinux2014_x86_64.whl (8.7MB)
[K     |████████████████████████████████| 8.7MB 35.1MB/s 
[?25hCollecting starlette==0.13.6
[?25l  Downloading https://files.pythonhosted.org/packages/c5/a4/c9e228d7d47044ce4c83ba002f28ff479e542455f0499198a3f77c94f564/starlette-0.13.6-py3-none-any.whl (59kB)
[K     |████████████████████████████████| 61kB 5.8MB/s 
[?25hCollecting httptools==0.1.*; sys_platform != "win32" and sys_platform != "cygwin" and platform_python_implementation != "PyPy"
[?25l  Downloading https://files.pythonhosted.org/packages/b1/a6/dc1e7e8f4049ab70d52c9690ec10652e268ab2542853033cc1d539594102/httptools-0.1.1-cp3

In [4]:
from haystack import Finder
from haystack.database.elasticsearch import ElasticsearchDocumentStore

from haystack.retriever.dense import EmbeddingRetriever
from haystack.utils import print_answers
import pandas as pd
import requests


### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker
# ! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

In [5]:
# In Colab / No Docker environments: Start Elasticsearch from source
# ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
# ! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
# ! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30


### Init the DocumentStore
In contrast to Tutorial 1 (extractive QA), we:

* specify the name of our `text_field` in Elasticsearch that we want to return as an answer
* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question
* set `excluded_meta_data=["question_emb"]` so that we don't return the huge embedding vectors in our search results

In [6]:
from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="",
                                            index="document",
                                            embedding_field="question_emb",
                                            embedding_dim=768,
                                            excluded_meta_data=["question_emb"])

09/16/2020 02:24:15 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.161s]
09/16/2020 02:24:15 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.007s]


### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [7]:
retriever = EmbeddingRetriever(document_store=document_store, embedding_model="deepset/sentence_bert", use_gpu=False)

09/16/2020 02:24:19 - INFO - haystack.retriever.dense -   Init retriever using embeddings of model deepset/sentence_bert
09/16/2020 02:24:19 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
09/16/2020 02:24:19 - INFO - farm.infer -   Could not find `deepset/sentence_bert` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/16/2020 02:24:31 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None


### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in elasticsearch.
Here: We download some question-answer pairs related to COVID-19

In [8]:
# Now, let's write the dicts containing documents to our DB.

data['title'] = data['title'].apply(lambda x: x.strip())
data = data.rename(columns = {'title': 'question', 'abstract': 'text'})
data['question_emb'] = retriever.embed_queries(texts = list(data['question'].values))

docs_to_index = data.to_dict(orient = 'records')
document_store.write_documents(docs_to_index)
# document_store.write_documents(data[['title', 'abstract']].rename(columns={'title':'name','abstract':'text'}).to_dict(orient='records'))

Inferencing Samples: 100%|██████████| 25/25 [03:17<00:00,  7.91s/ Batches]
09/16/2020 02:27:51 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.978s]


In [None]:
# # Download
# temp = requests.get("https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/faqs/faq_covidbert.csv")
# open('small_faq_covid.csv', 'wb').write(temp.content)

# # Get dataframe with columns "question", "answer" and some custom metadata
# df = pd.read_csv("small_faq_covid.csv")
# # Minimal cleaning
# df.fillna(value="", inplace=True)
# df["question"] = df["question"].apply(lambda x: x.strip())
# print(df.head())

# # Get embeddings for our questions from the FAQs
# questions = list(df["question"].values)
# df["question_emb"] = retriever.embed_queries(texts=questions)
# df["question_emb"] = df["question_emb"].apply(list) # convert from numpy to list for ES indexing
# df = df.rename(columns={"answer": "text"})

# # Convert Dataframe to list of dicts and index them in our DocumentStore
# docs_to_index = df.to_dict(orient="records")
# document_store.write_documents(docs_to_index)

### Ask questions
Initialize a Finder (this time without a reader) and ask questions

In [9]:
finder = Finder(reader=None, retriever=retriever)
prediction = finder.get_answers_via_similar_questions(question="Language models for chemestry", top_k_retriever=10)
print_answers(prediction, details="all")

 '$4500\\kms$ is easily obtained in the '
                                  'relativistic MONDian lensing model of\n'
                                  'Angus et al. (2007). However, MONDian model '
                                  'with little hot dark matter\n'
                                  '$M_{HDM} \\le 0.6\\times 10^{15}\\msun$ and '
                                  'CDM model with a small halo mass $\\le\n'
                                  '1\\times 10^{15}\\msun$ are barely '
                                  'consistent with lensing and velocity '
                                  'data.\n',
                       'document_id': '0704.0094',
                       'meta': {   'authors': 'HongSheng Zhao (SUPA, St '
                                              'Andrews)',
                                   'authors_parsed': [   [   'Zhao',
                                                             'HongSheng',
                                                            