<a href="https://colab.research.google.com/github/martindevoto/machine-learning-notebooks-personal/blob/main/Intro_Haystack_pt_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utilizing existing FAQs for Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb)

While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.

**Pros**:

- Very fast at inference time
- Utilize existing FAQ data
- Quite good control over answers

**Cons**:

- Generalizability: We can only answer questions that are similar to existing ones in FAQ

In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option.

In [None]:
# Make sure you have a GPU running
!nvidia-smia

/bin/bash: nvidia-smia: command not found


In [None]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.4 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-87mju15q/farm-haystack_454425f00c1c4e4e9d4515c52625b648
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-87mju15q/farm-haystack_454425f00c1c4e4e9d4515c52625b648
  Resolved https://github.com/deepset-ai/haystack.git to commit 692cde11e736884cbd1a5f99416cee94c4f0921b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting elasticsearch<=7.10,>=7.7
  Downloadi

In [None]:
from haystack.document_stores import ElasticsearchDocumentStore

from haystack.nodes import EmbeddingRetriever
import pandas as pd
import requests

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()



In [None]:
# In colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os 
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ['elasticsearch-7.9.2/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT,
    preexec_fn=lambda: os.setuid(1)
)

# wait until ES has started
! sleep 30

### Init the DocumentStore
In contrast to Tutorial 1 (extractive QA), we:

* specify the name of our `text_field` in Elasticsearch that we want to return as an answer
* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question
* set `excluded_meta_data=["question_emb"]` so that we don't return the huge embedding vectors in our search results

In [None]:
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    host='localhost',
    username='',
    password='',
    index='document',
    embedding_field='question_emb',
    embedding_dim=384,
    excluded_meta_data=['question_emb']
)

### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [None]:
retriever = EmbeddingRetriever(
    document_store=document_store, 
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    use_gpu=True
)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/all-MiniLM-L6-v2
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find sentence-transformers/all-MiniLM-L6-v2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.7M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded sentence-transformers/all-MiniLM-L6-v2


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

INFO - haystack.modeling.data_handler.processor -  Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in elasticsearch.
Here: We download some question-answer pairs related to COVID-19

In [None]:
# Download
temp = requests.get("https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/faqs/faq_covidbert.csv")
open('small_faq_covid.csv', 'wb').write(temp.content)

# Get daraframe with columns "question", "answer" and some custom metadata
df = pd.read_csv('small_faq_covid.csv')
# Minimal cleaning
df.fillna(value='', inplace=True)
df['question'] = df['question'].apply(lambda x: x.strip())
print(df.head())

# Get embeddings for our questions from the FAQs
questions = list(df['question'].values)
df['question_emb'] = retriever.embed_queries(texts=questions)
df = df.rename(columns={'question': 'content'})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient='records')
document_store.write_documents(docs_to_index)

                                                                          question  ... last_update
0                                                     What is a novel coronavirus?  ...  2020/03/17
1              Why is the disease being called coronavirus disease 2019, COVID-19?  ...  2020/03/17
2  Why might someone blame or avoid individuals and groups (create stigma) beca...  ...  2020/03/17
3                             How can people help stop stigma related to COVID-19?  ...  2020/03/17
4                                                 What is the source of the virus?  ...  2020/03/17

[5 rows x 12 columns]


Inferencing Samples: 100%|██████████| 7/7 [00:03<00:00,  2.10 Batches/s]


### Ask questions
Initialize a Pipeline (this time without a reader) and ask questions

In [None]:
from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)

In [None]:
from haystack.utils import print_answers

prediction = pipe.run(query="How is the virus spreading?",
                      params={'Retriever': {'top_k': 10}})
print_answers(prediction, details='minimum')

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.32 Batches/s]



Query: How is the virus spreading?
Answers:
[   {   'answer': 'This virus was first detected in Wuhan City, Hubei '
                  'Province, China. The first infections were linked to a live '
                  'animal market, but the virus is now spreading from '
                  'person-to-person. It’s important to note that '
                  'person-to-person spread can happen on a continuum. Some '
                  'viruses are highly contagious (like measles), while other '
                  'viruses are less so.\n'
                  '\n'
                  'The virus that causes COVID-19 seems to be spreading easily '
                  'and sustainably in the community (“community spread”) in '
                  'some affected geographic areas. Community spread means '
                  'people have been infected with the virus in an area, '
                  'including some who are not sure how or where they became '
                  'infected.\n'
                  '\n'