<a href="https://colab.research.google.com/github/joshuaalpuerto/faq-haystack-guide/blob/main/JB_FAQ_style_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 11.2 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-hy4co5ex/farm-haystack_80ca5900aca24baebf3e3169669bffb8
  Resolved https://github.com/deepset-ai/haystack.git to commit 322652c3066628e6f1e8a69facf55a5d8fde9f08
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to buil

DEPRECATION: git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-hy4co5ex/farm-haystack_80ca5900aca24baebf3e3169669bffb8
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.2 which is incompatible.


## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [2]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

### Create a simple DocumentStore
The InMemoryDocumentStore is good for quick development and prototyping. For more scalable options, check-out the [docs](https://docs.haystack.deepset.ai/docs/document_store).

In [3]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [4]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/all-MiniLM-L6-v2


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in our documentstore.
Here: I have Jobbatical Question and Answer

In [6]:
import pandas as pd


data = pd.read_json('/content/drive/MyDrive/datasets/jb-qna.json')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   _id       430 non-null    object
 1   question  430 non-null    object
 2   answer    430 non-null    object
dtypes: object(3)
memory usage: 10.2+ KB


Unnamed: 0,_id,question,answer
0,6047ac78ce1f20003d5b932b,What documents do I need to bring to the visa appointment? What happens if I...,Jobbatical agent will provide you with a complete checklist of what's requir...
1,6047ac78ce1f20003d5b932c,I have a valid Schengen C visa and its validity will overlap with the D visa...,Having both valid C and D visas is fine as long as the visas have been issue...
2,6047ac78ce1f20003d5b932d,Where can I apply for a D visa? Can I apply for it in Estonia?,"In general, Estonian D visa must be applied for in your country of citizensh..."
3,6047ac78ce1f20003d5b932e,When should I apply for a D visa?,"In general the visa applications are reviewed within 10-14 working days, dep..."
4,6047ac78ce1f20003d5b932f,"I recently got married but don't have a marriage certificate, is that a prob...",The marriage certificate is a required document for your spouse to be able t...


In [8]:
# Clean up of the data includes dropping na values, dropping duplicates, casting the answers to lowercase, 
# removing extra punctuation in the answers, and removing whitespace from the questions. 
# We only want the Question and Answer columns + an additional column that contains the original question. 
# This will be handy when we print results as you'll see at the end.
data = data.dropna()
data = data.drop_duplicates(subset='question')

data['answer'] = data['answer'].apply(lambda x: x.lower())
data['answer'] = data['answer'].str.strip(".")
data['question'] = data['question'].str.strip()
# Print cleaned data
data.head()

data = data[['question','answer']]
# Print modified data
data.head()

Unnamed: 0,question,answer
0,What documents do I need to bring to the visa appointment? What happens if I...,jobbatical agent will provide you with a complete checklist of what's requir...
1,I have a valid Schengen C visa and its validity will overlap with the D visa...,having both valid c and d visas is fine as long as the visas have been issue...
2,Where can I apply for a D visa? Can I apply for it in Estonia?,"in general, estonian d visa must be applied for in your country of citizensh..."
3,When should I apply for a D visa?,"in general the visa applications are reviewed within 10-14 working days, dep..."
4,"I recently got married but don't have a marriage certificate, is that a prob...",the marriage certificate is a required document for your spouse to be able t...


In [9]:

# Create embeddings for our questions from the FAQs
# In contrast to most other search use cases, we don't create the embeddings here from the content of our documents,
# but rather from the additional text field "question" as we want to match "incoming question" <-> "stored question".
questions = list(data["question"].values)
# We use to embed_queries because we only want to create embedding to single column instead to all document
data["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = data.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
# This will convert to json where column will be key and each row will be value
# [{"question": "Some question 1","answer": "This is the answer for question 1"},    
#  {"question": "Some question 2","answer": "This is the answer for question 2"}]
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

### Ask questions
Initialize a Pipeline (this time without a reader) and ask questions

In [12]:
from haystack.utils import print_answers
from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="Can I bring my wife to Estonia?", params={"Retriever": {"top_k": 5}})

print_answers(prediction, details="medium")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Query: Can I bring my wife to Estonia?'
'Answers:'
[   {   'answer': 'it is allowed to work on the basis of a visa on the '
                  'condition that the estonian employer has registered your '
                  "spouse's short term employment. jobbatical can help with "
                  'this. without a short term employment registration it is '
                  'forbidden to work',
        'context': 'it is allowed to work on the basis of a visa on the '
                   'condition that the estonian employer has registered your '
                   "spouse's short term employment. jobbatical can help with "
                   'this. without a short term employment registration it is '
                   'forbidden to work',
        'score': 0.5965965390205383},
    {   'answer': 'yes. the family form must include data about your close '
                  'living family (parents, siblings, spouse, child) and is a '
                  'mandatory part of the residence permit