<a href="https://colab.research.google.com/github/joshuaalpuerto/faq-haystack-guide/blob/main/JB_FAQ_style_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

### Create a simple DocumentStore
The InMemoryDocumentStore is good for quick development and prototyping. For more scalable options, check-out the [docs](https://docs.haystack.deepset.ai/docs/document_store).

In [None]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


### Create a Retriever using embeddings
Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).
We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings.

In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=False,
    scale_score=False,
)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Prepare & Index FAQ data
We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in our documentstore.
Here: I have Jobbatical Question and Answer

In [None]:
import pandas as pd


data = pd.read_json('/content/drive/MyDrive/datasets/jb-qna.json')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 430 entries, 0 to 429
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   _id       430 non-null    object
 1   question  430 non-null    object
 2   answer    430 non-null    object
dtypes: object(3)
memory usage: 10.2+ KB


Unnamed: 0,_id,question,answer
0,6047ac78ce1f20003d5b932b,What documents do I need to bring to the visa appointment? What happens if I...,Jobbatical agent will provide you with a complete checklist of what's requir...
1,6047ac78ce1f20003d5b932c,I have a valid Schengen C visa and its validity will overlap with the D visa...,Having both valid C and D visas is fine as long as the visas have been issue...
2,6047ac78ce1f20003d5b932d,Where can I apply for a D visa? Can I apply for it in Estonia?,"In general, Estonian D visa must be applied for in your country of citizensh..."
3,6047ac78ce1f20003d5b932e,When should I apply for a D visa?,"In general the visa applications are reviewed within 10-14 working days, dep..."
4,6047ac78ce1f20003d5b932f,"I recently got married but don't have a marriage certificate, is that a prob...",The marriage certificate is a required document for your spouse to be able t...


In [None]:
# Clean up of the data includes dropping na values, dropping duplicates, casting the answers to lowercase, 
# removing extra punctuation in the answers, and removing whitespace from the questions. 
# We only want the Question and Answer columns + an additional column that contains the original question. 
# This will be handy when we print results as you'll see at the end.
data = data.dropna()
data = data.drop_duplicates(subset='question')

data['answer'] = data['answer'].apply(lambda x: x.lower())
data['answer'] = data['answer'].str.strip(".")
data['question'] = data['question'].str.strip()
# Print cleaned data
data.head()

data = data[['question','answer']]
# Print modified data
data.head()

Unnamed: 0,question,answer
0,What documents do I need to bring to the visa appointment? What happens if I...,jobbatical agent will provide you with a complete checklist of what's requir...
1,I have a valid Schengen C visa and its validity will overlap with the D visa...,having both valid c and d visas is fine as long as the visas have been issue...
2,Where can I apply for a D visa? Can I apply for it in Estonia?,"in general, estonian d visa must be applied for in your country of citizensh..."
3,When should I apply for a D visa?,"in general the visa applications are reviewed within 10-14 working days, dep..."
4,"I recently got married but don't have a marriage certificate, is that a prob...",the marriage certificate is a required document for your spouse to be able t...


In [None]:

# Create embeddings for our questions from the FAQs
# In contrast to most other search use cases, we don't create the embeddings here from the content of our documents,
# but rather from the additional text field "question" as we want to match "incoming question" <-> "stored question".
questions = list(data["question"].values)
# We use to embed_queries because we only want to create embedding to single column instead to all document
data["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = data.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
# This will convert to json where column will be key and each row will be value
# [{"question": "Some question 1","answer": "This is the answer for question 1"},    
#  {"question": "Some question 2","answer": "This is the answer for question 2"}]
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'be18e5ef1220558adceafa21489fac8b' already exists in index 'document'


### Ask questions
Initialize a Pipeline (this time without a reader) and ask questions

In [None]:
from haystack.utils import print_answers
from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="When will I receive my residence permit?", params={"Retriever": {"top_k": 5}})
answers = [answer.to_dict() for answer in prediction['answers']]

print_answers(prediction, details="medium")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Query: When will I receive my residence permit?'
'Answers:'
[   {   'answer': 'by law, the processing of your application can take up to 2 '
                  'months from submission, but the police board has the right '
                  'to extend the deadline if necessary. additional time for '
                  'processing can take up to a month',
        'context': 'by law, the processing of your application can take up to '
                   '2 months from submission, but the police board has the '
                   'right to extend the deadline if necessary. additional time '
                   'for processing can take up to a month',
        'score': 0.7636213898658752},
    {   'answer': "once you're in spain, you'll need to get a resident card "
                  '(tarjeta de identidad de extranjero, tie). this requires '
                  'making an appointment beforehand. on the day of your '
                  'appointment, the police will register your fingerprints and 

In [None]:
# we try to generate a response base on the answers we retrieved (retrieved answers is really good but the result is bad!) 
from haystack import Document
from haystack.utils import print_documents


# Prepare the answers as documents context
docs = [Document(answer['answer']) for answer in answers]
print(docs)

[<Document: {'content': "jobbatical agent will provide you with a complete checklist of what's required at the appointment, most items will be prepared by your agent, and all files are available on the platform prior the appointment.if a required document is missing, you risk your visa application rejected, and will need to re-apply. please be sure to follow the checklist provided by your agent, and bring the complete application pack to the appointment", 'content_type': 'text', 'score': None, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': 'e35d011bacb0473d24fa3952c95a2668'}>, <Document: {'content': 'applying for a visa requires the application form, your passport with a minimum remaining validity of 1 year, 1 photo (passport size), approval letter, and criminal records', 'content_type': 'text', 'score': None, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '99b2d0b0489b739d725ae85f28b994b6'}>, <Document: {'content': 'ideally, the visa interview sho

In [None]:
# let's try go use generative answer base on seq2seq model
from haystack.nodes import Seq2SeqGenerator


generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa", 
   max_length=300,
   min_length=100)
res = generator.predict(
   query="What do I need for my visa appointment?",
   documents=docs,
   top_k=1
)
print_answers(res, details="all")

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


'Query: What do I need for my visa appointment?'
'Answers:'
[   <Answer {'answer': "If you're in the US, you need to fill out a visa application form. If you're not, you'll need to go to a visa office and fill out an application for a visa. The visa office will have a list of what you need, and they'll be able to help you fill it out. If they can't help you, they'll send you back to the visa office to do it all over again, and you'll have to go through the process again. It's a pain in the ass, but it's worth it.", 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['e35d011bacb0473d24fa3952c95a2668', '99b2d0b0489b739d725ae85f28b994b6', 'b60f73bbb7f5c1e7104c5a07e7020f01', '5365b87a9f8473f745d6f619e823c477', '4493bf634c06f6427870e54784a83f28'], 'meta': {'doc_scores': [None, None, None, None, None], 'content': ["jobbatical agent will provide you with a complete checklist of what's required at the appointment, mos

In [None]:
# Worst response compared to seq2seq
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import RAGenerator, DensePassageRetriever

document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat", return_embedding=True)
# Initialize DPR Retriever to encode documents, encode question and query documents
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
)

generator = RAGenerator(
    model_name_or_path="facebook/rag-token-nq",
    retriever=retriever,
    use_gpu=True,
    top_k=1,
    max_length=200,
    min_length=5,
    embed_title=False,
    num_beams=2
)

res = generator.predict(
   query="What do I need for my visa appointment?",
   documents=docs,
   top_k=1
)
print_answers(res, details="all")

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the clas

Create embeddings:   0%|          | 0/16 [00:00<?, ? Docs/s]

'Query: What do I need for my visa appointment?'
'Answers:'
[   <Answer {'answer': ' approval letter ( unexpired )', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': ['e35d011bacb0473d24fa3952c95a2668', '99b2d0b0489b739d725ae85f28b994b6', 'b60f73bbb7f5c1e7104c5a07e7020f01', '5365b87a9f8473f745d6f619e823c477', '4493bf634c06f6427870e54784a83f28'], 'meta': {'doc_scores': [None, None, None, None, None], 'content': ["jobbatical agent will provide you with a complete checklist of what's required at the appointment, most items will be prepared by your agent, and all files are available on the platform prior the appointment.if a required document is missing, you risk your visa application rejected, and will need to re-apply. please be sure to follow the checklist provided by your agent, and bring the complete application pack to the appointment", 'applying for a visa requires the application form, your passport with 