<a href="https://colab.research.google.com/github/kynthesis/HaystackResearch/blob/main/3_Simple_FAQ_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cách xây dựng một pipeline QA cho dữ liệu FAQ**



# 1. Kiểm tra GPU runtime

In [1]:
%%bash

nvidia-smi

Sat Jul  1 09:31:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Cài đặt Haystack

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

# 3. Bật chế độ logging

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# 4. Khởi tạo DocumentStore

In [4]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


# 5. Chuẩn bị các file FAQ

In [5]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/mcu"

fetch_archive_from_http(
    url="https://github.com/kynthesis/HaystackResearch/raw/main/mcu_faq.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Fetching from https://github.com/kynthesis/HaystackResearch/raw/main/mcu_faq.zip to 'data/mcu'


True

In [6]:
import pandas as pd

df = pd.read_csv(f"{doc_dir}/mcu_faq.csv")

df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

                                            question  \
0  Which characters were adapted from Marvel's Ir...   
1  Why was the fighter pilot's chute jammed durin...   
2                          What is 'Iron Man' about?   
3                     Is "Iron Man" based on a book?   
4            Who or what is the Invincible Iron Man?   

                                              answer            title  
0  For this list only the creators of the charact...  Iron Man (2008)  
1  First, it's a very common plot device in actio...  Iron Man (2008)  
2  When wealthy industrialist Tony Stark (Robert ...  Iron Man (2008)  
3  Iron Man is based on a comic book of the same ...  Iron Man (2008)  
4  Iron Man is the alias used by billionaire indu...  Iron Man (2008)  


# 6. Khởi tạo EmbeddingRetriever

In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)

# 7. Indexing các file tài liệu vào DocumentStore

In [8]:
questions = list(df["question"].values)
df["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = df.rename(columns={"question": "content"})

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

In [None]:
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

# 9. Tạo pipeline FAQ gồm Retriever

In [10]:
from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)

# 8. Đặt câu hỏi cho pipeline FAQ

In [35]:
prediction = pipe.run(
    query="Is Red Skull dead?",
    params={"Retriever": {"top_k": 1}}
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

# 10. Nhận câu trả lời từ pipeline FAQ

---



In [36]:
from haystack.utils import print_answers

print_answers(prediction, details="medium")

'Query: Is Red Skull dead?'
'Answers:'
[   {   'answer': 'No. The Tesseract converted him to energy and teleported '
                  'him into the rift it had opened. He does not appear in the '
                  'sequel, Captain America: The Winter Soldier (2014). '
                  'Furthermore, Hugo Weaving said that he is not interested in '
                  'returning as Schmidt. Avengers Infinity War and Endgame '
                  'brought the character back seven years later, now played by '
                  "Ross Marquand and deified as a 'Grim Reaper' who protects "
                  'the Soul Gem on planet Vormir.',
        'context': 'No. The Tesseract converted him to energy and teleported '
                   'him into the rift it had opened. He does not appear in the '
                   'sequel, Captain America: The Winter Soldier (2014). '
                   'Furthermore, Hugo Weaving said that he is not interested '
                   'in returning as Schmidt. Av