<a href="https://colab.research.google.com/github/kynthesis/HaystackResearch/blob/main/1_Basic_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cách xây dựng một pipeline QA cơ bản**



# 1. Kiểm tra GPU runtime

In [1]:
%%bash

nvidia-smi

Sat Jul  1 04:43:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    50W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Cài đặt Haystack

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

# 3. Bật chế độ logging

In [4]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# 4. Khởi tạo DocumentStore

In [5]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


# 5. Chuẩn bị các file tài liệu

In [6]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/witcher"

fetch_archive_from_http(
    url="https://github.com/kynthesis/HaystackResearch/raw/main/witcher.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Fetching from https://github.com/kynthesis/HaystackResearch/raw/main/witcher.zip to 'data/witcher'


True

# 6. Indexing các file tài liệu vào DocumentStore

In [None]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

# 7. Khởi tạo Retriver

In [8]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

# 8. Khởi tạo Reader

In [9]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


# 9. Tạo pipeline QA gồm Retriever và Reader

In [10]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

# 10. Đặt câu hỏi cho pipeline QA

In [23]:
prediction = pipe.run(
    query="Who is the grandmother of Ciri?",
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

# 11. Nhận câu trả lời từ pipeline QA

In [26]:
from haystack.utils import print_answers

print_answers(prediction, details="medium")

'Query: Who is the grandmother of Ciri?'
'Answers:'
[   {   'answer': 'Queen Calanthe',
        'context': 'r shields with their swords.\n'
                   'Along the gang-plank towards them came Queen Calanthe. Her '
                   'grandmother. She who was officially\n'
                   'called Ard Rhena, the High',
        'score': 0.8988839983940125},
    {   'answer': 'queen',
        'context': 'tered. "I am a princess, and not an orphan. I have a '
                   'grandmother. She is queen, what do you think? When I tell '
                   'her that you wanted to hit me with a be',
        'score': 0.89717698097229},
    {   'answer': 'Muriel Countess of Garramone',
        'context': ' knows about my lineage...\n'
                   'In short I am a relative to Ciri, Muriel Countess of '
                   "Garramone, called the Fair, was Cirilla's "
                   'great-grandmother and also m',
        'score': 0.8413123488426208},
    {   'answer': 'the qu