<a href="https://colab.research.google.com/github/kynthesis/HaystackResearch/blob/main/2_Elasticsearch_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cách xây dựng một pipeline QA scalable với Elasticsearch**



# 1. Kiểm tra GPU runtime

In [1]:
%%bash

nvidia-smi

Sat Jul  1 05:58:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Cài đặt Haystack

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]

# 3. Bật chế độ logging

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# 4. Cài đặt Elasticsearch

In [4]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

In [5]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

In [6]:
import time

time.sleep(30)

# 4. Khởi tạo ElasticsearchDocumentStore

In [7]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

# 5. Chuẩn bị các file tài liệu

In [8]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/witcher"

fetch_archive_from_http(
    url="https://github.com/kynthesis/HaystackResearch/raw/main/witcher.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Fetching from https://github.com/kynthesis/HaystackResearch/raw/main/witcher.zip to 'data/witcher'


True

# 6. Khởi tạo Pipeline, TextConverter, và PreProcessor

In [9]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# 7. Thêm các Node cần dùng vào Pipeline

In [10]:
indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

# 8. Indexing các file tài liệu vào DocumentStore

In [None]:
import os

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

# 9. Khởi tạo Retriever

In [12]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

# 10. Khởi tạo Reader

In [13]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


# 11. Tạo pipeline QA gồm Retriever và Reader

In [14]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

# 12. Đặt câu hỏi cho pipeline QA

In [29]:
prediction = querying_pipeline.run(
    query="Who is the White Wolf?",
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

# 13. Nhận câu trả lời từ pipeline QA

In [30]:
from haystack.utils import print_answers

print_answers(prediction, details="medium")

'Query: Who is the White Wolf?'
'Answers:'
[   {   'answer': 'Gwynbleidd',
        'context': 'et broken, clinging to her.\n'
                   'The wounded man was the witcher, known as Gwynbleidd, '
                   'White Wolf.\n'
                   'Initially, the dryads had not known what to do.\n'
                   'The blee',
        'score': 0.8780103325843811},
    {   'answer': 'Geralt of Rivia',
        'context': ' A child of destiny ... A child of Elder Blood, the blood '
                   'of elves.\n'
                   'Geralt of Rivia, the White Wolf, and his destiny. No, no, '
                   "that's a legend. A poet'",
        'score': 0.8085618019104004},
    {   'answer': 'Geralt',
        'context': "ize what bird you have caught in your snares.''The "
                   "witcher?''Of course, Geralt, who is called the White "
                   'Wolf.\n'
                   'The same rogue who laid claim to the rig',
        'score': 0.7469651103019714},
  