<a href="https://colab.research.google.com/github/kynthesis/HaystackResearch/blob/main/9_Generation_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cách xây dựng một pipeline Question Generation**



# 1. Kiểm tra GPU runtime

In [1]:
%%bash

nvidia-smi

Sun Jul  2 13:09:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    46W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Cài đặt Haystack

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,elasticsearch,inference]

# 3. Import các package cần thiết

In [4]:
from pprint import pprint
from tqdm.auto import tqdm
from haystack.nodes import QuestionGenerator, BM25Retriever, FARMReader
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import (
    QuestionGenerationPipeline,
    RetrieverQuestionGenerationPipeline,
    QuestionAnswerGenerationPipeline,
)
from haystack.utils import launch_es, print_questions

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


# 4. Chuẩn bị Elasticsearch

In [5]:
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)

! sleep 30

# 5. Chuẩn bị nội dung văn bản

In [6]:
text1 = "Iron Man is a superhero appearing in American comic books published by Marvel Comics. Co-created by writer and editor Stan Lee, developed by scripter Larry Lieber, and designed by artists Don Heck and Jack Kirby, the character first appeared in Tales of Suspense #39 (cover dated March 1963), and received his own title in Iron Man #1 (May 1968). In 1963, the character founded the Avengers superhero team with Thor, Ant-Man, Wasp and the Hulk."
text2 = "RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, England to New York City, United States. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it the deadliest sinking of a single ship up to that time. It remains the deadliest peacetime sinking of an ocean liner or cruise ship. The disaster drew public attention, provided foundational material for the disaster film genre, and has inspired many artistic works."
text3 = "Counter-Strike (CS) is a series of multiplayer tactical first-person shooter video games in which teams of terrorists battle to perpetrate an act of terror (bombing, hostage-taking, assassination) while counter-terrorists try to prevent it (bomb defusal, hostage rescue, escort mission). The series began on Windows in 1999 with the release of the first game, Counter-Strike. It was initially released as a modification ('mod') for Half-Life that was designed by Minh 'Gooseman' Le and Jess 'Cliffe' Cliffe before the rights to the mod's intellectual property were acquired by Valve, the developers of Half-Life, who then turned Counter-Strike into a retail product released in 2000."

docs = [{"content": text1}, {"content": text2}, {"content": text3}]

# 6. Indexing tài liệu vào DocumentStore

In [7]:
document_store = ElasticsearchDocumentStore()
document_store.write_documents(docs)

# 7. Thử nghiệm Question Generation cơ bản

In [8]:
question_generator = QuestionGenerator()

question_generation_pipeline = QuestionGenerationPipeline(question_generator)
for idx, document in enumerate(document_store):
    print(f"\n * Generating questions for document {idx}: {document.content[:100]}...\n")
    result = question_generation_pipeline.run(documents=[document])
    print_questions(result)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


Downloading (…)okenizer_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Using sep_token, but it is not set yet.



 * Generating questions for document 0: Iron Man is a superhero appearing in American comic books published by Marvel Comics. Co-created by ...


Generated questions:
 - Who is the creator of Iron Man?
 - What comic book was Iron Man first published in?
 - When was the first issue of Tales of Suspense published?
 - Who designed Iron Man for Marvel?
 - When was the cover of Tales of Suspense published?
 - When was Iron Man #1 released?
 - Who did Thor form the Avengers team with?

 * Generating questions for document 1: RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North A...


Generated questions:
 - What was the name of the British passenger liner that sank in the North Atlantic Ocean on April 15, 1912?
 - When did the RMS Titanic sink?
 - How many passengers and crew were aboard the Titanic?
 - How many people died in the sinking of the USS Enterprise?
 - How many of the passengers and crew died?
 - What is the deadliest sinking in peac

# 8. Thử nghiệm Question Generation theo chủ đề

In [9]:
retriever = BM25Retriever(document_store=document_store)
rqg_pipeline = RetrieverQuestionGenerationPipeline(retriever, question_generator)

print(f"\n * Generating questions for documents matching the query 'Titanic'\n")
result = rqg_pipeline.run(query="Titanic")
print_questions(result)


 * Generating questions for documents matching the query 'Titanic'


Generated questions:
 - What was the name of the British passenger liner that sank in the North Atlantic Ocean on April 15, 1912?
 - When did the RMS Titanic sink?
 - How many passengers and crew were aboard the Titanic?
 - How many people died in the sinking of the USS Enterprise?
 - How many of the passengers and crew died?
 - What is the deadliest sinking in peacetime of an ocean liner or cruise ship?
 - What drew public attention?
 - What genre of film was disaster a foundation for?


# 9. Thử nghiệm Question Generation kèm câu trả lời

In [10]:
reader = FARMReader("deepset/roberta-base-squad2")
qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    print_questions(result)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


0it [00:00, ?it/s]


 * Generating questions and answers for document 0: Iron Man is a superhero appearing in American comic books published by Marvel Comics. Co-created by ...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: Who is the creator of Iron Man?
      A: Stan Lee
 - Q: What comic book was Iron Man first published in?
      A: Tales of Suspense
 - Q: When was the first issue of Tales of Suspense published?
      A: March 1963
 - Q: Who designed Iron Man for Marvel?
      A: Don Heck and Jack Kirby
 - Q: When was the cover of Tales of Suspense published?
      A: March 1963
 - Q: When was Iron Man #1 released?
      A: May 1968
 - Q: Who did Thor form the Avengers team with?
      A: Ant-Man, Wasp and the Hulk

 * Generating questions and answers for document 1: RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North A...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: What was the name of the British passenger liner that sank in the North Atlantic Ocean on April 15, 1912?
      A: RMS Titanic
 - Q: When did the RMS Titanic sink?
      A: 15 April 1912
 - Q: How many passengers and crew were aboard the Titanic?
      A: 2,224
 - Q: How many people died in the sinking of the USS Enterprise?
      A: more than 1,500
 - Q: How many of the passengers and crew died?
      A: more than 1,500
 - Q: What is the deadliest sinking in peacetime of an ocean liner or cruise ship?
      A: RMS Titanic
 - Q: What drew public attention?
      A: The disaster
 - Q: What genre of film was disaster a foundation for?
      A: disaster film

 * Generating questions and answers for document 2: Counter-Strike (CS) is a series of multiplayer tactical first-person shooter video games in which te...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]


Generated pairs:
 - Q: What is the name of the first-person shooter video game series Counter-Strike?
      A: CS
 - Q: When did the series begin on Windows?
      A: 1999
 - Q: What was the first game in the series?
      A: Counter-Strike
 - Q: How did counter-terrorists try to prevent an act of terror?
      A: bomb defusal, hostage rescue, escort mission
 - Q: When did the Counter-Strike series begin on Windows?
      A: 1999
 - Q: What was the name of the first game to be released on Windows in 1999?
      A: Counter-Strike
 - Q: Who designed the mod for Half-Life?
      A: Minh 'Gooseman' Le and Jess 'Cliffe' Cliffe
 - Q: Which company acquired the rights to the mod's intellectual property?
      A: Valve
 - Q: What company acquired Counter-Strike's intellectual property?
      A: Valve
 - Q: What company developed Half-Life?
      A: Valve
 - Q: When was Counter Strike released for sale?
      A: 2000


# 10. Thử nghiệm Question Generation hỗ trợ dịch máy

In [12]:
text1 = "Việt Nam, quốc hiệu là Cộng hòa Xã hội chủ nghĩa Việt Nam, là một quốc gia nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực Đông Nam Á, giáp với Lào, Campuchia, Trung Quốc, biển Đông và vịnh Thái Lan."
docs = [{"content": text1}]
document_store.delete_documents()
document_store.write_documents(docs)

from haystack.nodes import TransformersTranslator

in_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-vi-en")
out_translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-vi")

from haystack.pipelines import TranslationWrapperPipeline

pipeline_with_translation = TranslationWrapperPipeline(
    input_translator=in_translator, output_translator=out_translator, pipeline=qag_pipeline
)

for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = pipeline_with_translation.run(documents=[document])
    print_questions(result)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


0it [00:00, ?it/s]


 * Generating questions and answers for document 0: Việt Nam, quốc hiệu là Cộng hòa Xã hội chủ nghĩa Việt Nam, là một quốc gia nằm ở cực Đông của bán đả...



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Translating:   0%|          | 0/2 [00:00<?, ?it/s]

Translating:   0%|          | 0/2 [00:00<?, ?it/s]

Translating:   0%|          | 0/2 [00:00<?, ?it/s]


Generated pairs:
 - Q: Quốc Liên của Việt Nam là gì?
      A: Việt Nam, Việt Nam của Việt Nam, là một quốc gia ở phía Đông Đông Đông Bắc
 - Q: Biên giới Việt Nam, Cam-pu-chia, Trung Quốc, Đông Hải và Vịnh Thái Lan ở đâu?
      A: ♪ Vùng xa phía đông của Đông Á ♪
