<a href="https://colab.research.google.com/github/kynthesis/HaystackResearch/blob/main/Haystack_Seminal_Demo_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Haystack Presentation Demo V2**



# 1. Kiểm tra GPU runtime

In [1]:
%%bash

nvidia-smi

Mon Jul 10 05:56:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# 2. Cài đặt Haystack

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]==1.17.2

# 3. Chuẩn bị các file tài liệu

In [3]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/mcu"

fetch_archive_from_http(
    url="https://github.com/kynthesis/HaystackResearch/raw/main/mcu.zip",
    output_dir=doc_dir,
)

True

# 4. Cài đặt Elasticsearch

In [4]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

In [5]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

In [6]:
import time

time.sleep(30)

# 5. Khởi tạo ElasticsearchDocumentStore

In [7]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    host=os.environ.get("ELASTICSEARCH_HOST", "localhost"),
    username="",
    password="",
    index="mcu",
    similarity="dot_product",
    embedding_dim=768
)

# 6. Khởi tạo IndexingPipeline, TextConverter, và PreProcessor

In [8]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# 7. Indexing các file tài liệu vào ElasticsearchDocumentStore

In [None]:
import os

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

# 8. Khởi tạo Retriever

In [11]:
from haystack.nodes import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
)

document_store.update_embeddings(retriever)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Updating embeddings:   0%|          | 0/496 [00:00<?, ? Docs/s]

Create embeddings:   0%|          | 0/496 [00:00<?, ? Docs/s]

# 9. Khởi tạo SemanticSearchPipeline

In [12]:
semantic_search_pipeline = Pipeline()

semantic_search_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

# 10. Demo SearchSearchPipeline

In [13]:
result = semantic_search_pipeline.run(
    "The green guy",
    params={"Retriever": {"top_k": 5}}
)

from haystack.utils import print_documents

print_documents(result, max_text_len=200, print_name=True, print_meta=True)


Query: The green guy

{   'content': 'Bruce Banner, a scientist on the run from the U.S. Government, '
               'must find a cure for the monster he turns into whenever he '
               'loses his temper.\n'
               '\n'
               "Depicting the events after the Gamma Bomb. 'The Incredible ...",
    'meta': {   '_split_id': 0,
                '_split_overlap': [   {   'doc_id': 'b45507df283cdf1071b66a4f50a81074',
                                          'range': [963, 1106]}]},
    'name': None}

{   'content': 'He explains that his mom made him the mix tape of her favorite '
               'songs. She listens and likes it. He asks her to dance, but she '
               "doesn't trust him. He says it reminds him of an old fable "
               'about other peop...',
    'meta': {   '_split_id': 9,
                '_split_overlap': [   {   'doc_id': '8c4ab0f3001b9c62484358842a71f9ba',
                                          'range': [0, 144]},
                 

# 11. Khởi tạo Reader

In [14]:
from haystack.nodes import FARMReader

reader = FARMReader("ahotrod/albert_xxlargev1_squad2_512")

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/890M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

# 11. Khởi tạo ExtractiveQAPipeline

In [15]:
extractive_qa_pipeline = Pipeline()

extractive_qa_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
extractive_qa_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

# 12. Demo ExtractiveQAPipeline

In [16]:
result = extractive_qa_pipeline.run(
    query="Who is the guy with the Mind Stone?",
    params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

from haystack.utils import print_answers

print_answers(result, details="medium", max_text_len=200)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

'Query: Who is the guy with the Mind Stone?'
'Answers:'
[   {   'answer': "'Vision'.",
        'context': '" metal man (played by Paul Bettany), who they take to '
                   "referring to as 'Vision'. Wanda is instantly smitten with "
                   'Vision and even attempts to flirt wit',
        'score': 0.9543946385383606},
    {   'answer': 'Strucker',
        'context': 'Bruce, Tony and Steve manage to penetrate the fortress. '
                   'Steve captures Strucker as he is preparing to flee. '
                   'Strucker monologues for a bit, long enough',
        'score': 0.0014318835455924273},
    {   'answer': 'Dr. Arnim Zola',
        'context': 'r. A German voice speaks and analyzes the two. The voice '
                   'belongs to Dr. Arnim Zola (Toby Jones), the HYDRA '
                   'scientist responsible for building Red Skul',
        'score': 0.0011545359157025814},
    {   'answer': 'Yondu',
        'context': 'vaporizes Ronan. Rocket colle

# 13. Khởi tạo Generator

In [17]:
from haystack.nodes import Seq2SeqGenerator

generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa")

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

# 14. Khởi tạo GenerativeQAPipeline

In [18]:
generative_qa_pipeline = Pipeline()

generative_qa_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
generative_qa_pipeline.add_node(component=reader, name="Generator", inputs=["Retriever"])

# 15. Demo GenerativeQAPipeline

In [28]:
result = extractive_qa_pipeline.run(
    query="Tell me something about Dr.Strange?",
    params={"Retriever": {"top_k": 5}}
)

from haystack.utils import print_answers

print_answers(result, details="medium", max_text_len=200)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

'Query: Tell me something about Dr.Strange?'
'Answers:'
[   {   'answer': 'a surgeon with a God complex',
        'context': 'a deadly enemy who is out to destroy it.\n'
                   '\n'
                   'Stephen Strange is a surgeon with a God complex. When he '
                   'gets in an accident and his hands are injured, he f',
        'score': 0.7264957427978516},
    {   'answer': 'brilliant but arrogant',
        'context': 'Strange and Mordo save the world?\n'
                   '\n'
                   'Sadly, Dr Stephen Strange, a brilliant but arrogant New '
                   'York City neurosurgeon on top of his game, is more '
                   'interest',
        'score': 0.7238306999206543},
    {   'answer': 'brilliant but arrogant New York City neurosurgeon on top of '
                  'his game, is more interested in fame and his perfect '
                  'success rate than saving lives',
        'context': ', a brilliant but arrogant New York City ne

# 16. Khởi tạo Agent bằng API Key OpenAI

In [46]:
from getpass import getpass

api_key_prompt = "Enter OpenAI API key:"
api_key = getpass(api_key_prompt)

from haystack.agents import Agent
from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="text-davinci-003", api_key=api_key, stop_words=["Observation:"])
agent = Agent(prompt_node=prompt_node)

Enter OpenAI API key:··········


# 17. Cung cấp Tool cho Agent

In [None]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/mcu"

fetch_archive_from_http(
    url="https://github.com/kynthesis/HaystackResearch/raw/main/mcu.zip",
    output_dir=doc_dir,
)

from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

import os

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

In [77]:
from haystack.nodes import EmbeddingRetriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1", use_gpu=True
)
document_store.update_embeddings(retriever=retriever)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
mcu_qa = ExtractiveQAPipeline(reader=reader, retriever=retriever)

from haystack.utils import print_answers

result = mcu_qa.run("What is the nickname of the man who turned The Mad Titan into ash?")

print_answers(result, "minimum")

Updating Embedding:   0%|          | 0/496 [00:00<?, ? docs/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

'Query: What is the nickname of the man who turned The Mad Titan into ash?'
'Answers:'
[   {   'answer': 'Thanos',
        'context': 'The mad titan Thanos has begun his quest to obtain all six '
                   'Infinity Stones, which will give him the power to wipe out '
                   'half of all life in the universe'},
    {   'answer': 'Mighty Thor',
        'context': "'s surprise - inexplicably wields his magical hammer, "
                   'Mjolnir, as the Mighty Thor. Together, they embark upon a '
                   'harrowing cosmic adventure to uncover '},
    {   'answer': 'THE INCREDIBLE HULK',
        'context': 'life as Bruce Banner or the creature he could permanently '
                   'become: THE INCREDIBLE HULK.\n'
                   '\n'
                   'Over the opening credits, we see how the Hulk was born. '
                   'Bruce '},
    {   'answer': 'Mysterio',
        'context': 'ates, and after watching a news report on him, they start '
 

In [47]:
from haystack.agents import Tool

search_tool = Tool(
    name="MCU_QA",
    pipeline_or_node=mcu_qa,
    description="useful for when you need to answer questions related to the Marvel's MCU.",
    output_variable="answers",
)

agent.add_tool(search_tool)

# 18. Demo Agent OpenAI

In [76]:
result = agent.run("What is the nickname of the man who turned The Mad Titan into ash?")

print(result["transcript"].split("---")[0])


Agent zero-shot-react started with {'query': 'What is the nickname of the man who turned The Mad Titan into ash?', 'params': None}
[32m find[0m[32m out[0m[32m who[0m[32m The[0m[32m Mad[0m[32m Titan[0m[32m is[0m[32m.[0m[32m
[0m[32mTool[0m[32m:[0m[32m MC[0m[32mU[0m[32m_[0m[32mQ[0m[32mA[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m Who[0m[32m is[0m[32m The[0m[32m Mad[0m[32m Titan[0m[32m?[0m[32m
[0m

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Observation: [33mThanos[0m
Thought: [32m Now[0m[32m that[0m[32m I[0m[32m know[0m[32m who[0m[32m The[0m[32m Mad[0m[32m Titan[0m[32m is[0m[32m,[0m[32m I[0m[32m can[0m[32m find[0m[32m out[0m[32m the[0m[32m nickname[0m[32m of[0m[32m the[0m[32m man[0m[32m who[0m[32m turned[0m[32m him[0m[32m into[0m[32m ash[0m[32m.[0m[32m
[0m[32mTool[0m[32m:[0m[32m MC[0m[32mU[0m[32m_[0m[32mQ[0m[32mA[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m Who[0m[32m turned[0m[32m Than[0m[32mos[0m[32m into[0m[32m ash[0m[32m?[0m[32m
[0m[32m
[0m

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Observation: [33mStark[0m
Thought: [32m Now[0m[32m that[0m[32m I[0m[32m found[0m[32m the[0m[32m identity[0m[32m of[0m[32m the[0m[32m man[0m[32m,[0m[32m I[0m[32m can[0m[32m figure[0m[32m out[0m[32m his[0m[32m nickname[0m[32m.[0m[32m
[0m[32mTool[0m[32m:[0m[32m MC[0m[32mU[0m[32m_[0m[32mQ[0m[32mA[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m What[0m[32m is[0m[32m the[0m[32m nickname[0m[32m of[0m[32m Stark[0m[32m?[0m[32m
[0m[32m
[0m

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Observation: [33mIron Man[0m
Thought: [32m With[0m[32m this[0m[32m information[0m[32m,[0m[32m I[0m[32m can[0m[32m answer[0m[32m the[0m[32m question[0m[32m.[0m[32m
[0m[32mFinal[0m[32m Answer[0m[32m:[0m[32m Iron[0m[32m Man[0m find out who The Mad Titan is.
Tool: MCU_QA
Tool Input: Who is The Mad Titan?

Observation: Thanos
Thought: Now that I know who The Mad Titan is, I can find out the nickname of the man who turned him into ash.
Tool: MCU_QA
Tool Input: Who turned Thanos into ash?


Observation: Stark
Thought: Now that I found the identity of the man, I can figure out his nickname.
Tool: MCU_QA
Tool Input: What is the nickname of Stark?


Observation: Iron Man
Thought: With this information, I can answer the question.
Final Answer: Iron Man


# 19. Khởi tạo Chatbot bằng API HuggingFace

In [80]:
from getpass import getpass

model_api_key = getpass("Enter model provider API key:")

from haystack.nodes import PromptNode

model_name = "OpenAssistant/oasst-sft-1-pythia-12b"
prompt_node = PromptNode(model_name, api_key=model_api_key, max_length=256)

from haystack.agents.memory import ConversationSummaryMemory

summary_memory = ConversationSummaryMemory(prompt_node)

from haystack.agents.conversational import ConversationalAgent

conversational_agent = ConversationalAgent(prompt_node=prompt_node, memory=summary_memory)
conversational_agent.add_tool(search_tool)

Enter model provider API key:··········


# 20. Demo Chatbot OpenAssistant

In [85]:
import ipywidgets as widgets
from IPython.display import clear_output

## Text Input
user_input = widgets.Textarea(
    value="",
    placeholder="Type your prompt here",
    disabled=False,
    style={"description_width": "initial"},
    layout=widgets.Layout(width="100%", height="100%"),
)

## Submit Button
submit_button = widgets.Button(
    description="Submit", button_style="success", layout=widgets.Layout(width="100%", height="80%")
)


def on_button_clicked(b):
    user_prompt = user_input.value
    user_input.value = ""
    print("\nUser:\n", user_prompt)
    conversational_agent.run(user_prompt)


submit_button.on_click(on_button_clicked)

## Show Memory Button
memory_button = widgets.Button(
    description="Show Memory", button_style="info", layout=widgets.Layout(width="100%", height="100%")
)


def on_memory_button_clicked(b):
    memory = conversational_agent.memory.load()
    if len(memory):
        print("\nMemory:\n", memory)
    else:
        print("Memory is empty")


memory_button.on_click(on_memory_button_clicked)

## Clear Memory Button
clear_button = widgets.Button(
    description="Clear Memory", button_style="warning", layout=widgets.Layout(width="100%", height="100%")
)


def on_clear_button_button_clicked(b):
    conversational_agent.memory.clear()
    print("\nMemory is cleared\n")


clear_button.on_click(on_clear_button_button_clicked)

## Layout
grid = widgets.GridspecLayout(3, 3, height="200px", width="800px", grid_gap="10px")
grid[0, 2] = clear_button
grid[0:2, 0:2] = user_input
grid[2, 0:] = submit_button
grid[1, 2] = memory_button
display(grid)




User:
 Tell me something about Tony Stark

Agent conversational-agent started with {'query': 'Tell me something about Tony Stark', 'params': None}
[32mOk[0m[32m,[0m[32m Tony[0m[32m Stark[0m[32m is[0m[32m a[0m[32m fictional[0m[32m character[0m[32m in[0m[32m the[0m[32m Marvel[0m[32m Cin[0m[32mematic[0m[32m Universe[0m[32m.[0m[32m He[0m[32m is[0m[32m a[0m[32m wealthy[0m[32m industrial[0m[32mist[0m[32m and[0m[32m inventor[0m[32m who[0m[32m is[0m[32m known[0m[32m for[0m[32m his[0m[32m expertise[0m[32m in[0m[32m the[0m[32m field[0m[32m of[0m[32m technology[0m[32m and[0m[32m his[0m[32m philanth[0m[32mropic[0m[32m efforts[0m[32m.[0m[32m
[0m[32m
[0m[32mSt[0m[32mark[0m[32m was[0m[32m born[0m[32m in[0m[32m New[0m[32m York[0m[32m City[0m[32m and[0m[32m grew[0m[32m up[0m[32m with[0m[32m a[0m[32m passion[0m[32m for[0m[32m science[0m[32m and[0m[32m engineering[0m[32m.[0m[32m He