# Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation

- **Level**: Beginner
- **Time to complete**: 10 minutes
- **Components Used**: [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore), [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder), [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder), [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever), [`PromptBuilder`](https://docs.haystack.deepset.ai/docs/promptbuilder), [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/docs/openaichatgenerator)
- **Prerequisites**: You must have an [OpenAI API Key](https://platform.openai.com/api-keys).
- **Goal**: After completing this tutorial, you'll have learned the new prompt syntax and how to use PromptBuilder and OpenAIChatGenerator to build a generative question-answering pipeline with retrieval-augmentation.

> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Overview

This tutorial shows you how to create a generative question-answering pipeline using the retrieval-augmentation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) approach with Haystack 2.0. The process involves four main components: [SentenceTransformersTextEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder) for creating an embedding for the user query, [InMemoryBM25Retriever](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever) for fetching relevant documents, [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) for creating a template prompt, and [OpenAIChatGenerator](https://docs.haystack.deepset.ai/docs/openaichatgenerator) for generating responses.

For this tutorial, you'll use the Wikipedia pages of [Seven Wonders of the Ancient World](https://en.wikipedia.org/wiki/Wonders_of_the_World) as Documents, but you can replace them with any text you want.


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installing Haystack

Install Haystack 2.0 and other required packages with `pip`:

In [1]:
%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=3.0.0"

Collecting haystack-ai
  Downloading haystack_ai-2.8.0-py3-none-any.whl.metadata (13 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.4.0-py3-none-any.whl.metadata (16 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting posthog (from haystack-ai)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Downloading haystack_ai-2.8.0-py3-none-any.whl (391 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 391.4/391.4 kB 10.2 MB/s eta 0:00:00
Downloading haystack_experimental-0.4.0-py3-none-any.whl (109 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.8/109.8 kB 11.8 MB/s eta 0:00:00
Downloading lazy_imports-0.4.0-py3-none-any.wh

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.


### Enabling Telemetry

Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/enabling-telemetry) for more details.

In [2]:
from haystack.telemetry import tutorial_running

tutorial_running(27)

## Fetching and Indexing Documents

You'll start creating your question answering system by downloading the data and indexing the data with its embeddings to a DocumentStore.

In this tutorial, you will take a simple approach to writing documents and their embeddings into the DocumentStore. For a full indexing pipeline with preprocessing, cleaning and splitting, check out our tutorial on [Preprocessing Different File Types](https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline).


### Initializing the DocumentStore

Initialize a DocumentStore to index your documents. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you'll be using the `InMemoryDocumentStore`.

In [1]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

  from .autonotebook import tqdm as notebook_tqdm


> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

The DocumentStore is now ready. Now it's time to fill it with some Documents.

### Fetch the Data

You'll use the Wikipedia pages of [Seven Wonders of the Ancient World](https://en.wikipedia.org/wiki/Wonders_of_the_World) as Documents. We preprocessed the data and uploaded to a Hugging Face Space: [Seven Wonders](https://huggingface.co/datasets/bilgeyucel/seven-wonders). Thus, you don't need to perform any additional cleaning or splitting.

Fetch the data and convert it into Haystack Documents:

In [None]:
import pandas as pd
from haystack import Document

# 讀取 Excel 檔案
df = pd.read_excel("./work_daily.xlsx", header=None)

# 建立儲存文件的列表
docs = []

# 每三行一組處理資料
for i in range(3, len(df), 3):  # 從第4行開始（跳過標題行）
    try:
        # 檢查是否有足夠的列
        if len(df.columns) < 21:  # Excel 有 21 列
            continue
            
        # 每三列為一組，每組有 7 個時間區段（每個區段占 3 列）
        for col in range(3, 19, 3):  # 從第 4 列開始，每隔 3 列處理一次
            # 取得日期
            date = df.iloc[0, col]  # 第一行的日期
            
            # 取得人員名稱
            person = df.iloc[i, col]
            
            # 取得工作內容
            yesterday_work = df.iloc[i, col+1]  # 昨天的工作
            today_work = df.iloc[i, col+2]      # 今天的工作
            
            # 確認資料有效性
            if pd.notna(date) and pd.notna(person) and (pd.notna(yesterday_work) or pd.notna(today_work)):
                content = f"""
                日期: {date.strftime('%Y-%m-%d')}
                人員: {person}
                昨天工作: {yesterday_work if pd.notna(yesterday_work) else '無'}
                今天工作: {today_work if pd.notna(today_work) else '無'}
                """
                
                # 創建 Document 物件
                doc = Document(
                    content=content.strip(),  # 移除多餘的空白
                    meta={
                        'date': date,
                        'person': person,
                        'source': 'work_daily.xlsx'
                    }
                )
                docs.append(doc)
                print(content)  # 印出來檢查
                
    except Exception as e:
        print(f"處理第 {i} 行時發生錯誤: {e}")

print(f"總共處理了 {len(docs)} 筆資料")

In [5]:
import pandas as pd
from haystack import Document

# 讀取 Excel 檔案
df = pd.read_excel("./work_daily.xlsx", header=None)

# 將整個 DataFrame 轉換為字串
content = df.to_string()

# 創建單一個 Document 物件
doc = Document(
    content=content,
    meta={
        'source': 'work_daily.xlsx'
    }
)

# 建立文件列表
docs = [doc]

print("文件內容長度:", len(content))
print("已創建文件數量:", len(docs))

# 接下來執行向量化和寫入 DocumentStore 的步驟
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

文件內容長度: 2855585
已創建文件數量: 1


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.32s/it]


1

In [5]:
# from datasets import load_dataset
# from haystack import Document

# dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
# docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

(…)-00000-of-00001-4077bd623d55100a.parquet:   0%|          | 0.00/119k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/151 [00:00<?, ? examples/s]

### Initalize a Document Embedder

To store your data in the DocumentStore with embeddings, initialize a [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder) with the model name and call `warm_up()` to download the embedding model.

> If you'd like, you can use a different [Embedder](https://docs.haystack.deepset.ai/docs/embedders) for your documents.

In [3]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()

### Write Documents to the DocumentStore

Run the `doc_embedder` with the Documents. The embedder will create embeddings for each document and save these embeddings in Document object's `embedding` field. Then, you can write the Documents to the DocumentStore with `write_documents()` method.

In [4]:
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Batches: 100%|██████████| 41/41 [00:15<00:00,  2.57it/s]


DuplicateDocumentError: ID '866d1b227e2a41006641426adbbf65a2b11732d69d99a6340897df97ee8a7143' already exists.

## Building the RAG Pipeline

The next step is to build a [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines) to generate answers for the user query following the RAG approach. To create the pipeline, you first need to initialize each component, add them to your pipeline, and connect them.

### Initialize a Text Embedder

Initialize a text embedder to create an embedding for the user query. The created embedding will later be used by the Retriever to retrieve relevant documents from the DocumentStore.

> ⚠️ Notice that you used `sentence-transformers/all-MiniLM-L6-v2` model to create embeddings for your documents before. This is why you need to use the same model to embed the user queries.

In [10]:
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

### Initialize the Retriever

Initialize a [InMemoryEmbeddingRetriever](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) and make it use the InMemoryDocumentStore you initialized earlier in this tutorial. This Retriever will get the relevant documents to the query.

In [6]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

retriever = InMemoryEmbeddingRetriever(
    document_store=document_store,
    top_k=3  # 調整要返回的相關文件數量
)

### Define a Template Prompt

Create a custom prompt for a generative question answering task using the RAG approach. The prompt should take in two parameters: `documents`, which are retrieved from a document store, and a `question` from the user. Use the Jinja2 looping syntax to combine the content of the retrieved documents in the prompt.

Next, initialize a [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) instance with your prompt template. The PromptBuilder, when given the necessary values, will automatically fill in the variable values and generate a complete prompt. This approach allows for a more tailored and effective question-answering experience.

In [7]:
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

template = [ChatMessage.from_user("""
請根據以下工作日誌資訊回答問題。請用繁體中文回答。

工作日誌內容：
{% for document in documents %}
    {{ document.content }}
{% endfor %}

問題: {{question}}
回答:
""")]

prompt_builder = ChatPromptBuilder(template=template)

### Initialize a ChatGenerator


ChatGenerators are the components that interact with large language models (LLMs). Now, set `OPENAI_API_KEY` environment variable and initialize a [OpenAIChatGenerator](https://docs.haystack.deepset.ai/docs/OpenAIChatGenerator) that can communicate with OpenAI GPT models. As you initialize, provide a model name:

In [8]:
import os
from getpass import getpass
from haystack.components.generators.chat import OpenAIChatGenerator

os.environ["OPENAI_API_KEY"] = ""

chat_generator = OpenAIChatGenerator(model="gpt-4o")

> You can replace `OpenAIChatGenerator` in your pipeline with another `ChatGenerator`. Check out the full list of chat generators [here](https://docs.haystack.deepset.ai/docs/generators).

### Build the Pipeline

To build a pipeline, add all components to your pipeline and connect them. Create connections from `text_embedder`'s "embedding" output to "query_embedding" input of `retriever`, from `retriever` to `prompt_builder` and from `prompt_builder` to `llm`. Explicitly connect the output of `retriever` with "documents" input of the `prompt_builder` to make the connection obvious as `prompt_builder` has two inputs ("documents" and "question").

For more information on pipelines and creating connections, refer to [Creating Pipelines](https://docs.haystack.deepset.ai/docs/creating-pipelines) documentation.

In [11]:
from haystack import Pipeline

basic_rag_pipeline = Pipeline()
# Add components to your pipeline
basic_rag_pipeline.add_component("text_embedder", text_embedder)
basic_rag_pipeline.add_component("retriever", retriever)
basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
basic_rag_pipeline.add_component("llm", chat_generator)


In [12]:
# Now, connect the components to each other
basic_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
basic_rag_pipeline.connect("retriever", "prompt_builder")
basic_rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000021900FD97F0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

That's it! Your RAG pipeline is ready to generate answers to questions!

## Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to both the `text_embedder` and the `prompt_builder`. This ensures that the `{{question}}` variable in the template prompt gets replaced with your specific question.

In [20]:
question = "請問馨文在表單上做了什麼工作，並且這些工作的時程在幾月幾號~幾月幾號"

response = basic_rag_pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0].text)

Batches: 100%|██████████| 1/1 [00:00<00:00, 47.74it/s]


根據工作日誌內容，馨文在表單上做了以下工作：

1. 2022-06-30：
   - (iRich) 中秋上線項目追蹤
   - (iRich) 收集掉落系統-零件盒開啟potocol串接
   - (iRich) 零件盒開啟獲獎表演串接
   - (iRich) 零件盒banner切換流程

2. 2022-07-01：
   - (Maruay) 獵龍滿月兔/奪寶海盜兔移植測試
   - (iRich) 海盜兔標價板問題修正

因此，這些工作的時程為2022年6月30日至2022年7月1日。


Here are some other example questions to test:

In [18]:
examples = [
    "請問這個工作日報提供了哪些人的資訊",
    "請告訴我可以查詢這些人哪些時間的工作內容",
]
questionList = "".join(examples)

response = basic_rag_pipeline.run({"text_embedder": {"text": questionList}, "prompt_builder": {"question": questionList}})

print(response["llm"]["replies"][0].text)

Batches: 100%|██████████| 1/1 [00:00<00:00, 35.81it/s]


這份工作日誌提供了以下人員的資訊：

1. 馨文
   - 可以查詢的工作時間為：2022-06-30（昨天）和2022-07-01（今天）

2. 彥鈞
   - 可以查詢的工作時間為：2022-06-30（昨天）和2022-07-01（今天）

3. 人員
   - 可以查詢的工作時間為：2022-06-30（昨天）和2022-07-01（今天）


## What's next

🎉 Congratulations! You've learned how to create a generative QA system for your documents with the RAG approach.

If you liked this tutorial, you may also enjoy:
- [Filtering Documents with Metadata](https://haystack.deepset.ai/tutorials/31_metadata_filtering)
- [Preprocessing Different File Types](https://haystack.deepset.ai/tutorials/30_file_type_preprocessing_index_pipeline)
- [Creating a Hybrid Retrieval Pipeline](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval)

To stay up to date on the latest Haystack developments, you can [subscribe to our newsletter](https://landing.deepset.ai/haystack-community-updates) and [join Haystack discord community](https://discord.gg/haystack).

Thanks for reading!