# üèÜüé¨ RAG with Llama 3.1 and Haystack

  <img src="https://img-cdn.inc.com/image/upload/w_1280,ar_16:9,c_fill,g_auto,q_auto:best/images/panoramic/meta-llama3-inc_539927_dhgoal.webp" width="380"/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="https://haystack.deepset.ai/images/haystack-ogimage.png" width="430" style="display:inline;">



Simple RAG example on the Oscars using [Llama 3.1 open models](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f) and the [Haystack LLM framework](https://haystack.deepset.ai/).

## Installation

In [None]:
! pip install haystack-ai "transformers>=4.43.1" sentence-transformers accelerate bitsandbytes python-dotenv

## Authorization

- you need an Hugging Face account
- you need to accept Meta conditions here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and wait for the authorization

In [2]:
import os
from dotenv import load_dotenv
from pathlib import Path

# Load environment variables from .env file in project root
env_path = Path(__file__).parent.parent / '.env' if '__file__' in globals() else Path('..') / '.env'
load_dotenv(dotenv_path=env_path)

# HF_API_TOKEN will be automatically loaded from .env
if not os.getenv("HF_API_TOKEN"):
    raise ValueError("HF_API_TOKEN not found in .env file")

## RAG with Llama-3.1-8B-Instruct (with your own documents) üèÜüé¨

### Load your documents

You can load documents from various formats: PDF, TXT, DOCX, MD, etc.
Place your documents in a folder and specify the path below.

In [3]:
from pathlib import Path
import rich

In [4]:
# Wikipedia source (commented out)
# import wikipedia
# from haystack.dataclasses import Document
#
# title = "96th_Academy_Awards"
# page = wikipedia.page(title=title, auto_suggest=False)
# raw_docs = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

In [5]:
from haystack.components.converters import TextFileToDocument, PyPDFToDocument
from haystack.dataclasses import Document

# Path to your local documents folder
DOCUMENTS_PATH = Path("../data/documents_ro")

# Load documents based on file type
raw_docs = []

# Load text files
txt_files = list(DOCUMENTS_PATH.glob("*.txt"))
if txt_files:
    txt_converter = TextFileToDocument()
    txt_docs = txt_converter.run(sources=txt_files)
    raw_docs.extend(txt_docs["documents"])

# Load PDF files
pdf_files = list(DOCUMENTS_PATH.glob("*.pdf"))
if pdf_files:
    pdf_converter = PyPDFToDocument()
    pdf_docs = pdf_converter.run(sources=pdf_files)
    raw_docs.extend(pdf_docs["documents"])

# Load markdown files
md_files = list(DOCUMENTS_PATH.glob("*.md"))
if md_files:
    md_converter = TextFileToDocument()
    md_docs = md_converter.run(sources=md_files)
    raw_docs.extend(md_docs["documents"])

print(f"Loaded {len(raw_docs)} documents from {DOCUMENTS_PATH}")
for i, doc in enumerate(raw_docs[:3], 1):  # Show first 3
    print(f"\n--- Document {i} ---")
    print(f"Source: {doc.meta.get('file_path', 'Unknown')}")
    print(f"Content preview: {doc.content[:150]}...")

PyPDFToDocument could not extract text from the file ..\data\documents_ro\bilant-2024.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\buget-2022-scanat.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\plan-de-achizitii-2020.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\raport-lg-544-anul-2025.pdf. Returning an empty document.


PyPDFToDocument could not extract text from the file ..\data\documents_ro\bilant-2024.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\buget-2022-scanat.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\plan-de-achizitii-2020.pdf. Returning an empty document.
PyPDFToDocument could not extract text from the file ..\data\documents_ro\raport-lg-544-anul-2025.pdf. Returning an empty document.


Loaded 18 documents from ..\data\documents_ro

--- Document 1 ---
Source: aviz-clasic-in-ron.pdf
Content preview: Furnizor:
S.C. Cubus Arts S.R.L.
Nr. ord. reg. com. / an: J2000000508324
CIF: RO 13548146
Adresa: Strada Morii 198
892200 Lugojoara, jud. Timi»ô, Rom√¢n...

--- Document 2 ---
Source: aviz-liste-lungi-euro+ron-engleza.pdf
Content preview: DELIVERY NOTICE / AVIZ DE √éNSO»öIRE A MƒÇRFII
Document number: SRV-1520
Date (dd/mm/yyyy): 08/11/2025
Seller / Furnizor:
S.C. Cubus Arts S.R.L.
Company ...

--- Document 3 ---
Source: bilant-2024.pdf
Content preview: ...


### Load documents from local folder

### Indexing Pipeline

In [6]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
document_store = InMemoryDocumentStore()

In [8]:
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        # device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
        device=ComponentDevice.from_str("cpu"), 
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# connect the components
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x000002080D35A9E0>
üöÖ Components
  - splitter: DocumentSplitter
  - embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
üõ§Ô∏è Connections
  - splitter.documents -> embedder.documents (list[Document])
  - embedder.documents -> writer.documents (list[Document])

In [9]:
indexing_pipeline.run({"splitter":{"documents":raw_docs}})

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21/21 [13:26<00:00, 38.41s/it]


{'writer': {'documents_written': 645}}

### RAG Pipeline

In [10]:
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

template = [ChatMessage.from_user("""
Using the information contained in the context, give a comprehensive answer to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}

""")]
prompt_builder = ChatPromptBuilder(template=template)

ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


Here, we use the [`HuggingFaceLocalChatGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalchatgenerator), loading the model in Colab with 4-bit quantization.

In [11]:
# import torch
# from haystack.components.generators.chat import HuggingFaceLocalChatGenerator

# # Using Phi-3 instead - no approval needed, works on CPU
# generator = HuggingFaceLocalChatGenerator(
#     model="microsoft/Phi-3-mini-4k-instruct",
#     huggingface_pipeline_kwargs={"device_map":"auto",
#                                   "model_kwargs":{"trust_remote_code": True}},
#     generation_kwargs={"max_new_tokens": 500})

# generator.warm_up()

In [32]:
# Alternative: Using Ollama (uncomment to use)
# ! pip install ollama-haystack
from haystack_integrations.components.generators.ollama import OllamaChatGenerator

generator = OllamaChatGenerator(
    model="llama3.1:8b",
    url="http://localhost:11435",  # Using port 11435 instead of default 11434
    generation_kwargs={
        "max_tokens": 200,  # Reduced from 500 to get faster responses
        "temperature": 0.7
    },
    timeout=600,  # 10 minutes timeout for slow CPU processing
    keep_alive="30m"  # Keep model loaded for 30 minutes to avoid reload delays
)

**Alternative: Use Ollama (no HF access needed)**

If you don't have access to Meta Llama models yet, you can use Ollama:
1. Install Ollama from https://ollama.com
2. Run: `ollama pull llama3.1:8b`
3. Use the cell below instead of HuggingFaceLocalChatGenerator

In [33]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        # device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        device=ComponentDevice.from_str("cpu"), 
        prefix="Represent this sentence for searching relevant passages: ",  # as explained in the model card (https://huggingface.co/Snowflake/snowflake-arctic-embed-l#using-huggingface-transformers), queries should be prefixed
    ))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")

ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


<haystack.core.pipeline.pipeline.Pipeline object at 0x000002088CD43B20>
üöÖ Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - generator: OllamaChatGenerator
üõ§Ô∏è Connections
  - text_embedder.embedding -> retriever.query_embedding (list[float])
  - retriever.documents -> prompt_builder.documents (list[Document])
  - prompt_builder.prompt -> generator.messages (list[ChatMessage])

### Let's ask some questions!

In [34]:
def get_generative_answer(query):

  results = query_pipeline.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["generator"]["replies"][0].text
  rich.print(answer)

In [21]:
get_generative_answer("Facturi peste 150 de lei")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  7.53it/s]


In [31]:
get_generative_answer("Cate documente sunt in baza de date ?")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  6.09it/s]


In [35]:
get_generative_answer("Care este adresa firmei Cubus Arts?")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.55it/s]


In [36]:
get_generative_answer("Care este adresa firmei DEMO IMPEX?")

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.26it/s]


In [37]:
get_generative_answer("Care este valoarea totala a facturilor din noiembrie 2025")

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.11it/s]


---
This is a simple demo.
We can improve the RAG Pipeline in several ways, including better preprocessing the input.

To use Llama 3 models in Haystack, you also have **other options**:
- [LlamaCppGenerator](https://docs.haystack.deepset.ai/docs/llamacppgenerator) and [OllamaGenerator](https://docs.haystack.deepset.ai/docs/ollamagenerator): using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
- [HuggingFaceAPIChatGenerator](https://docs.haystack.deepset.ai/docs/huggingfaceapichatgenerator), which allows you to query a the Hugging Face API, a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
- [vLLM via OpenAIChatGenerator](https://haystack.deepset.ai/integrations/vllm): high-throughput and memory-efficient inference and serving engine for LLMs.



(*Notebook by [Stefano Fiorucci](https://github.com/anakin87)*)