# Session 3: Haystack Continued

This notebook is designed for **VS Code** and uses **Ollama** to run local LLM models.

**What you’ll do**
- Get an Opensearch instance running locally with Podman
- Build a RAG Pipeline with Opensearch 
- Build Streamlit front-end and connect to Pipeline for queries
- Explore Hybrid Retrieval with Opensearch


### 1. Installing Podman

**For MAC OS**
```bash
#For MAC OS run this in a terminal window
brew install --cask podman-desktop #For the GUI application
#or
brew install podman #For the terminal version


**For Windows:** Download Using the Following link: https://podman-desktop.io/downloads/windows

## 2. Starting up an Opensearch Cluster

Run the following commands in the terminal:
```bash
#Start up the Podman Machine for the First Time
podman machine init
podman machine set --rootful #allows port binding without restrictions inside the VM
podman machine start

Run the following command to start up your Opensearch Cluster
```bash
podman run \
  -p 9200:9200 \
  -p 9600:9600 \
  -e "discovery.type=single-node" \
  -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=OSPassword246" \
  --name opensearch \
  docker.io/opensearchproject/opensearch:latest


Open a new Terminal window and run the following Curl command to check that your Opensearch Cluster is running and reachable:

```bash
curl -k -u admin:OSAdmin_123 https://localhost:9200 

## 3. Creating a RAG Ingestion Pipeline Connected to Opensearch
Use some wikepedia pages to ingest as content

In [23]:
# you can swap out these URLs with any public text URLs you like
PUBLIC_URLS = [
    "https://en.wikipedia.org/wiki/Yellow_warbler",
    "https://en.wikipedia.org/wiki/Natural_language_processing",
    "https://en.wikipedia.org/wiki/Bioluminescence"
]


The code below should create the ingestion pipeline and successfully output the required inputs to run it:

In [30]:
# We are going to use LinkContentFetcher and HTMLToDocument Components to fetch and convert the texts into Haystack Documents
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner

from haystack.document_stores.types import DuplicatePolicy


#initialise all the components here:
# --- OpenSearch DocumentStore (local) ---
document_store = OpenSearchDocumentStore(
    hosts="http://localhost:9200",
    index="public_texts",
    use_ssl=True,
    verify_certs=False,
    http_auth=("admin", "OSPassword246"),
)
fetcher = LinkContentFetcher( user_agents=["ai-mutual-mentorship/0.1 (https://github.com/larry6point6/ai-mutual-mentorship-scheme)"]) # takes input of URL lists and outputs stream (a list of Bytestream objects)
# https://docs.haystack.deepset.ai/docs/linkcontentfetcher
converter = HTMLToDocument() # takes a list of Bytestream objects and outputs a list of Haystack Documents
# https://docs.haystack.deepset.ai/docs/htmltodocument
splitter = DocumentSplitter(split_by="word", split_length=200, split_overlap=15) # takes a list of Haystack Documents and splits them into smaller chunks
#https://docs.haystack.deepset.ai/docs/documentsplitter
cleaner = DocumentCleaner() # removes emtpy white lines extra spaces and repeated substrings by default (you can add custom cleaning parameters like remove_regex)
# https://docs.haystack.deepset.ai/docs/documentcleaner
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)


# initalise the ingestion pipeline
ingestion_pipeline = Pipeline() 

# Add all the components to the pipeline
ingestion_pipeline.add_component(instance=fetcher, name="fetcher")
ingestion_pipeline.add_component(instance=converter, name="converter")
ingestion_pipeline.add_component(instance=cleaner, name="cleaner")
ingestion_pipeline.add_component(instance=splitter, name="splitter")
ingestion_pipeline.add_component(instance=writer, name="writer")

# Connect the inputs and outputs of the components together
ingestion_pipeline.connect("fetcher.streams", "converter") # When there is only one correct type of input/output these can be inferred
ingestion_pipeline.connect("converter", "cleaner")
ingestion_pipeline.connect("cleaner", "splitter")
ingestion_pipeline.connect("splitter", "writer")

## Print out the list of required inputs using the following command
ingestion_pipeline.inputs()


{'fetcher': {'urls': {'type': list[str], 'is_mandatory': True}},
 'converter': {'meta': {'type': dict[str, typing.Any] | list[dict[str, typing.Any]] | None,
   'is_mandatory': False,
   'default_value': None},
  'extraction_kwargs': {'type': dict[str, typing.Any] | None,
   'is_mandatory': False,
   'default_value': None}},
 'writer': {'policy': {'type': haystack.document_stores.types.policy.DuplicatePolicy | None,
   'is_mandatory': False,
   'default_value': None}}}

Run the ingestion pipeline

In [29]:

# --- Run the ingestion pipeline ---
# Run ingestion
result = ingestion_pipeline.run({"fetcher": {"urls": PUBLIC_URLS}})

print("Ingestion result:", result)
print("Documents in store:", document_store.count_documents())

Ingestion result: {'writer': {'documents_written': 0}}
Documents in store: 137


## Part 5 — Vector DBs & OpenSearch (Docker) — Hybrid Retrieval
We’ll now use **OpenSearch** as the document store, then compare **BM25**, **dense embeddings**, and the **OpenSearchHybridRetriever**.

> Quickstart (local dev):
```bash
# Single node, security disabled for local testing (see official docs for options)
docker run -p 9200:9200 -p 9600:9600   -e "discovery.type=single-node"   -e "DISABLE_SECURITY_PLUGIN=true"   --name opensearch   -d opensearchproject/opensearch:latest
```
OpenSearch Haystack integration: `pip install opensearch-haystack`.


In [19]:
EMBED_MODEL = "nomic-embed-text"
OLLAMA_ENDPOINT = "http://localhost:11434"

In [None]:

from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore
from haystack_integrations.components.retrievers.opensearch import (
    OpenSearchBM25Retriever,
    OpenSearchEmbeddingRetriever,
    OpenSearchHybridRetriever,
)

# Adjust embedding_dim to your embedding model; nomic-embed-text -> 768
OPENSEARCH = {
    "hosts": ["http://localhost:9200"],
    "index": "demo_docs",
    "embedding_dim": 768,
}

doc_store = OpenSearchDocumentStore(**OPENSEARCH)

# Embed with Ollama and write
op_embedder = OllamaDocumentEmbedder(model=EMBED_MODEL, url=OLLAMA_ENDPOINT)
docs_emb = op_embedder.run(DOCS)
doc_store.write_documents(docs_emb["documents"])  # index
print("OpenSearch indexed docs.")


KeyError: 'splitter'

In [0]:

# Three retrievers
os_bm25 = OpenSearchBM25Retriever(document_store=doc_store, top_k=5)
os_emb  = OpenSearchEmbeddingRetriever(document_store=doc_store, top_k=5)
# Hybrid retriever combines both under the hood
os_hybrid = OpenSearchHybridRetriever(document_store=doc_store,
                                     embedder=OllamaTextEmbedder(model=EMBED_MODEL, url=OLLAMA_ENDPOINT),
                                     top_k=5)

query = "What is Retrieval-Augmented Generation?"
print("BM25:")
print([d.meta.get("source") for d in os_bm25.run(query=query)["documents"]])
print("Embedding:")
print([d.meta.get("source") for d in os_emb.run(query_embedding=OllamaTextEmbedder(model=EMBED_MODEL, url=OLLAMA_ENDPOINT).run(text=query)["embedding"])]["documents"])  # noqa
print("Hybrid:")
print([d.meta.get("source") for d in os_hybrid.run(query=query)["documents"]])
