# Local Semantic Search & RAG with Qdrant and Ollama

This notebook demonstrates a **local-first Retrieval-Augmented Generation (RAG) pipeline**.

**High-level flow**
1. Raw text documents are converted into vector embeddings using a local Ollama embedding model.
2. Embeddings are stored in Qdrant, a vector database optimized for similarity search.
3. User queries are embedded using the same model.
4. Qdrant retrieves the most semantically similar documents.
5. (Optional) Retrieved context is passed to a local LLM via Ollama to generate grounded answers.

The entire stack runs locally, with no external API dependencies.

In [None]:
# !pip install bs4 langchain dotenv langchain-community langchain-text-splitters langchain-ollama langchain-qdrant qdrant-client

## Embeddings

Embeddings are dense numerical representations of text that capture semantic meaning.
They are **mandatory** for semantic search: vector databases such as Qdrant operate on vectors,
not raw text.

Key points:
- The same embedding model must be used for both documents and queries.
- Vector dimensionality must match the Qdrant collection configuration.
- Small changes in text should produce small changes in vectors.

In [1]:
import bs4
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_qdrant import QdrantVectorStore
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain.agents import create_agent
from qdrant_client import QdrantClient
from langchain.tools import tool

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
from dotenv import load_dotenv
import os
load_dotenv()

True

In [3]:
#### INDEXING ####

# Load documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/", 
              "https://github.com/NirDiamant/Prompt_Engineering"),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

from uuid import uuid4

for doc in docs:
    doc.metadata["doc_id"] = str(uuid4())

In [4]:
# Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)
splits = text_splitter.split_documents(docs)

In [6]:
splits[20]

Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'doc_id': 'e6094d18-bad0-4f60-8a11-4ba36cc84363', 'start_index': 15618}, page_content='MRKL (Karpas et al. 2022), short for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API).\nThey did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably

In [7]:
# Embeddings
embeddings = OllamaEmbeddings(
    model="embeddinggemma",
)

In [8]:
vector = embeddings.embed_query("Hello world")

print(len(vector)) 

768


## Qdrant Collection

A Qdrant *collection* is analogous to a table in a relational database.

It defines:
- Vector size (must match the embedding model output dimension)
- Distance metric (e.g. cosine similarity)
- Optional payload schema for metadata filtering

In this notebook, the collection is created once and reused across runs.

In [9]:

# Qdrant client (local)
qdrant_client = QdrantClient(
    url="http://localhost:6333",
)

# Create / load collection
collection_name = "github_ai_repo"


# vectorstore = QdrantVectorStore.from_documents(
#    documents=splits,
#    embedding=embeddings,
#    url="http://localhost:6333",
#    collection_name=collection_name,
# )

vectorstore = QdrantVectorStore.from_existing_collection(
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name=collection_name,
)

## Semantic Search

Semantic search works by:
1. Embedding the user query.
2. Computing similarity between the query vector and stored vectors.
3. Returning the closest matches according to the distance metric.

Unlike keyword search, this retrieves documents based on meaning rather than exact terms.

In [10]:
@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [14]:
model = ChatOllama(model="qwen3:0.6b", temperature=0)

In [15]:
tools = [retrieve_context]
# If desired, specify custom instructions
prompt = (
    "You have access to a tool that retrieves context from a blog post. "
    "Use the tool to help answer user queries."
)
agent = create_agent(model, tools, system_prompt=prompt)

In [None]:
query = (
    "What is the standard method for Prompt Engineering?\n\n"
    "Once you get the answer, look up common extensions of that method."
)

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()