In [1]:
import nest_asyncio
nest_asyncio.apply()

import ollama

In [1]:
import chromadb

# PersistentClient: This is the key part. It tells the system: "Don't just keep this in RAM (memory). Save it to the hard drive.
chomra_client = chromadb.PersistentClient(path="./mini-llama-articles")
# You are creating a specific bucket named "mini-llama-articles" to hold this specific set of data.
chroma_collection = chroma_client.create_collection("mini-llama-articles")

NameError: name 'chroma_client' is not defined

`BAAI/bge-small-en-v1.5` has longer context size, if you see the `22max_position_embeddings`
https://huggingface.co/BAAI/bge-small-en-v1.5/blob/main/config.json#:~:text=%22max_position_embeddings%22%3A%20512%2C

For the simple RAG project notes, why `BAAI/bge-small-en-v1.5` is the ideal choice:

### 1. High Performance with Minimal Resource Cost
The `bge-small-en-v1.5` model strikes an incredible balance between accuracy and efficiency. Despite being a "small" model (only ~33 million parameters) that can run quickly on a standard CPU with just ~130MB of RAM, it consistently outperforms much larger models (like OpenAI’s older Ada-002) on industry-standard benchmarks like MTEB. This means you get production-quality retrieval—finding the right documents for your query—without needing expensive GPUs or paying for API calls, making it perfect for local development and testing.

### 2. Optimized for RAG Workflows
Unlike older sentence transformers (like `all-MiniLM-L6-v2`) that were designed for short sentences (max 256 tokens), this model is specifically optimized for Retrieval-Augmented Generation (RAG). It supports a **512-token context window**, allowing it to "read" and index half-page paragraphs without losing as much context. Additionally, the `v1.5` update fixed the "similarity distribution" issue, meaning the scores it assigns to documents are more meaningful, allowing your system to better distinguish between a "perfect match" and a "kind of relevant" document.

In [2]:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.settings import Settings

llm = Ollama(model="llama3.2:3b")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Settings.llm = llm
Settings.embed_model = embed_model

  from .autonotebook import tqdm as notebook_tqdm


In [19]:
# Display the Setting Settings present in LlamaIndex
print("Chunk size:", Settings.chunk_size)
print("Chunk overlap:", Settings.chunk_overlap)

print("LLM:", Settings.llm)
print("Embedding model:", Settings.embed_model)
print("Tokenizer:", Settings.tokenizer )

print("Context Window:", Settings.context_window)
print("Number of output tokens:", Settings.num_output )

Chunk size: 1024
Chunk overlap: 200
LLM: callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000002E9387826C0> system_prompt=None messages_to_prompt=<function messages_to_prompt at 0x000002E9165F9EE0> completion_to_prompt=<function default_completion_to_prompt at 0x000002E9179198A0> output_parser=None pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'> query_wrapper_prompt=None base_url='http://localhost:11434' model='llama3.2:3b' temperature=None context_window=131072 request_timeout=30.0 prompt_key='prompt' json_mode=False additional_kwargs={} is_function_calling_model=True keep_alive=None thinking=None
Embedding model: model_name='BAAI/bge-small-en-v1.5' embed_batch_size=10 callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000002E9387826C0> num_workers=None embeddings_cache=None max_length=512 normalize=True query_instruction=None text_instruction=None cache_folder=None show_progress_bar=False
Tokenizer: functools.part

# 2️⃣ Breaking it down
``` LlamaIndex Workflows (an event-driven architecture). In that system, functions don't call each other directly; they communicate by throwing "Events" back and forth.```

### **A. `Event`**

* `Event` is a **base class in LlamaIndex workflows**.
* Workflows in LlamaIndex are **step-based pipelines**, and **events** are the objects that pass between steps.
* Examples of events:

  * `StartEvent` → triggers the workflow
  * `StopEvent` → indicates a step has finished
* You can define **custom events** to carry specific data between steps.

---

### **B. `NodeWithScore`**

* Each `NodeWithScore` is a **document chunk + similarity score**.
* When you retrieve documents from your vector database, you get:

  * The **chunk of text** (Node)
  * Its **relevance score** (Score)

So `NodeWithScore` represents **a retrieved document and how relevant it is to the query**.

---

### **C. `RetrieverEvent`**

* This is a **custom event** that will hold the **results of the retrieval step** in your RAG workflow.
* By defining:

```python
nodes: list[NodeWithScore]
```

You are saying:

> “This event will carry a list of retrieved nodes (documents) with their similarity scores.”

# 4️⃣ How it fits into a RAG workflow

```
StartEvent(query)
       │
       ▼
Retrieve Step
       │
       ▼
RetrieverEvent(nodes=[NodeWithScore, ...])
       │
       ▼
Synthesize Step (uses nodes to generate answer)
```

* `RetrieverEvent` is **just a container**
* Carries **retrieved text chunks + scores** from retrieval → synthesis


In [3]:
from llama_index.core.workflow import Event
from llama_index.core.schema import NodeWithScore


class RetrieverEvent(Event):
    """Result of running retrieval"""

    nodes: list[NodeWithScore]

## The `ctx: Context` used here.
Context exists for more advanced workflows, and you are already one small step away from using it meaningfully.
Context is a shared, persistent state container for a single workflow run.

### 1. What is ctx? (The Shared Brain)
In a Workflow, every step (ingest, retrieve, synthesize) is an isolated island.
* The retrieve function has no idea what variables exist inside ingest.
* The synthesize function has no idea what variables exist inside retrieve

ctx (Context) is the shared memory that connects these islands. It is a global storage box that lives for the entire duration of the workflow run. If you put something in the box during Step 1, you can take it out in Step 3.

## This method is the Vector Search (Finds answers even if words don't match exactly)

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.workflow import (
    Context,
    Workflow,
    StartEvent,
    StopEvent,
    step
)

class RAGWorkflow(Workflow):
    @step
    async def ingest(self, ctx: Context, ev: StartEvent) -> StopEvent | None:
        dirname = ev.get("dirname")
        if not dirname:
            return None
        
        documents = SimpleDirectoryReader(dirname).load_data()
        # Vector Store Retriever (also known as Dense Retrieval or Semantic Search)
        index = VectorStoreIndex.from_documents(
            documents=documents
        )
        return StopEvent(result=index)
    
    @step
    async def retrieve(self, ctx: Context, ev: StartEvent) -> RetrieverEvent | None:
        """ Retrieve relevant documents from the index based on the query. """
        # Both values come from the StartEvent, not Context
        query = ev.get("query")
        index = ev.get("index")

        if not query:
            return None
        
        print(f"Retrieving documents for query: {query}")

        await ctx.store.set("query", query)

        if index is None:
            print("Index is empty, load some documents before querying!")
            return None

        retriever = index.as_retriever( similarity_top_k=2 )
        nodes = await retriever.aretrieve(query)
        print(f"Retrieved {len(nodes)} documents.")
        print(("Document Retrieved:"))
        for node in nodes:
            print("-----"*10)
            print(node.get_text())
        return RetrieverEvent(nodes=nodes)


# When the code gets to synthesize, it has the answers (nodes), but it has forgotten the question (query).
# Without ctx, the LLM would receive 5 paragraphs of text but wouldn't know what question to answer about them.
# ctx bridges this gap by teleporting the query variable from the first step to the last step.
    @step
    async def synthesize(self, ctx: Context, ev: RetrieverEvent) -> StopEvent:
        """Return a streaming response using reranked nodes."""
        summarizer = CompactAndRefine(streaming=True, verbose=True)
        query = await ctx.store.get("query", default=None)

        response = await summarizer.asynthesize(query, nodes=ev.nodes)
        return StopEvent(result=response)
        


In [7]:
w = RAGWorkflow()

In [None]:
# Ingest the documents and only ingest function runs
index = await w.run(dirname="data")

In [None]:
# Run a query and only retrieve function runs
result = await w.run(query="How was DeepSeekR1 trained?", index=index)
print("\nFinal response:")
async for chunk in result.async_response_gen():
    print(chunk, end="", flush=True)

Retrieving documents for query: How was DeepSeekR1 trained?
Retrieved 2 documents.
Document contents:
--------------------------------------------------
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support th

2025-12-07 15:50:53,879 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


DeepSeek-R1 was trained using a multi-stage approach, which includes supervised fine-tuning (SFT) followed by large-scale reinforcement learning (RL). The model was initially trained with cold-start data before undergoing RL training. This combination of SFT and RL appears to have significantly enhanced the reasoning capabilities of DeepSeek-R1.