<a href="https://colab.research.google.com/github/knowusuboaky/RagAgent/blob/main/Open_Insights_DS_Gen_AI_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **KWADWO NYAME OWUSU - BOAKYE'S CODE**

# Exercise: Multi-Modal Retrieval Augmented Generation from Scratch

## Objective

Implement a Retrieval Augmented Generation (RAG) system from scratch using the [PHI-3 vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) model, [Jina-CLIP-V1](https://huggingface.co/jinaai/jina-clip-v1) embeddings, and the Chroma vector database. Your task is to use this system to generate text based on the contents of the paper "Attention is All You Need" ([https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)).

## Allowed Libraries

You are only permitted to use the following libraries:
- `torch`
- `chromadb`
- `numpy`
- `io`
- `fitz`
- `requests`
- `PIL`
- `transformers`

## Components to Implement

1. **Document Ingestion**: Load the "Attention is All You Need" paper and prepare it for processing. This includes handling the PDF and extracting text and images as needed.

2. **Embeddings Generation**: Use Jina-CLIP-V1 to create embeddings for the text and images extracted from the document.

3. **Vector Database Management**: Utilize the Chroma vector database to store and manage the embeddings.

4. **Retrieval Mechanism**: Implement a mechanism to retrieve relevant document segments (text or images) based on a given query using the stored embeddings.

5. **Generation Model**: Integrate the PHI-3 vision model to generate new content based on the retrieved segments.

## Submission

Your submission should include:
- The complete code for the exercise.
- A README file explaining your implementation and how to run the code.
- Examples of generated outputs based on sample queries.

## Evaluation Criteria

- Correctness: The RAG system should accurately retrieve relevant document segments and generate coherent outputs.
- Efficiency: The implementation should be optimized for performance.
- Creativity: Innovative approaches to integrate and utilize the components are encouraged.
- Clarity: Code should be well-documented and easy to understand.

Good luck and enjoy the exercise!

# Automating Retrieval-Augmented Q&A with AI

Learn how to build a **RAG powered assistant** that automatically retrieves, grounds, and answers user questions using **local PDFs**, **image embeddings**, and a **Phi 3 Vision LLM**, all running on **CPU** with an in memory **Chroma vector store**.

Let’s dive in!

---

## Core Capabilities

* **Unified Document Intelligence**
  Seamlessly ingests PDFs, extracts text and images, and stores their embeddings in an in memory vector store without any external database.

* **Lightweight Embedding Engine**
  Uses **Jina CLIP V1** to generate text and image embeddings for efficient similarity search.

* **Smart Retrieval Pipeline**

  1. Top k semantic vector hits
  2. Automatic similarity normalization and ranking
  3. Merges image and text results into a unified context

* **Concise, Context Aware Answers**
  Powered by **Phi 3 Vision 128k Instruct**, it generates clear, concise, and context grounded responses.

  * Up to 3 sentences
  * Based only on retrieved context
  * Safe fallback: *“I don’t know based on the provided information.”*

* **No External Dependencies**
  Works offline, with no API keys, GPUs (optional), or web connectors required.
  Everything runs locally using only the allowed libraries:
  `torch`, `chromadb`, `numpy`, `io`, `fitz`, `requests`, `PIL`, `transformers`.

Files location: [Github Repo](https://github.com/knowusuboaky/RagAgent)

### Install & Verify Libraries

In [1]:
# * Install & Verify Libraries

# Remove conflicting packages
!pip -q uninstall -y numpy scipy transformers tokenizers huggingface-hub safetensors accelerate || true

# Reinstall a consistent, Colab-friendly set
!pip -q install --no-cache-dir \
  "numpy==2.0.2" \
  "scipy==1.14.1" \
  "requests==2.32.4" \
  "pillow==11.3.0" \
  "transformers==4.57.1" \
  "tokenizers==0.22.1" \
  "huggingface-hub==0.35.3" \
  "safetensors==0.6.2" \
  "accelerate==1.0.1" \
  "chromadb==1.1.1" \
  "pymupdf==1.26.5" \
  "torch==2.4.1" \
  "torchvision==0.19.1" \
  "torchaudio==2.4.1"


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m227.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m198.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m188.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m157.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m211.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m118.3 MB/s[0m e

### Load Libraries

In [2]:
# * Imports

from transformers import (
    AutoProcessor, AutoModel,
    AutoProcessor as HF_AutoProcessor, AutoModelForCausalLM,
    AutoConfig,
)
import io
import fitz
import requests
import numpy as np
import torch
from PIL import Image
import chromadb
from chromadb.config import Settings

### Helper Functions

These utilities power normalize, chunk_text, and the system prompt used by the agent. They are lightweight, GPU and CPU-friendly, and rely only on the allowed libraries. Together, they prepare clean, normalized text chunks and enforce concise, context-grounded responses for reliable question answering.

*Note: All helpers run entirely in memory and avoid side effects. Chunking is overlap-aware to preserve context, embeddings are L2-normalized for cosine similarity, and retrieval later in the agent converts distances to similarities for clearer ranking. The system prompt keeps answers concise, factual, and grounded in the retrieved context.*

In [3]:
# * Helper Functions

def normalize(t: torch.Tensor, eps: float = 1e-12) -> torch.Tensor:
    """
    Normalize embedding vectors to unit length for consistent cosine similarity.

    Args:
        t: Input tensor of shape (n, d).
        eps: Small epsilon to prevent division by zero.

    Returns:
        Normalized tensor with L2 norm = 1.
    """
    n = torch.norm(t, p=2, dim=1, keepdim=True).clamp_min(eps)
    return t / n


def chunk_text(text: str, max_chars: int = 1000, overlap: int = 200) -> list[str]:
    """
    Split long text into overlapping segments while preserving natural sentence flow.

    Args:
        text: The input string to chunk.
        max_chars: Maximum number of characters per chunk.
        overlap: Number of characters to repeat between chunks for context continuity.

    Returns:
        List of text chunks.
    """
    text = " ".join((text or "").split()).strip()
    out = []
    i = 0
    n = len(text)
    if n == 0:
        return out
    while i < n:
        j = min(i + max_chars, n)
        cut = j
        if j < n:
            window = text[i:j]
            pos_dot = window.rfind(". ")
            pos_spc = window.rfind(" ")
            if pos_dot >= max_chars // 2:
                cut = i + pos_dot + 2
            elif pos_spc >= max_chars // 2:
                cut = i + pos_spc + 1
        out.append(text[i:cut])
        if cut >= n:
            break
        i = max(0, cut - overlap)
    return out


# ... System prompt for RAG agent
qa_system_prompt = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.

{context}"""

### Create The Agent

This section configures the class that loads models, prepares the vector store, and runs end-to-end Retrieval-Augmented Generation on GPU/CPU. The agent wraps document indexing, semantic retrieval, and grounded answering behind a simple synchronous interface.

*Note: This notebook uses local Hugging Face models for both generation (Phi-3 Vision Instruct) and embeddings (Jina CLIP V1), PyMuPDF for PDF parsing, and an in-memory Chroma vector store. No external web search or paid APIs are required.*

In [4]:
# * Setup Main Agent

class RagAgent:
    """
    A lightweight Retrieval-Augmented Generation (RAG) agent that combines a multimodal
    embedding model (Jina-CLIP) and a text-generation model (Phi-3-Vision) with an
    in-memory Chroma vector store for retrieval.

    The agent automatically selects GPU (CUDA) when available and falls back to CPU otherwise.
    """

    # ... set up device, load both models (embeddings + generator), and open an in-memory Chroma client
    def __init__(
        self,
        model: str | None = None,
        embedding_model: str | None = None,
    ):
        """
        Initialize the RAG agent, load the models, and create a Chroma client.

        Args:
            model: Optional model ID for the generator (Phi-3-Vision).
            embedding_model: Optional model ID for the embedding model (Jina-CLIP).
        """
        use_cuda = torch.cuda.is_available()
        self.device = torch.device("cuda" if use_cuda else "cpu")
        self.default_corpus: str | None = None

        # Load the embedding model (Jina-CLIP)
        emb_name = embedding_model or "jinaai/jina-clip-v1"
        self.clip_processor = AutoProcessor.from_pretrained(emb_name, trust_remote_code=True)
        self.clip_model = AutoModel.from_pretrained(emb_name, trust_remote_code=True).to(self.device).eval()

        # Load the generation model (Phi-3 Vision) in eager attention mode
        llm_name = model or "microsoft/phi-3-vision-128k-instruct"
        phi_cfg = AutoConfig.from_pretrained(llm_name, trust_remote_code=True)

        # Ensure supported attention path for Phi-3-Vision in Transformers
        if hasattr(phi_cfg, "use_flash_attention_2"):
            phi_cfg.use_flash_attention_2 = False
        if hasattr(phi_cfg, "attn_implementation"):
            phi_cfg.attn_implementation = "eager"
        if hasattr(phi_cfg, "use_sdpa"):
            phi_cfg.use_sdpa = False

        dtype = torch.float16 if use_cuda else torch.float32

        self.phi_processor = HF_AutoProcessor.from_pretrained(llm_name, trust_remote_code=True)
        self.phi_model = AutoModelForCausalLM.from_pretrained(
            llm_name,
            config=phi_cfg,
            trust_remote_code=True,
            torch_dtype=dtype,
            low_cpu_mem_usage=True,
            attn_implementation="eager",
        ).to(self.device).eval()

        # Disable KV cache to avoid DynamicCache.seen_tokens issues across versions
        if hasattr(self.phi_model, "config"):
            self.phi_model.config.use_cache = False
        if hasattr(self.phi_model, "generation_config"):
            self.phi_model.generation_config.use_cache = False

        # Initialize in-memory Chroma client
        self.chroma = chromadb.Client(Settings())

    # ... fetch raw PDF bytes from a URL and verify it is a valid PDF
    def fetch_document(self, url: str) -> bytes:
        """
        Download and validate a PDF from a URL.

        Args:
            url: Remote PDF URL.

        Returns:
            Raw PDF bytes.
        """
        r = requests.get(url, stream=True, timeout=60)
        r.raise_for_status()
        data = r.content
        if b"%PDF" not in data[:8]:
            raise ValueError("URL did not return a PDF.")
        fitz.open(stream=io.BytesIO(data), filetype="pdf").close()
        return data

    # ... read PDF bytes, extract page text into overlapping chunks, and optionally pull images
    def document_ingest(
        self,
        pdf_bytes: bytes,
        extract_images: bool = True,
        max_image_pages: int | None = 3
    ) -> dict:
        """
        Parse a PDF into structured text chunks and optional images.

        Args:
            pdf_bytes: Raw PDF data.
            extract_images: Whether to extract embedded images.
            max_image_pages: Maximum number of pages to scan for images.

        Returns:
            Dictionary with 'pages', 'chunks', and 'images' keys.
        """
        doc = fitz.open(stream=io.BytesIO(pdf_bytes), filetype="pdf")
        try:
            pages, chunks, images = [], [], []
            for i, page in enumerate(doc, start=1):
                text = page.get_text("text")
                pages.append({"page": i, "text": text})
                for seg in chunk_text(text, max_chars=1000, overlap=200):
                    chunks.append({"page": i, "text": seg})

            if extract_images:
                total = len(doc)
                upto = total if max_image_pages is None else min(max_image_pages, total)
                for p in range(upto):
                    for img in doc[p].get_images(full=True):
                        xref = img[0]
                        try:
                            base = doc.extract_image(xref)
                            img_bytes = base["image"]
                            pil = Image.open(io.BytesIO(img_bytes)).convert("RGB")
                            images.append({"page": p + 1, "xref": xref, "pil": pil})
                        except Exception:
                            pass

            return {"pages": pages, "chunks": chunks, "images": images}
        finally:
            doc.close()

    # ... embed a list of strings with Jina-CLIP and return L2-normalized vectors
    def embed_text(self, texts: list[str], batch: int = 16, max_length: int = 512) -> np.ndarray:
        """
        Encode a list of text strings into normalized Jina-CLIP embeddings.

        Args:
            texts: List of input strings.
            batch: Batch size for inference.
            max_length: Maximum token length per input.

        Returns:
            NumPy array of shape (n, d).
        """
        if not texts:
            return np.zeros((0, 0), dtype=np.float32)

        vecs = []
        with torch.no_grad():
            for i in range(0, len(texts), batch):
                t = texts[i:i + batch]
                inp = self.clip_processor(
                    text=t,
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt",
                ).to(self.device)

                if hasattr(self.clip_model, "get_text_features"):
                    feat = self.clip_model.get_text_features(**inp)
                else:
                    out = self.clip_model(**inp)
                    feat = getattr(out, "pooler_output", None) or out.last_hidden_state.mean(dim=1)

                vecs.append(normalize(feat).cpu())

        return torch.cat(vecs, dim=0).float().numpy()

    # ... embed a list of PIL images with Jina-CLIP and return L2-normalized vectors
    def embed_image(self, imgs: list[Image.Image], batch: int = 8) -> np.ndarray:
        """
        Encode a list of PIL images into normalized Jina-CLIP embeddings.

        Args:
            imgs: List of images.
            batch: Batch size for inference.

        Returns:
            NumPy array of shape (n, d).
        """
        if not imgs:
            return np.zeros((0, 0), dtype=np.float32)

        imgs = [im.convert("RGB") for im in imgs]
        vecs = []
        with torch.no_grad():
            for i in range(0, len(imgs), batch):
                b = imgs[i:i + batch]
                inp = self.clip_processor(images=b, return_tensors="pt").to(self.device)

                if hasattr(self.clip_model, "get_image_features"):
                    feat = self.clip_model.get_image_features(**inp)
                else:
                    out = self.clip_model(**inp)
                    feat = getattr(out, "image_embeds", None) or out.last_hidden_state.mean(dim=1)

                vecs.append(normalize(feat).cpu())

        return torch.cat(vecs, dim=0).float().numpy()

    # ... get or create a named Chroma collection to store/retrieve vectors
    def knowledge_database(self, name: str):
        """
        Return or create a named Chroma collection.

        Args:
            name: Collection name.

        Returns:
            A Chroma collection object.
        """
        return self.chroma.get_or_create_collection(name=name)

    # ... orchestrate end-to-end indexing: ingest -> embed -> add to Chroma (text and optional images)
    def index_document(self, url: str, name: str | None = None, include_images: bool = True) -> dict:
        """
        Ingest, embed, and index a PDF into Chroma collections.

        Args:
            url: PDF URL to index.
            name: Optional corpus name. Defaults to 'main'.
            include_images: Whether to embed and index images.

        Returns:
            Counts of pages, text chunks, and images indexed.
        """
        corpus = name or "main"
        self.default_corpus = corpus

        data = self.fetch_document(url)
        ing = self.document_ingest(data, extract_images=include_images, max_image_pages=3)

        # Text embeddings
        texts = [c["text"] for c in ing["chunks"]]
        text_ids = [f"{corpus}:chunk:{i}" for i in range(len(texts))]
        text_metas = [{"page": ing["chunks"][i]["page"]} for i in range(len(texts))]
        text_emb = self.embed_text(texts, batch=16)

        self.knowledge_database(f"{corpus}_text").add(
            ids=text_ids,
            embeddings=text_emb.astype(np.float32).tolist(),
            metadatas=text_metas,
            documents=texts,
        )

        # Image embeddings
        if include_images and ing["images"]:
            imgs = [x["pil"] for x in ing["images"]]
            img_ids = [f"{corpus}:img:{i}" for i in range(len(imgs))]
            img_metas = [{"page": ing["images"][i]["page"]} for i in range(len(imgs))]
            img_emb = self.embed_image(imgs, batch=8)

            self.knowledge_database(f"{corpus}_images").add(
                ids=img_ids,
                embeddings=img_emb.astype(np.float32).tolist(),
                metadatas=img_metas,
            )

        return {"pages": len(ing["pages"]), "chunks": len(texts), "images": len(ing["images"])}

    # ... run a similarity search against a Chroma collection and return scored hits
    def query_collection(self, collection, query_vec: np.ndarray, k: int = 5) -> list[dict]:
        """
        Perform cosine-based similarity search within a Chroma collection.

        Args:
            collection: Chroma collection to query.
            query_vec: Query embedding vector.
            k: Number of top results to return.

        Returns:
            List of dicts with keys: id, similarity, document, metadata.
        """
        res = collection.query(
            query_embeddings=[query_vec.astype(np.float32).tolist()],
            n_results=k,
            include=["documents", "metadatas", "distances"],
        )

        docs = res.get("documents", [[]])[0]
        metas = res.get("metadatas", [[]])[0]
        dists = res.get("distances", [[]])[0]
        ids = res.get("ids", [[]])[0] if "ids" in res else [None] * len(docs)

        hits = []
        for i in range(len(docs)):
            _id = ids[i] if i < len(ids) else None
            sim = 1.0 - float(dists[i])
            if sim < 0.0:
                sim = 0.0
            hits.append({
                "id": _id,
                "similarity": sim,
                "document": docs[i],
                "metadata": metas[i],
            })
        return hits

    # ... stitch the best snippets into a compact context window for the generator
    def build_context(self, hits: list[dict], max_chars: int = 2500) -> str:
        """
        Concatenate retrieved text snippets into a single grounded context window.

        Args:
            hits: Ranked retrieval results.
            max_chars: Maximum character budget for the context.

        Returns:
            Joined text block ready to inject into the system prompt.
        """
        s, used = [], 0
        for h in hits:
            if h.get("document"):
                page = h.get("metadata", {}).get("page")
                txt = f"[page {page}] " + " ".join(h["document"].split())
                if used + len(txt) > max_chars:
                    txt = txt[: max(0, max_chars - used)]
                s.append(txt)
                used += len(txt)
                if used >= max_chars:
                    break
        return "\n\n".join(s)

    # ... full RAG pipeline: embed question -> retrieve -> rank -> build context -> generate -> package results
    def invoke(
        self,
        question: str,
        name: str | None = None,
        top_k_each: int = 5,
        include_images_in_context: bool = False,
        max_new_tokens: int = 512,
        temperature: float = 0.2,
        top_p: float = 0.95,
    ) -> dict:
        """
        Run a complete RAG inference cycle using Phi-3 Vision.

        Returns:
            Dict with keys:
                question, answer, source, sources, system_prompt, context, hits
        """
        corpus = name or self.default_corpus
        if not corpus:
            return {
                "question": question,
                "answer": "No corpus indexed yet. Call index_document(url) first.",
                "source": None,
                "sources": [],
                "system_prompt": qa_system_prompt.format(context=""),
                "context": "",
                "hits": [],
            }

        # 1. Embed the question
        qvec = self.embed_text([question], batch=1)[0]

        # 2. Retrieve text and optional images
        text_col = self.knowledge_database(f"{corpus}_text")
        text_hits = self.query_collection(text_col, qvec, k=top_k_each)
        all_hits = [{"modality": "text", **h} for h in text_hits]

        if include_images_in_context:
            try:
                img_col = self.knowledge_database(f"{corpus}_images")
                img_hits = self.query_collection(img_col, qvec, k=max(2, top_k_each // 2))
                all_hits += [{"modality": "image", **h} for h in img_hits]
            except Exception:
                pass

        if not all_hits:
            return {
                "question": question,
                "answer": "No relevant segments were retrieved.",
                "source": None,
                "sources": [],
                "system_prompt": qa_system_prompt.format(context=""),
                "context": "",
                "hits": [],
            }

        # 3. Rank and build context
        sims = np.array([h["similarity"] for h in all_hits], dtype=np.float32)
        a, b = float(sims.min()), float(sims.max())
        scores = (sims - a) / (b - a) if b > a else np.ones_like(sims)
        for i, h in enumerate(all_hits):
            h["score"] = float(scores[i])
        all_hits.sort(key=lambda x: x["score"], reverse=True)

        context = self.build_context(all_hits, max_chars=2500)
        system_msg = qa_system_prompt.format(context=context)

        # 4. Generate an answer without echoing the prompt
        answer_text = ""
        try:
            messages = [
                {"role": "system", "content": system_msg},
                {"role": "user", "content": [{"type": "text", "text": question}]},
            ]
            inputs = self.phi_processor.apply_chat_template(
                messages, add_generation_prompt=True, return_tensors="pt", tokenize=True
            ).to(self.device)
            model_inputs = {"input_ids": inputs}

            with torch.no_grad():
                out_ids = self.phi_model.generate(
                    **model_inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=temperature > 0,
                    temperature=temperature,
                    top_p=top_p,
                    use_cache=False,
                    pad_token_id=getattr(self.phi_processor.tokenizer, "eos_token_id", None),
                    eos_token_id=getattr(self.phi_processor.tokenizer, "eos_token_id", None),
                )
            input_len = model_inputs["input_ids"].shape[-1]
            gen_tokens = out_ids[0, input_len:]
            answer_text = self.phi_processor.decode(gen_tokens, skip_special_tokens=True).strip()

        except Exception:
            prompt = f"{system_msg}\n\nQuestion: {question}\nAnswer:"
            model_inputs = self.phi_processor(text=prompt, return_tensors="pt")
            model_inputs = {k: v.to(self.device) for k, v in model_inputs.items()}
            with torch.no_grad():
                out_ids = self.phi_model.generate(
                    **model_inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=temperature > 0,
                    temperature=temperature,
                    top_p=top_p,
                    use_cache=False,
                    pad_token_id=getattr(self.phi_processor.tokenizer, "eos_token_id", None),
                    eos_token_id=getattr(self.phi_processor.tokenizer, "eos_token_id", None),
                )
            decoded = self.phi_processor.batch_decode(out_ids, skip_special_tokens=True)[0]
            marker = "Answer:"
            pos = decoded.rfind(marker)
            answer_text = decoded[pos + len(marker):].strip() if pos != -1 else decoded.strip()

        # 5. Compact sources for display
        src = []
        seen_pages = set()
        for h in all_hits:
            page = h.get("metadata", {}).get("page")
            if page in seen_pages:
                continue
            seen_pages.add(page)
            snippet = ""
            if h.get("document"):
                t = " ".join(h["document"].split())
                snippet = t[:240]
            src.append({
                "id": h.get("id"),
                "page": page,
                "score": float(h.get("score", 0.0)),
                "modality": h.get("modality", "text"),
                "snippet": snippet,
            })
            if len(src) >= 5:
                break

        # 6. Structured return with convenience alias 'source'
        return {
            "question": question,
            "answer": answer_text,
            "source": (src[0] if src else None),
            "sources": src,
            "system_prompt": system_msg,
            "context": context,
            "hits": all_hits,
        }

### Instantiate the AI Agent

In [5]:
# * Instantiate the AI Agent

rag_agent = RagAgent(model="microsoft/phi-3-vision-128k-instruct",
                     embedding_model="jinaai/jina-clip-v1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

processing_clip.py: 0.00B [00:00, ?B/s]

transform.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- transform.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- processing_clip.py
- transform.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_clip.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- configuration_clip.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


modeling_clip.py: 0.00B [00:00, ?B/s]

rope_embeddings.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- rope_embeddings.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


eva_model.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- eva_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


hf_model.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- hf_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-clip-implementation:
- modeling_clip.py
- rope_embeddings.py
- eva_model.py
- hf_model.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/891M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py: 0.00B [00:00, ?B/s]

block.py: 0.00B [00:00, ?B/s]

mha.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- mha.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


mlp.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- mlp.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- block.py
- mha.py
- mlp.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


bert_padding.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- bert_padding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


embedding.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- embedding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-flash-implementation:
- modeling_bert.py
- block.py
- bert_padding.py
- embedding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


config.json: 0.00B [00:00, ?B/s]

configuration_phi3_v.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-3-vision-128k-instruct:
- configuration_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


preprocessor_config.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

processing_phi3_v.py: 0.00B [00:00, ?B/s]

image_processing_phi3_v.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-3-vision-128k-instruct:
- image_processing_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-3-vision-128k-instruct:
- processing_phi3_v.py
- image_processing_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

modeling_phi3_v.py: 0.00B [00:00, ?B/s]

image_embedding_phi3_v.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-3-vision-128k-instruct:
- image_embedding_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/phi-3-vision-128k-instruct:
- modeling_phi3_v.py
- image_embedding_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.35G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
rag_agent

<__main__.RagAgent at 0x7f7c49cf21b0>

### Index a Document into the Vector Store

In [7]:
rag_agent.index_document("https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf")

{'pages': 11, 'chunks': 46, 'images': 1}

#### Quick Peek (Optional)

*Just run the ingest steps manually and inspect the result*

In [8]:
# * Document Ingest

url = "https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"
pdf_bytes = rag_agent.fetch_document(url)
ing = rag_agent.document_ingest(pdf_bytes, extract_images=True, max_image_pages=3)

# Text chunks
len(ing["chunks"]), ing["chunks"][:2]   # show how many + first 2 chunks

# Pages
len(ing["pages"]), ing["pages"][0]      # page count + first page text

# Images
len(ing["images"]), ing["images"][:1]   # how many images + first image metadata

(1,
 [{'page': 3,
   'xref': 94,
   'pil': <PIL.Image.Image image mode=RGB size=1520x2239>}])

#### Inspect what's actually stored in Chroma

*After indexing, pull a few stored rows from the collections*

In [9]:
# * Text collection

corpus = rag_agent.default_corpus or "main"
col_text = rag_agent.knowledge_database(f"{corpus}_text")
res_text = col_text.get(include=["metadatas", "documents"], limit=5)

print("### Chroma:", f"{corpus}_text", "###")
print("Total rows returned:", len(res_text.get("ids", [])))

# IDs
print("\n### First up to 3 IDs:")
for i, _id in enumerate(res_text.get("ids", [])[:3], start=1):
    print(f"  {i}. {_id}")

# Metadatas
print("\n### First up to 3 metadatas:")
for i, md in enumerate(res_text.get("metadatas", [])[:3], start=1):
    print(f"  {i}. {md}")

# Document previews
docs = res_text.get("documents", [])
print("\n### Preview of up to 3 documents (first 600 chars each)\n")
for i, text in enumerate(docs[:3], start=1):
    if not text:
        continue
    meta = res_text["metadatas"][i-1] if i-1 < len(res_text["metadatas"]) else {}
    page = meta.get("page", "?")
    preview = text[:600] + ("..." if len(text) > 600 else "")
    print(f"[{i}] page={page} | chars={len(text)}")
    print(preview, "\n")

# * Images collection (if present)

try:
    col_img = rag_agent.knowledge_database(f"{corpus}_images")
    res_img = col_img.get(include=["metadatas"], limit=5)

    print("\n### Chroma:", f"{corpus}_images", "###")
    print("Total rows returned:", len(res_img.get("ids", [])))

    print("\n### Image IDs:")
    for i, _id in enumerate(res_img.get("ids", []), start=1):
        print(f"  {i}. {_id}")

    print("\n### Image metadatas:")
    for i, md in enumerate(res_img.get("metadatas", []), start=1):
        print(f"  {i}. {md}")

except Exception as e:
    print("\n### Chroma: images ###")
    print("No image collection or error:", e)

### Chroma: main_text ###
Total rows returned: 5

### First up to 3 IDs:
  1. main:chunk:0
  2. main:chunk:1
  3. main:chunk:2

### First up to 3 metadatas:
  1. {'page': 1}
  2. {'page': 1}
  3. {'page': 1}

### Preview of up to 3 documents (first 600 chars each)

[1] page=1 | chars=989
Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models als... 

[2] page=1 | chars=905
ence and convolutions entirely. Experiments on two machine translation tasks show 

### Start A Chat

In [10]:
# * Chat 1 - Ask a Question and Print the Answer

response = rag_agent.invoke("Tell me about the paper?")

print("Question: Tell me about the paper?")
print("\n--- Answer ---")
print(response["answer"])
print("\n--- Top Source ---")
print(response["source"])
print("\n--- All Sources ---")
print(response["sources"])
print("\n--- System Prompt ---")
print(response["system_prompt"])
print("\n--- Context ---")
print(response["context"])



Question: Tell me about the paper?

--- Answer ---
This paper discusses effective approaches to attention-based neural machine translation, a decomposable attention model, a deep reinforced model for abstractive summarization, and the use of output embedding to improve language models. It also covers neural machine translation of rare words with subword units, linear time neural machine translation, structured attention networks, Adam optimization method, factorization tricks for LSTM networks, and a structured self-attentive sentence embedding.

--- Top Source ---
{'id': 'main:chunk:43', 'page': 11, 'score': 1.0, 'modality': 'text', 'snippet': '[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention- based neural machine translation. arXiv preprint arXiv:1508.04025, 2015. [22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. '}

--- All Sources ---
[{'id': 'main:chunk:43', 'page': 11, 'score': 1.0, 'modality': 'text', 'snippet': 

In [11]:
# * Chat 2 - Ask a Question and Print the Answer

response = rag_agent.invoke("What is the main idea?")

print("Question: What is the main idea?")
print("\n--- Answer ---")
print(response["answer"])
print("\n--- Top Source ---")
print(response["source"])
print("\n--- All Sources ---")
print(response["sources"])
print("\n--- System Prompt ---")
print(response["system_prompt"])
print("\n--- Context ---")
print(response["context"])

Question: What is the main idea?

--- Answer ---
The main idea is that self-attention is an attention mechanism that relates different positions of a single sequence in order to compute a representation of the sequence. It has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks. The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

--- Top Source ---
{'id': 'main:chunk:7', 'page': 2, 'score': 1.0, 'modality': 'text', 'snippet': 'ibed in section 3.2. Self-attention, sometimes called intra-attention is an attention mechan

In [12]:
# * Chat 3 - Ask a Question and Print the Answer

response = rag_agent.invoke("What are the main results?")

print("Question: What are the main results?")
print("\n--- Answer ---")
print(response["answer"])
print("\n--- Top Source ---")
print(response["source"])
print("\n--- All Sources ---")
print(response["sources"])
print("\n--- System Prompt ---")
print(response["system_prompt"])
print("\n--- Context ---")
print(response["context"])

Question: What are the main results?

--- Answer ---
The main results are that single-head attention is 0.9 BLEU worse than the best setting, and quality also drops off with too many heads.

--- Top Source ---
{'id': 'main:chunk:34', 'page': 8, 'score': 1.0, 'modality': 'text', 'snippet': 'e results in Table 3. In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU '}

--- All Sources ---
[{'id': 'main:chunk:34', 'page': 8, 'score': 1.0, 'modality': 'text', 'snippet': 'e results in Table 3. In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU '}, {'id': 'main:chunk:15', 'page': 4, 'score': 1.0, 'modality': 'text', 'snippet': 'Figure 2. Multi-head attention allows 

In [13]:
# * Chat 4 - Ask a Question and Print the Answer (Off Question)

response = rag_agent.invoke("Who is Messi?")

print("Question: Who is Messi?")
print("\n--- Answer ---")
print(response["answer"])
print("\n--- Top Source ---")
print(response["source"])
print("\n--- All Sources ---")
print(response["sources"])
print("\n--- System Prompt ---")
print(response["system_prompt"])
print("\n--- Context ---")
print(response["context"])

Question: Who is Messi?

--- Answer ---
I don't know.

--- Top Source ---
{'id': 'main:chunk:7', 'page': 2, 'score': 1.0, 'modality': 'text', 'snippet': 'ibed in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfull'}

--- All Sources ---
[{'id': 'main:chunk:7', 'page': 2, 'score': 1.0, 'modality': 'text', 'snippet': 'ibed in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfull'}, {'id': 'main:chunk:1', 'page': 1, 'score': 1.0, 'modality': 'text', 'snippet': 'ence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcant