<a href="https://colab.research.google.com/github/mickymics/RAG-Implementation/blob/main/native_rag_llamaindex_impl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install nest_asyncio

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
%pip install -Uq jedi

In [None]:
%pip install -Uq llama-index

In [None]:
%pip install -Uq llama-index-embeddings-huggingface

**Straight forward RAG Steps**

In [None]:
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

**Global Settings vs ServiceContext**

**Settings** = “default everywhere” (simple) global style

**ServiceContext** = “custom per index” (advanced). - different LLMs or embeddings for different indexes

**randomness control knob** [temperature=0.1] - It tells the model how creative or deterministic it should be when generating text. Consistent and fact-focused, good for retrieval (RAG)

**Low temperature** (0.0 – 0.2) - very predictable, safe, repetitive answers.

In [None]:
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
# from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import ServiceContext


# Document objects
documents = SimpleDirectoryReader(input_dir="./data").load_data()
len(documents)

# Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)

**Default Behavior**

When executing *VectorStoreIndex.from_documents()* line, it does the following internally:

Before embedding, chunking happens implicitly; **LlamaIndex** automatically splits the documents into smaller pieces a.k.a chunks (default ~512 tokens).

Chunks are then embedded and stored into the multi dimentional vector index.

**By default, LlamaIndex uses SentenceSplitter with chunk_size=512 tokens** and some overlap and it defaults to **OpenAI Embeddings (text-embedding-ada-002)** and **OpenAI LLM (gpt-3.5-turbo)**

In [None]:
index = VectorStoreIndex.from_documents(documents)

**Stages of querying:**

**Retrieval**: retrieves the most relevant chunks from the vector store (default similarity_top_k=2).



```
index.as_query_engine()* -> **returns a Retriever**

# Source Node 1/2 and Source Node 2/2 → meaning two chunks (nodes) were retrieved as the top candidates.

```
**Postprocessing**: Nodes retrieved are optionally reranked, transformed, or filtered based on the metadata or keywords attached.

**Response synthesis**: Both of these retrieved chunks are then sent as context to the LLM along with the query/prompt to get the final response and sources nodes (if mentioned) with pprint_response(response, show_source=True)

**Retrieval Technique**

User query also embedded and vectorized the same way and stored into the multi dimentional vector index.

When query, LlamaIndex retrieves the most relevant or nearest neighbours chunks from the vector store using cosine similarity.

**By default, LlamaIndex uses OpenAI's gpt-3.5-turbo model**

In [None]:
from llama_index.core.response.pprint_utils import pprint_response

response = query_engine.query("What is Prometheus namespace")
pprint_response(response, show_source=True)
print(response)