## LlamaIndex and Chroma

A simple notebook largely inspired by Bhavik Jikadara in his post on medium here: https://bhavikjikadara.medium.com/llamaindex-chroma-building-a-simple-rag-pipeline-cd67fc184190

This is a one stop notebook for those that just want to test a simple RAG use case using chroma and llama3 (using Ollama)

## What is Retrieval-Augmented Generation (RAG)?
RAG (Retrieval-Augmented Generation) is a technique in natural language processing (NLP) that enhances the capabilities of language models by combining information retrieval with text generation. It allows a model to generate more accurate and contextually relevant responses by retrieving relevant pieces of information from a database or knowledge base before generating text.

![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*izQyDNIbCsyx8YI48avntA.png)

## 1. Data Indexing
The first step in building a RAG pipeline is data indexing. This process involves converting text data into a searchable database of vector embeddings, which represent the meaning of the text in a format that computers can easily understand.

- Document Chunking: The collection of documents is split into smaller chunks of text. This allows for more precise and relevant pieces of information to be fed into the language model when needed, avoiding information overload.
- Vector Embeddings: The chunks of text are then transformed into vector embeddings. These embeddings encode the meaning of natural language text into numerical representations.
- Vector Database: Finally, the vector embeddings are stored in a vector database, making them easily searchable.

## 2. Data Retrieval and Generation
Once the context data is stored as vector embeddings, the process of data retrieval and generation begins.

- Query Transformation: The user’s query (or prompt) is also transformed into a vector embedding, similar to how the context data was processed.
- Context Matching: The query vector is compared against all the vectors in the vector database. The top-k most similar chunks of context data are selected.
- Response Generation: The selected chunks of context, along with the user’s query, are fed into the language model (LLM) to generate a relevant and accurate response.

## How to Built a Simple RAG Pipeline
Now that we’ve covered the theory behind a RAG pipeline, let’s dive into the practical implementation. Below are the steps we’ll follow:

- Set up the environment
- Import an LLM
- Import an embedding model
- Prepare the data
- Prompt Engineering
- Create the query engine
- Setting Up the Environment

First, we need to import the necessary libraries:

- **Chroma**: An AI-native open-source vector database, which will be used to create a vector database for our embeddings.
- **LlamaIndex**: A framework for building context-augmented generative AI applications with LLMs. It handles reading the context data, creating vector embeddings, building prompt templates, and prompting the LLM locally.

In [None]:
# To install these libraries, you can run the following commands:
!pip install chromadb llama-index

Llama index comes "naked", which means that to use chroma(the vectorDb), Llama (the LLM through Ollama) and the embedder for the text (through huggingface), we need to install the specific 'Flavour' of llama-index for the task.

In [None]:
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-ollama
%pip install llama-index-vector-stores-chroma

In [None]:
import sys
import subprocess
import chromadb
from llama_index.core import PromptTemplate, Settings, SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

## Install Ollama

Ollama uses a server-client architecture where models must be pulled (downloaded) and loaded onto the server before they can be used.

Here’s how you can automate the entire process in a Jupyter notebook, from starting the server to pulling the model and making requests:



In [None]:


def is_ollama_installed():
    """
    Check if Ollama is installed on the system.
    """
    try:
        result = subprocess.run(["ollama", "--version"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        if result.returncode == 0:
            print(f"Ollama is installed: {result.stdout.decode().strip()}")
            return True
        else:
            print("Ollama is not installed.")
            return False
    except FileNotFoundError:
        print("Ollama is not installed.")
        return False

def install_ollama():
    """
    Install Ollama using the official installation script.
    """
    print("Installing Ollama...")
    try:
        # Run the installation script
        subprocess.run(
            ["curl", "-fsSL", "https://ollama.com/install.sh", "|", "sh"],
            check=True,
            shell=True,
        )
        print("Ollama installation completed successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Error occurred during installation: {e}")
        sys.exit(1)

def verify_installation():
    """
    Verify that Ollama is installed and available.
    """
    if is_ollama_installed():
        print("Ollama is ready to use.")
    else:
        print("Failed to verify Ollama installation. Please check manually.")
        sys.exit(1)

if not is_ollama_installed():
    install_ollama()
verify_installation()

## Importing Llama LLM

With the libraries imported, we can now bring in the Llama language model. I opted for Llama because it allows for local execution, which is both free and private. Using the Ollama library makes it simple:

### Launch Ollama server

In [None]:
import requests
import subprocess
import time

def start_ollama_server():
    try:
        subprocess.Popen(["ollama", "serve"])
        print("Ollama server started.")
    except FileNotFoundError:
        print("Error: Ollama is not installed. Please install it first.")

def pull_model(model_name):
    try:
        subprocess.run(["ollama", "pull", model_name], check=True)
        print(f"Model '{model_name}' pulled successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Failed to pull model '{model_name}': {e}")

def is_server_running(base_url):
    try:
        response = requests.get(base_url)
        return response.status_code == 200
    except requests.exceptions.ConnectionError:
        return False

Now we load the model

In [None]:

# Parameters
model_name = "llama3"
base_url = "http://localhost:11434"

# Ensure server is running
if not is_server_running(base_url):
    start_ollama_server()
    time.sleep(5)

# Pull the model
pull_model(model_name)

 and we test it out 

In [None]:
llm = Ollama(model=model_name, base_url=base_url)
response = llm.complete("Why is the sky blue?")
print(response.text)

Next, we need an embedding model to transform text into vector embeddings. I chose the **“BAAI/bge-small-en-v1.5”** model from Hugging Face, which is small and quick to implement — ideal for a proof of concept (POC).

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

"""
If you want something fancier or are versed in chineses, here are the available models that can be used
    BGE_MODELS = (
        "BAAI/bge-small-en",
        "BAAI/bge-small-en-v1.5",
        "BAAI/bge-base-en",
        "BAAI/bge-base-en-v1.5",
        "BAAI/bge-large-en",
        "BAAI/bge-large-en-v1.5",
        "BAAI/bge-small-zh",
        "BAAI/bge-small-zh-v1.5",
        "BAAI/bge-base-zh",
        "BAAI/bge-base-zh-v1.5",
        "BAAI/bge-large-zh",
        "BAAI/bge-large-zh-v1.5",
    )
"""


Settings.llm = llm
Settings.embed_model = embed_model

## Preparing the Data
To prepare the data, we first read the context file using SimpleDirectoryReader. In this example, we're using a PDF of my one-page resume. We then create a vector database using Chroma and store the vector embeddings.

In [None]:
import chromadb.api
import chromadb.api.client


def collection_exist(chroma_client: chromadb.api.client.Client, name: str ):
    collection_list = chroma_client.list_collections()
    collection_name_set = set((item.name for item in collection_list))
    return name in collection_name_set

In [None]:
documents = SimpleDirectoryReader(input_files=["../data/external/Resume_Maxime_Bonnesoeur.pdf"]).load_data()
chroma_client = chromadb.EphemeralClient()
if not collection_exist(chroma_client=chroma_client, name = "ollama"):
    chroma_collection = chroma_client.create_collection("ollama")
else:
    chroma_collection = chroma_client.get_collection("ollama")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context, 
    embed_model=embed_model,
    transformations=[SentenceSplitter(chunk_size=256, chunk_overlap=10)]
)

## Prompt Engineering
With the RAG pipeline set up, the next step is writing a template query. This template assigns the LLM a task and persona, provides context, and plugs in the user’s question.

In [None]:
template = (
    "Imagine you are a data scientist's assistant and "
    "you answer a recruiter's questions about the data scientist's experience."
    "Here is some context from the data scientist's "
    "resume related to the query::\n"
    "-----------------------------------------\n"
    "{context_str}\n"
    "-----------------------------------------\n"
    "Considering the above information, "
    "Please respond to the following inquiry:\n\n"
    "Question: {query_str}\n\n"
    "Answer succinctly and ensure your response is "
    "clear to someone without a data science background."
    "The data scientist's name is in the document."
)
qa_template = PromptTemplate(template)

Now, we create the query engine that will run all of the components of our RAG pipeline

In [None]:
query_engine = index.as_query_engine(
    text_qa_template=qa_template,
    similarity_top_k=3
)

## Running the RAG Pipeline
The exciting part of building an AI application is seeing it work! To run the RAG pipeline, simply prompt the query engine with a question:

In [None]:
response = query_engine.query("Do you have experience with Python?")
print(response.response)