 # LLAMAIndex + W&B: RAG with Evaluation
 <img src="./cover.png" width="50%" align="center">
 
 This Jupyter Notebook demonstrates how to use LLAMAIndex with Weights & Biases (W&B) for Retrieval-Augmented Generation (RAG).  
We will set up the environment, read document, initialize W&B, perform queries, and evaluate the results..

 ## 0Ô∏è‚É£ | Initial Setup
 First, we need to import necessary libraries and set up our environment.

In [None]:
#!pip install -r requirements.txt

In [None]:
# Importing required libraries
import warnings
import os
import openai
from pathlib import Path
from dotenv import load_dotenv
from llama_index.llms import OpenAI
import wandb

# Configuring warnings and environmental variables
warnings.filterwarnings("ignore")
WANDB_PROJECT = "test_local_v2"

 ### üìã Read Documents
 We will now load the documents for our RAG setup. In this example, we use a PDF file named 'Mixtral.pdf'.

In [None]:
# Loading the PDFReader from llama_index
from llama_index import VectorStoreIndex, download_loader

PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path("./Mixtral.pdf"))

In [None]:
documents[:3]

 ### üìâ Initialize W&B
 Weights & Biases (W&B) is used for tracking experiments, visualizing data, and sharing insights. We initialize it here for our project.

In [None]:
# Initialize W&B for tracking and visualizations
from llama_index import ServiceContext
from llama_index.callbacks import CallbackManager, WandbCallbackHandler

wandb_args = {"project": WANDB_PROJECT, "name": "baseline-rag"}
wandb_callback = WandbCallbackHandler(run_args=wandb_args)
callback_manager = CallbackManager([wandb_callback])

### üèéÔ∏è Setup Local LLM

Here we used the Mistral-7b model. In order to make this experimetns even faster i will use the quantised version of the model. I higly recommed to check QuIP qunatised models, since they support up to 2bit qunatisation with extramly low loss in quality. 

In [None]:
from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

Note that if you installed llama.cpp propperly it will use the available GPU by default. Either 'cuda' or 'metal'.

In [None]:
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=None,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path="/Users/nkise/Documents/projects/Courses üìú/RAG/llama.cpp/models/mistral-instruct-7b-q3k-small.gguf",
    temperature=0.1,
    max_new_tokens=512,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=False,
)

Otherwise, if you want to use GPT-3.5 Api just uncomment the following cell and comment the previous ones

In [None]:
load_dotenv()
openai.api_key = os.getenv(
    "OPENAI_API_KEY"
)  ## DON'T FORGET TO SET YOUR API KEY AS AN ENVIRONMENTAL VARIABLE

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

Let's test if the model is working

In [None]:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

## 1Ô∏è‚É£ Baseline RAG

 ### Setup ServiceContext
 The ServiceContext in LLAMAIndex is used to manage the lifecycle of services like models and callbacks. We set it up with the required configurations.

Note that as an embedding model we will use also the local model. LlamaIndex will also automatically detect the necessray GPU , so don't worry about it if you are usig Mac with M-processor. 

In [None]:
# Setting up the ServiceContext with the language model and embedding model
embed_model = "local:BAAI/bge-small-en-v1.5"
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model=embed_model, 
    callback_manager=callback_manager
)

 ### Create VectorStore
 The VectorStore in LLAMAIndex is responsible for chunking, embedding, and storing document vectors. We create and configure it here.

In [None]:
# Creating the VectorStoreIndex for document handling
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

# Converting the index to a query engine for retrieval
query_engine = index.as_query_engine()

 ### Testing the Query Engine
 Let's test our query engine by asking a few questions related to the loaded documents.

In [None]:
# Defining a function to display responses
from llama_index.response.notebook_utils import display_response


def query_and_display(question):
    response = query_engine.query(question)
    display_response(response)


In [None]:
# Testing the query engine with different questions
query_and_display("Who wrote Mixtral paper?")
query_and_display("What is Sparse MoE?")
query_and_display("How many experts are used in Sparse MoE?")
query_and_display("Where can I find the code?")

In [None]:
# Closing the W&B run after queries
wandb_callback.finish()

 ## 2Ô∏è‚É£ Evaluation
 We now move to the evaluation phase where we will assess the performance of our RAG setup using different metrics.

### ‚ùì Generating Eval questions

To evaluate - we need questions. Let's be honest - we are lazy to write them by ourselves. So let's already available QuestionsGenerator inside llamaindex + GPT-3.5 Api to generate them for us. Alternatively you can use your local llm model. Just simply repplace the llm object

In [None]:
# Importing necessary modules for evaluation
import copy
import random
import nest_asyncio
import pandas as pd
from llama_index.evaluation import (
    DatasetGenerator,
    RelevancyEvaluator,
    ResponseEvaluator,
    RetrieverEvaluator,
)

In [None]:
# Initialize W&B for evaluation
wandb_args = {"project": WANDB_PROJECT, "name": "eval-questions-generation"}
wandb_callback = WandbCallbackHandler(run_args=wandb_args)
callback_manager = CallbackManager([wandb_callback])
llm_eval = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(
    llm=llm_eval, 
    embed_model=embed_model, 
    callback_manager=callback_manager
)

In [None]:
# Setting up the documents and generating questions for evaluation
random_documents = copy.deepcopy(documents)

# Shuffling the documents and selecting 5 random documents. Just to make the evaluation quicker
random.shuffle(random_documents)
random_documents = random_documents[:5]

In [None]:
# Generating questions from the documents for evaluation
data_generator = DatasetGenerator.from_documents(
    random_documents, service_context=service_context, num_questions_per_chunk=2
)

# Applying nest_asyncio to run async code in Jupyter
nest_asyncio.apply()
eval_questions = data_generator.generate_questions_from_nodes()

In [None]:
eval_questions[:3]

Ideally you want to save evaluation questions as an artifact in W&B. This way you can easily show them, share and re-use. 

In [None]:
import wandb

In [None]:
# Persisting the questions to a CSV file using W&B, for further loading
# Create an artifact object
artifact = wandb.Artifact(name="eval-questions", type="text")

# Add the list of questions as a file to the artifact
with artifact.new_file("questions.txt", mode="w") as f:
    f.write("\n".join(eval_questions))

# Log the artifact to W&B
wandb.log_artifact(artifact)

You can easily load them for later use

In [None]:
# # Lookup the artifact
# artifact = wandb.use_artifact("eval-questions:v0")

# # Get the file containing the list of questions
# file = artifact.get_file("questions.txt")

# # Read the list of questions from the file
# with file.open("r") as f:
#     questions = f.read().split("\n")

# # Print the list of questions
# print(questions)


In [None]:
wandb_callback.finish()

 ### üîé Evaluation on the validation set
 We evaluate the responses on a validation set to measure the effectiveness of our setup.

In [None]:
# Initialize W&B for response evaluation
wandb_args = {"project": WANDB_PROJECT, "name": "baseline-evaluation"}
wandb_callback = WandbCallbackHandler(run_args=wandb_args)
callback_manager = CallbackManager([wandb_callback])

In [None]:
# Preparing the data for evaluation
question_df = pd.DataFrame(columns=["questions"], data=eval_questions)
question_df.head()

In [None]:
# Setup for evaluating the responses
llm_eval = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_eval = ServiceContext.from_defaults(
    llm=llm_eval, 
    callback_manager=callback_manager
)

In [None]:
# Running the evaluation using BatchEvalRunner
from llama_index.evaluation import (
    BatchEvalRunner,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
)

faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_eval)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_eval)
runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
)

eval_results = await runner.aevaluate_queries(
    index.as_query_engine(), queries=eval_questions
)

So here is the thing, current integration of wandb and llamaindex is not perfect. So we will need to do some workarounds in order to propperly log our information. But, its fairly easy. We just need to use the weandb library itself. 

In [None]:

# Make a dataframe from the results.
faithfulness_df = pd.DataFrame.from_records(
    [eval_result.dict() for eval_result in eval_results["faithfulness"]]
)
relevancy_df = pd.DataFrame.from_records(
    [eval_result.dict() for eval_result in eval_results["relevancy"]]
)
relevancy_df.head()

In [None]:
# save questions , faithfulness_df and relevancy_df to csv. Drop none columns from faithfulness_df and relevancy_df
question_df.to_csv("questions.csv", index=False)
faithfulness_df.dropna(axis=1).to_csv("faithfulness.csv", index=False)
relevancy_df.dropna(axis=1).to_csv("relevancy.csv", index=False)

In [None]:
# Make 2 new tables in Wandb for Faithfulness and Relevancy. Log the results.
# Firstly, create a table for Faithfulness.
import wandb

faithfulness_table = wandb.Table(dataframe=faithfulness_df)
relevancy_table = wandb.Table(dataframe=relevancy_df)

In [None]:
wandb.log({"faithfulness": faithfulness_table, "relevancy": relevancy_table})

In [None]:
# wandb log scalr mean of faithfulness and relevancy scores
wandb.log({"faithfulness_mean": faithfulness_df["score"].mean()})
wandb.log({"relevancy_mean": relevancy_df["score"].mean()})


In [None]:
faithfulness_df["score"].mean(), relevancy_df["score"].mean()

In [None]:
wandb_callback.finish()

 ## üöÄ Advanced RAG
Now let's ramp up RAG quality. Our baseline RAG already can answer questions, but not particularly well. Let's explore an advanced setup with hierarchical node parsing and re-ranking for better context merging and retrieval prioritization.

The method here is the one which described in the course of Advanced RAG on the DeeepLearning.AI. Highly recommend to check it!

In [None]:
from llama_index.node_parser import HierarchicalNodeParser

# create the hierarchical node parser. Note we have to specify the chunk sizes LYERS
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 256]
)

In [None]:
# Get the nodes from the documents
nodes = node_parser.get_nodes_from_documents(documents)

**Printing the leaf node**

In [None]:
from llama_index.node_parser import get_leaf_nodes

leaf_nodes = get_leaf_nodes(nodes)
print(leaf_nodes[0].text)

**Now the 1st layer of the Parent node***

In [None]:
nodes_by_id = {node.node_id: node for node in nodes}

parent_node = nodes_by_id[leaf_nodes[1].parent_node.node_id]
print(parent_node.text)

In [None]:
# initialise WandbCallbackHandler and pass any wandb.init args
wandb_args = {"project":"test", "name":"adv-rag"}
wandb_callback = WandbCallbackHandler(run_args=wandb_args)
# pass wandb_callback to the service context
callback_manager = CallbackManager([wandb_callback])

# Creating the specification for the context retrieval
auto_merging_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    node_parser=node_parser, # Note that hierarchical node parser in here
    callback_manager=callback_manager
)

In [None]:
from llama_index import VectorStoreIndex, StorageContext

# StorageContext is an utility conteinr for nodes, graphs and other doc types
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# Creating the Index from the configuration
automerging_index = VectorStoreIndex(
    leaf_nodes, storage_context=storage_context, service_context=auto_merging_context
)

In [None]:
from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

# Getting retriever from the index
automerging_retriever = automerging_index.as_retriever(
    similarity_top_k=12
)

# Creating AutoMergingRetriever
# Note we pass the retriever from Index with hierarchical node parser
retriever = AutoMergingRetriever(
    automerging_retriever, 
    automerging_index.storage_context, 
    verbose=True,
)

# Creating the re-ranker, we will need it later for merged chunks
rerank = SentenceTransformerRerank(top_n=4, model="BAAI/bge-reranker-base")

# Creating the query engine wrapper. We need wrapper to put postprocessors in it.
auto_merging_engine = RetrieverQueryEngine.from_args(
    automerging_retriever, node_postprocessors=[rerank], verbose=True, service_context=auto_merging_context
)

In [None]:
# Run the query engine on a user question.
response = auto_merging_engine.query("Who wrote Mixtral paper?")
display_response(response)

In [None]:
response = auto_merging_engine.query("What is Sparse MoE?")
display_response(response)

In [None]:
response = auto_merging_engine.query("How many experts are used in Sparse MoE?")
display_response(response)

In [None]:
response = auto_merging_engine.query("Where I can find a code?")
display_response(response)

In [None]:
# close wandb run
wandb_callback.finish()

 #### Evaluation of Advanced RAG
 We perform a similar evaluation as before

In [None]:
# Initialize W&B for response evaluation
wandb_args = {"project": WANDB_PROJECT, "name":"evaluation-adv-rag"}
wandb_callback = WandbCallbackHandler(run_args=wandb_args)
callback_manager = CallbackManager([wandb_callback])

In [None]:
# Setup for evaluating the responses
llm_eval = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_eval = ServiceContext.from_defaults(
    llm=llm_eval, 
    callback_manager=callback_manager
)

In [None]:
from llama_index.evaluation import BatchEvalRunner
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_eval)
relevancy_evaluator = RelevancyEvaluator(service_context=service_context_eval)

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=1,
)

eval_results = await runner.aevaluate_queries(
    auto_merging_engine, queries=eval_questions,
)

In [None]:

# Make a dataframe from the results.
faithfulness_df = pd.DataFrame.from_records(
    [eval_result.dict() for eval_result in eval_results["faithfulness"]]
)
relevancy_df = pd.DataFrame.from_records(
    [eval_result.dict() for eval_result in eval_results["relevancy"]]
)
relevancy_df.head()

In [None]:
# Make 2 new tables in Wandb for Faithfulness and Relevancy. Log the results.
# Firstly, create a table for Faithfulness.
faithfulness_table = wandb.Table(dataframe=faithfulness_df)
relevancy_table = wandb.Table(dataframe=relevancy_df)

In [None]:
wandb.log({"faithfulness": faithfulness_table, "relevancy": relevancy_table})

In [None]:
# wandb log scalr mean of faithfulness and relevancy scores
wandb.log({"faithfulness_mean": faithfulness_df["score"].mean()})
wandb.log({"relevancy_mean": relevancy_df["score"].mean()})

Note how the scores improved! 

In [None]:
faithfulness_df["score"].mean(), relevancy_df["score"].mean()

In [None]:
wandb_callback.finish()

In [None]:
response = auto_merging_engine.query("How many experts are used in Sparse MoE?")
display_response(response)

In [None]:
response