<a href="https://colab.research.google.com/github/ivanleech/llm_evaluator/blob/main/llm_evaluator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Trulens to evaluate LLM/RAG 🔎

📓 This notebook explores Advanced RAG methods using Direct Query as baseline, and compares them to sentence-window and automerging-retrieval methods.

To evaluate the responses and context retrieved from various methods, we use TruLens🐙 to explore a Triad of metrics, context relevance(context to query), answer relevance(answer to query) and groundedness(answer to context).

Instead of using OpenAI, we use Ollama 🦙 to install various models locally, allowing users without access to OpenAI to run these notebook without any issue. This notebook is recommend to run using Google Colab using the free T4 GPU compute. 🖥️


## 01 Setup

In [1]:
!git clone https://github.com/ivanleech/llm_evaluator.git

Cloning into 'llm_evaluator'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 12 (delta 3), reused 2 (delta 0), pack-reused 0[K
Receiving objects: 100% (12/12), 13.06 KiB | 3.26 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [2]:
%%capture
!pip install -r llm_evaluator/requirements.txt

Collecting trulens_eval==0.19.2 (from -r llm_evaluator/requirements.txt (line 1))
  Downloading trulens_eval-0.19.2-py3-none-any.whl (632 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m632.1/632.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama_index==0.9.24 (from -r llm_evaluator/requirements.txt (line 2))
  Downloading llama_index-0.9.24-py3-none-any.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydantic==2.5.3 (from -r llm_evaluator/requirements.txt (line 3))
  Downloading pydantic-2.5.3-py3-none-any.whl (381 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.9/381.9 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting litellm==1.18.0 (from -r llm_evaluator/requirements.txt (line 4))
  Downloading litellm-1.18.0-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2

In [3]:
!curl https://ollama.ai/install.sh | sh

from trulens_eval import Tru
tru = Tru()
tru.reset_database()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8422    0  8422    0     0  27431      0 --:--:-- --:--:-- --:--:-- 27433>>> Downloading ollama...
100  8422    0  8422    0     0  23448      0 --:--:-- --:--:-- --:--:-- 23394
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 0.0.0.0:11434.
>>> Install complete. Run "ollama" from the command line.
🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [4]:
import urllib.request
from llama_index import Document
from llama_index import SimpleDirectoryReader

# Downloads document and store as pdf locally
url = 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'
file = 'attention.pdf'
urllib.request.urlretrieve(url, file)

# Create llama_index document which will be used to create index
# The created index will provide context and used to answer questions using RAG later
documents = SimpleDirectoryReader(input_files=[file]).load_data()
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [16]:
import os
from time import sleep
from subprocess import Popen
from llama_index.embeddings import HuggingFaceEmbedding

# Pull embedding llm and reranking llm from HuggingFace
# model = 'dolphin-phi'
model = 'llama2'
base_url = 'http://localhost:11434'
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
reranker_model = "BAAI/bge-reranker-base"

# Pulls model from Ollama. Will be used as evaluating llm and generating llm
p = Popen(["ollama", "serve"])  # something long running
sleep(1)
os.system(f'ollama pull {model}')

0

In [6]:
from langchain.llms import Ollama
from trulens_eval import LiteLLM
import litellm
litellm.set_verbose=False

# 2 objects are created from the same model, 1 for generating answers, 1 for evaluating the RAG performance

# Used by llama_index ServiceContext as generating llm
ollama = Ollama(base_url=base_url, model=model)

# Used by Trulens as evaluation llm
ollama_provider = LiteLLM(model_engine=f"ollama/{model}", api_base=base_url)

## 02 Advanced RAG

Retrieval Augmented Generation


### Direct Query Engine

In [7]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext

# Creates ServiceContext, which is a bundler to hold generating llm and embedding llm
service_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)

# VectorStoreIndex converts text from earlier document into embeddings using embedding llm
# query_engine is created from the index, and is able to answer questions with context from the document
index = VectorStoreIndex.from_documents([document], service_context=service_context)
query_engine = index.as_query_engine()

## Sentence-window retrieval

In [8]:
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index import StorageContext
from llama_index import load_index_from_storage

# SentenceWindowNodeParser takes in parameters used to decide how the sentence window RAG is
node_parser = SentenceWindowNodeParser.from_defaults(window_size=3, window_metadata_key="window", original_text_metadata_key="original_text")
sentence_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model, node_parser=node_parser)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'sentence_index'
if not os.path.exists(save_dir):
    sentence_index = VectorStoreIndex.from_documents([document], service_context=sentence_context)
    sentence_index.storage_context.persist(persist_dir=save_dir)
else:
    sentence_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=sentence_context)

In [9]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank_top_n=2
similarity_top_k=6

# From index, retrieve top k documents most simlar to user query (i.e retrieve top 6 most similar documents)
# From retrieved documents, rerank and get top n most relavant results to be used as context for RAG (i.e rerank and get top 2 most relavant results)
postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)

sentence_window_engine = sentence_index.as_query_engine(similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank])

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

## Automerging Retrieval

In [10]:
from llama_index.node_parser import HierarchicalNodeParser
from llama_index.node_parser import get_leaf_nodes


chunk_sizes = [2048, 512, 128]
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
merging_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'merging_index'
if not os.path.exists(save_dir):
    automerging_index = VectorStoreIndex(leaf_nodes, storage_context=storage_context, service_context=merging_context)
    automerging_index.storage_context.persist(persist_dir=save_dir)
else:
    automerging_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=merging_context,)

In [11]:
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

# From leaf nodes, if insufficient context is found, parents nodes will be retrieved to provide context.
# This will be repeated until sufficient context is found
base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
retriever = AutoMergingRetriever(base_retriever, automerging_index.storage_context, verbose=True)
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)
auto_merging_engine = RetrieverQueryEngine.from_args(retriever, service_context=merging_context, node_postprocessors=[rerank])

## 03 Use Trulens to evaluate model

In [12]:
import numpy as np
from trulens_eval.feedback import Groundedness
from trulens_eval import Feedback, TruLlama

qa_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name="Answer Relevance")
              .on_input_output())

qs_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name = "Context Relevance")
              .on_input()
              .on(TruLlama.select_source_nodes().node.text)
              .aggregate(np.mean))

grounded = Groundedness(groundedness_provider=ollama_provider)

groundedness = (Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
              .on(TruLlama.select_source_nodes().node.text)
              .on_output()
              .aggregate(grounded.grounded_statements_aggregator))

feedbacks = [qa_relevance, qs_relevance, groundedness]

eval_questions = ['What is the paper about?', 'What is attention in context of the paper?', 'Who are the authors of the paper?']

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
tru_recorder = TruLlama(query_engine, app_id='Direct Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder as recording:
        response = query_engine.query(question)
        print(question)
        print(response)

What is the paper about?
The paper "Attention Is All You Need" by Ashish Vaswani et al. discusses a new architecture for neural machine translation called the Transformer, which replaces traditional recurrent neural network (RNN) and convolutional neural network (CNN) components with attention mechanisms. The Transformer model relies entirely on self-attention mechanisms, eliminating the need for RNNs or CNNs in the encoder-decoder architecture.

The paper introduces several innovations to the attention mechanism, including:

1. Multi-head attention: Instead of performing a single attention function with dmodel-dimensional keys, values, and queries, the Transformer performs multiple attention functions in parallel, each with its own learned linear projection. This allows the model to jointly attend to information from different representation subspaces at different positions.
2. Scaled dot-product attention: The attention function is computed as a weighted sum of the values, where the 

In [None]:
tru_recorder_sentence_window = TruLlama(sentence_window_engine, app_id='Sentence Window Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(response)

In [None]:
tru_recorder_automerging = TruLlama(auto_merging_engine, app_id='Automerging Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_automerging as recording:
        response = auto_merging_engine.query(question)
        print(question)
        print(response)

In [None]:
tru.get_leaderboard(app_ids=[])

In [None]:
# tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all

In [None]:
!curl ipecho.net/plain

In [None]:
try:
  tru.stop_dashboard()
except:
  pass
sleep(5)
# click the url provided and key in above ip in the box
tru.run_dashboard(port=8501)