<a href="https://colab.research.google.com/github/ivanleech/llm_evaluator/blob/main/llm_evaluator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Trulens to evaluate LLM/RAG 🔎

📓 This notebook explores Advanced RAG methods using Direct Query as baseline, and compares them to sentence-window and automerging-retrieval methods.

To evaluate the responses and context retrieved from various methods, we use TruLens🐙 to explore a Triad of metrics, context relevance(context to query), answer relevance(answer to query) and groundedness(answer to context).

Instead of using OpenAI, we use Ollama 🦙 to install various models locally, allowing users without access to OpenAI to run these notebook without any issue. This notebook is recommend to run using Google Colab using the free T4 GPU compute. 🖥️


## 01 Setup

In [1]:
%%capture
!git clone https://github.com/ivanleech/llm_evaluator.git
!pip install -r llm_evaluator/requirements.txt

In [2]:
!curl https://ollama.ai/install.sh | sh

from trulens_eval import Tru
tru = Tru()
tru.reset_database()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8422    0  8422    0     0   8343      0 --:--:--  0:00:01 --:--:--  8346>>> Downloading ollama...
100  8422    0  8422    0     0   7838      0 --:--:--  0:00:01 --:--:--  7841
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 0.0.0.0:11434.
>>> Install complete. Run "ollama" from the command line.
🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [3]:
import urllib.request
from llama_index import Document
from llama_index import SimpleDirectoryReader

# Downloads document and store as pdf locally
url = 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'
file = 'attention.pdf'
urllib.request.urlretrieve(url, file)

# Create llama_index document which will be used to create index
# The created index will provide context and used to answer questions using RAG later
documents = SimpleDirectoryReader(input_files=[file]).load_data()
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [4]:
import os
from time import sleep
from subprocess import Popen
from llama_index.embeddings import HuggingFaceEmbedding

# Pull embedding llm and reranking llm from HuggingFace
# model = 'dolphin-phi'
model = 'llama2'
base_url = 'http://localhost:11434'
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
reranker_model = "BAAI/bge-reranker-base"

# Pulls model from Ollama. Will be used as evaluating llm and generating llm
p = Popen(["ollama", "serve"])  # something long running
sleep(1)
os.system(f'ollama pull {model}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

0

In [5]:
from langchain.llms import Ollama
from trulens_eval import LiteLLM
import litellm
litellm.set_verbose=False

# 2 objects are created from the same model, 1 for generating answers, 1 for evaluating the RAG performance

# Used by llama_index ServiceContext as generating llm
ollama = Ollama(base_url=base_url, model=model)

# Used by Trulens as evaluation llm
ollama_provider = LiteLLM(model_engine=f"ollama/{model}", api_base=base_url)

## 02 Advanced RAG

Retrieval Augmented Generation


### Direct Query Engine

In [6]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext

# Creates ServiceContext, which is a bundler to hold generating llm and embedding llm
service_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)

# VectorStoreIndex converts text from earlier document into embeddings using embedding llm
# query_engine is created from the index, and is able to answer questions with context from the document
index = VectorStoreIndex.from_documents([document], service_context=service_context)
query_engine = index.as_query_engine()

## Sentence-window retrieval

In [7]:
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index import StorageContext
from llama_index import load_index_from_storage

# SentenceWindowNodeParser takes in parameters used to decide how the sentence window RAG is
node_parser = SentenceWindowNodeParser.from_defaults(window_size=3, window_metadata_key="window", original_text_metadata_key="original_text")
sentence_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model, node_parser=node_parser)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'sentence_index'
if not os.path.exists(save_dir):
    sentence_index = VectorStoreIndex.from_documents([document], service_context=sentence_context)
    sentence_index.storage_context.persist(persist_dir=save_dir)
else:
    sentence_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=sentence_context)

In [8]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank_top_n=2
similarity_top_k=6

# From index, retrieve top k documents most simlar to user query (i.e retrieve top 6 most similar documents)
# From retrieved documents, rerank and get top n most relavant results to be used as context for RAG (i.e rerank and get top 2 most relavant results)
postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)

sentence_window_engine = sentence_index.as_query_engine(similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank])

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

## Automerging Retrieval

In [9]:
from llama_index.node_parser import HierarchicalNodeParser
from llama_index.node_parser import get_leaf_nodes


chunk_sizes = [2048, 512, 128]
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
merging_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'merging_index'
if not os.path.exists(save_dir):
    automerging_index = VectorStoreIndex(leaf_nodes, storage_context=storage_context, service_context=merging_context)
    automerging_index.storage_context.persist(persist_dir=save_dir)
else:
    automerging_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=merging_context,)

In [10]:
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

# From leaf nodes, if insufficient context is found, parents nodes will be retrieved to provide context.
# This will be repeated until sufficient context is found
base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
retriever = AutoMergingRetriever(base_retriever, automerging_index.storage_context, verbose=True)
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)
auto_merging_engine = RetrieverQueryEngine.from_args(retriever, service_context=merging_context, node_postprocessors=[rerank])

## 03 Use Trulens to evaluate model

In [11]:
import numpy as np
from trulens_eval.feedback import Groundedness
from trulens_eval import Feedback, TruLlama

qa_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name="Answer Relevance")
              .on_input_output())

qs_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name = "Context Relevance")
              .on_input()
              .on(TruLlama.select_source_nodes().node.text)
              .aggregate(np.mean))

grounded = Groundedness(groundedness_provider=ollama_provider)

groundedness = (Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
              .on(TruLlama.select_source_nodes().node.text)
              .on_output()
              .aggregate(grounded.grounded_statements_aggregator))

feedbacks = [qa_relevance, qs_relevance, groundedness]

eval_questions = ['What is the paper about?', 'What is attention in context of the paper?', 'Who are the authors of the paper?']

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [12]:
tru_recorder = TruLlama(query_engine, app_id='Direct Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder as recording:
        response = query_engine.query(question)
        print(question)
        print(response)

  warn_deprecated(


What is the paper about?
The paper "Attention Is All You Need" by Ashish Vaswani et al. discusses a new neural network architecture for machine translation called the Transformer. The Transformer replaces traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs) with attention mechanisms, which allow the model to focus on specific parts of the input sequence when generating each output element.

The paper introduces several key innovations:

1. Multi-head Attention: Instead of performing a single attention function with dmodel-dimensional keys, values, and queries, the Transformer performs multiple attention functions in parallel, each with their own set of learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
2. Scaled Dot-Product Attention: The Transformer uses a scaled dot-product attention mechanism, which computes the attention weights by taking the dot product 

In [13]:
tru_recorder_sentence_window = TruLlama(sentence_window_engine, app_id='Sentence Window Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(response)

What is the paper about?
Based on the provided context information, it appears that the paper "A Deep Reinforced Model for Abstractive Summarization" by Romain Paulus, Caiming Xiong, and Richard Socher (arXiv preprint arXiv:1705.04304, 2017) is focused on developing a deep reinforcement learning model for abstractive summarization. The paper proposes a new approach that combines the strengths of both sequence-to-sequence and reinforcement learning methods to improve the quality of generated summaries.

The authors propose a framework that uses a encoder-decoder architecture with a reinforcement learning agent to optimize the summary generation process. The encoder generates a contextualized representation of the input text, which is then passed to the decoder to generate the summary. The decoder is trained using a reinforcement learning agent that rewards the model for generating high-quality summaries based on a set of predefined criteria such as coherence, fluency, and accuracy.

The

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


What is attention in context of the paper?
In the context of the paper, "attention" refers to an attention mechanism used in various models, including the Transformer, ConvS2S, and ByteNet. Attention allows these models to focus on specific parts of the input sequence or output sequence when computing representations or making predictions. Self-attention is a type of attention that relates different positions within a single sequence, while end-to-end memory networks use a recurrent attention mechanism instead of sequence-aligned recurrence. The number of operations required to relate signals from two arbitrary input or output positions grows linearly for ConvS2S and logarithmically for ByteNet, which can make it more difficult to learn dependencies between distant positions. However, in the Transformer, attention is reduced to a constant number of operations through the use of Multi-Head Attention, which helps to counteract this effect.
Who are the authors of the paper?
Based on the p

In [14]:
tru_recorder_automerging = TruLlama(auto_merging_engine, app_id='Automerging Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_automerging as recording:
        response = auto_merging_engine.query(question)
        print(question)
        print(response)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


What is the paper about?
Based on the provided context information and the query, I can infer that the paper "Attention Is All You Need" is likely about a machine learning technique called attention-based neural networks. The paper presents a new architecture for neural networks that uses attention mechanisms to selectively focus on certain parts of the input data when processing it. This allows the network to pay more attention to important features and ignore irrelevant ones, potentially improving its performance on a given task.

The paper likely discusses the design and implementation of this attention-based architecture, as well as any experiments or evaluations that were conducted to test its effectiveness. The authors may also explore the theoretical underpinnings of the attention mechanism and how it differs from traditional neural network architectures.

Without prior knowledge, I cannot provide a more specific answer based on the content of the paper itself. However, based on

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


What is attention in context of the paper?
Attention in the context of the paper refers to a mechanism used in neural networks to relate different positions of a single sequence in order to compute a representation of the sequence. Self-attention, also known as intra-attention, is an attention mechanism that allows the network to focus on specific parts of the input sequence when computing its output. This is particularly useful in tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. Additionally, end-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence, which has been shown to perform well on simple-language question answering and language modeling tasks.




Who are the authors of the paper?
Based on the provided context information, we can identify the authors of the paper as follows:

1. Rico Sennrich, Barry Haddow, and Alexandra Birch - The authors of the paper "Neural Machine Translation of Rare Words with Subword Units" (arXiv:1608.05859, 2016).
2. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean - The authors of the paper "Attention Is All You Need" (arXiv:1308.0850, 2013).

Therefore, the answer to the query is: Rico Sennrich, Barry Haddow, Alexandra Birch, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean.


In [15]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Groundedness,Context Relevance,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Direct Query Engine,4.0,0.733333,0.3,36.666667,0.0
Automerging Query Engine,1.0,0.3,0.45,36.666667,0.0
Sentence Window Query Engine,0.9,0.816667,0.6,36.666667,0.0


In [16]:
# tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all

In [17]:
!curl ipecho.net/plain

34.142.254.9

In [None]:
try:
  tru.stop_dashboard()
except:
  pass
sleep(5)
# click the url provided and key in above ip in the box
tru.run_dashboard(port=8501)

Dashboard closed.
Dashboard closed.
Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
npx: installed 22 in 5.296s

Go to this url and submit the ip given here. your url is: https://grumpy-lands-poke.loca.lt

