<a href="https://colab.research.google.com/github/ivanleech/llm_evaluator/blob/main/llm_evaluator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Trulens to evaluate LLM/RAG 🔎

📓 This notebook explores Advanced RAG methods using Direct Query as baseline, and compares them to sentence-window and automerging-retrieval methods.

To evaluate the responses and context retrieved from various methods, we use TruLens🐙 to explore a Triad of metrics, context relevance(context to query), answer relevance(answer to query) and groundedness(answer to context).

Instead of using OpenAI, we use Ollama 🦙 to install various models locally, allowing users without access to OpenAI to run these notebook without any issue. This notebook is recommend to run using Google Colab using the free T4 GPU compute. 🖥️


## 01 Setup

In [18]:
%%capture
!curl https://ollama.ai/install.sh | sh

from trulens_eval import Tru
tru = Tru()
tru.reset_database()

In [19]:
import urllib.request
from llama_index import Document
from llama_index import SimpleDirectoryReader

# Downloads document and store as pdf locally
url = 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'
file = 'attention.pdf'
urllib.request.urlretrieve(url, file)

# Create llama_index document which will be used to create index
# The created index will provide context and used to answer questions using RAG later
documents = SimpleDirectoryReader(input_files=[file]).load_data()
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [20]:
import os
from time import sleep
from subprocess import Popen
from llama_index.embeddings import HuggingFaceEmbedding

# Pull embedding llm and reranking llm from HuggingFace
# model = 'dolphin-phi'
model = 'llama2'
base_url = 'http://localhost:11434'
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
reranker_model = "BAAI/bge-reranker-base"

# Pulls model from Ollama. Will be used as evaluating llm and generating llm
p = Popen(["ollama", "serve"])  # something long running
sleep(1)
os.system(f'ollama pull {model}')

0

In [21]:
from langchain.llms import Ollama
from trulens_eval import LiteLLM
import litellm
litellm.set_verbose=False

# 2 objects are created from the same model, 1 for generating answers, 1 for evaluating the RAG performance

# Used by llama_index ServiceContext as generating llm
ollama = Ollama(base_url=base_url, model=model)

# Used by Trulens as evaluation llm
ollama_provider = LiteLLM(model_engine=f"ollama/{model}", api_base=base_url)

## 02 Advanced RAG

Retrieval Augmented Generation


### Direct Query Engine

In [22]:
from llama_index import VectorStoreIndex
from llama_index import ServiceContext

# Creates ServiceContext, which is a bundler to hold generating llm and embedding llm
service_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)

# VectorStoreIndex converts text from earlier document into embeddings using embedding llm
# query_engine is created from the index, and is able to answer questions with context from the document
index = VectorStoreIndex.from_documents([document], service_context=service_context)
query_engine = index.as_query_engine()

## Sentence-window retrieval

In [23]:
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index import StorageContext
from llama_index import load_index_from_storage

# SentenceWindowNodeParser takes in parameters used to decide how the sentence window RAG is
node_parser = SentenceWindowNodeParser.from_defaults(window_size=3, window_metadata_key="window", original_text_metadata_key="original_text")
sentence_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model, node_parser=node_parser)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'sentence_index'
if not os.path.exists(save_dir):
    sentence_index = VectorStoreIndex.from_documents([document], service_context=sentence_context)
    sentence_index.storage_context.persist(persist_dir=save_dir)
else:
    sentence_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=sentence_context)

In [24]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank_top_n=2
similarity_top_k=6

# From index, retrieve top k documents most simlar to user query (i.e retrieve top 6 most similar documents)
# From retrieved documents, rerank and get top n most relavant results to be used as context for RAG (i.e rerank and get top 2 most relavant results)
postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)

sentence_window_engine = sentence_index.as_query_engine(similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank])

## Automerging Retrieval

In [25]:
from llama_index.node_parser import HierarchicalNodeParser
from llama_index.node_parser import get_leaf_nodes


chunk_sizes = [2048, 512, 128]
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
merging_context = ServiceContext.from_defaults(llm=ollama, embed_model=embed_model)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

# Creates index if not found locally and save to local dir, load from local otherwise
save_dir = 'merging_index'
if not os.path.exists(save_dir):
    automerging_index = VectorStoreIndex(leaf_nodes, storage_context=storage_context, service_context=merging_context)
    automerging_index.storage_context.persist(persist_dir=save_dir)
else:
    automerging_index = load_index_from_storage(StorageContext.from_defaults(persist_dir=save_dir), service_context=merging_context,)

In [26]:
from llama_index.retrievers import AutoMergingRetriever
from llama_index.query_engine import RetrieverQueryEngine

# From leaf nodes, if insufficient context is found, parents nodes will be retrieved to provide context.
# This will be repeated until sufficient context is found
base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
retriever = AutoMergingRetriever(base_retriever, automerging_index.storage_context, verbose=True)
rerank = SentenceTransformerRerank(top_n=rerank_top_n, model=reranker_model)
auto_merging_engine = RetrieverQueryEngine.from_args(retriever, service_context=merging_context, node_postprocessors=[rerank])

## 03 Use Trulens to evaluate model

In [27]:
import numpy as np
from trulens_eval.feedback import Groundedness
from trulens_eval import Feedback, TruLlama

qa_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name="Answer Relevance")
              .on_input_output())

qs_relevance = (Feedback(ollama_provider.relevance_with_cot_reasons, name = "Context Relevance")
              .on_input()
              .on(TruLlama.select_source_nodes().node.text)
              .aggregate(np.mean))

grounded = Groundedness(groundedness_provider=ollama_provider)

groundedness = (Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
              .on(TruLlama.select_source_nodes().node.text)
              .on_output()
              .aggregate(grounded.grounded_statements_aggregator))

feedbacks = [qa_relevance, qs_relevance, groundedness]

eval_questions = ['What is the paper about?', 'What is attention in context of the paper?', 'Who are the authors of the paper?']

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [37]:
! pip install html2text

Collecting html2text
  Downloading html2text-2020.1.16-py3-none-any.whl (32 kB)
Installing collected packages: html2text
Successfully installed html2text-2020.1.16


In [38]:
tru_recorder = TruLlama(query_engine, app_id='Direct Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder as recording:
        response = query_engine.query(question)
        print(question)
        print(response)



What is the paper about?
The paper "Attention Is All You Need" by Ashish Vaswani et al. discusses a new neural network architecture for sequence-to-sequence tasks, called the Transformer, which replaces traditional recurrent or convolutional layers with attention mechanisms. The Transformer is trained on machine translation tasks and achieves state-of-the-art results, outperforming previously published ensembles.

The paper introduces a new attention mechanism called "Scaled Dot-Product Attention," which computes the attention weights by taking the dot product of the query and key vectors, dividing each by the square root of the key dimension, and applying a softmax function. The paper also explores the use of multiple attention heads in parallel to jointly attend to information from different representation subspaces at different positions.

The Transformer architecture consists of an encoder and a decoder, each composed of multiple layers. The encoder takes in a sequence of tokens (e



What is attention in context of the paper?
In the context of the paper, "attention" refers to a mechanism used in the Transformer architecture to allow the model to focus on different parts of the input sequence simultaneously and weigh their importance when computing the output. Attention was introduced as an alternative to traditional recurrent neural network (RNN) architectures, which process the input sequence sequentially and have limited capacity to capture long-range dependencies.

In the Transformer architecture, attention is applied in three ways:

1. Encoder-decoder attention: The queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
2. Self-attention layers in the encoder: Each position in the encoder can attend to all positions in the previous layer of the encoder.
3. Self-attention layers in the decoder: Each position in 



Who are the authors of the paper?
The authors of the paper are:

1. Sepp Hochreiter
2. Jürgen Schmidhuber
3. Rafal Jozefowicz
4. Oriol Vinyals
5. Mike Schuster
6. Noam Shazeer
7. Yonghui Wu
8. Łukasz Kaiser
9. Ilya Sutskever

These authors are known for their work in the field of natural language processing and deep learning, particularly in the area of transformer models.


In [29]:
tru_recorder_sentence_window = TruLlama(sentence_window_engine, app_id='Sentence Window Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(response)



What is the paper about?
Based on the provided context information, it appears that the paper being discussed is "A Deep Reinforced Model for Abstractive Summarization" by Romain Paulus, Caiming Xiong, and Richard Socher. The paper proposes a deep reinforcement learning model for abstractive summarization, which aims to generate summaries that are both concise and accurate. The model uses a combination of sequence-to-sequence and reinforcement learning techniques to learn the optimal policy for summarization.

The paper also mentions other related work in the field of natural language processing, including the use of output embeddings to improve language models (as mentioned in [24]) and the concept of dropout as a regularization technique to prevent overfitting in neural networks (as mentioned in [27]).

Overall, the paper appears to be focused on developing a new deep reinforcement learning model for abstractive summarization, while also providing context and background information o

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


What is attention in context of the paper?
Attention in the context of the paper refers to a mechanism used in various neural network models, including the Transformer, ConvS2S, and ByteNet, to relate signals from different positions in a sequence. Attention allows the model to focus on specific parts of the input sequence when computing representations or making predictions, enabling it to capture longer-range dependencies and better handle inputs with varying lengths.

In the Transformer, attention is used entirely without recurrence or convolution, relying solely on self-attention to compute representations of its input and output. This allows the model to learn dependencies between distant positions in a more efficient manner than traditional sequence-aligned recurrent neural networks (RNNs) or convolutional neural networks (CNNs).

In contrast, ConvS2S and ByteNet use attention mechanisms that grow in complexity with the distance between positions, which can make it more difficult



Who are the authors of the paper?
Based on the provided context information, the authors of the paper "Deep Residual Learning for Image Recognition" are:

1. Kaiming He
2. Xiangyu Zhang
3. Shaoqing Ren
4. Jian Sun


In [30]:
tru_recorder_automerging = TruLlama(auto_merging_engine, app_id='Automerging Query Engine', feedbacks=feedbacks)
for question in eval_questions:
    with tru_recorder_automerging as recording:
        response = auto_merging_engine.query(question)
        print(question)
        print(response)

What is the paper about?
Based on the provided context information and the paper's PDF file path, it appears that the paper "Attention Is All You Need" is focused on a deep learning architecture called Transformer, which was introduced in 2017 by Vaswani et al. in the paper titled "Attention Is All You Need". The Transformer model relies entirely on self-attention mechanisms, eliminating the need for traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs).

The paper presents a novel attention mechanism that allows the model to focus on different parts of the input sequence simultaneously and weigh their importance. This allows the model to capture long-range dependencies and contextual relationships in the input data, which is particularly useful for natural language processing tasks such as machine translation and text summarization.

The paper also analyzes the performance of the Transformer model on several benchmark datasets and compares it to other st

In [31]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance,Groundedness,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Query Engine,0.95,1.0,0.9,43.0,0.0
Direct Query Engine,0.666667,0.5,0.3,43.0,0.0
Sentence Window Query Engine,0.6,0.777778,0.3,43.0,0.0


In [32]:
# tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all

In [33]:
!curl ipecho.net/plain

34.16.162.186

In [34]:
try:
  tru.stop_dashboard()
except:
  pass
sleep(5)
# click the url provided and key in above ip in the box
tru.run_dashboard(port=8501)

Dashboard closed.
Dashboard closed.
Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
npx: installed 22 in 5.543s

Go to this url and submit the ip given here. your url is: https://social-rooms-sort.loca.lt



  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


RuntimeError: Dashboard failed to start in time. Please inspect dashboard logs for additional information.