# Chapter 3 RAG Application Evaluation Indicators

## 1. Introduction
This chapter mainly evaluates three commonly used indicators in RAG applications, namely:
1. Answer relevance: evaluates whether the output of the RAG system is relevant to the question;
2. Context relevance: evaluates whether the context recalled by the RAG system is relevant to the question;
3. Groundness: evaluates whether the output of the RAG system is based on the context recalled;

<img src="./images/ch03_traid.jpg" width="500">

First you need to install the assessment framework required for this course. If you have already installed it, you can skip this step.

In [1]:
# requirements
# pip install trulens_eval

Here, for the sake of beauty and convenience of display, we set the output to ignore warning messages.

In [2]:
# Ignore warnings to prevent them from affecting the output
import warnings
warnings.filterwarnings('ignore')

Next, import the toolkit utils required for this course, and then set the API key of openai.
There are three ways to set the API key:
1. Set `OPENAI_API_KEY` in the environment variable, and then use utils to get it directly;
2. Explicitly set api_key and assign it directly to openai.api_key;
3. If you don’t have the key of openai, you can also choose to use a third-party service and modify openai.api_base;

In [3]:
import utils
# Import custom toolkit

import os
import openai
# openai.api_key = utils.get_openai_api_key()
# Set OpenAI's API key, obtained from the environment variable

# openai.api_key = ""
# Or fill in your OpenAI API key here

# openai.api_key = "sk- "
# openai.api_base = " "
# Or customize the API key and API base address, which can be used for third-party API services


✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


Next, let’s start the tutorial setup.
First, you need to reset the database, which will be used to store questions, recall results, and answers for easy management and evaluation.

In [4]:
# Import the Tru class
from trulens_eval import Tru


# Instantiate the Tru class
tru = Tru()

# Reset the database
# The database will be used to store questions, intermediate recall results, answers, and evaluation results
tru.reset_database()


🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


Next, import the SimpleDirectoryReader required to read PDF files and read the PDF files in the specified folder.
It should be noted that the default parameters are suitable for reading English documents. If the document is in Chinese, full-width characters need to be converted to half-width characters later.

In [5]:
# Setting up the Llama Index reader
from llama_index import SimpleDirectoryReader

# Read a PDF document from a folder and load it into a document object
# The document used is the Wikipedia page for the term "artificial intelligence"
documents = SimpleDirectoryReader(
    input_files=["./data/人工智能.pdf"]
).load_data()

documents_en = SimpleDirectoryReader(
    input_files=["./data/eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()


For the sake of convenience, load the read PDF documents into the same document object, separated by `"\n\n"`;

In [6]:
from llama_index import Document

# Merge the contents of documents into one large document, rather than having each page be a document
document = Document(text="\n\n".\
                    join([doc.text for doc in documents]))

document_en = Document(text="\n\n".\
                    join([doc.text for doc in documents_en]))


In [7]:
# Replace Chinese punctuation marks with English punctuation marks for easy subsequent processing
# If it is an English document, you can skip this step
# If not processed, it will lead to the inability to correctly segment Chinese sentences, which will affect the size of the subsequent sentence_window and cause the input length to exceed the maximum limit of gpt-3.5-turbo
document.text=document.text.replace('。','. ')
document.text=document.text.replace('！','! ')
document.text=document.text.replace('？','? ')



Set up index storage. First, set the large model used for evaluation to gpt-3.5-turbo. Note that the context window of the version used here is 4096, so you need to pay attention to the input length.
Then set the embedding model. We chose BAAI/bge-small-zh-v1.5. Here you can choose the size and language of the model according to the needs of the scene and the trade-off of computing resources.

In [8]:
# Set sentence_index
from utils import build_sentence_window_index

from llama_index.llms import OpenAI

# Set the large model to use
# "gpt-3.5-turbo" is the name of the model
# Temperature is the temperature used to control the diversity of the text generation process
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Set the embedding model
# Here is BAAI/bge-small-zh-v1.5 used locally
# All the contents of the document will be indexed into the sentence index object
# For domestic use, you can switch to huggingface mirror site
sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-zh-v1.5",
    save_dir="sentence_index"
)

sentence_index_en = build_sentence_window_index(
    document_en,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index_en"
)


Use the encapsulated functions in the toolkit to return the engine used for subsequent retrieval based on the index established in the previous step.

In [9]:
from utils import get_sentence_window_query_engine

# Create a search engine based on the sentence_index object
# Will be used later for recall in the RAG application
sentence_window_engine = \
get_sentence_window_query_engine(sentence_index)

In [10]:
sentence_window_engine_en = \
get_sentence_window_query_engine(sentence_index_en)

Here we first test a single problem to debug and see what the output is.

In [11]:
output = sentence_window_engine.query(
    "AI的核心问题和长远目标是什么？")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
output_en = sentence_window_engine_en.query(
    "How do you create your AI portfolio?")
# Example: Recall using a search engine

In [13]:
# In actual development, you can debug by viewing metadata
output.metadata

{'7e8484a0-f7d2-4b64-b683-fa76b1dec6fe': {'window': '⼈⼯智能的研究是⾼度技术性和专业的，各分⽀领域都是深⼊且各不相通的，因⽽涉及范围极⼴[9].  ⼈⼯智能的\n研究可以分为⼏个技术问题.  其分⽀领域主要集中在解决具体问题，其中之⼀是，如何使⽤各种不同的⼯具完成\n特定的应⽤程序. \n AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10].  通⽤⼈⼯智能（GAI）⽬前仍然是该领域的长远⽬标[11].  ⽬前弱⼈⼯智能已经有初\n步成果，甚⾄在⼀些影像识别、语⾔分析、棋类游戏等等单⽅⾯的能⼒达到了超越⼈类的⽔平，⽽且⼈⼯智能的\n通⽤性代表着，能解决上述的问题的是⼀样的AI程序，⽆须重新开发算法就可以直接使⽤现有的AI完成任务，与\n⼈类的处理能⼒相同，但达到具备思考能⼒的统合强⼈⼯智能还需要时间研究，⽐较流⾏的⽅法包括统计⽅法，\n计算智能和传统意义的AI. ',
  'original_text': 'AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10]. '},
 '4e0b1c3d-5ba5-4c6b-b5c6-29df66a75281': {'window': '⼈⼯智能的\n研究可以分为⼏个技术问题.  其分⽀领域主要集中在解决具体问题，其中之⼀是，如何使⽤各种不同的⼯具完成\n特定的应⽤程序. \n AI的核⼼问题包括建构能够跟⼈类似甚⾄超卓的推理、知识、计划、学习、交流、感知、移动 、移物、使⽤⼯\n具和操控机械的能⼒等[10].  通⽤⼈⼯智能（GAI）⽬前仍然是该领域的长远⽬标[11].  ⽬前弱⼈⼯智能已经有初\n步成果，甚⾄在⼀些影像识别、语⾔分析、棋类游戏等等单⽅⾯的能⼒达到了超越⼈类的⽔平，⽽且⼈⼯智能的\n通⽤性代表着，能解决上述的问题的是⼀样的AI程序，⽆须重新开发算法就可以直接使⽤现有的AI完成任务，与\n⼈类的处理能⼒相同，但达到具备思考能⼒的统合强⼈⼯智能还需要时间研究，⽐较流⾏的⽅法包括统计⽅法，\n计算智能和传统意义的AI.  ⽬前有⼤量的⼯具应⽤了⼈⼯智能，其中包括搜索和数学优化、逻辑推演. ',
  'or

In [14]:
output_en.metadata

{'e0d0633c-9a38-4330-8980-2e4ceea28d30': {'window': 'Chapter 4: Scoping Successful AI Projects.\n Chapter 5: Finding Projects that Complement \nYour Career Goals.\n Chapter 6: Building a Portfolio of Projects that \nShows Skill Progression.\n Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n Chapter 8: Using Informational Interviews to Find \nthe Right Job.\n Chapter 9: Finding the Right AI Job for You.\n',
  'original_text': 'Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n'},
 '600b5970-e3cd-47cd-a2e0-ea2ffb2c5e79': {'window': 'Chapter 6: Building a Portfolio of Projects that \nShows Skill Progression.\n Chapter 7: A Simple Framework for Starting Your AI \nJob Search.\n Chapter 8: Using Informational Interviews to Find \nthe Right Job.\n Chapter 9: Finding the Right AI Job for You.\n Chapter 10: Keys to Building a Career in AI.\n Chapter 11: Overcoming Imposter Syndrome.\n',
  'original_text': 'Chapter 9: Finding the Right AI Job for You.\n'}}

## 2. Feedback functions
The feedback function is a function that measures the relationship between the question, context, and answer of the RAG system. In the RAG system, the feedback function is usually an indicator of the evaluation model, which is used to evaluate the performance of various aspects of the RAG system. Specifically, in this tutorial, the three indicators are mainly answer relevance, context relevance, and groundness.

<img src="./images/ch03_feedback.jpg" width="500">

In [15]:
import nest_asyncio

# Ensure that streamlit can be used for evaluation result management and visualization later
nest_asyncio.apply()


In [16]:
from trulens_eval import OpenAI as fOpenAI

# Initialize the OpenAI gpt-3.5-turbo model as a provider
# The provider will be used to assist in evaluating various indicators of RAG applications: answer relevance, context relevance, groundedness.
provider = fOpenAI()

### 2.1、 Answer Relevance
Answer relevance is used to evaluate whether the output of the RAG system is relevant to the question.

<img src="./images/ch03_answer_rele.jpg" width="500">

The structure of the feedback function of answer relevance is:

<img src="./images/ch03_structure_answer.jpg" width="500">

Here we can use the encapsulated Feedback function. What we need to do is to specify the evaluation method, name, and object of evaluation.

In [17]:
from trulens_eval import Feedback


# Set feedback for answer relevance here
# Use provider.relevance_with_cot_reasons as the evaluation function, that is, evaluate by calling LLM using chain of thought
# on_input_output() means evaluating on input and output
f_qa_relevance = Feedback(
    provider.relevance_with_cot_reasons,
    name="Answer Relevance"
).on_input_output()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


### 2.2, Context Relevance
Context relevance is used to evaluate whether the context recalled by the RAG system is relevant to the problem.

<img src="./images/ch03_context_rele.jpg" width=500>

The structure of its feedback function is:

<img src="./images/ch03_structure_context.jpg" width=500>

In [18]:
from trulens_eval import TruLlama

# Select the context for recall
context_selection = TruLlama.select_source_nodes().node.text

The settings here are similar to the previous step. You only need to modify the evaluation object.
You can also choose to modify the evaluation method for comparison.

In [19]:
import numpy as np


# Use provider.qs_relevance as the evaluation function
# on_input() means evaluating on input
# on(context_selection) means to evaluate on context_selection
# aggregate(np.mean) means using np.mean as the aggregation function
# What this actually means is: for each sentence in context_selection, an evaluation will be performed, and then the average value will be taken as the final evaluation result
f_qs_relevance = (
    Feedback(provider.qs_relevance,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)

✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .


In [20]:
import numpy as np

# Same as above, evaluate each sentence in context_selection and take the average as the evaluation result
f_qs_relevance = (
    Feedback(provider.qs_relevance_with_cot_reasons,
             name="Context Relevance")
    .on_input()
    .on(context_selection)
    .aggregate(np.mean)
)


✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .


### 2.3 Groundedness

In [21]:
from trulens_eval.feedback import Groundedness

grounded = Groundedness(groundedness_provider=provider)

Finally, groundedness is used to evaluate whether the output of the RAG system is based on the recalled context.
The setting is similar to the previous one.

In [22]:
# Evaluation of groundedness, the explanation is the same as answer relevance and context relevance
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons,
             name="Groundedness"
            )
    .on(context_selection)
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


## III. Evaluation of the RAG application

In the evaluation of the RAG system, the feedback function can be implemented in many ways.
The most accurate evaluation results can be obtained by using manual scoring, but this method is costly, so in practical applications, automatic evaluation methods are usually used.
In this tutorial, gpt-3.5-turbo is used to evaluate the RAG system.
The advantage of this method is that the RAG system can be evaluated quickly and at low cost, but its evaluation results may not be as accurate as manual scoring.

<img src="./images/ch03_bench.jpg" width="500">

The following is the implementation of the evaluation process of the entire RAG system.

In [23]:
from trulens_eval import TruLlama
from trulens_eval import FeedbackMode


# Instantiate the TruLlama class to record the evaluation results
# sentence_window_engine is the search engine created previously
# app_id is the application ID, used to identify the application
tru_recorder = TruLlama(
    sentence_window_engine,
    app_id="App_1",
    feedbacks=[
        f_qa_relevance,
        f_qs_relevance,
        f_groundedness
    ]
)

tru_recorder_en = TruLlama(
    sentence_window_engine_en,
    app_id="App_2",
    feedbacks=[
        f_qa_relevance,
        f_qs_relevance,
        f_groundedness
    ]
)

Read the questions used for evaluation. To save time and reduce the cost of calling the API, we only set 6 questions.
In actual scenarios, more questions can be generated manually or through the prompt seed method to cover more scenarios.

In [24]:
eval_questions = []
# Read the evaluation questions, in ./data/eval_questions.txt, you can customize them
with open('./data/eval_questions.txt', 'r') as file:
    for line in file:
        item = line.strip()
        eval_questions.append(item)


In [25]:
eval_questions_en = []
with open('./data/eval_questions_en.txt', 'r') as file:
    for line in file:
        item = line.strip()
        eval_questions_en.append(item)

In [26]:
eval_questions

['人工智能中的先验知识是如何被存储的？',
 '人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？',
 '管理者如何管理AI？',
 '强人工智能是什么？',
 '人工智能被滥用带来的危害？']

In [27]:
eval_questions_en

['What are the keys to building a career in AI?',
 "How can teamwork contribute to success in AI?'",
 "What is the importance of networking in AI?'",
 "What are some good habits to develop for a successful career?'",
 "How can altruism be beneficial in building a career?'",
 "What is imposter syndrome and how does it relate to AI?'",
 "Who are some accomplished individuals who have experienced imposter syndrome?'",
 "What is the first step to becoming good at AI?'",
 "What are some common challenges in AI?'",
 'Is it normal to find parts of AI challenging?']

In [28]:
eval_questions.append("如何在人工智能领域获得成功？")

In [29]:
eval_questions

['人工智能中的先验知识是如何被存储的？',
 '人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？',
 '管理者如何管理AI？',
 '强人工智能是什么？',
 '人工智能被滥用带来的危害？',
 '如何在人工智能领域获得成功？']

Next, we start the evaluation by requesting the output of the RAG system and then evaluating the output using the feedback function.

In [30]:
# For each evaluation question, perform the evaluation and record the results
# Note: This process may take time, please be patient
for question in eval_questions:
    with tru_recorder as recording:
        sentence_window_engine.query(question)

In [31]:
for question in eval_questions_en:
    with tru_recorder_en as recording:
        sentence_window_engine_en.query(question)

After that, encoding and decoding are required to convert the evaluation results into a Chinese-readable form for easy analysis.

In [32]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

# Convert the unicode in the record into Chinese for easy viewing
def decode_unicode(s):
    return s.encode('ascii').decode('unicode-escape')

records['input'] = records['input'].apply(decode_unicode)
records['output'] = records['output'].apply(decode_unicode)

records.head()

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_e5a8b3d540d5ceefbfbcfc7ab8b66530,"""人工智能中的先验知识是如何被存储的？""","""人工智能中的先验知识是通过某种方式告知机器的知识，可以描述目标、特征、种类及对象之间的关系...",-,"{""record_id"": ""record_hash_e5a8b3d540d5ceefbfb...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:42.283509"", ""...",2024-03-09T20:29:44.818854,0.8,0.85,1.0,"[{'args': {'prompt': '人工智能中的先验知识是如何被存储的？', 're...","[{'args': {'question': '人工智能中的先验知识是如何被存储的？', '...",[{'args': {'source': '知识表⽰是⼈⼯智能领域的核⼼研究问题之⼀，它的⽬...,2,0,0.0
1,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_1606ea1a9ed8d5bbc0eb20da128e0433,"""人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？""","""人工智能的自我更新和自我提升可能导致其脱离人类的控制。""",-,"{""record_id"": ""record_hash_1606ea1a9ed8d5bbc0e...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:44.919348"", ""...",2024-03-09T20:29:46.465240,1.0,0.85,0.666667,[{'args': {'prompt': '人工智能的自我更新和自我提升是否可能导致其脱离人...,[{'args': {'question': '人工智能的自我更新和自我提升是否可能导致其脱...,[{'args': {'source': '⾄少，它本⾝应该有正常的情绪. ⼀个⼈⼯智能...,1,0,0.0
2,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_c4b95535dc5bf0b50a931bbeb7332f16,"""管理者如何管理AI？""","""Management should consider adjusting their ro...",-,"{""record_id"": ""record_hash_c4b95535dc5bf0b50a9...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:46.557848"", ""...",2024-03-09T20:29:48.281784,0.9,0.6,0.95,"[{'args': {'prompt': '管理者如何管理AI？', 'response':...","[{'args': {'question': '管理者如何管理AI？', 'statemen...",[{'args': {'source': 'AI逐渐普及后，将会在企业管理中扮演很重要的⾓⾊...,1,0,0.0
3,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_2b91f2a57d165175d58926fd1a2dd22c,"""强人工智能是什么？""","""强人工智能是一种观点，认为计算机本身具有思维，而不仅仅是用来模拟人类思维的工具。根据这个观...",-,"{""record_id"": ""record_hash_2b91f2a57d165175d58...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:48.371499"", ""...",2024-03-09T20:29:50.379916,1.0,,0.5,"[{'args': {'prompt': '强人工智能是什么？', 'response': ...",,[{'args': {'source': '强⼈⼯智能可以有两 类： ⼈类的⼈⼯智能，即机器...,2,0,0.0
4,App_1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_2fb0adc005659d26ec770abae7e96329,"""人工智能被滥用带来的危害？""","""The misuse of artificial intelligence can pot...",-,"{""record_id"": ""record_hash_2fb0adc005659d26ec7...","{""n_requests"": 0, ""n_successful_requests"": 0, ...","{""start_time"": ""2024-03-09T20:29:50.474544"", ""...",2024-03-09T20:29:53.133885,1.0,,,"[{'args': {'prompt': '人工智能被滥用带来的危害？', 'respons...",,,2,0,0.0


In [33]:
import pandas as pd

# Display evaluation results
pd.set_option("display.max_colwidth", None)
display(records[["input", "output"] + feedback])

Unnamed: 0,input,output,Groundedness,Answer Relevance,Context Relevance
0,"""人工智能中的先验知识是如何被存储的？""","""人工智能中的先验知识是通过某种方式告知机器的知识，可以描述目标、特征、种类及对象之间的关系，也可以描述事件、时间、状态、原因和结果，以及任何需要机器存储的知识。""",1.0,0.8,0.85
1,"""人工智能的自我更新和自我提升是否可能导致其脱离人类的控制？""","""人工智能的自我更新和自我提升可能导致其脱离人类的控制。""",0.666667,1.0,0.85
2,"""管理者如何管理AI？""","""Management should consider adjusting their roles by relinquishing administrative tasks, focusing on enhancing their comprehensive judgment and analytical prediction capabilities, treating AI as a colleague to form collaborative teams, and acknowledging that AI technologies also have limitations and bottlenecks.""",0.95,0.9,0.6
3,"""强人工智能是什么？""","""强人工智能是一种观点，认为计算机本身具有思维，而不仅仅是用来模拟人类思维的工具。根据这个观点，只要计算机运行适当的程序，它就具有自己的思维能力。""",0.5,1.0,
4,"""人工智能被滥用带来的危害？""","""The misuse of artificial intelligence can potentially lead to violations of laws such as copyright infringement. There have been cases where artificial intelligence technology has been used to remove mosaic from explicit videos or alter the appearance of individuals in videos. Additionally, there are concerns that the development of artificial intelligence could lead to uncontrollable situations, where AI may manipulate human emotions, influence financial markets, and even develop weapons that are beyond human comprehension. Furthermore, there are predictions that certain professions may be replaced by machines and AI in the future, potentially leading to significant job losses and societal disruptions.""",,1.0,
5,"""如何在人工智能领域获得成功？""","""通过利用概率和经济学上的概念，发展出能够处理不确定或不完整的信息的方法，寻找更有效的算法，并强调感知运动的重要性，可以在人工智能领域获得成功。""",,0.8,
6,"""What are the keys to building a career in AI?""","""Learning foundational technical skills, working on projects, finding a job, and being part of a supportive community are the keys to building a career in AI.""",,0.8,
7,"""How can teamwork contribute to success in AI?'""","""Teamwork can contribute to success in AI by allowing individuals to lead projects effectively, even without a formal leadership position. Working on larger AI projects often requires collaboration and the ability to steer projects by applying deep technical insights. This teamwork can help improve projects significantly and allow individuals to grow as leaders within the field.""",,0.9,
8,"""What is the importance of networking in AI?'""","""Networking in AI is crucial as it can provide valuable insights, guidance, and opportunities for individuals looking to advance in the field. By connecting with professionals who have experience in AI, individuals can gain knowledge about the industry, potential career paths, and current trends. Networking also allows for the exchange of information, which can help individuals stay updated on the latest developments in AI and build relationships that may lead to job opportunities or collaborations in the future. Additionally, networking can help individuals establish a support system within the AI community, enabling them to seek advice, mentorship, and guidance as they navigate their careers in this rapidly evolving field.""",,,
9,"""What are some good habits to develop for a successful career?'""","""Developing good habits in areas such as eating, exercise, sleep, personal relationships, work, learning, and self-care can help individuals move forward in their careers while maintaining their health. Additionally, aiming to lift others during each step of one's own journey can lead to better outcomes in the long run.""",,,


In [34]:
# Get leaderboard
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Groundedness,Answer Relevance,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
App_1,0.779167,0.916667,0.766667,1.666667,0.0
App_2,,0.85,,1.7,0.0


In [35]:
# Run the dashboard
# Note: Please check if the port is occupied. If so, please modify the port number.
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://10.31.153.170:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>