Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
2cac660
init new workshop
markxnelson Apr 17, 2025
fc65b15
1st edit
ldemarchis Apr 18, 2025
35e1b56
renamed .md
ldemarchis Apr 18, 2025
b6cf0fb
Update manifest.json
ldemarchis Apr 18, 2025
71f2c2d
ToC
ldemarchis Apr 18, 2025
cbf17d5
refactoring
ldemarchis Apr 18, 2025
9b99291
refactoring
ldemarchis Apr 18, 2025
3922672
set OCI credentials added
ldemarchis Apr 18, 2025
4b5af38
Lab 1 Update
ldemarchis Apr 28, 2025
5360475
Lab 1 update
ldemarchis Apr 28, 2025
c409647
Lab 2 updated
ldemarchis Apr 29, 2025
1a94329
Lab 5 updated
ldemarchis Apr 29, 2025
0ed8e08
Lab 5 updated
ldemarchis Apr 29, 2025
8b3a12c
Lab 5 updated
ldemarchis Apr 29, 2025
521ff77
Lab 5 updated
ldemarchis Apr 29, 2025
52ce229
Update testbed.png
ldemarchis Apr 29, 2025
f0b10a6
Update evaluating.md
ldemarchis Apr 29, 2025
80412c9
Objectives updated
ldemarchis Apr 29, 2025
fa06c6c
Intros updated
ldemarchis Apr 29, 2025
98b575f
Lab 6 updated
ldemarchis Apr 30, 2025
9347428
Title updated
ldemarchis Apr 30, 2025
36f1c85
lab 3 updated
ldemarchis May 2, 2025
b966305
Update RAG.md
ldemarchis May 2, 2025
7389e5f
Update RAG.md
ldemarchis May 2, 2025
ab50412
changed images directory
ldemarchis May 2, 2025
4eabe97
Add .gitkeep to track images folder
ldemarchis May 2, 2025
6379bb0
Add files via upload
ldemarchis May 2, 2025
4e7e7bd
updated introduction
ldemarchis May 5, 2025
27eb805
Update get-started.md
ldemarchis May 5, 2025
17b309f
Update get-started.md
ldemarchis May 5, 2025
1cb4100
updated get-started
ldemarchis May 5, 2025
22d0265
updated introduction and get started
ldemarchis May 5, 2025
242c575
general update
ldemarchis May 5, 2025
b74dbe8
updated api-server
ldemarchis May 6, 2025
18152ed
Update server.md
ldemarchis May 6, 2025
466226a
Update server.md
ldemarchis May 6, 2025
81eabcc
corrections
ldemarchis May 6, 2025
9e42eca
updated explore.md
ldemarchis May 6, 2025
c0accab
Update explore.md
ldemarchis May 6, 2025
849e941
Update explore.md
ldemarchis May 6, 2025
c9ff4ec
Update explore.md
ldemarchis May 7, 2025
684d068
updated evaluating.md
ldemarchis May 7, 2025
dc48e80
generic updateds
ldemarchis May 7, 2025
1013fa1
corrections
ldemarchis May 7, 2025
c8d26c5
corrected typos
ldemarchis May 8, 2025
49348e7
Update introduction.md
ldemarchis May 8, 2025
f7f1764
references update
ldemarchis May 8, 2025
99b1d38
update acknowledgments
ldemarchis May 9, 2025
b9b5c1a
updated source document
ldemarchis May 9, 2025
dfd2fd9
Add files via upload
ldemarchis May 9, 2025
aa4f0c9
source document related content changed
ldemarchis May 9, 2025
efe2a70
Add files via upload
ldemarchis May 9, 2025
301d824
updated RAG.md
ldemarchis May 9, 2025
8953d8e
Update server.md
ldemarchis May 9, 2025
74a80f7
Update experimenting.md
ldemarchis May 12, 2025
a82d8a6
help.md created
ldemarchis May 12, 2025
76e9455
minor fixes
ldemarchis May 12, 2025
a49442f
Add files via upload
ldemarchis May 16, 2025
5a8a9f0
general review
ldemarchis May 16, 2025
86c8d92
estimated times for lab1&lab2
ldemarchis May 19, 2025
1ce7987
estimated times added to lab3-6
ldemarchis May 20, 2025
bf00b47
General updates
ldemarchis May 23, 2025
2774f06
alias podman=docker added
ldemarchis May 23, 2025
7ee7de8
Update get-started.md
ldemarchis May 23, 2025
d72bcde
volume version+OCI credentials as optional
ldemarchis May 27, 2025
d2ca8da
Update introduction.md
ldemarchis May 28, 2025
253bece
Update get-started.md
ldemarchis May 28, 2025
de8b91b
Update get-started.md
ldemarchis May 28, 2025
53b99ce
Update get-started.md
ldemarchis May 28, 2025
b8aac8a
copy button added
ldemarchis May 28, 2025
eb27081
note indentation added
ldemarchis May 28, 2025
2d75890
Update explore.md
ldemarchis May 28, 2025
3480499
Create getting_started-30_testset.json
ldemarchis Jun 4, 2025
fadf25b
updates
ldemarchis Jun 4, 2025
9bf4e2c
updates to 1.1
ldemarchis Jun 11, 2025
b52976c
re-directoring
ldemarchis Jun 17, 2025
51652ce
clean-up
ldemarchis Jun 17, 2025
4f7d0cd
test
ldemarchis Jun 17, 2025
db398c2
refactoring
ldemarchis Jun 17, 2025
8e2f4d8
Update manifest.json
ldemarchis Jun 17, 2025
93a121a
Update manifest.json
ldemarchis Jun 17, 2025
c156488
updated explore.md
ldemarchis Jun 17, 2025
1322060
Merge branch 'main' of https://github.com/markxnelson/developer
ldemarchis Jun 17, 2025
9f7799c
updated explore.md
ldemarchis Jun 17, 2025
26ace7c
updated prepare.md
ldemarchis Jun 17, 2025
b972fd2
updated rag.md
ldemarchis Jun 17, 2025
b579b2f
updated explore.md
ldemarchis Jun 19, 2025
bdbc615
Update explore.md
ldemarchis Jun 19, 2025
f23b2b7
Update explore.md
ldemarchis Jun 19, 2025
a6279a2
prepare.md adapted for Sandbox version
ldemarchis Jun 20, 2025
eb09755
updated introduction.md and server.md for sanbox version
ldemarchis Jun 20, 2025
d63a035
Rename sandbox to tenancy
andytael Jul 14, 2025
cb08dd9
spelling and linting
andytael Jul 14, 2025
1fe91f7
Adding deploy section
andytael Jul 14, 2025
0ee1379
Deploy using IaC
andytael Jul 14, 2025
141c59a
cleanup and fixes
andytael Jul 14, 2025
5d9d406
linting, spelling
andytael Jul 14, 2025
8740366
Merge branch 'oracle-livelabs:main' into main
andytael Jul 14, 2025
1900c26
minor changes
ldemarchis Jul 15, 2025
fbe7e03
typo fix
andytael Jul 15, 2025
bc11231
linting
andytael Jul 15, 2025
84d5beb
numbers
andytael Jul 15, 2025
24e460b
desktop linting, spelling
andytael Jul 15, 2025
9279b3f
lint
andytael Jul 15, 2025
6e14290
manifest update
andytael Jul 16, 2025
7612971
updates
andytael Jul 16, 2025
a6f41c6
manifest updates
andytael Jul 18, 2025
f368787
Merge branch 'main' into main
andytael Jul 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions ai-optimizer/desktop/evaluating/evaluating.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Evaluating Performance

## Introduction

We are confident that adjusting certain parameters can improve the quality and accuracy of the chatbot’s responses. However, can we be sure that a specific configuration remains reliable when scaled to hundreds or even thousands of different questions?

In this lab, you will explore the *Testbed* feature. The Testbed allows you to evaluate your chatbot at scale by generating a Q&A test dataset and automatically running it against your current configuration.

**Note**: The example shown in this lab relies on gpt-4o-mini. Feel free to use your local LLMs (e.g. llama3.1) if you choose to or can't use OpenAI LLMs.

Estimated Time: 15 minutes

### Objectives

In this lab, you will:

* Explore the *Testbed* tab
* Generate a Q&A Test dataset
* Perform an evaluation on the Q&A Testset

### Prerequisites

* All previous labs successfully completed

## Task 1: Navigate to the Testbed tab

Access the *Testbed* from the left-hand menu:

![testbed](./images/testbed.png)

As a first step, you can either upload an existing Q&A test set—either from a local file or from a saved collection in the database—or generate a new one from a local PDF file.

## Task 2: Generate a Q&A Test dataset

The AI Optimizer allows you to generate as many questions and answers as you need, based on a single document from your knowledge base. To enable test dataset generation, simply select the corresponding radio button:

![generate](./images/generatenew.png)

1. Upload a document

Upload the same document that was used to create the vector store. You can easily download it from [this link](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/ai-vector-search-users-guide.pdf).

2. Increase the number of questions to be generated to 10 or more

Keep in mind that the process can take a significant amount of time, especially if you are using a local LLM without sufficient hardware resources. If you choose to use a remote OpenAI model instead, the generation time will be less affected by the number of Q&A pairs to create.

3. Leave the default option for:
* Q&A Language Model: **gpt-4o-mini**
* Q&A Embedding Model: **text-embedding-3-small**

4. Click on **Generate Q&A** button and wait until the process is over:

![patience](./images/patience.png)

5. Browse the questions and answers generated:

![qa-browse](./images/qa-browse.png)

Note that the **Question** and **Answer** fields are editable, allowing you to modify the proposed Q&A pairs based on the **Context** (which is randomly extracted and not editable) and the **Metadata** generated by the Testbed engine.

In the *Metadata* field you'll find a **topic** tag that classifies each Q&A pair. The topic list is generated automatically by analyzing the document content and is assigned to each Q&A pair. It will be used in the final report to break down the **Overall Correctness Score** and highlight areas where the chatbot lacks precision.

You can also export the generated Q&A dataset using the **Download** button. This allows you to edit and review it—e.g., in Visual Studio Code.

![qa-json](./images/qa-json.png)

6. Update the **Test Set Name**

Replace the automatically generated default name to make it easier to identify the test dataset later, especially when running repeated tests with different chatbot configurations. For example, change it from:

![default-test-set](./images/default-test-set.png)

to something more descriptive, like:

![test-rename](./images/test-rename.png)

## Task 3: Evaluate the Q&A Testset

Now you are ready to perform an evaluation on the Q&As you generated in the previous step.

1. In the left-hand menu:

* Under **Language Model Parameters**, select **gpt-4o-mini** from the **Chat model** dropdown list.

* Ensure **Enable RAG?** is selected (if it wasn't already)

* In the **Select Alias** dropdown list, choose the **TEST2** value.

* Leave all other parameters unchanged

2. With **gpt-4o-mini** selected as the evaluation model, click the **Start Evaluation** button and wait a few seconds. All questions from your dataset will be submitted to the chatbot using the configuration defined in the left pane:

![start-eval](./images/start-eval.png)

3. Let's examine the result report, starting with the first section:

![result](./images/result-topic.png)

This section displays:

* The chatbot's **Evaluation Settings**, as configured in the left-hand pane before launching the massive test.

* The **RAG Settings** including the database and vector store used, the name of the embedding **model** used, and all associated parameters (e.g., **chunk size**, **top-k**).

* The **Overall Correctness Score**, representing the percentage of questions for which the LLM judged the chatbot's response as correct compared to the reference answer

* The **Correctness By Topic**, which breaks down the results based on the automatically generated topics assigned to each Q&A pair in the dataset.

The second section of the report contains details on each question submitted, with a focus on the **Failures** collection and the **Full Report** list. To view all fields, scroll horizontally. In the image below, the second frame has been scrolled to the right:

![result](./images/result-question.png)

The main fields displayed are:

* **question**: the submitted question
* **reference_answer**: the expected answer used as a benchmark
* **reference_context**: the source document section used to generate the Q&A pair
* **agent_answer**: the response provided by the chatbot based on the current configuration and vector store
* **correctness_reason**: an explanation (if any) of why the response was considered incorrect. If correct, this field will display **None**.

You can download the results in different formats:

* Click the **Download Report** button to generate an HTML summary of the *Overall Correctness Score* and *Correctness by Topic*

* To export the **Full Report** and the **Failures** list, download them as .csv files using the download icons shown in the interface:

![csv](./images/download-csv.png)

## Task 4 (optional): Try a different Q&A Testset

Now let's perform a test using an external saved test dataset, which you can download [here](https://raw.githubusercontent.com/markxnelson/developer/refs/heads/main/ai-optimizer/getting_started-30_testset.json). This file contains 30 pre-generated questions.

If you wish to remove any Q&A pairs that you consider irrelevant or unhelpful, you can edit the file, save it, and then reload it as a local file—following the steps shown in the screenshot below:

![load-tests](./images/load-tests.png)

Next, let’s update the Chat Model parameters by setting the **Temperature** to **0** in the left-hand pane under the **Language Model Parameters** section.
Why? Q&A datasets are typically generated with a low level of creativity to minimize randomness and focus on expressing core concepts clearly—avoiding unnecessary "frills" in the answers.
Now, repeat the test to see whether there are improvements in the **Overall Correctness Score**.

* To compare with previous results, open the dropdown under **Previous Evaluations for...** and click on the **View** button to display the associated report.

![previous](./images/previous.png)

* You can repeat the tests as many times as needed, changing the **Vector Store**, **Search Type**, and **Top K** parameters to apply the same tuning strategies you've used previously with individual questions—now extended to a full test using curated and reproducible data.

## Acknowledgements

* **Author** - Lorenzo De Marchis, Developer Evangelist, May 2025
* **Contributors** - Mark Nelson, John Lathouwers, Corrado De Bari, Jorge Ortiz Fuentes, Andy Tael
* **Last Updated By** - Lorenzo De Marchis, May 2025
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading