# 🤖 GenAI Model Validation Workshop <a class='anchor' id='top'></a>

[Program](https://docs.google.com/document/d/1uqOlTim6czjeK16xXz4tXvYznTxkiy19moeoGNSynSY/edit?tab=t.0#heading=h.8c6jf2k12z8) | [GitHub](https://github.com/h2oai/h2o-genai-model-validation-training) | [Enterprise h2oGPTe](https://h2ogpte.h2oworld.h2o.ai/) | [EvalStudio](https://eval-studio.h2oworld.h2o.ai/)


## 📝 Outline <a class='anchor' id='outline'></a>
1. [Environment Preparation](#preparation)
2. [Embedding and Explainability](#embedding_explainability)
3. [Test Generation and Benchmarking](#test_gen)
4. [Eval Metrics and RAG](#eval_metrics)
5. [Human Evals](#human_evals)

## 🛠️ Environment Preparation <a class='anchor' id='preparation'></a> [↑](#top)

### Check compatibility your browser

Run the following cells - they will check compatibility of your browser and refresh the page.

*Technical Note: the Python kernel is not impacted by page refresh*

In [7]:
import ipywidgets as widgets

In [None]:
from IPython.display import display, Javascript


# Refresh the page only if a specific flag is not set
def refresh_page_once():
    display(
        Javascript("""
    if (!localStorage.getItem('pageRefreshed')) {
        localStorage.setItem('pageRefreshed', 'true');
        window.location.reload();
    } else {
        localStorage.removeItem('pageRefreshed');
    }
    """)
    )


refresh_page_once()

In [9]:
widgets.Button(description="Your browser is OK", disabled=True, button_style="success")

Button(button_style='success', description='Your browser is OK', disabled=True, style=ButtonStyle())

### Get h2oGPTe API Key

1. Got to [h2oGPTe Settings](https://h2ogpte.h2oworld.h2o.ai/settings).
2. Generate new API key and copy the key.
3. Fill the key into text box below
4. Click on 'Generate config' button

In [None]:
from h2o_mrm import _generate_env

_generate_env("https://h2ogpte.h2oworld.h2o.ai")

In [26]:
!cat .env

H2OGPTE_API_KEY=sk-d1SJGpNBH9HBqr8JElpUKPMGYd6bIyIP7CQPgkNZ96sNyRcl
H2OGPTE_URL='https://h2ogpte.h2oworld.h2o.ai'
TOKENIZERS_PARALLELISM=false

### 🐍 Prepare Python Environment [↑](#top)

In [12]:
# Supress Warnings
import warnings

warnings.filterwarnings("ignore")

# Load Environment Variables
from dotenv import load_dotenv

_ = load_dotenv()

In [None]:
# Python packages
import os
import uuid
from pathlib import Path

# Experiment
from h2o_mrm.experiment import Experiment

# Topic Modeling
from h2o_mrm.widgets import topic_model_widget
from h2o_mrm.viz import create_chunk_distribution_map, create_topics_distribution_pie

# Question Generation
from h2o_mrm.widgets.chunk_nav import create_qa_gen_widget
from h2o_mrm.widgets.chunk_nav.core import create_question_generator, create_summarizer

# Generated Question Evaluation
from h2o_mrm.widgets.aw_data_table import create_genqa_eval_widget

# RAG Models
from h2o_mrm.rag_models import H2OGPTERAG, H2ogpteConfig

# Human Labeling
from h2o_mrm.widgets import human_labeling_widget

In [16]:
CACHE_LOC = "/home/jovyan/cache"
DOCS_LOC = "/home/jovyan/docs"

# 1. Embedding and Explainability <a class='anchor' id='embedding_explainability'></a> [↑](#top)

The goal of experiment is to analyze document ["Comptroller’s Handbook: Model Risk Management"](https://www.occ.treas.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html) in the context of RAG systems.

## Experiment

Experiment defines scope of work including documents and rag system under testing.

It does:
 - chunking of document using H2OGPTe chunking strategy.
 - embedding of chunks into vectors using given embedding model.
  

> ℹ️ Note: we pre-cached computed results to speed up the workshop
  

In [31]:
exp = Experiment(
    "OCC Handbook",  # Do not change name since it is used for cache look-ups to speed up computation.
    max_tokens_per_chunk=320,
    embedding_model_name="BAAI/bge-m3",
    cache_dir=CACHE_LOC,
)
exp.add_documents([f"{DOCS_LOC}/pub-ch-model-risk.pdf"])

In [33]:
exp


Name:            OCC Handbook
Docs:            ['/tmp/home/jovyan/docs/pub-ch-model-risk.pdf']
Embedding model: BAAI/bge-m3
Chunks:          0 (max tokens/chunk: 320)
Topics:


Local cache embeddings: /tmp/home/jovyan/cache/chromadb
Local cache collection: /tmp/home/jovyan/cache/database.db


### Create Chunks

Divide document into chunks of specified number of tokens.

In this step:
- Documnets that are part of this collection will be parsed into plain text
- The parsed text will be divided into chunks. Chunk is a string of words/sentences that add up to the `tokens_per_chunk` value.
- Each chunk of tokens is then encoded into a vector of floats using an embedding model.
- The vectors are then stored in a vector database.
- A typical vector database can store each chunk as text, the embedding vector, and any meatadata associated with it.

In [None]:
# Create and Save Chunks
exp_chunks = exp.chunk_documents()

All the data from our documents is parsed, chunked, and transformed into vectors and stored in a vector database.
The chunks in their text form can be reviewed as shown below.

In [None]:
print(exp_chunks[100].text)

#### Topic Modeling

In this section, we can identify the topics in our collection of documents by clustering all the chunks.
Creating a topic model involves multiple steps.
1. Dimentionality Reduction: Since embeddings are in high dimensions (1024 in this example), before applying clustering algorithms, we need to reduce the dimensionality of the embeddings.
2. Clustering: Apply clustering algorithms such as HDBSCAN, K-Means, etc ... to group the vectors in reduced dimensional space into multiple clusters.
3. Topic representation: Identify the most important words/phrases in each topic and create descriptions for the topics.

For each step, we can choose different techniques and each technique can have multiple hyper-parameters.
Therefore, we tend to create multiple topic models by tuning the hyper-parameters. We then measure the quality of each topic model using the silhouette score metric.

For this experiment, we already built 34 different topic models. The following command will show the top 10 topic models measured by the silhouette score.

In [None]:
# List topic models for this experiment
exp.list_topic_models(top=10)

We can build additional topic models in a automatic way by using the following command.

- specify different ranges of values for each of the three hyper-parameters `n_neighbors`, `n_components`, and `min_cluster_size`.
- `n_neighbors`, `n_components` are inputs to the `UMAP` algorithm used for dimentionality reduction.
- `min_cluster_size` is an input to the `HDBSCAN` algorithm to control the clustering.
- Topic models with all combinations of the specified ranges will be built and ranked using the silhouette score.
- If you do not wish to explore all combinations of the hyper-parameters, you can specify a list of combinations to try.
- If `combinations` is specified, arguments `n_neighbors`, `n_components`, and `min_cluster_size` will be ignored.
- `combinations` is a list of Tuple(`n_neighbors`, `n_components`, `min_cluster_size`).

In [None]:
exp.build_all_topic_models(
    n_neighbors=[35, 40],
    n_components=[5, 10, 15, 20],
    min_cluster_size=[5, 7, 9],
    # combinations=[(10, 2, 10), (10, 2, 11), (25, 25, 9), (15, 2, 10)],
)

To see the list of newly built topic models we can re-run the `exp.list_topic_models(top=10)` command. To proceed with
our experiment, we need to select a topic model to represent the information in our collection.

`exp.set_best_topic_model()` will select the topic model with the highest silhouette score. 
however, you can also select any topic model from the list by its id using the `exp.set_topic_model` command.

In [None]:
# Select the topic model with best silhouette_score
exp.set_best_topic_model()

# Alternatively, select any topic model by it's id
# exp.set_topic_model("60d5c18c-9ad4-4d7a-8a9b-eac2dc9eb77e")

# Verify that the selected topic model is properly set
exp.selected_topic_model_id

The following is an interactive UI widget to visualize the selected topic model.
we can also build new topic models by interacting with the widget.

In [None]:
tmw = topic_model_widget.create_widget(
    topic_model_id=exp.selected_topic_model_id,
    cache_dir=exp.cache_dir,
    create_topic_cluster_data=exp.build_topic_cluster_creator(
        show_doc_in_tooltip=True,
        show_topic_names=True,
        # hidden_topics=[0],
    ),
)
tmw

If you interacted with the topic model widget to create a new topic model,
and want to select the newly created one as the topic model for this experiment,
run the following lines.

In [None]:
# Select the topic model from the widget above
# exp.set_topic_model(uuid.UUID(hex=tmw.topic_model_id))

# Verify that the new topic model is selected
# exp.selected_topic_model_id

The following 3 commands will help you understand the distribution of chunks across different topics.

In [None]:
exp.get_num_chunks_in_topic_chart()

In [None]:
create_chunk_distribution_map(exp.chunks, exp.topic_names, x_size=20)

In [None]:
create_topics_distribution_pie(exp.chunks, exp.topic_names, filter_topics=[-1])

# 2. Test Generation and Benchmarking <a class='anchor' id='test_gen'></a> [↑](#top)

- Automatic Prompt engineering
- Automatic QA generation

Users can implement multiple different techniques for question generation. we included an example implementation.

Question generation includes the following steps:
1. Select one or more chunks from a cluster.
2. If the clusters are small, all chunks can be selected.
3. If the clusters are large, a `twin` (a statistically similar subset) of the cluster can be selected. we do this to cover all the information represented in the cluster without exhaustively selecting all chunks.
4. Summarize the selected chunks using LLM.
5. Generate questions with the summary as the reference using LLM.
6. Validate generated questions using NLP techniques like `cosine similarity`, `BERTScore`, and `NLI Score`.

The summarization and question generation steps can have multiple implementations using different LLMs and system-prompts.

In [None]:
_ = load_dotenv()

In [None]:
os.getenv("H2OGPTE_API_KEY")

In [None]:
llama_summerizer = create_summarizer(
    model_type="h2ogpte",
    model_name="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
llama_question_generator = create_question_generator(
    model_type="h2ogpte",
    model_name="meta-llama/Meta-Llama-3.1-70B-Instruct",
)

#### Interactive Question Generation

The following is a widget to experiment with the question generation process interactively.
- Pick the lasso tool from the top-right corner of the plot.
- Use the lasso tool to select a few chunks from a cluster.
- This will trigger the summarizer (top right) to generate the summary of the selected chunks.
- When the summary is ready, click the generate button (bottom right) to create questions with the generated summary as a reference.
- experiment with different parts of the collection to verify that the generated questions are relevant.

In [None]:
question_gen_widget = create_qa_gen_widget(
    exp.chunks,
    fig_data=exp.fig_data,
    summarize_text=llama_summerizer,
    generate_questions=llama_question_generator,
)
question_gen_widget

#### Automatic Question Generation

We can automate the question generation process using the following command.

- Select the topics from which questions need to be generated.
- Specify a summarizer and a question generator implementation.
- Specify a sampling method to pick chunks from each cluster. `twinning` or `all` (exhaustive) 

In [None]:
# exp.generate_questions(
#     topics=[0, 1, 2, 3, 4, 5, 6],
#     summarizer=llama_summerizer,
#     question_generator=llama_question_generator,
#     question_generator_name="Meta-Llama-3.1-70B-Instruct",
#     # sampling_method="twinning",
#     sampling_method="all",
# )

If you have already generated questions and saved them in cache, but your selected topic model for this experiment changed after that, the topics for questions need to updated.
We can run this everytime to make sure that information in the cache is aligned with the current state of the experiment.

In [None]:
exp.update_questions_topics()

Preview generated questions

In [None]:
generated_questions = exp.list_generated_questions()
print(len(generated_questions))
for x in generated_questions[:5]:
    print(x)

#### Evaluate Generated Questions

In [None]:
exp.validate_generated_questions()

#### Load Validated Questions in a Widget

All questions and the validation scores are presented as a table to browse.

In [None]:
validated_questions = exp.get_validated_questions()
genq_eval_widget = create_genqa_eval_widget(validated_questions)
genq_eval_widget

# 3. Eval Metrics and RAG <a class='anchor' id='eval_metrics'></a> [↑](#top)

#### Metrics

- [X] Groundedness
- [X] Context Recall
- [X] Context Precision
- [X] Recall Relevancy
- [X] Precision Relevancy
- [X] Answer Relevancy



#### Get Answers from RAG

In [None]:
# NOTE: To be able to use the cached data, please do not modify anything in this cell

rag_name = "h2ogpte.dev.h2o.ai"
rag_version = "1.6.0-dev28"
llm_name = "meta-llama/Meta-Llama-3.1-70B-Instruct"
llm_args = dict(
    temperature=0.0,
    seed=42,
    max_new_tokens=4096,
)

In [None]:
# NOTE: To be able to use the cached data, please do not modify anything in this cell

rag_under_test_id = exp.register_rag_under_test(
    rag_name=rag_name,
    rag_version=rag_version,
    llm_name=llm_name,
    llm_args=llm_args,
    embedding_model_name="BAAI/bge-m3",
)
rag_under_test_id

In [None]:
# NOTE: To be able to use the cached data, please do not modify anything in this cell

rag_collection_name = "OCC Handbook 3"
config = H2ogpteConfig.from_env()
rag = H2OGPTERAG(config, rag_collection_name, llm_name, llm_args)

In [None]:
# NOTE: To be able to use the cached data, please do not modify anything in this cell

rag.add_documents([Path(f"{DOCS_LOC}/pub-ch-model-risk.pdf")])

In [None]:
exp.get_answers_from_rag(
    rag_under_test_id=rag_under_test_id,
    answer_question=rag.answer_question,
)

In [None]:
exp.add_rag_chunks(rag_under_test_id, rag.get_all_chunks)

In [None]:
exp.evaluate_answers(rag_under_test_id)

In [None]:
exp.plot_metrics(rag_under_test_id)

# 4. Human Evaluation <a class='anchor' id='human_evals'></a> [↑](#top)

In [None]:
hlw = human_labeling_widget.create_widget(
    fig_data_json=exp.plot_metric(
       rag_under_test_id,
       metric="groundedness",
       cache_file=os.path.join(CACHE_LOC, "human_eval_groundedness_fig_data.json"), 
    ),
    answer_info_func=exp.get_answer_info_func(
       rag_under_test_id,
       cache_file=os.path.join(CACHE_LOC, "human_eval_answer_info.json"),
    ),
    question_id=402,
)
hlw

In [1]:
print("Done!")

Done!
