## 📚 Prerequisites

Before starting, ensure your Azure Services are set up correctly, your Conda environment is ready, and your environment variables are configured according to the instructions in the [README.md](README.md) file.

## 📋 Table of Contents

This notebook guides you through creating an Azure AI Search Index, covering the following topics:

1. [**Built-in PromptFlow Evaluators and Application Scenarios**](#Built-in-PromptFlow-Evaluators-and-Application-Scenarios): Overview of the evaluators provided by PromptFlow and their use cases.

2. [**Building a Validation Framework with PromptFlow SDK and Azure AI Studio**](#Building-a-Validation-Framework-with-PromptFlow-SDK-and-Azure-AI-Studio): Steps to integrate PromptFlow SDK with Azure AI Studio for creating a robust validation framework.

3. [**Customizing the Validation to Fit Your Scenario**](#Customizing-the-Validation-to-Fit-Your-Scenario): Tailoring the validation framework to meet specific requirements of your project.

# Built-in Promptflow Evaluators and Application Scenarios

The PromptFlow Evaluation Framework provides a suite of built-in evaluators designed to assess the performance and safety of language models across various application scenarios. These evaluators are categorized based on the type of assessment they perform, ranging from the quality of generated content to its safety and appropriateness.

## Application Scenarios

### Question and Answer
This scenario caters to applications that involve posing queries and generating responses. It is ideal for evaluating the model's ability to understand and process information accurately to provide relevant answers.

### Chat
This scenario is tailored for applications where the model engages in dialogue, employing a retrieval-augmented approach. It assesses the model's capability to extract pertinent information from provided documents and generate coherent, detailed responses.

## Overview of Evaluator Categories and Their Technical Applications

Each evaluator is meticulously crafted to cater to specific technical scenarios and requirements. For example, the **RelevanceEvaluator** necessitates a `question`, `answer`, and `context` to ascertain the relevance of the provided answer to the posed question within the specified context. This evaluator is indispensable for applications such as virtual assistants or customer support chatbots, where the pertinence of responses critically influences user satisfaction.

### Evaluating Q/A Pairs for Accuracy and Coherence

Alright, let's dive into how we can check out a Q/A pair, especially when we want to see how a user's answer stacks up against the real deal (aka the ground truth). Here are the key players you'll want to bring into the game:

- **SimilarityEvaluator**: This tool scrutinizes the congruence between the user's answer and the ground truth. It is instrumental in gauging how well the user's response aligns with the expected answer, a feature paramount for platforms like educational portals where precision is of the essence.

- **F1ScoreEvaluator**: This evaluator computes the F1 score by examining the overlap between the user's answer and the ground truth. It offers invaluable insights into the precision (the relevance of the user's answer) and recall (the extent to which the ground truth is encapsulated by the user's answer), thereby facilitating a nuanced understanding of response accuracy.

- **RelevanceEvaluator**: Traditionally employed to evaluate the relevance of an answer to the given question and context, it can also be adeptly used to measure how pertinent the user's answer is in relation to the ground truth, especially in contexts where the backdrop of the question significantly influences the accuracy of the answer.

- **CoherenceEvaluator**: This evaluator is essential for assessing the logical flow and coherence of the user's answer vis-à-vis the ground truth. It ensures that the response not only corresponds with the expected answer but also exhibits logical consistency and coherence, crucial for elaborate answers necessitating detailed explanation or justification.

### Prioritizing Content Safety

Furthermore, the **ContentSafetyEvaluator** and **ContentSafetyChatEvaluator** play a critical role in applications that emphasize user safety, like social media platforms or community forums. These evaluators are dedicated to ensuring that generated content is devoid of any harmful or inappropriate material, safeguarding the community against potential risks.

This enhanced framework for evaluator categories and their applications underscores the importance of tailored evaluations in enhancing the accuracy, relevance, and safety of responses across various digital platforms.

## Evaluator Categories and Classes

| Category            | Evaluator Class            | Required Data Fields          | Example                                                                                                   | Purpose and Applications                                                                                   |
|---------------------|----------------------------|-------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Performance and Quality | GroundednessEvaluator      | answer, context                | `{"answer": "Paris.", "context": "France is a country in Europe. Its capital is Paris."}`                 | Measures how well the answer is grounded in the provided context. Useful for fact-checking applications.  |
|                     | RelevanceEvaluator         | question, answer, context      | `{"question": "What is the capital of France?", "answer": "Paris.", "context": "France is a country in Europe. Its capital is Paris."}` | Assesses the relevance of the answer to the given question and context. Ideal for QA systems.             |
|                     | CoherenceEvaluator         | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France."}`             | Evaluates the logical flow and coherence of the conversation. Useful for dialogue systems.                |
|                     | FluencyEvaluator           | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France."}`             | Checks the linguistic fluency and readability of the answer. Important for content generation.           |
|                     | SimilarityEvaluator        | question, answer, ground_truth | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France.", "ground_truth": "The capital of France is Paris."}` | Compares the similarity between the generated answer and a ground truth answer. Useful for automated grading systems. |
|                     | F1ScoreEvaluator           | answer, ground_truth           | `{"answer": "Paris is the capital of France.", "ground_truth": "The capital of France is Paris."}`        | Calculates the F1 score based on the overlap between the generated answer and the ground truth. Useful for evaluating model precision and recall. |
| Risk and Safety     | ViolenceEvaluator          | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris."}`                                      | Detects violent content in the model's responses. Essential for content moderation.                      |
|                     | SexualEvaluator            | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris."}`                                      | Identifies sexual content in responses. Critical for maintaining content appropriateness.                |
|                     | SelfHarmEvaluator          | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris."}`                                      | Screens for self-harm related content in answers. Important for user safety.                             |
|                     | HateUnfairnessEvaluator    | question, answer               | `{"question": "What is the capital of France?", "answer": "Paris."}`                                      | Detects hate speech and unfairness in content. Vital for ethical AI applications.                        |
| Composite           | QAEvaluator                | question, answer, context, ground_truth | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France.", "context": "France is a country in Europe. Its capital is Paris.", "ground_truth": "The capital of France is Paris."}` | Combines quality evaluators for QA pairs. Useful for comprehensive QA system evaluation.                 |
|                     | ChatEvaluator              | question, answer, context, ground_truth | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France.", "context": "France is a country in Europe. Its capital is Paris.", "ground_truth": "The capital of France is Paris."}` | Integrates quality evaluators for chat messages. Ideal for evaluating chatbots.                          |
|                     | ContentSafetyEvaluator     | question, answer, context, ground_truth | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France.", "context": "France is a country in Europe. Its capital is Paris.", "ground_truth": "The capital of France is Paris."}` | Combines safety evaluators for QA pairs. Essential for ensuring content safety in QA systems.            |
|                     | ContentSafetyChatEvaluator | question, answer, context, ground_truth | `{"question": "What is the capital of France?", "answer": "Paris is the capital of France.", "context": "France is a country in Europe. Its capital is Paris.", "ground_truth": "The capital of France is Paris."}` | Merges safety evaluators for chat messages. Crucial for safe interactions in chat applications.          |

## Building a Validation Framework with PromptFlow SDK and Azure AI Studio

### Evaluation Focus for LLM/SLM Benchmarking

When evaluating your LLM/SLM, consider the following key areas:

- **🧠 Understanding**: Measure the model's reasoning and comprehension abilities. Utilize established datasets such as MMLU, MedPub, and TruthfulQA to benchmark overall performance.

- **⚙️ Retrieval System/QA**: Examine the effectiveness of the LLM-based system in its entirety. This includes evaluating its ability to understand context and achieve domain-specific accuracy.

- **🛡️ Responsible AI (RAI)**: Ensure the model adheres to Responsible AI principles. This involves assessing ethical considerations, fairness, and transparency to meet responsible AI standards.

In [20]:
import os
from datetime import datetime
from pprint import pprint

# Define the target directory (change yours)
TARGET_DIRECTORY = os.getcwd()

# Check if the directory exists
if os.path.exists(TARGET_DIRECTORY):
    # Change the current working directory
    os.chdir(TARGET_DIRECTORY)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {TARGET_DIRECTORY} does not exist.")

### 0. Optional: Building Your Own SDK for Enhanced Control and Granularity

Be mindful of the level of abstraction. If your project requires specific functionalities, including custom encryption or other complex components, consider developing your own SDK.

In [21]:
from src.quality.gpt_evals import AzureAIQualityEvaluator

from dotenv import load_dotenv
load_dotenv()

False

In [22]:
quality_evals = AzureAIQualityEvaluator(azure_endpoint=os.environ.get("AZURE_AOAI_ENDPOINT"),
    api_key=os.environ.get("AZURE_AOAI_API_KEY"),
    azure_deployment=os.environ.get("AZURE_AOAI_COMPLETION_MODEL_DEPLOYMENT_ID"),
    api_version=os.environ.get("AZURE_AOAI_DEPLOYMENT_VERSION"),
    subscription_id=os.environ.get("AZURE_AI_STUDIO_SUBSCRIPTION_ID"),
    resource_group_name=os.environ.get("AZURE_AI_STUDIO_RESOURCE_GROUP_NAME"),
    project_name=os.environ.get("AZURE_AI_STUDIO_PROJECT_NAME"))

2024-08-30 08:15:21,373 - micro - MainProcess - INFO     AzureAIQualityEvaluator initialized successfully. (gpt_evals.py:__init__:61)


### 1. Building Golden Datasets for Evaluation

- **Diversity**: Ensure the dataset spans a broad spectrum of scenarios to thoroughly assess model performance.

- **Complexity Levels**: Include both straightforward and complex queries to evaluate the model's depth of understanding.

- **Ambiguity**: Incorporate queries with multiple valid interpretations to test the model's ambiguity handling.

- **Data Enrichment**:
  - **Paraphrasing**: Use tools like GPT-4 to paraphrase existing queries, enhancing dataset variety.
  - **Synthetic Data**: Employ Large Language Models (LLMs) to generate data for underrepresented scenarios.

In [23]:
data_input_path = os.path.join(os.getcwd(), "my_utils", "data", "evaluations", "dataframe", "golden_eval_dataset.csv")

In [24]:
import pandas as pd
df = pd.read_csv(data_input_path)
df = df.drop(columns=["count"])
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/envs/promptflow-eval-framework/lib/python3.9/site-packages/promptflow/evals/evaluators/_f1_score/flow/my_utils/data/evaluations/dataframe/golden_eval_dataset.csv'

### 2. Evaluating Quality and Performance


**📊 What metrics are we evaluating?**

- **F1 Score**: Measures the balance between precision and recall. Precision measures how many of the predicted positives are actually correct, while recall measures how many of the actual positives are correctly identified by the model.
  - **Range**: 0 (worst) to 1 (best).

- **GPT Groundedness**: Assesses the factual accuracy or realism of the content.
  - **Range**: 0 (not grounded in reality) to 5 (highly factual).

- **GPT Relevance**: Evaluates how relevant the content is to the given context or query.
  - **Range**: 0 (not relevant) to 5 (highly relevant).

- **GPT Coherence**: Measures the logical flow and consistency of the content.
  - **Range**: 0 (incoherent) to 5 (highly coherent).

- **GPT Fluency**: Assesses the readability and smoothness of the text.
  - **Range**: 0 (hard to read) to 5 (extremely fluent).

- **GPT Similarity**: Measures how similar the evaluated content is to a reference or expected response.
  - **Range**: 0 (not similar) to 5 (very similar).


In [5]:
data_input_path = os.path.join(os.getcwd(), "my_utils", "data", "evaluations", "dataframe", "golden_eval_dataset.csv")

# Execute the quality evaluation in parallel and batch mode. This approach optimizes performance by calculating each of the metrics mentioned above (F1 score, GPT groundedness, relevance, coherence, fluency, and similarity) concurrently across multiple data points. After computing these metrics individually, the results are aggregated to provide a comprehensive quality assessment. This method ensures efficient processing and a holistic evaluation of the chat quality.
metrics_quality, azure_ai_studio_url = quality_evals.run_chat_quality(data_input=data_input_path)

2024-08-30 07:43:42,758 - micro - MainProcess - INFO     Data successfully converted to JSONL format. (gpt_evals.py:_convert_to_jsonl:79)
2024-08-30 07:43:42,758 - micro - MainProcess - INFO     Evaluating the quality of chat responses... (gpt_evals.py:run_chat_quality:104)


[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:43:56 -0700][flowinvoker][INFO] - Getting

2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: a742d48f-8805-4cf9-b23d-9e1406745054_validate_inputs_faad6417-cecc-4933-bdbb-c1e1e6238687
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:56 -0700   98603 execution.flow     INFO     Exe

[2024-08-30 07:43:57 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Burj Khalifa is the tallest building in the world.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}
[2024-08-30 07:43:57 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Burj Khalifa is the tallest building in the world.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}
[2024-08-30 07:43:57 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Alexander Fleming discovered penicillin.', 'ground_truth': 'Alexander Fleming discovered penicillin.'}
[2024-08-30 07:43:57 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Alexander Fleming discovered penicillin.', 'ground_truth': 'Alexander Fleming discovered penicillin.'}


2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: bbea0d2f-58bc-47e7-a530-5532f5ce088c_validate_inputs_fe49fc85-609f-4b66-b535-db105f3fec45
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:57 -0700   98603 execution.flow     INFO     Exe

[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Russia is the largest country by area.', 'ground_truth': 'The largest country by area is Russia.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Russia is the largest country by area.', 'ground_truth': 'The largest country by area is Russia.'}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 0a5a38e5-6997-45d8-821a-c9ac48f2c9d8_validate_inputs_e612791b-4609-4db9-8006-2ce41ba49957
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: 0a5a38e5-6997-45d8-821a-c9ac48f2c

[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Nile is the longest river in the world.', 'ground_truth': 'The longest river in the world is the Nile.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Nile is the longest river in the world.', 'ground_truth': 'The longest river in the world is the Nile.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Alexander Graham Bell invented the telephone.', 'ground_truth': 'Alexander Graham Bell invented the telephone.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Alexander Graham Bell invented the telephone.', 'ground_truth': 'Alexander Graham Bell invented the telephone.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The freezing point of water is 0 degrees Celsius.', 'ground_truth': 'The freezing point of water is 0 degrees

2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: c2543195-5610-4f5c-84b8-a1ea983f7e33_validate_inputs_1c9d00a9-c4a0-43c5-ae51-b39c47394262
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Exe

[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Albert Einstein is known as the father of modern physics.', 'ground_truth': 'Albert Einstein is known as the father of modern physics.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Albert Einstein is known as the father of modern physics.', 'ground_truth': 'Albert Einstein is known as the father of modern physics.'}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: 2ece17d9-9c44-4f78-9518-ac6ede906761_compute_f1_score_5b91c7c2-7e4e-4bb5-80cd-6bbd666e68df


[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Au is the chemical symbol for gold.', 'ground_truth': 'The chemical symbol for gold is Au.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Au is the chemical symbol for gold.', 'ground_truth': 'The chemical symbol for gold is Au.'}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: a77acd77-a6d4-4c04-bc76-b35a03cf473f_validate_inputs_406cd40a-f98c-4c14-88fa-de516c2e1d4d


[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'China is the most populous country in the world.', 'ground_truth': 'The most populous country in the world is China.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'China is the most populous country in the world.', 'ground_truth': 'The most populous country in the world is China.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The blue whale is the largest mammal in the world.', 'ground_truth': 'The largest mammal in the world is the blue whale.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The blue whale is the largest mammal in the world.', 'ground_truth': 'The largest mammal in the world is the blue whale.'}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: b15b0ada-ed6e-415d-b39a-1e8b5d576ae6_validate_inputs_048acf61-0bb3-49ba-b4a6-972ca2f02d16


[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The atom is the smallest unit of matter.', 'ground_truth': 'The smallest unit of matter is the atom.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The atom is the smallest unit of matter.', 'ground_truth': 'The smallest unit of matter is the atom.'}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 7e0c4740-a3f6-4303-ab7b-6275d978930d_validate_inputs_25a4d278-2835-41dc-9694-66ccd3ae2663
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   9

[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The boiling point of water is 100 degrees Celsius.', 'ground_truth': 'The boiling point of water is 100 degrees Celsius.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The boiling point of water is 100 degrees Celsius.', 'ground_truth': 'The boiling point of water is 100 degrees Celsius.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': "Jane Austen wrote 'Pride and Prejudice'.", 'ground_truth': "Jane Austen wrote 'Pride and Prejudice'."}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': "Jane Austen wrote 'Pride and Prejudice'.", 'ground_truth': "Jane Austen wrote 'Pride and Prejudice'."}


2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: b15b0ada-ed6e-415d-b39a-1e8b5d576ae6_compute_f1_score_878276ee-5d09-44dc-8eda-df148f0821d2
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:58 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: d23bd814-76a0-4db8-8b49-8cb4a169

[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': "Nitrogen is the most abundant gas in the Earth's atmosphere.", 'ground_truth': "The most abundant gas in the Earth's atmosphere is nitrogen."}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': "Nitrogen is the most abundant gas in the Earth's atmosphere.", 'ground_truth': "The most abundant gas in the Earth's atmosphere is nitrogen."}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Rome is the capital of Italy.', 'ground_truth': 'The capital of Italy is Rome.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Rome is the capital of Italy.', 'ground_truth': 'The capital of Italy is Rome.'}
[2024-08-30 07:43:58 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The square root of 64 is 8.', 'ground_truth': 'The square root of 64 is 8.'}
[2024-08-30 07:43:58 -0700][

2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 06d53481-e01a-43ef-bf0e-125a601d5fbc_validate_inputs_9cab7458-d0b6-46ff-b803-5a91bcd577f4
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Sta

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'O is the chemical symbol for oxygen.', 'ground_truth': 'The chemical symbol for oxygen is O.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'O is the chemical symbol for oxygen.', 'ground_truth': 'The chemical symbol for oxygen is O.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: aa074184-ffbc-490b-9faf-4c42ef8b77ff_validate_inputs_372774f1-268b-4980-8a1b-16bb0543fccd
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: aa074184-ffbc-490b-9faf-4c42ef8b7

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Asia is the largest continent on Earth.', 'ground_truth': 'The largest continent on Earth is Asia.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Asia is the largest continent on Earth.', 'ground_truth': 'The largest continent on Earth is Asia.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: def55b6d-4344-4b7f-b6f2-f7cd228e8cab_validate_inputs_39d64132-f4ba-40bb-bbdb-56342ed47b21
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: def55b6d-4344-4b7f-b6f2-f7cd228e8

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Pacific Ocean is the deepest ocean in the world.', 'ground_truth': 'The deepest ocean in the world is the Pacific Ocean.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Pacific Ocean is the deepest ocean in the world.', 'ground_truth': 'The deepest ocean in the world is the Pacific Ocean.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: b9bab22e-c906-450e-b76b-dc798b13a4a8_validate_inputs_baa18652-8804-499d-998b-9d680fdb858f
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: b9bab22e-c906-450e-b76b-dc798b13a

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'George Washington was the first President of the United States.', 'ground_truth': 'George Washington was the first President of the United States.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'George Washington was the first President of the United States.', 'ground_truth': 'George Washington was the first President of the United States.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': "Oxygen is the most abundant element in the Earth's crust.", 'ground_truth': "The most abundant element in the Earth's crust is oxygen."}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': "Oxygen is the most abundant element in the Earth's crust.", 'ground_truth': "The most abundant element in the Earth's crust is oxygen."}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 2f5a4b5c-806e-46ef-9e93-a44b7b06d9e9_validate_inputs_3365b839-dc39-433c-9dda-72f88b6ef881
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Exe

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Canberra is the capital of Australia.', 'ground_truth': 'The capital of Australia is Canberra.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Canberra is the capital of Australia.', 'ground_truth': 'The capital of Australia is Canberra.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 2e8745c0-88dd-4be4-aeec-2103b7f17185_validate_inputs_f01a29b6-4b62-4f67-838b-9e8dc38c0ff3
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flo

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Greenland is the largest island in the world.', 'ground_truth': 'The largest island in the world is Greenland.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Greenland is the largest island in the world.', 'ground_truth': 'The largest island in the world is Greenland.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Michelangelo painted the Sistine Chapel ceiling.', 'ground_truth': 'Michelangelo painted the Sistine Chapel ceiling.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Michelangelo painted the Sistine Chapel ceiling.', 'ground_truth': 'Michelangelo painted the Sistine Chapel ceiling.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: fe59d120-f8e5-465c-a364-b1cc63e0d1bb_validate_inputs_5ac78b42-06c8-4d60-96ee-345dfa0f66ea
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registr

[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Avocado is the main ingredient in guacamole.', 'ground_truth': 'The main ingredient in guacamole is avocado.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Avocado is the main ingredient in guacamole.', 'ground_truth': 'The main ingredient in guacamole is avocado.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Moscow is the capital of Russia.', 'ground_truth': 'The capital of Russia is Moscow.'}
[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Moscow is the capital of Russia.', 'ground_truth': 'The capital of Russia is Moscow.'}


2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:43:59 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: fe59d120-f8e5-465c-a364-b1cc63e0d1bb_compute_f1_score_697cef0c-66fc-486f-9836-a623c059183e


[2024-08-30 07:43:59 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The femur is the largest bone in the human body.', 'ground_truth': 'The largest bone in the human body is the femur.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The femur is the largest bone in the human body.', 'ground_truth': 'The largest bone in the human body is the femur.'}


2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: 2e67ca74-c384-4c69-8354-56b61b135c42_compute_f1_score_d61789df-c86b-4d70-a1dd-d1a937aa73c5
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: a493f61c-7f1f-4c05-9cde-21126c55

[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Thomas Edison invented the light bulb.', 'ground_truth': 'Thomas Edison invented the light bulb.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Thomas Edison invented the light bulb.', 'ground_truth': 'Thomas Edison invented the light bulb.'}


2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: a493f61c-7f1f-4c05-9cde-21126c55a2d8_compute_f1_score_06c79109-d7c9-45b7-834c-05a76eb9ac45
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.

[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Amazon River is the largest river in South America.', 'ground_truth': 'The largest river in South America is the Amazon River.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Amazon River is the largest river in South America.', 'ground_truth': 'The largest river in South America is the Amazon River.'}


2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 0cc7fe84-8a21-4972-a011-7d8e73628471_validate_inputs_d1af8426-df0c-410d-a496-51ebf719e6ef


[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Mandarin Chinese is the most spoken language in the world.', 'ground_truth': 'The most spoken language in the world is Mandarin Chinese.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Mandarin Chinese is the most spoken language in the world.', 'ground_truth': 'The most spoken language in the world is Mandarin Chinese.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Madrid is the capital of Spain.', 'ground_truth': 'The capital of Spain is Madrid.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Madrid is the capital of Spain.', 'ground_truth': 'The capital of Spain is Madrid.'}


2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: 6e894777-99b8-4aac-abe2-755faa3eb117_validate_inputs_61fffa5c-fcbc-483e-aaf5-78039840080d
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Exe

[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The skin is the largest organ in the human body.', 'ground_truth': 'The largest organ in the human body is the skin.'}
[2024-08-30 07:44:00 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The skin is the largest organ in the human body.', 'ground_truth': 'The largest organ in the human body is the skin.'}


2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node validate_inputs. node run id: aa80540b-c858-452a-a19a-cc44288e0931_validate_inputs_50a4fbb6-551c-489f-bb42-eab522ae201b
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:44:00 -0700   98603 execution.flow     INFO     Executing node compute_f1_score. node run id: aa80540b-c858-452a-a19a-cc44288e0

2024/08/30 07:44:14 INFO mlflow.tracking._tracking_service.client: 🏃 View run loving_house_fl2hx8ct at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.MachineLearningServices/workspaces/slm-evaluation/#/experiments/d8f2318e-fd83-488e-bd9b-deca311699d7/runs/07f8788f-634a-4996-9082-276fcda912ec.
2024/08/30 07:44:14 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.MachineLearningServices/workspaces/slm-evaluation/#/experiments/d8f2318e-fd83-488e-bd9b-deca311699d7.
2024-08-30 07:44:21,807 - micro - MainProcess - INFO     Quality evaluation completed successfully. (gpt_evals.py:run_chat_quality:120)
2024-08-30 07:44:21,808 - micro - MainProcess - INFO     See your results in the studio for more detailed information: http

[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "df", "type": "pandas", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "metrics_quality", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "df", "type": "pandas", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "metrics_quality", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "df", "type": "pandas", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "metrics_quality", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": 

In [6]:
pprint(metrics_quality)

Pretty printing has been turned OFF
[{"variableName": "df", "type": "pandas", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "metrics_quality", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "df", "type": "pandas", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "metrics_quality", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
Pretty printing has been turned ON


In [7]:
quality_evals.plot_metrics(metrics_quality)

The azure_ai_studio_url variable is used to automatically send the flow and evaluation results to a remote server in Azure AI Studio for tracking. This is particularly useful in production environments where monitoring the performance and outcomes of AI models is crucial. By sending data to Azure AI Studio, you can leverage its tools for detailed analysis, visualization, and comparison of model performance over time, aiding in the continuous improvement of your AI solutions.

In [8]:
# Tracking the evaluation results - Remote 
azure_ai_studio_url

'https://ai.azure.com/build/evaluation/07f8788f-634a-4996-9082-276fcda912ec?wsid=/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.MachineLearningServices/workspaces/slm-evaluation'

![Azure AI Studio](my_utils/images/AzureAIstudio.png)

### 3. Evaluating RAI (Responsible AI)

**📊 What metrics are we evaluating?**

- **Self Harm Defect Rate**: Measures the frequency of self-harm content in the generated text.
  - **Range**: 0 (none) to 1 (high frequency).

- **Sexual Defect Rate**: Assesses the occurrence of inappropriate sexual content.
  - **Range**: 0 (none) to 1 (high occurrence).

- **Hate Unfairness Defect Rate**: Evaluates the presence of hate speech or unfair content towards individuals or groups.
  - **Range**: 0 (none) to 1 (high presence).

- **Violence Defect Rate**: Measures the level of violent content in the outputs.
  - **Range**: 0 (none) to 1 (high level).

In [9]:
## WIP - will fix in future iteration - use at your own risk
# metrics, azure_ai_studio_url = quality_evals.run_chat_content_safety(data_input=data_input_path)

In [10]:
## WIP - will fix in future iteration - use at your own risk
# quality_evals.plot_metrics(metrics)

## Customizing the Validation to Fit Your Scenario

#### Scenario 1: Combine Built-in PromptFlow Custom Evaluation for Contextual Accuracy in Q&A Matching

**Objective**: Assess the performance of our AI bot (LLM/SLM) in responding to user queries, focusing on the accuracy of responses and contextual understanding, with a predefined ground truth for comparison.

**Setup**:
- **Input**: User queries encompassing a wide range of topics and complexities.
- **AI Bot**: Our system tasked with providing responses to the queries.

**Evaluation Criteria**:
- **Contextual Understanding**: Evaluates the AI bot's ability to comprehend the context and intent behind each query.
- **Response Accuracy**: Measures how closely the AI bot's responses align with the expected answers based on the ground truth.

**Goal**: Determine the effectiveness of our AI bot in delivering contextually accurate and precise responses to user queries, highlighting areas for improvement.

In [2]:
from promptflow.evals.evaluators import (RelevanceEvaluator, F1ScoreEvaluator, GroundednessEvaluator, ChatEvaluator, 
                                         ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, 
                                         CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, QAEvaluator,
                                        ContentSafetyEvaluator, ContentSafetyChatEvaluator)

In [10]:
import os
from promptflow.core import AzureOpenAIModelConfiguration

# Initialize Azure OpenAI Connection with your environment variables
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_AOAI_ENDPOINT"),
    api_key=os.environ.get("AZURE_AOAI_API_KEY"),
    azure_deployment=os.environ.get("AZURE_AOAI_COMPLETION_MODEL_DEPLOYMENT_ID"),
    api_version=os.environ.get("AZURE_AOAI_DEPLOYMENT_VERSION"),
)

In [11]:
qa_eval = F1ScoreEvaluator()
context_similarity = SimilarityEvaluator(model_config)
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_AI_STUDIO_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_AI_STUDIO_RESOURCE_GROUP_NAME"),
    "project_name": os.environ.get("AZURE_AI_STUDIO_PROJECT_NAME"),
}

In [12]:
from promptflow.evals.evaluate import evaluate

In [13]:
data_input_path = os.path.join(TARGET_DIRECTORY, "my_utils", "data", "evaluations", "jsonl", "F1ScoreEvaluator.jsonl")

result = evaluate(
    data=data_input_path,
    evaluators={
        "qa_eval": qa_eval,
        "context_similarity": context_similarity
    },
    # column mapping
    evaluator_config={
        "qa_eval": {
            "answer": "${data.answer}",
            "ground_truth": "${data.ground_truth}",
        },
        "context_similarity": {
            "question": "${data.question}",
            "answer": "${data.answer}",
            "ground_truth": "${data.ground_truth}",
        }
    },
    azure_ai_project=azure_ai_project
)

[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting connections from pf client with provider from args: local...
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Getting

2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Mount Everest is the tallest mountain in the world.', 'ground_truth': "The world's tallest mountain is Mount Everest."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Paris is the capital of France.', 'ground_truth': 'The capital of France is Paris.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 1443eeb8-3ebb-4eaa-8cad-a63b56b558c3_validate_inputs_486645e4-700b-4d42-b51c-48fe456d54a2


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Mount Everest is the tallest mountain in the world.', 'ground_truth': "The world's tallest mountain is Mount Everest."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Paris is the capital of France.', 'ground_truth': 'The capital of France is Paris.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 1443eeb8-3ebb-4eaa-8cad-a63b56b558c3_compute_f1_score_28783b17-f2af-452e-8979-27bf69ee899c


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Promptflow executor initiated successfully.


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Leonardo da Vinci painted the Mona Lisa.', 'ground_truth': 'The Mona Lisa was painted by Leonardo da Vinci.'}
[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Leonardo da Vinci painted the Mona Lisa.', 'ground_truth': 'The Mona Lisa was painted by Leonardo da Vinci.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: fe626365-1a96-4f48-9c0a-c9ba8edd60e9_validate_inputs_a69fadcf-5924-49fb-bbda-0ae5b12d9f7c


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The speed of light is approximately 299,792,458 meters per second.', 'ground_truth': 'Light travels at a speed of 299,792,458 meters per second.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 75302c27-2443-4da5-a4f9-d5df73ead5d0_validate_inputs_6e4c4855-9bf1-4bed-ab30-286eb6b75e6a


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The speed of light is approximately 299,792,458 meters per second.', 'ground_truth': 'Light travels at a speed of 299,792,458 meters per second.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': "George Orwell is the author of '1984'.", 'ground_truth': "The author of '1984' is George Orwell."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'H2O is the chemical symbol for water.', 'ground_truth': 'The chemical symbol for water is H2O.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Tokyo is the capital of Japan.', 'ground_truth': "Japan's capital is Tokyo."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Pacific Ocean is the largest ocean on Earth.', 'ground_truth': 'The largest ocean on Earth is the Pacific Ocean.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Great Pyramid of Giza was built by Alexander the Great.', 'ground_truth': 'The Great Pyramid of Giza was built by the Egyptians.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Albert Einstein developed the theory of relativity.', 'ground_truth': 'The theory of relativity was developed by Albert Einstein.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': "George Orwell is the author of '1984'.", 'ground_truth': "The author of '1984' is George Orwell."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'H2O is the chemical symbol for water.', 'ground_truth': 'The chemical symbol for water is H2O.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: fe626365-1a96-4f48-9c0a-c9ba8edd60e9_compute_f1_score_6189cb8b-0aee-460e-9a7f-80b453a5e3e2


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Tokyo is the capital of Japan.', 'ground_truth': "Japan's capital is Tokyo."}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 75302c27-2443-4da5-a4f9-d5df73ead5d0_compute_f1_score_9ef4572d-63e7-424f-b5d5-e3c04181f41a


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Pacific Ocean is the largest ocean on Earth.', 'ground_truth': 'The largest ocean on Earth is the Pacific Ocean.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Great Pyramid of Giza was built by Alexander the Great.', 'ground_truth': 'The Great Pyramid of Giza was built by the Egyptians.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Albert Einstein developed the theory of relativity.', 'ground_truth': 'The theory of relativity was developed by Albert Einstein.'}


2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: de4d7a49-956c-4175-af76-8ad3c5b9c614_validate_inputs_9532ea04-92b0-4e79-9d6b-51e8b5bfbd81
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes 

[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The human body has four senses.', 'ground_truth': 'The human body has five primary senses.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The human body has four senses.', 'ground_truth': 'The human body has five primary senses.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 314af56e-2e47-45b6-9d0c-0fd86f6a4b27_validate_inputs_18b627a0-92f9-4fba-9d0a-d26ac6ce4722
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Amazon is the longest river in the world.', 'ground_truth': 'The Nile is often cited as the longest river in the world, but some sources claim the Amazon is longer.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The internet was invented in the 1960s.', 'ground_truth': 'The foundational technology of the internet was developed in the late 1960s.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 314af56e-2e47-45b6-9d0c-0fd86f6a4b27_compute_f1_score_6fa1fc73-a5e9-425e-9048-fa92fe2da311


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Amazon is the longest river in the world.', 'ground_truth': 'The Nile is often cited as the longest river in the world, but some sources claim the Amazon is longer.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The internet was invented in the 1960s.', 'ground_truth': 'The foundational technology of the internet was developed in the late 1960s.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The heart is on the right side of the human body.', 'ground_truth': 'The heart is located slightly to the left side of the chest in the human body.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The sun revolves around the Earth.', 'ground_truth': 'The Earth revolves around the sun.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The heart is on the right side of the human body.', 'ground_truth': 'The heart is located slightly to the left side of the chest in the human body.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The sun revolves around the Earth.', 'ground_truth': 'The Earth revolves around the sun.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 07c89df7-76c5-4cd2-bed6-c8c69bc49919_validate_inputs_a8dc9d27-9b08-47a6-a87a-2b2853abbf8c
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: cd5e7dd0-1de2-4bde-8bea-2451102726ca_validate_inputs_c7adfd88-7638-4c68-a559-bc1f18d89ebe
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with 

[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Sharks are mammals.', 'ground_truth': 'Sharks are a group of fish characterized by a cartilaginous skeleton.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Sharks are mammals.', 'ground_truth': 'Sharks are a group of fish characterized by a cartilaginous skeleton.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: cd5e7dd0-1de2-4bde-8bea-2451102726ca_compute_f1_score_271c9621-9622-47da-bdf5-8e5a2fd54f60
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The capital of Canada is Toronto.', 'ground_truth': 'The capital of Canada is Ottawa.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The capital of Canada is Toronto.', 'ground_truth': 'The capital of Canada is Ottawa.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The inventor of the telephone was Thomas Edison.', 'ground_truth': 'Alexander Graham Bell is credited with inventing the first practical telephone.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The inventor of the telephone was Thomas Edison.', 'ground_truth': 'Alexander Graham Bell is credited with inventing the first practical telephone.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: eb21e164-05a4-4d72-8355-a853cc4bb50d_compute_f1_score_81598d23-6ec5-4846-a4ef-56411b95babb
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: dd973991-6496-47ef-917d-07b2a6db9bb5_compute_f1_score_b45c83c0-a84a-48d6-8d3b-2aada8d89c81
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes 

[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The chemical symbol for gold is Ag.', 'ground_truth': 'The chemical symbol for gold is Au.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The chemical symbol for gold is Ag.', 'ground_truth': 'The chemical symbol for gold is Au.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The largest desert in the world is the Sahara.', 'ground_truth': 'The largest desert in the world is Antarctica.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The largest desert in the world is the Sahara.', 'ground_truth': 'The largest desert in the world is Antarctica.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 4ee59af5-4046-475c-8953-a27661f7a6f1_validate_inputs_7993172d-79bb-411c-bfeb-c333fab4709d
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: d6985598-3ad4-4a8e-9044-150c19db4f70_validate_inputs_18f4d1ec-3407-4861-8f96-97d565b76df7
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: c188f5ec-45fd-499a-bf89-96d2e8dec378_validate_inputs_0dd2faca-b2a4-42f4-a1f2-0cac02df4373
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execut

[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The first man to step on the moon was Lance Armstrong.', 'ground_truth': 'The first man to step on the moon was Neil Armstrong.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The first man to step on the moon was Lance Armstrong.', 'ground_truth': 'The first man to step on the moon was Neil Armstrong.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The capital of Egypt is Cairo.', 'ground_truth': 'The capital of Egypt is Cairo.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The capital of Egypt is Cairo.', 'ground_truth': 'The capital of Egypt is Cairo.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Photosynthesis is performed by animals.', 'ground_truth': 'Photosynthesis is performed by plants.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Photosynthesis is performed by animals.', 'ground_truth': 'Photosynthesis is performed by plants.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The smallest bone in the human body is the femur.', 'ground_truth': 'The smallest bone in the human body is the stapes bone in the ear.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The smallest bone in the human body is the femur.', 'ground_truth': 'The smallest bone in the human body is the stapes bone in the ear.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Pacific Ocean is the warmest ocean.', 'ground_truth': 'The Indian Ocean is considered the warmest ocean.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The light bulb was invented by Nikola Tesla.', 'ground_truth': 'The light bulb was invented by Thomas Edison.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Pacific Ocean is the warmest ocean.', 'ground_truth': 'The Indian Ocean is considered the warmest ocean.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The light bulb was invented by Nikola Tesla.', 'ground_truth': 'The light bulb was invented by Thomas Edison.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: f5ae70dc-3740-4544-8359-2b5509aaf42c_validate_inputs_ebe8656d-a464-4867-97f8-2b6669dc8c13
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 554708bb-5486-4576-a826-1dae4657c2ca_validate_inputs_62bdfb69-4893-4e93-99eb-49a01d95ca72
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0

[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The first president of the United States was John Adams.', 'ground_truth': 'The first president of the United States was George Washington.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: d7204fcf-aac8-460e-8ec8-672db88accb7_compute_f1_score_2e0b2190-15bc-4602-a113-110727655013


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The first president of the United States was John Adams.', 'ground_truth': 'The first president of the United States was George Washington.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: c8d45e41-10e7-4075-8567-42bc573089f9_compute_f1_score_b9424fa0-3c99-41a1-ae72-12348ca96af5
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 554708bb-5486-4576-a826-1dae4657c2ca_compute_f1_score_816656ae-34e7-4aef-b1e1-3e3f57fd6ace
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: aa38a198-098f-4df5-81ea-22b2a732632c_compute_f1_score_fd7e8a6d-01e2-47ca-bf7e-831207a246cc
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The formula for water is CO2.', 'ground_truth': 'The formula for water is H2O.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The tallest building in the world is the Empire State Building.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The formula for water is CO2.', 'ground_truth': 'The formula for water is H2O.'}
[2024-08-30 07:52:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The tallest building in the world is the Empire State Building.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}


2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:52:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 2db9b2fd-aa42-4a23-8b3c-4ef7dc7d7f3c_validate_inputs_bfc9ff9e-f1be-4d01-936f-aecb46b598ca
2024-08-30 07:

2024/08/30 07:52:29 INFO mlflow.tracking._tracking_service.client: 🏃 View run eager_eagle_3yg6bhvl at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.MachineLearningServices/workspaces/slm-evaluation/#/experiments/d8f2318e-fd83-488e-bd9b-deca311699d7/runs/c2315a8d-1930-46f1-adaf-6690dbd358c3.
2024/08/30 07:52:29 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.MachineLearningServices/workspaces/slm-evaluation/#/experiments/d8f2318e-fd83-488e-bd9b-deca311699d7.


[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], 

[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}, {"variableName": "result", "type": "dictionary", "supportedEngines": ["pandas"], "isLocalVariable": false}]
[{"variableName": "azure_ai_project", "type": "dictionary", "supportedEngines": ["pandas"], 

In [14]:
pprint(result)

{'metrics': {'context_similarity.gpt_similarity': 2.6923076923076925,
             'qa_eval.f1_score': 0.7724547511312218},
 'rows': [{'inputs.answer': 'Paris is the capital of France.',
           'inputs.ground_truth': 'The capital of France is Paris.',
           'inputs.question': 'What is the capital of France?',
           'line_number': 0,
           'outputs.context_similarity.gpt_similarity': 5.0,
           'outputs.qa_eval.f1_score': 1.0},
          {'inputs.answer': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.ground_truth': 'The theory of relativity was developed by '
                                  'Albert Einstein.',
           'inputs.question': 'Who developed the theory of relativity?',
           'line_number': 1,
           'outputs.context_similarity.gpt_similarity': 5.0,
           'outputs.qa_eval.f1_score': 0.8571428571428571},
          {'inputs.answer': 'The speed of light is approximately 299,792,45

#### Scenario 2: Integration of Custom Evaluation with Built-in PromptFlow for Enhanced Contextual Accuracy in Q&A Matching

**Objective**: To enhance the evaluation of our AI bot's (LLM/SLM) performance in responding to user queries, we have developed a custom evaluation framework. This framework focuses on the accuracy of responses and their contextual understanding, utilizing a predefined ground truth for comparison. It is designed to complement and extend the built-in evaluation methods provided by PromptFlow.

**Custom Evaluation Framework**:
- **Implementation**: We have implemented a custom evaluation module, `SemanticSimilarityEvaluator`, leveraging the `transformers` library to utilize pre-trained models for semantic similarity assessments.
- **Functionality**: This module calculates the semantic similarity between the AI bot's response and the ground truth. It uses embeddings generated by a pre-trained model (`bert-base-uncased`) and computes cosine similarity to quantify semantic closeness.

**Integration with PromptFlow**:
- Our custom evaluation is seamlessly integrated with PromptFlow's built-in evaluation methods. This combination allows for a comprehensive assessment that covers both the nuanced contextual understanding and the accuracy of the AI bot's responses.
- **Input**: User queries across various topics and complexities.
- **AI Bot**: Our system, tasked with generating responses.
- **Evaluation Criteria**:
  - **Contextual Understanding**: Assesses the AI bot's grasp of the query's context and intent.
  - **Response Accuracy**: Measures the alignment of the AI bot's responses with the expected answers, enriched by our custom semantic similarity evaluation.

**Goal**: To ascertain the efficacy of our AI bot in providing contextually accurate and precise responses, leveraging both our custom evaluation and PromptFlow's built-in methods to highlight areas for improvement and ensure comprehensive coverage of evaluation metrics.

In [15]:
from src.quality.custom.custom_similarity import SemanticSimilarityEvaluator

In [16]:
semantic_similarity_eval = SemanticSimilarityEvaluator(model_name='bert-base-uncased')

[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Paris is the capital of France.', 'ground_truth': 'The capital of France is Paris.'}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Albert Einstein developed the theory of relativity.', 'ground_truth': 'The theory of relativity was developed by Albert Einstein.'}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The speed of light is approximately 299,792,458 meters per second.', 'ground_truth': 'Light travels at a speed of 299,792,458 meters per second.'}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Paris is the capital of France.', 'ground_truth': 'The capital of France is Paris.'}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Mount Everest is the tallest mountain in the world.', 'ground_truth': "The world's tallest moun

In [17]:
data_input_path = os.path.join(TARGET_DIRECTORY, "my_utils", "data", "evaluations", "jsonl", "F1ScoreEvaluator.jsonl")

result = evaluate(
    data=data_input_path,
    evaluators={
        "qa_eval": qa_eval,
        "context_similarity": context_similarity,
        "semantic_similarity": semantic_similarity_eval
    },
    # column mapping
    evaluator_config={
        "qa_eval": {
            "answer": "${data.answer}",
            "ground_truth": "${data.ground_truth}",
        },
        "context_similarity": {
            "question": "${data.question}",
            "answer": "${data.answer}",
            "ground_truth": "${data.ground_truth}",
        },
        "semantic_similarity": {
        "response": "${data.answer}",
        "ground_truth": "${data.ground_truth}",
    }
    },
    azure_ai_project=azure_ai_project
)

2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': "George Orwell is the author of '1984'.", 'ground_truth': "The author of '1984' is George Orwell."}


2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Albert Einstein developed the theory of relativity.', 'ground_truth': 'The theory of relativity was developed by Albert Einstein.'}


2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 9f3b6856-238c-4778-932e-1d1763743a0e_validate_inputs_c7202f71-f225-436b-b674-821dac936358


[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The speed of light is approximately 299,792,458 meters per second.', 'ground_truth': 'Light travels at a speed of 299,792,458 meters per second.'}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Mount Everest is the tallest mountain in the world.', 'ground_truth': "The world's tallest mountain is Mount Everest."}
[2024-08-30 07:54:08 -0700][flowinvoker][INFO] - Execute flow with data {'answer': "George Orwell is the author of '1984'.", 'ground_truth': "The author of '1984' is George Orwell."}


2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 3ecf6897-49eb-4c44-9054-e044d2d6f86b_validate_inputs_0f56a106-49ea-4093-96d7-0e96f2a6ffc2
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:08 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-3

[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Tokyo is the capital of Japan.', 'ground_truth': "Japan's capital is Tokyo."}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Tokyo is the capital of Japan.', 'ground_truth': "Japan's capital is Tokyo."}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 2c196d1f-bf01-49f9-9224-324d328ff9e4_validate_inputs_1f674211-c1c2-48d7-a333-7ad5c74bc345
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 2c196d1f-bf01-49f9-9224-324d328ff

[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Leonardo da Vinci painted the Mona Lisa.', 'ground_truth': 'The Mona Lisa was painted by Leonardo da Vinci.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Leonardo da Vinci painted the Mona Lisa.', 'ground_truth': 'The Mona Lisa was painted by Leonardo da Vinci.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Jupiter is the largest planet in our solar system.', 'ground_truth': 'The largest planet in our solar system is Jupiter.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Jupiter is the largest planet in our solar system.', 'ground_truth': 'The largest planet in our solar system is Jupiter.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 08e75767-7165-4dac-948b-c9f67f0a0703_validate_inputs_b0c9aeff-84a3-45f0-b1ec-0d08bc11233b


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'H2O is the chemical symbol for water.', 'ground_truth': 'The chemical symbol for water is H2O.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'H2O is the chemical symbol for water.', 'ground_truth': 'The chemical symbol for water is H2O.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Pacific Ocean is the largest ocean on Earth.', 'ground_truth': 'The largest ocean on Earth is the Pacific Ocean.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Pacific Ocean is the largest ocean on Earth.', 'ground_truth': 'The largest ocean on Earth is the Pacific Ocean.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Great Pyramid of Giza was built by Alexander the Great.', 'ground_truth': 'The Great Pyramid of Giza was built by the

2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The human body has four senses.', 'ground_truth': 'The human body has five primary senses.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Great Pyramid of Giza was built by Alexander the Great.', 'ground_truth': 'The Great Pyramid of Giza was built by the Egyptians.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: e08c25a3-8b61-4164-a227-a59185cbaa08_validate_inputs_313536a0-a3a3-4b50-b022-7bf4aa3d20e6


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The human body has four senses.', 'ground_truth': 'The human body has five primary senses.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Amazon is the longest river in the world.', 'ground_truth': 'The Nile is often cited as the longest river in the world, but some sources claim the Amazon is longer.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Amazon is the longest river in the world.', 'ground_truth': 'The Nile is often cited as the longest river in the world, but some sources claim the Amazon is longer.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The internet was invented in the 1960s.', 'ground_truth': 'The foundational technology of the internet was developed in the late 1960s.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {

2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 8ad2c112-7bc0-46aa-9b44-f9811c3c800d_validate_inputs_a0d0b125-9d1c-4370-b583-8e8214c5d132


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The heart is on the right side of the human body.', 'ground_truth': 'The heart is located slightly to the left side of the chest in the human body.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The heart is on the right side of the human body.', 'ground_truth': 'The heart is located slightly to the left side of the chest in the human body.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 801f7897-9c26-43e4-8e27-8429330921e6_validate_inputs_f1610ce4-e762-4e51-b854-2c5f3770390d
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activa

[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The sun revolves around the Earth.', 'ground_truth': 'The Earth revolves around the sun.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The sun revolves around the Earth.', 'ground_truth': 'The Earth revolves around the sun.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 54055ec8-af71-4f02-912c-0d3715978eb9_compute_f1_score_12bcf2be-46d0-4b56-8af4-f14414e69142
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: a56f88d6-7d3

[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Sharks are mammals.', 'ground_truth': 'Sharks are a group of fish characterized by a cartilaginous skeleton.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Sharks are mammals.', 'ground_truth': 'Sharks are a group of fish characterized by a cartilaginous skeleton.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 0ae76ea8-c111-4ea5-8b02-282aa0e1b915_validate_inputs_63ea88fa-ccca-4583-b49c-2fc3de632aac
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 82e21d8c-7357-4546-8454-2576d0f30a68_compute_f1_score_f908c4e7-8335-44b5-ae52-cd7c614e38a9


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The capital of Canada is Toronto.', 'ground_truth': 'The capital of Canada is Ottawa.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The capital of Canada is Toronto.', 'ground_truth': 'The capital of Canada is Ottawa.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 39d9039a-c05c-45ff-aa93-5054b8cb312e_validate_inputs_9b37bdb5-9e74-4114-bdc0-a4086305147e
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.

[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The inventor of the telephone was Thomas Edison.', 'ground_truth': 'Alexander Graham Bell is credited with inventing the first practical telephone.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The inventor of the telephone was Thomas Edison.', 'ground_truth': 'Alexander Graham Bell is credited with inventing the first practical telephone.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The chemical symbol for gold is Ag.', 'ground_truth': 'The chemical symbol for gold is Au.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The chemical symbol for gold is Ag.', 'ground_truth': 'The chemical symbol for gold is Au.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The largest desert in the world is the Sahara.', 'ground_truth': 'T

2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.


[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The first man to step on the moon was Lance Armstrong.', 'ground_truth': 'The first man to step on the moon was Neil Armstrong.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The first man to step on the moon was Lance Armstrong.', 'ground_truth': 'The first man to step on the moon was Neil Armstrong.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The capital of Egypt is Cairo.', 'ground_truth': 'The capital of Egypt is Cairo.'}
[2024-08-30 07:54:10 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The capital of Egypt is Cairo.', 'ground_truth': 'The capital of Egypt is Cairo.'}


2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 4c3e578b-28bb-4cc1-9262-f6dccdd9498e_validate_inputs_cb1e360b-54fb-4ca9-89fe-0d5b0d72a0e4
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Node compute_f1_score completes.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:10 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:11 -0700    

[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'Photosynthesis is performed by animals.', 'ground_truth': 'Photosynthesis is performed by plants.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'Photosynthesis is performed by animals.', 'ground_truth': 'Photosynthesis is performed by plants.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 44205a4f-6004-4c09-99ee-565afbbd3d74_validate_inputs_73eeee06-faeb-40fc-97c8-ed6b40787787


[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The smallest bone in the human body is the femur.', 'ground_truth': 'The smallest bone in the human body is the stapes bone in the ear.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The smallest bone in the human body is the femur.', 'ground_truth': 'The smallest bone in the human body is the stapes bone in the ear.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The Pacific Ocean is the warmest ocean.', 'ground_truth': 'The Indian Ocean is considered the warmest ocean.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The Pacific Ocean is the warmest ocean.', 'ground_truth': 'The Indian Ocean is considered the warmest ocean.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 4a85f1c0-d7a6-430a-b192-4e1ab30d7a20_validate_inputs_cd20cf95-6c94-493d-9aff-14c9ba5eb495


[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The light bulb was invented by Nikola Tesla.', 'ground_truth': 'The light bulb was invented by Thomas Edison.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The light bulb was invented by Nikola Tesla.', 'ground_truth': 'The light bulb was invented by Thomas Edison.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 37f0a44e-1646-4244-9fc0-8c7870903989_compute_f1_score_7c6d56ac-64a4-4fb7-aa8d-cca72458b4a7
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 0ae76ea8-c111-4ea5-8b02-282aa0e1b915_compute_f1_score_9cf34d81-ab94-4255-9b5a-eaf3b115b142
2024-08-30 07:54

[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The first president of the United States was John Adams.', 'ground_truth': 'The first president of the United States was George Washington.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The formula for water is CO2.', 'ground_truth': 'The formula for water is H2O.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.


[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The first president of the United States was John Adams.', 'ground_truth': 'The first president of the United States was George Washington.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: ce00ce4f-a3d1-417e-bf8f-4ef4df3a3688_validate_inputs_78484ade-22ea-4f51-89d4-5d5e6cfc13e8


[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The formula for water is CO2.', 'ground_truth': 'The formula for water is H2O.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 496a0a5e-422c-42e8-b739-c02be61926ba_validate_inputs_8e41b5ea-c8d0-418f-b495-c367bd733e85
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Current thread is not main thread, skip signal handler registration in AsyncNodesScheduler.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node validate_inputs. node run id: 711d4f2b-549c-47ff-bf5f-b8b5f12e1dc4_validate_inputs_45077ec5-5

[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Validating flow input with data {'answer': 'The tallest building in the world is the Empire State Building.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}
[2024-08-30 07:54:11 -0700][flowinvoker][INFO] - Execute flow with data {'answer': 'The tallest building in the world is the Empire State Building.', 'ground_truth': 'The tallest building in the world is the Burj Khalifa.'}


2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: 4a85f1c0-d7a6-430a-b192-4e1ab30d7a20_compute_f1_score_e471aeb2-70e5-41c8-acbb-71be69ef5022
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     The node 'compute_f1_score' will be executed because the activate condition is met, i.e. '${validate_inputs.output}' is equal to 'True'.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Node validate_inputs completes.
2024-08-30 07:54:11 -0700    2147 execution.flow     INFO     Executing node compute_f1_score. node run id: ce00ce4f-a3d1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [19]:
result

{'rows': [{'inputs.answer': 'Paris is the capital of France.',
   'inputs.ground_truth': 'The capital of France is Paris.',
   'inputs.question': 'What is the capital of France?',
   'outputs.qa_eval.f1_score': 1.0,
   'outputs.context_similarity.gpt_similarity': 5.0,
   'outputs.semantic_similarity.semantic_similarity': 0.9259477853775024,
   'line_number': 0},
  {'inputs.answer': 'Albert Einstein developed the theory of relativity.',
   'inputs.ground_truth': 'The theory of relativity was developed by Albert Einstein.',
   'inputs.question': 'Who developed the theory of relativity?',
   'outputs.qa_eval.f1_score': 0.8571428571428571,
   'outputs.context_similarity.gpt_similarity': 5.0,
   'outputs.semantic_similarity.semantic_similarity': 0.8779064416885376,
   'line_number': 1},
  {'inputs.answer': 'The speed of light is approximately 299,792,458 meters per second.',
   'inputs.ground_truth': 'Light travels at a speed of 299,792,458 meters per second.',
   'inputs.question': 'What i

#### Scenario 3: Dimensionality Reduction for Query Representation in High-Dimensional Language Spaces

**Objective**: To explore and understand the underlying structure of user queries within a high-dimensional natural language dataset, we aim to apply UMAP and other dimensionality reduction techniques. This exploration will help us identify key dimensions that influence query representations and suggest alternative metrics to evaluate and improve the AI system's comprehension and response capabilities. 

**Dimensionality Reduction Approach**:
- **UMAP Implementation**:
  - **Purpose**: UMAP is employed to reduce the high-dimensional space of query embeddings into a more manageable 2D or 3D space, revealing patterns and clusters that correspond to different types of user queries.
  - **Process**: We preprocess the queries using a transformer-based model (e.g., `bert-base-uncased`) to generate embeddings. UMAP is then applied to these embeddings to visualize and analyze the inherent structure of the data.
  - **Outcome**: The resulting low-dimensional representation highlights areas where queries are closely related, providing insights into common themes or topics and identifying outliers or ambiguous queries.


**Goal**: To enhance the AI system's understanding and representation of user queries by leveraging dimensionality reduction techniques and alternative metrics. The ultimate aim is to improve the system's response accuracy and contextual understanding, ensuring that it can effectively handle a wide range of query types in a nuanced and meaningful manner. We can use this metric to understand the current population of system queries and compare it for drift detection purposes.