##### **Evaluation Setup**:
* Pip Install Fabric Data Agent SDK
* Load the **DataFrame** with question and expected_answers list.
  * You can update in-cell DataFrame.
  * Or upload the CSV file in "question,expected_answer" format to lakehouse
    * Copy the file path and load the data to DataFrame using pandas.read_csv("<lakehouse_filepath>")
* Invoke the evaluate_data_agent API with data_frame, **data_agent_name**, workspace_name (Optional), table_name (Optional).
  * data_agent_name : Name of the Data Agent
  * workspace_name (Optional) : Workspace Name if Data Agent is in different workspace. Default value is None.
  * table_name (Optional) : Evaluation output table name to store the evaluation result. Default table name is 'evaluation_output'.
    * After evaluation there will be two tables one with provided <table_name> for evaluation output and other with <table_name>_steps for detailed steps.
  * data_agent_stage (Optional) : Data Agent stage i.e., sandbox or production. Default value is production.


#### Install Fabric Data Agent SDK

In [1]:
%pip install -U fabric-data-agent-sdk

Collecting fabric-data-agent-sdk
  Downloading fabric_data_agent_sdk-0.1.0a0-py3-none-any.whl.metadata (3.4 kB)
Collecting openai>=1.57.0 (from fabric-data-agent-sdk)
  Downloading openai-1.72.0-py3-none-any.whl.metadata (25 kB)
Collecting httpx==0.27.2 (from fabric-data-agent-sdk)


  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx==0.27.2->fabric-data-agent-sdk)
  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx==0.27.2->fabric-data-agent-sdk)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Collecting jiter<1,>=0.4.0 (from openai>=1.57.0->fabric-data-agent-sdk)
  Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting pydantic<3,>=1.9.0 (from openai>=1.57.0->fabric-data-agent-sdk)
  Downloading pydantic-2.11.3-py3-none-any.whl.metadata (65 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting annotated-types>=0.6.0 (from pydantic<3,>=1.9.0->openai>=1.57.0->fabric-data-agent-sdk)
  

Collecting pydantic-core==2.33.1 (from pydantic<3,>=1.9.0->openai>=1.57.0->fabric-data-agent-sdk)
  Downloading pydantic_core-2.33.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting typing-inspection>=0.4.0 (from pydantic<3,>=1.9.0->openai>=1.57.0->fabric-data-agent-sdk)
  Downloading typing_inspection-0.4.0-py3-none-any.whl.metadata (2.6 kB)
Downloading fabric_data_agent_sdk-0.1.0a0-py3-none-any.whl (26 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.7-py3-none-any.whl (78 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m2.8 MB/

Installing collected packages: typing-inspection, pydantic-core, jiter, h11, annotated-types, pydantic, httpcore, httpx, openai, fabric-data-agent-sdk


Successfully installed annotated-types-0.7.0 fabric-data-agent-sdk-0.1.0a0 h11-0.14.0 httpcore-1.0.7 httpx-0.27.2 jiter-0.9.0 openai-1.72.0 pydantic-2.11.3 pydantic-core-2.33.1 typing-inspection-0.4.0
Note: you may need to restart the kernel to use updated packages.


##### Load the Dataframe using in cell initialization or input csv file

In [2]:
import pandas as pd

# Create DataFrame with "question,expected_answer". Please update the questions and expected_answers as per the requirement.
df = pd.DataFrame(columns=["question", "expected_answer"],
                  data=[
                    ["show total sales for Canadian Dollar for January 2013", "46,117.30."],
                    ["what is the product with the highest total sales for Canadian Dollar in 2013", "Mountain-200 Black, 42"],
                    ["Total sales outside of US", "19968887.95"],
                    ["which product category had the highest total sales for Canadian Dollar in 2013", "Bikes (Total Sales: 938654.76)"]
                ])

# Load from input CSV file with data in format "question,expected_answer"
# input_file_path = "/lakehouse/default/Files/Data/Input/curated_2.csv"
# df = pd.read_csv(input_file_path)


##### Invoke Evaluation API with input parameters

In [3]:
from fabric.dataagent.evaluation import evaluate_data_agent

# Data Agent name
data_agent_name = "AgentEvaluation"

# Workspace Name (Optional) if Data Agent is in different workspace
workspace_name = None

# Table name (Optional) to store the evaluation result. Default value is 'evaluation_output'
# After evaluation there will be two tables one with provided <table_name> for evaluation output and other with <table_name>_steps for detailed steps.
table_name = "demo_evaluation_output"

# Data Agent stage ie., sandbox or production. Default to production.
data_agent_stage = "production"

# Evaluate the Data Agent. Returns the unique id for the evaluation run
evaluation_id = evaluate_data_agent(df, data_agent_name, workspace_name=workspace_name, table_name=table_name, data_agent_stage=data_agent_stage)

print(f"Unique Id for the current evaluation run: {evaluation_id}")

Processing Rows:   0%|          | 0/4 [00:00<?, ?step/s]

Processing Rows:  25%|██▌       | 1/4 [00:26<01:19, 26.35s/step]

Processing Rows:  50%|█████     | 2/4 [00:51<00:50, 25.42s/step]

Processing Rows:  75%|███████▌  | 3/4 [01:16<00:25, 25.58s/step]

Unique Id for the current evaluation run: 45dfc4ae-a6ae-4265-80ff-a75c7b463fb0


##### Overall summary of an evaluation stored in the input table.
Returns the DataFrame with summary details.

Input Parameters:
* table_name (Optional) : Table name which contains the evaluation result. Default value is 'evaluation_output'
* verbose (Optional) : Flag to display the summary. Default is False.

In [None]:
from fabric.dataagent.evaluation import get_evaluation_summary

get_evaluation_summary(table_name)

index,evaluation_id,true_count,false_count,unclear_count,true_percentage
0,2cf897c6-801c-4e94-8664-f4848ffb6071,3,1,0,75.0
1,30c590da-6df8-494f-a359-642d34d4fa32,2,2,0,50.0
2,34fc1c4c-8696-48e5-aebb-78b3b681cc93,2,2,0,50.0
3,45dfc4ae-a6ae-4265-80ff-a75c7b463fb0,4,0,0,100.0
4,726cb9fe-614a-4dc0-b121-18dd224aa391,2,2,0,50.0
5,75d8e7db-3448-4e93-bb20-819e038e28a6,4,0,0,100.0
6,760af511-c8a8-497c-b729-6b113f4ac2a9,3,1,0,75.0
7,76e1e1c8-29ee-4505-bce2-9cd331927ef4,3,1,0,75.0
8,7b65fc7b-c8c8-46a1-9c2b-04adf65f4f48,3,1,0,75.0
9,934d185f-99da-4f48-908b-6a5a9c4fa9de,3,1,0,75.0


##### Evaluation details of a single run
Returns the DataFrame with evaluation details.

Input Parameters:
* evaluation_id : Unique Id for an evaluation run.
* table_name (Optional) : Table name which contains the evaluation result. Default value is 'evaluation_output'.
* get_all_rows (Optional) : Flag to get all the rows for an evaluation. Default value is False, which returns only failed evaluation rows.
* Verbose (Optional) : Flag to display the summary. Default is False.

**Note**: The thread url in the evaluation details is only accessible by person who ran the evaluation.

In [None]:
from fabric.dataagent.evaluation import get_evaluation_details

# Unique Id for an evaluation run
# evaluation_id = 'd36ce205-a88d-42bd-927d-260ec2e2a479'
# Evaluation output table name
table_name = "demo_evaluation_output"
# Flag to get all the rows for an evaluation. Default value is False, which returns only failed evaluation rows.
get_all_rows = False
# Flag to display the summary. Default is False.
verbose = True

eval_details = get_evaluation_details(evaluation_id, table_name, get_all_rows=get_all_rows, verbose=verbose)

question,expected_answer,evaluation_judgement,actual_answer,thread_url
show total sales for Canadian Dollar for January 2013,"46,117.30.",False,"There's content here that I can't work with. Try asking a new question. If that doesn't work, there might be an issue with content in your source data.",thread_QEBJbIKDGiBVl3xKxENJky90
Total sales outside of US,19968887.95,False,"The query for the total sales outside of the US seems to be irrelevant to the available data schema. Please provide additional context or rephrase your question so that it is more relevant to the existing data fields. The available data tables include information about accounts, currencies, customers, and dates.",thread_Ja1XoNI0ULNAnBHzCleUTrAA


### Advanced Options

##### Use customized prompt for evaluation
* critic_prompt (Optional): Prompt (Optional) to evaluate the actual answer from Data Agent. 
  * Please use the variables **query, expected_answer and actual_answer** as placeholders.

In [6]:
from fabric.dataagent.evaluation import evaluate_data_agent

# Prompt (Optional) to evaluate the actual response. Please use the varaibles query, expected_answer and actual_answer as placeholders
critic_prompt = """
        Given the following query, expected answer, and actual answer, please determine if the actual answer is equivalent to expected answer. If they are equivalent, respond with 'yes'.

        Query: {query}

        Expected Answer:
        {expected_answer}

        Actual Answer:
        {actual_answer}

        Is the actual answer equivalent to the expected answer?
        """

# Data Agent name
data_agent_name = "AgentEvaluation"

# Evaluate the Data Agent. Returns the unique id for the evaluation run
evaluation_id = evaluate_data_agent(df, data_agent_name, critic_prompt=critic_prompt)

Processing Rows: 100%|██████████| 4/4 [01:46<00:00, 26.64s/step]


In [7]:
from fabric.dataagent.evaluation import get_evaluation_details

# Unique Id for an evaluation run
evaluation_id = '4e725e05-5b72-493f-b849-d8787decc188'
# Evaluation output table name
table_name = "evaluation_output"
# Flag to get all the rows for an evaluation. Default value is False, which returns only failed evaluation rows.
get_all_rows = True
# Flag to display the summary. Default is False.
verbose = True

eval_details = get_evaluation_details(evaluation_id, table_name, get_all_rows=get_all_rows, verbose=verbose)

question,expected_answer,evaluation_judgement,actual_answer,thread_url
show total sales for Canadian Dollar for January 2013,"46,117.30.",True,"The total sales for Canadian Dollar in January 2013 is 46,117.3.",thread_vpK7SupUSmaRbrxqBcZueJ7Q
what is the product with the highest total sales for Canadian Dollar in 2013,"Mountain-200 Black, 42",True,"The product with the highest total sales for Canadian Dollar in 2013 is ""Mountain-200 Black, 42"".",thread_ZOLweVJJQ4mrBa5pDnwYP6l8
Total sales outside of US,19968887.95,False,"The total sales outside of the US amount to approximately 1,001,630,000. The query used to obtain this information is: ```sql SELECT SUM(fs.SalesAmount) AS TotalSalesOutsideUS FROM FactInternetSales fs JOIN DimSalesTerritory dst ON fs.SalesTerritoryKey = dst.SalesTerritoryKey JOIN DimGeography dg ON dst.SalesTerritoryCountry = dg.EnglishCountryRegionName WHERE dg.EnglishCountryRegionName <> 'United States' ```",thread_h4Hdx5Mt44L0gZrG9YKRco4p
which product category had the highest total sales for Canadian Dollar in 2013,Bikes (Total Sales: 938654.76),True,"The product category that had the highest total sales for Canadian Dollar in 2013 was ""Bikes"".",thread_NTMS5V17E96KR5FTtqkutJ7M
