# Working with Prompt Template Assets for a Retrieval-Augmented Generation task in watsonx.governance

This notebook will create a retrieval-augmented generation (RAG) prompt template asset (PTA) in a given project, configure watsonx.governance to monitor that PTA and evaluate generative quality metrics and model health metrics, and then promote the prompt template asset to a space and do the same evaluation.

If you wish to execute this notebook for task types other than RAG, refer to [Evaluating prompt template asset of different task types](https://github.com/IBM/watson-openscale-samples/blob/main/IBM%20Cloud/WML/notebooks/watsonx/README.md) for guidance on evaluating prompt templates for other available task types.

This notebook should be run using a Python 3.10 or greater runtime environment. If you are viewing this notebook in Watson Studio and do not see Python 3.10.x or higher in the upper right corner of your screen, please update the runtime now. 

**Note**: Run your notebook on a Cloud Pak for Data (CPD) cluster using version 5.0.0 or above.

## Learning goals

- Create a prompt template asset in a CPD project
- Configure watsonx.governance to monitor the created prompt template asset 
- Evaluate generative quality metrics and model health metrics
- Promote the prompt template asset to a space
- Evaluate the prompt template asset in a space 

## Prerequisites

- Service credentials for IBM watsonx.governance are required
- Watson OpenScale (WOS) credentials are required
- Watson Machine Learning (WML) credentials are required
- A `.csv` file containing test data to be evaluated
- ID of the CPD project in which you want to create the PTA
- ID of the CPD space to which you want to promote the PTA

## Contents

[Evaluating a Prompt Template Asset from a project](#evaluateproject)
- [Step 1 - Setup](#settingup)
- [Step 2 - Create a Prompt template](#prompt)
- [Step 3 - Setup the prompt template](#ptatsetup)
- [Step 4 - Risk evaluations for the PTA subscription](#evaluate)
- [Step 5 - Display the Model Risk metrics](#mrmmetric)
- [Step 6 - Display the Generative AI Quality metrics](#genaimetrics)
- [Step 7 - Plot faithfulness and answer relevance metrics against records](#plotproject)
- [Step 8 - See factsheets information](#factsheetsspace)

[Evaluating a Prompt Template Asset from a space](#evaluatespace)
- [Step 9 - Promote a PTA to a space](#promottospace)
- [Step 10 - Create a deployment for a PTA in a space](#ptadeployment)
- [Step 11 - Set up the PTA in a space for evaluation with supported monitor parameters](#ptaspace)
- [Step 12 - Score the model and configure monitors](#score)
- [Step 13 - Display the source attributions for a record](#attributions)
- [Step 14 - Plot faithfulness and answer relevance metrics against records](#plotspace)
- [Step 15 - See factsheets information from a space](#factsheetsproject)

## Evaluating a Prompt Template Asset from a project <a name="evaluateproject"></a>

In the first section of this notebook, you will learn how to:

1. Create a PTA in a project
2. Create a `development`-type subscription for a PTA in OpenScale
3. Configure monitors supported by OpenScale for the subscription
4. Perform risk evaluations against the PTA subscription with a sample set of test data
5. Display the metrics generated with the risk evaluation
6. Display the factsheets information for the subscription

## Step 1 - Setup <a name="settingup"></a>

### Install the necessary packages

In [83]:
!pip install -U ibm-watson-openscale | tail -n 1
!pip install --upgrade ibm-watson-machine-learning | tail -n 1
!pip install matplotlib



**Note**: you may need to restart the kernel to use updated packages.

### Configure your credentials

Run your notebook on a CPD cluster using version 5.0.0 or above.

In [84]:
WOS_CREDENTIALS = {
     "url": "https://cpd-watsonx.apps.comarch.mop.ibm",
     "username": "<username>",
     "password" : "<password>"
}

WML_CREDENTIALS = {
     "url": "<url>",
     "username": "<username>",
     "password" : "<password>",
     "instance_id": "wml_local",
     "version" : "5.0"
}

**Note**: Replace the `WOS_CREDENTIALS` with your Watson OpenScale credentials, and the `WML_CREDENTIALS` with your Watson Machine Learning credentials.

### Configure your project ID

To set up a development-type subscription in Watson OpenScale, the PTA must be within a CPD project. Supply the project ID where the PTA needs to be created.

In [85]:
project_id = "<project_id>"

### Configure your space ID

You can use an existing space, or you can create a new space to promote the model.

#### (Optional) If you choose an existing space

Set variable for an existing space:

In [86]:
import json
from ibm_watson_machine_learning import APIClient

wml_client = APIClient(WML_CREDENTIALS)
wml_client.version

'1.0.367'

List the available spaces:

### Create an access token

The following function generates an IAM access token using the provided credentials. The API calls for creating and scoring prompt template assets utilize the token generated by this function.

In [87]:
import requests, json
def generate_access_token():
    headers={}
    headers["Content-Type"] = "application/json"
    headers["Accept"] = "application/json"
    data = {
        "username":WOS_CREDENTIALS["username"],
        "password":WOS_CREDENTIALS["password"]
    }
    data = json.dumps(data).encode("utf-8")
    url = WOS_CREDENTIALS["url"] + "/icp4d-api/v1/authorize"
    response = requests.post(url=url, data=data, headers=headers,verify=False)
    json_data = response.json()
    iam_access_token = json_data['token']      
        
    return iam_access_token

iam_access_token = generate_access_token()

## Step 2 - Create a Prompt template <a name="prompt"></a>

Create a prompt template for a RAG task:

In [88]:
credentials={
     "password": WML_CREDENTIALS["password"],
     "url": WML_CREDENTIALS["url"],
     "instance_id": "openshift",
     "username": WML_CREDENTIALS["username"]
}

In [89]:
from ibm_watson_machine_learning.foundation_models.prompts import PromptTemplate, PromptTemplateManager
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

prompt_mgr = PromptTemplateManager(
                credentials = credentials,
                project_id = project_id
                )

prompt_input="""
<|start_of_role|>system<|end_of_role|>You are an AI assistant of OSS system of Telco companies using Comarch OSS that follows instruction extremely well. You will answer to OSS users' questions about provided context. Given the context, your task is to answer a OSS user's question by using information from the context. Your answer must be in English. When the question cannot be answered using the context or document, output the following response without additional explanations: "I CANNOT ANSWER THAT QUESTION BASED ON THE PROVIDED DOCUMENT.". If the context starts with "H", ingore the first letters until "

". The actual context starts after"

". When generating responses, prioritize correctness, i.e., ensure that your response is correct given the context and user query, and that it is grounded in the context. Furthermore, make sure that the response is supported by the given document or context. Always make sure that your response is relevant to the question. Avoid repeating information unless asked. Be clear and precise - do not include addiotional information unless asked by the OSS user. Answer with maximum one sentence. If the supporting explanation is too long to fit in one sentence, use bulleted points lists when possible rather than answering in multiple long sentences.<|end_of_text|><|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question.

{contexts}

Question: {question}
"""

prompt_input="""

Answer the question using below context.
{retrieved_contexts}

Question: {user_input}
"""


!pip install --upgrade ibm-aigov-facts-client | tail -n 1
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
from ibm_aigov_facts_client import AIGovFactsClient
from ibm_aigov_facts_client import AIGovFactsClient
from ibm_aigov_facts_client import CloudPakforDataConfig
creds=CloudPakforDataConfig(
            service_url=WOS_CREDENTIALS["url"],
            username=WOS_CREDENTIALS["username"],
            password=WOS_CREDENTIALS["password"]
        )
facts_client = AIGovFactsClient(
                cloud_pak_for_data_configs=creds,
                container_id=project_id,
                container_type="project",
                disable_tracing=True)





from ibm_aigov_facts_client import DetachedPromptTemplate, PromptTemplate

detached_information = DetachedPromptTemplate(
    prompt_id="detached_prompt 1",
    model_id="ibm/granite-3-8b-instruct",
    model_provider="Facebook",
    model_name="llama-3-70b-instruct",
    model_url="https://us-south.ml.cloud.ibm.com/ml/v1/deployments/insurance_test_deployment/text/generation?version=2021-05-01",
    prompt_url="prompt_url",
    prompt_additional_info={"IBM Cloud Region": "us-east1"}
)

task_id = "retrieval_augmented_generation"
name = "Prompt Template for Q&A LLM As a Judge"
#name = "Prompt Template for Q&A SLM"
description = "Detached prompt for Comarch RAG"
model_id = "ibm/granite-3-8b-instruct"

# define parameters for PromptTemplate
prompt_variables = {"retrieved_contexts": "","user_input": ""}
input = prompt_input
input_prefix= ""
output_prefix= ""

prompt_template = PromptTemplate(
    input=input,
    prompt_variables=prompt_variables,
    input_prefix=input_prefix,
    output_prefix=output_prefix,
)

pta_details = facts_client.assets.create_detached_prompt(
    model_id=model_id,
    task_id=task_id,
    name=name,
    description=description,
    prompt_details=prompt_template,
    detached_information=detached_information)
project_pta_id = pta_details.to_dict()["asset_id"]
project_pta_id




2024/12/16 17:19:46 INFO : ------------------------------ Detached Prompt Creation Started ------------------------------
2024/12/16 17:19:47 INFO : The detached prompt with ID afca833f-aeb7-4c8f-b072-1c726081ff43 was created successfully in container_id 36e48345-0549-4909-8440-d050e3d59b8a.


'afca833f-aeb7-4c8f-b072-1c726081ff43'

In [90]:
'''
prompt_template = PromptTemplate(name="RAG QA",
                                 model_id="meta-llama/llama-3-1-8b-instruct",
                                 task_ids=["retrieval_augmented_generation"],
                                 input_prefix="",
                                 output_prefix="",
                                 input_text= prompt_input
                                 input_variables=["contexts", "question"])

stored_prompt_template = prompt_mgr.store_prompt(prompt_template)
project_pta_id = stored_prompt_template.prompt_id
project_pta_id
'''

'\nprompt_template = PromptTemplate(name="RAG QA",\n                                 model_id="meta-llama/llama-3-1-8b-instruct",\n                                 task_ids=["retrieval_augmented_generation"],\n                                 input_prefix="",\n                                 output_prefix="",\n                                 input_text= prompt_input\n                                 input_variables=["contexts", "question"])\n\nstored_prompt_template = prompt_mgr.store_prompt(prompt_template)\nproject_pta_id = stored_prompt_template.prompt_id\nproject_pta_id\n'

## Step 3 - Set up the Prompt template <a name="ptatsetup"></a>

### Configure OpenScale

In [91]:
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *


authenticator = CloudPakForDataAuthenticator(
        url=WOS_CREDENTIALS['url'],
        username=WOS_CREDENTIALS['username'],
        password=WOS_CREDENTIALS['password'],
        disable_ssl_verification=True
    )

wos_client = APIClient(service_url=WOS_CREDENTIALS['url'],authenticator=authenticator)
print(wos_client.version)

3.0.41


### List available OpenScale datamarts and configure the datamart ID

In [92]:
wos_client.data_marts.show()

0,1,2,3,4,5
,,False,active,2024-11-05 09:44:21.837000+00:00,00000000-0000-0000-0000-000000000000


In [93]:
data_mart_id = "00000000-0000-0000-0000-000000000000"

### Map the project ID to an Openscale instance

When authentication is on CPD, you must take the additional step of mapping the project_id and space_id to an OpenScale instance.

### Set up the PTA in the project for evaluation with supported monitor parameters

The PTAs from a project are only supported with a `development`-type operational space ID. Running the following cell will create a `development`-type subscription from the PTA created within your project.

The available parameters that can be passed for the `execute_prompt_setup` function are:

 * `prompt_template_asset_id`: ID of the PTA for which a subscription needs to be created
 * `label_column`: The name of the column containing the ground truth or actual labels
 * `project_id`: The ID of the project
 * `space_id`: The ID of the space
 * `deployment_id`: (optional) The ID of the deployment
 * `operational_space_id`: The rank of the environment in which the monitoring is happening. Accepted values are `development`, `pre_production`, `production`
 * `problem_type`: (optional) The task type to monitor for the given PTA
 * `classification_type`: The classification type (`binary`/`multiclass`) applicable only for the `classification` problem (task) type
 * `input_data_type`: The input data type
 * `supporting_monitors`: Monitor configuration for the subscription to be created
 * `background_mode`: When `True`, the prompt setup operation will be executed in the background

### (Optional) If you choose to use LLM As Judge for computing answer quality and retrieval quality metrics

The answer quality and retrieval quality metrics can be computed using fine tuned smaller models or using LLM As Judge. To compute metrics using LLM As Judge a generative_ai_evaluator intergrated system should to be created and provided during prompt setup.

#### (Optional) Create a Generative AI evaluator if you choose LLM as judge

The Generative AI Evaluator can be any model from watsonx.ai or a custom endpoint invoking external models

Supported evaluator types
* watsonx.ai (for connecting to watsonx.ai in local CPD or watsonx.ai on remote CPD or watsonx.ai on Cloud)
* custom (for connecting to any external models)

#### Integrated System parameters
| Parameter | Description |
|:-|:-|
| `name` | Name for the evaluator. |
| `description` | Description for the evaluator. |
| `type` | The type of integrated system. Provide `generative_ai_evaluator`. |
| `parameters` | The evaluator configuration details like `evaluator_type`, `space_id`, `project_id` and `model_id`. |
| `credentials` | The user credentials |
| `connection` [Optional]| The scoring endpoint details when the evaluator is of type `custom`. |

As an example, an evaluator using FLAN_T5_XXL model from watsonx.ai instance present in the same CPD is created below. The other models which can be used from watsonx.ai are FLAN_T5_XXL, FLAN_UL2, FLAN_T5_XL, MIXTRAL_8X7B_INSTRUCT_V01.
For more details on the parameters and to create the other supported evaluators please refer to the link [Generative AI evaluator templates](https://github.com/IBM/watson-openscale-samples/wiki/Generative-AI-Evaluator-templates)

In [94]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes
from ibm_watson_openscale.utils.client_utils import get_cpd_decoded_token

uid = get_cpd_decoded_token(WOS_CREDENTIALS["url"], iam_access_token).get("uid")
gen_ai_evaluator = wos_client.integrated_systems.add(
    name="llm as a judge",
    description="evaluation through llm as a judge",
    type="generative_ai_evaluator",
    parameters={
        "evaluator_type": "watsonx.ai",
        "model_id": "meta-llama/llama-3-1-8b-instruct"
    },
    credentials={
        "wml_location": "cpd_local",
        "uid": uid,
        "url": WOS_CREDENTIALS["url"],
    }
)

# get evaluator integrated system ID
result = gen_ai_evaluator.result._to_dict()
evaluator_id = result["metadata"]["id"]
evaluator_id

'998c271d-3f0b-4cde-a7f1-cfdca6aec701'

#### Generative AI monitor parameters

##### Generative AI evaluator parameters
The generative_ai_evaluator details can be provided at the global level under `generative_ai_quality.parameters` to use the same evaluator for all the Answer quality metrics(Faithfulness, Answer relevance, Answer similarity) and Retrieval quality metrics(Context relevance, Retrieval precision, Average precision, Reciprocal rank, Hit rate, Normalized Discounted Cumulative Gain).

The generative ai evaluator can be specified at metric level to use different evaluators for each of the metric. The metrics under which generative_ai_evaluator parameter can be specified are faithfulness, answer_relevance, answer_similarity and retrieval_quality. The generative_ai_evaluator at the metric level takes precedence over the generative_ai_evaluator at global level

| Parameter | Description | Default Value |
|:-|:-|:-|
| `enabled`| The flag to enable generative ai evaluator |  |
| `evaluator_id`| The id of the generative ai evaluator integrated system. |  |

##### Faithfulness parameters
The below faithfulness parameters are supported only when generative ai evaluator is not used ie., when not using LLM as judge to compute the metrics. When using LLM as judge, the source attributions will not be computed.

| Parameter | Description | Default Value |
|:-|:-|:-|
| `attributions_count` [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The `attributions_count` parameter specifies the number of sentences in the context which need to be identified for attributions. , if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | `3` |
| `ngrams` [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | `2` |

##### Context relevance parameters
The below context relevance parameters are supported only when generative ai evaluator is not used ie., when not using LLM as judge to compute the metrics.

| Parameter | Description | Default Value |
|:-|:-|:-|
| `ngrams` [Optional]| The number of sentences to be grouped from the context when computing context relevance score. Having a very high value of ngrams might lead to having lower context relevance scores due to dispersion of data. Having a very low value might lead to increase in metric computation time and context relevance score not capturing the all the aspects of the question. | `2` |

##### Unsuccessful requests parameters

| Parameter | Description | Default Value |
|:-|:-|:-|
| `unsuccessful_phrases` [Optional]| The list of phrases to be used for comparing the model output to determine whether the request is unsuccessful or not. | `["i don't know", "i do not know", "i'm not sure", "i am not sure", "i'm unsure", "i am unsure", "i'm uncertain", "i am uncertain", "i'm not certain", "i am not certain", "i can't fulfill", "i cannot fulfill"]` |


In [95]:
# Update the label_column, context_fields, question_field values based on the prompt and test data used
label_column = "ground_truths"
context_fields = ["retrieved_contexts"]
question_field = "user_input"

operational_space_id = "development"
problem_type= "retrieval_augmented_generation"
input_data_type= "unstructured_text"


monitors = {
    "generative_ai_quality": {
        "parameters": {
            # Uncomment generative_ai_evaluator to use LLM as judge, Comment it if you want to use openscale SLM
           "generative_ai_evaluator": { # global LLM as judge configuration
                "enabled": True,
               "evaluator_id": evaluator_id,
            },
            "min_sample_size": 1,
            "metrics_configuration":{
                "faithfulness": {
                   #  "attributions_count": 3, # Not supported for LLM as Judge
                   #  "ngrams": 2,    # Not supported for LLM as Judge
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric.
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    #"generative_ai_evaluator": { # metric specific LLM as judge configuration
                    #    "enabled": True,
                    #    "evaluator_id": evaluator_id
                    #},
                },
                "answer_relevance": {
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    #"generative_ai_evaluator": { # metric specific LLM as judge configuration
                    #    "enabled": True,
                    #    "evaluator_id": evaluator_id
                    #},
                },
                "rouge_score": {},
                #"exact_match": {},
                #"bleu": {},
                #"unsuccessful_requests": {
                    #"unsuccessful_phrases": []
                #},
                #"hap_input_score": {},
                #"hap_score": {},
                #"pii": {},
                #"pii_input": {},
                "retrieval_quality": {
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    #"generative_ai_evaluator": { # metric specific LLM as judge configuration
                    #    "enabled": True,
                    #    "evaluator_id": evaluator_id
                    #},
                    # The metrics computed for retrieval quality are context_relevance, retrieval_precision, average_precision, reciprocal_rank, hit_rate, normalized_discounted_cumulative_gain
                    "context_relevance": {
                        #"ngrams": 2,    # Not supported for LLM as Judge
                    }
                },
                # Answer similarity metric is supported only when LLM as judge is configured. Uncomment only when using LLM as judge.
                "answer_similarity": {
                    # Uncomment generative_ai_evaluator to use a different evaluator for this metric
                    # Takes higher precedence than the generative_ai_evaluator specified at global level.
                    #"generative_ai_evaluator": { # metric specific LLM as judge configuration
                    #    "enabled": True,
                    #    "evaluator_id": evaluator_id
                    #},
                }
            }
        }
    }
}

response = wos_client.wos.execute_prompt_setup(prompt_template_asset_id = project_pta_id, 
                                               project_id = project_id,
                                               context_fields = context_fields,
                                               question_field = question_field,
                                               label_column = label_column,
                                               operational_space_id = operational_space_id, 
                                               problem_type = problem_type,
                                               input_data_type = input_data_type, 
                                               supporting_monitors = monitors, 
                                               background_mode = False)

result = response.result
result._to_dict()




 Waiting for end of adding prompt setup afca833f-aeb7-4c8f-b072-1c726081ff43 




running.....
finished

---------------------------------------------------------------
 Successfully finished setting up prompt template subscription 
---------------------------------------------------------------




{'prompt_template_asset_id': 'afca833f-aeb7-4c8f-b072-1c726081ff43',
 'project_id': '36e48345-0549-4909-8440-d050e3d59b8a',
 'deployment_id': '4d39b226-6cbf-4e7d-9c31-feed74e2b52e',
 'service_provider_id': 'a7f2e0e1-2bc3-4e4b-91b5-b12fe8469b68',
 'subscription_id': '3d629beb-9443-4aaa-bc18-bc538f24b6c0',
 'mrm_monitor_instance_id': '5ce4de35-a73b-4bc5-bce3-86a71375a185',
 'start_time': '2024-12-16T17:19:53.699919Z',
 'end_time': '2024-12-16T17:20:30.481594Z',
 'status': {'state': 'FINISHED'}}

..
finished

---------------------------------------------------------------
 Successfully finished setting up prompt template subscription 
---------------------------------------------------------------




{'prompt_template_asset_id': 'c52767fc-26cb-4517-8668-8647621f526f',
 'project_id': '36e48345-0549-4909-8440-d050e3d59b8a',
 'deployment_id': '88321827-80a8-4d5f-a2e8-8d6b67b0fa6b',
 'service_provider_id': 'a7f2e0e1-2bc3-4e4b-91b5-b12fe8469b68',
 'subscription_id': 'e400d44c-9a73-408f-9f06-c5d5f893d7de',
 'mrm_monitor_instance_id': 'd287f312-db98-4535-adf3-388c78d0324c',
 'start_time': '2024-12-16T11:25:27.864075Z',
 'end_time': '2024-12-16T11:26:01.530531Z',
 'status': {'state': 'FINISHED'}}

With the following cell, you can read the prompt setup task and check its status

In [96]:
response = wos_client.wos.get_prompt_setup(prompt_template_asset_id = project_pta_id,
                                                             project_id = project_id)

result = response.result
result_json = result._to_dict()

if result_json["status"]["state"] == "FINISHED":
    print("Finished prompt setup : The response is {}".format(result_json))
else:
    print("prompt setup failed The response is {}".format(result_json))

Finished prompt setup : The response is {'prompt_template_asset_id': 'afca833f-aeb7-4c8f-b072-1c726081ff43', 'project_id': '36e48345-0549-4909-8440-d050e3d59b8a', 'deployment_id': '4d39b226-6cbf-4e7d-9c31-feed74e2b52e', 'service_provider_id': 'a7f2e0e1-2bc3-4e4b-91b5-b12fe8469b68', 'subscription_id': '3d629beb-9443-4aaa-bc18-bc538f24b6c0', 'mrm_monitor_instance_id': '5ce4de35-a73b-4bc5-bce3-86a71375a185', 'start_time': '2024-12-16T17:19:53.699919Z', 'end_time': '2024-12-16T17:20:30.481594Z', 'status': {'state': 'FINISHED'}}


### Read the `subscription_id` from the prompt setup

Once the prompt setup status is `FINISHED`, read the subscription ID:

In [97]:
dev_subscription_id = result_json["subscription_id"]
dev_subscription_id

'3d629beb-9443-4aaa-bc18-bc538f24b6c0'

### Show all monitor instances in the development subscription
The following cell lists the monitors present in the development subscription, along with their respective statuses and other details. Please wait for all the monitors to be in an active state before proceeding further.

In [98]:
wos_client.monitor_instances.show(target_target_id = dev_subscription_id)

0,1,2,3,4,5,6
00000000-0000-0000-0000-000000000000,active,3d629beb-9443-4aaa-bc18-bc538f24b6c0,subscription,generative_ai_quality,2024-12-16 17:20:16.409000+00:00,ad1584d0-f96e-4ff7-97dc-4b492d88fe89
00000000-0000-0000-0000-000000000000,active,3d629beb-9443-4aaa-bc18-bc538f24b6c0,subscription,model_health,2024-12-16 17:20:17.483000+00:00,fee1facd-2e69-433a-ae32-df1ea322965e
00000000-0000-0000-0000-000000000000,active,3d629beb-9443-4aaa-bc18-bc538f24b6c0,subscription,mrm,2024-12-16 17:20:18.639000+00:00,5ce4de35-a73b-4bc5-bce3-86a71375a185


## Step 4 - Risk evaluations for the PTA subscription <a name="evaluate"></a>

### Evaluate the prompt template subscription

For risk assessment of a `development`-type subscription, you must have an evaluation dataset. The risk assessment function takes the evaluation dataset path as a parameter when evaluating the configured metrics. If there is a discrepancy between the feature columns in the subscription and the column names in the uploading `.CSV` file, you have the option to supply a mapping JSON file to associate the `.CSV` column names with the feature column names in the subscription.

**Note**: If you are running this notebook from Watson Studio, you may first need to upload your test data to Watson Studio, then run the code snippet below to download the feedback data file from the project to a local directory.

In [104]:
# Download rag data
!rm paragraphs_watson_version_2.csv
!wget https://ibm.box.com/shared/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv paragraphs_watson_version_2.csv
#!wget https://ibm.box.com/shared/static/ur6dchp9ypx57jor3j5dz3inu85omhvf.csv
#!wget https://raw.githubusercontent.com/IBM/watson-openscale-samples/main/IBM%20Cloud/WML/assets/data/watsonx/rag_state_union.csv
!mv 1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv  paragraphs_watson_version_2.csv

--2024-12-16 17:25:32--  https://ibm.box.com/shared/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv
Resolving ibm.box.com (ibm.box.com)... 74.112.186.157
Connecting to ibm.box.com (ibm.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv [following]
--2024-12-16 17:25:33--  https://ibm.box.com/public/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv [following]
--2024-12-16 17:25:33--  https://ibm.ent.box.com/public/static/1o6u5oeo7cmc9odoogdx3uc8nu7yi5s0.csv
Resolving ibm.ent.box.com (ibm.ent.box.com)... 74.112.186.157
Connecting to ibm.ent.box.com (ibm.ent.box.com)|74.112.186.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/b

Connecting to public.boxcloud.com (public.boxcloud.com)|74.112.186.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 392929 (384K) [text/csv]
Saving to: ‘skzhejosxkw68h5hsqqttmtmqwu7jyiv.csv’


2024-12-16 11:26:08 (852 KB/s) - ‘skzhejosxkw68h5hsqqttmtmqwu7jyiv.csv’ saved [392929/392929]

--2024-12-16 11:26:08--  http://paragraphs_watson_version_2.csv/
Resolving paragraphs_watson_version_2.csv (paragraphs_watson_version_2.csv)... failed: Name or service not known.
wget: unable to resolve host address ‘paragraphs_watson_version_2.csv’
FINISHED --2024-12-16 11:26:08--
Total wall clock time: 3.0s
Downloaded: 1 files, 384K in 0.5s (852 KB/s)


In [120]:
#!cat paragraphs_watson_version_2.csv
df = pd.read_csv("paragraphs_watson_version_2.csv", encoding='unicode_escape')
print(df)

                                           user_input  \
0             Can I suspend an order that is ongoing?   
1             How python is supported in Comarch ROM?   
2      Where can I find python documentation for ROM?   
3   What are the main stages in ROM order processing?   
4                        Is Comarch ROM intent-based?   
5       Can I order template element directly in ROM?   
6                           What Resource Catalog is?   
7   What types of Resource Specification can be cr...   
8   What types of characteristics are supported by...   
9   How can I track changes introduced to specific...   
10                      What is Service Specification   
11  What types of characteristics are supported by...   
12                          What 'Origin' stands for?   
13            What does it mean to redefine relation?   
14                     What is Service Order Manager?   
15                            How to create an Order?   
16                     How to t

### Read the Model Risk metrics `instance_id` from OpenScale

Evaluating test data against the prompt template subscription requires the monitor instance ID for your OpenScale Model Risk metrics.

In [121]:
monitor_definition_id = "mrm"
target_target_id = dev_subscription_id
result = wos_client.monitor_instances.list(data_mart_id=data_mart_id,
                                           monitor_definition_id=monitor_definition_id,
                                           target_target_id=target_target_id,
                                           project_id=project_id).result
result_json = result._to_dict()
mrm_monitor_id = result_json["monitor_instances"][0]["metadata"]["id"]
mrm_monitor_id

'5ce4de35-a73b-4bc5-bce3-86a71375a185'

The following cell will assess the test data with the subscription of the PTA and produce relevant measurements for the configured monitor.

In [122]:
test_data_set_name = "data.csv"
content_type = "multipart/form-data"
test_data="paragraphs_watson_version_2.csv"
test_data_path = "input.csv"
import pandas as pd
body = {}

llm_data = pd.read_csv(test_data, encoding='unicode_escape')
#Preparing the test data, removing extra columns
cols_to_remove = ['ground_truths_documents']
for col in cols_to_remove:
    if col in llm_data:
        del llm_data[col]
#llm_data=llm_data.head(30)
#print(llm_data)
for index, row in llm_data.iterrows():    
   row_with_headers = dict(zip(llm_data.columns, row))
   pd.DataFrame([row_with_headers]).to_csv(test_data_path)
   #print(row_with_headers)

   
   #llm_data.to_csv(test_data_path, index=False)
   response  = wos_client.monitor_instances.mrm.evaluate_risk(monitor_instance_id=mrm_monitor_id, 
                                                    test_data_set_name = test_data_set_name, 
                                                    test_data_path = test_data_path,
                                                    content_type = content_type,
                                                    body = body,
                                                    project_id = project_id,
                                                    background_mode = False)
   #print(response)




 Waiting for risk evaluation of MRM monitor 5ce4de35-a73b-4bc5-bce3-86a71375a185 




upload_in_progress.
running....
finished

---------------------------------------
 Successfully finished evaluating risk 
---------------------------------------





 Waiting for risk evaluation of MRM monitor 5ce4de35-a73b-4bc5-bce3-86a71375a185 




upload_in_progress
running....
finished

---------------------------------------
 Successfully finished evaluating risk 
---------------------------------------





 Waiting for risk evaluation of MRM monitor 5ce4de35-a73b-4bc5-bce3-86a71375a185 




upload_in_progress.
running....
finished

---------------------------------------
 Successfully finished evaluating risk 
---------------------------------------





 Waiting for risk evaluation of MRM monitor 5ce4de35-a73b-4bc5-bce3-86a71375a185 




upload_in_progress
running....
finished

---------------------------------------
 Successfully finished evaluating risk 
--------------------------------

...........
finished

---------------------------------------
 Successfully finished evaluating risk 
---------------------------------------


{
    "result": "HTTP response",
    "headers": {
        "_store": {
            "date": [
                "date",
                "Mon, 16 Dec 2024 11:28:09 GMT"
            ],
            "content-type": [
                "content-type",
                "application/json"
            ],
            "content-length": [
                "content-length",
                "1323"
            ],
            "ibm-cpd-transaction-id": [
                "ibm-cpd-transaction-id",
                "e9f373ec-cd05-4685-a27d-36ff60d2893b"
            ],
            "strict-transport-security": [
                "strict-transport-security",
                "max-age=31536000 ; includeSubDomains, max-age=31536000; includeSubDomains"
            ],
            "x-content-type-options": [
                "x-content-type-options",
                "nosniff"
      

### Read the risk evaluation response

After initiating the risk evaluation, the evaluation results are available for review:

In [123]:
import time
#time.sleep(300)
response  = wos_client.monitor_instances.mrm.get_risk_evaluation(mrm_monitor_id, project_id = project_id)
response.result.to_dict()

{'metadata': {'id': '619f1dae-67ae-4046-a949-9f21dbfd9f4c',
  'created_at': '2024-12-16T18:48:02.556Z',
  'created_by': 'internal-service'},
 'entity': {'triggered_by': 'user',
  'parameters': {'evaluation_start_time': '2024-12-16T18:47:55.823428Z',
   'evaluation_tests': ['drift_v2',
    'fairness',
    'generative_ai_quality',
    'model_health',
    'quality'],
   'evaluator_user_key': '787ba75a-7727-429c-b3cf-a90e17a2dbe3',
   'facts': {'state': 'finished'},
   'is_auto_evaluated': False,
   'measurement_id': '00be5d1d-d491-4197-9f97-8081fb4c0be6',
   'project_id': '36e48345-0549-4909-8440-d050e3d59b8a',
   'prompt_template_asset_id': 'afca833f-aeb7-4c8f-b072-1c726081ff43',
   'secret_id': '3e706168-b95b-429a-a3fb-ed6556593c69',
   'user_iam_id': '1000331023',
   'wos_created_deployment_id': '4d39b226-6cbf-4e7d-9c31-feed74e2b52e',
   'publish_metrics': 'false'},
  'status': {'state': 'finished',
   'queued_at': '2024-12-16T18:48:02.515000Z',
   'started_at': '2024-12-16T18:48:03.27

## Step 5 - Display the Model Risk metrics <a name="mrmmetric"></a>

Having calculated the measurements for the Foundation Model subscription, the Model Risk metrics generated for this subscription are available for your review:

In [124]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=mrm_monitor_id, project_id=project_id)

0,1,2,3,4,5,6,7,8,9,10,11
2024-12-16 18:48:02.648000+00:00,tests_passed,00be5d1d-d491-4197-9f97-8081fb4c0be6,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,619f1dae-67ae-4046-a949-9f21dbfd9f4c,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:02.648000+00:00,tests_run,00be5d1d-d491-4197-9f97-8081fb4c0be6,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,619f1dae-67ae-4046-a949-9f21dbfd9f4c,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:02.648000+00:00,tests_skipped,00be5d1d-d491-4197-9f97-8081fb4c0be6,3.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,619f1dae-67ae-4046-a949-9f21dbfd9f4c,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:02.648000+00:00,tests_failed,00be5d1d-d491-4197-9f97-8081fb4c0be6,0.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,619f1dae-67ae-4046-a949-9f21dbfd9f4c,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:47:11.144000+00:00,tests_passed,782a7753-a674-4f84-bcbe-77dde0483420,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,e4123f0f-b048-4b29-b326-438619176a13,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:47:11.144000+00:00,tests_run,782a7753-a674-4f84-bcbe-77dde0483420,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,e4123f0f-b048-4b29-b326-438619176a13,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:47:11.144000+00:00,tests_skipped,782a7753-a674-4f84-bcbe-77dde0483420,3.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,e4123f0f-b048-4b29-b326-438619176a13,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:47:11.144000+00:00,tests_failed,782a7753-a674-4f84-bcbe-77dde0483420,0.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,e4123f0f-b048-4b29-b326-438619176a13,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:46:18.119000+00:00,tests_passed,2ec4d78b-ac40-42ad-af23-35efcbc1d24e,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,9d379df3-bf10-418b-a13a-15d791e6c955,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:46:18.119000+00:00,tests_run,2ec4d78b-ac40-42ad-af23-35efcbc1d24e,1.0,,,['test_data_set_name:data.csv'],mrm,5ce4de35-a73b-4bc5-bce3-86a71375a185,9d379df3-bf10-418b-a13a-15d791e6c955,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0


Note: First 10 records were displayed.


## Step 6 - Display the Generative AI quality metrics <a name="genaimetrics"></a>

The monitor instance ID is required for reading the Generative AI quality metrics.

In [125]:
monitor_definition_id = "generative_ai_quality"
result = wos_client.monitor_instances.list(data_mart_id = data_mart_id,
                                           monitor_definition_id = monitor_definition_id,
                                           target_target_id = target_target_id,
                                           project_id = project_id).result
result_json = result._to_dict()
genaiquality_monitor_id = result_json["monitor_instances"][0]["metadata"]["id"]
genaiquality_monitor_id

'ad1584d0-f96e-4ff7-97dc-4b492d88fe89'

Display the Generative AI quality monitor metrics generated through the risk evaluation.

In [126]:
wos_client.monitor_instances.show_metrics(monitor_instance_id=genaiquality_monitor_id, project_id=project_id, limit=20)

0,1,2,3,4,5,6,7,8,9,10,11
2024-12-16 18:48:31.957200+00:00,rouge2,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,faithfulness,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,average_precision,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,records_processed,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,1.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,hit_rate,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,rougelsum,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.1111,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,answer_relevance,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,1.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,reciprocal_rank,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,ndcg,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,1.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0
2024-12-16 18:48:31.957200+00:00,retrieval_precision,f53b74bb-04f8-4b6f-9bd5-dd5fcdbbb2aa,0.0,,,"['computed_on:feedback', 'field_type:subscription', 'aggregation_type:mean']",generative_ai_quality,ad1584d0-f96e-4ff7-97dc-4b492d88fe89,7d67c212-9908-4aa0-8884-167cd6d2d5b5,subscription,3d629beb-9443-4aaa-bc18-bc538f24b6c0


Note: First 20 records were displayed.


### Display record level metrics for Generative AI quality 

Get the dataset ID for the Generative AI quality dataset:

In [127]:
result = wos_client.data_sets.list(target_target_id = dev_subscription_id,
                                target_target_type = "subscription",
                                type = "gen_ai_quality_metrics").result

genaiq_dataset_id = result.data_sets[0].metadata.id
genaiq_dataset_id

'dece6484-1d2b-4edf-98b0-f75a7cb6fa39'

Display record level metrics for Generative AI quality:

In [128]:
wos_client.data_sets.show_records(data_set_id = genaiq_dataset_id)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0.0,0.0,0.0,MRM_695ae977-5666-4ea7-940b-e6f987a73f5d-0,0.0,feedback,2024-12-16T18:47:56.873Z,0.1111,1.0,"{'faithfulness': 0.0, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.0]}",0.0,1.0,0.0,7d67c212-9908-4aa0-8884-167cd6d2d5b5,0.0,0.1111,0.0,0.1111
0.1169,0.2,1.0,MRM_922f8294-37c1-4574-b669-9fd343f9b45c-0,1.0,feedback,2024-12-16T18:47:04.842Z,0.3205,1.0,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.8]}",1.0,1.0,1.0,60810f6c-5599-4410-83d0-c042f5dfe98b,0.8,0.1795,0.0,0.3205
0.0,0.2,0.0,MRM_f53caa74-d6c8-4b37-b38d-1ff0eeb06bed-0,0.0,feedback,2024-12-16T18:46:07.141Z,0.1842,0.8,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,9cd60e2c-402d-4c75-9545-2c8d14a42d22,0.2,0.1842,0.0,0.1842
0.1026,0.2,1.0,MRM_729c52ff-f34b-4e13-8c06-b9f5685b42f4-0,1.0,feedback,2024-12-16T18:45:06.935Z,0.2,0.2,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.8]}",1.0,1.0,1.0,0ce8cecb-6a66-4541-b866-8bb860dea749,0.8,0.2,0.0,0.225
0.1021,0.2,0.0,MRM_96b8a3c7-77fd-4bc9-9d36-fc62016d82b0-0,0.0,feedback,2024-12-16T18:44:16.331Z,0.2869,1.0,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,fb065e17-af6c-4497-b9de-541bbac68f94,0.2,0.1941,0.2,0.3207
0.1033,0.2,0.0,MRM_a4c28441-e568-4299-bf5f-0b2c1ae5550b-0,0.0,feedback,2024-12-16T18:43:18.227Z,0.307,0.2,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,11c13cf8-6dd0-4036-9af1-0fecc3c9c92d,0.2,0.2047,0.0,0.3721
0.1928,0.2,0.0,MRM_3f140d24-2c55-4fd7-9374-ead2c9a3382d-0,0.0,feedback,2024-12-16T18:42:25.730Z,0.2353,0.6,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,b785efd8-e156-4932-9f47-b7286c9d02b5,0.2,0.2353,0.0,0.2824
0.5405,0.2,0.0,MRM_5b2dbdd4-200c-4c3d-bf76-74d39e8d955e-0,0.0,feedback,2024-12-16T18:41:34.030Z,0.6154,0.2,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,e37e32b9-85cc-4445-a349-3257247f7463,0.2,0.6154,0.0,0.6667
0.5149,0.2,0.0,MRM_2bda6a9f-b7ba-447d-9697-ddf9e2bb9223-0,0.0,feedback,2024-12-16T18:40:43.439Z,0.5631,0.8,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.2]}",0.0,1.0,0.0,744ab65a-3b4c-498a-8e62-af0de83b937d,0.2,0.6214,0.0,0.6214
0.0185,0.2,1.0,MRM_9e807983-42a7-43fe-a176-898d7a75b1e4-0,1.0,feedback,2024-12-16T18:39:45.861Z,0.1455,0.8,"{'faithfulness': 0.2, 'faithfulness_attributions': []}","{'context_columns': ['retrieved_contexts'], 'context_relevances': [0.8]}",1.0,1.0,1.0,7daf2ad8-93d0-49b9-ab4b-05728292f807,0.8,0.0727,0.2,0.1818


## Step 7 - Plot faithfulness and answer relevance metrics against records <a name="plotproject"></a>

Retrieve a list of records and extract the record IDs, faithfulness values, and answer relevance values:

In [129]:
result = wos_client.data_sets.get_list_of_records(data_set_id = genaiq_dataset_id).result
result["records"]
x = []
y_faithfulness = []
y_answer_relevance = []
for each in result["records"]:
    x.append(each["metadata"]["id"][-5:]) # Reading only last 5 characters to fit in the display
    y_faithfulness.append(each["entity"]["values"]["faithfulness"])
    y_answer_relevance.append(each["entity"]["values"]["answer_relevance"])

# Create a consolidated DataFrame for all records
table_data = []
for record in result['records']:
    entity_values = record['entity']['values']
    flat_record = {**entity_values, **record['metadata']}
    table_data.append(flat_record)

# Create a DataFrame for all records
consolidated_df = pd.DataFrame(table_data)
consolidated_df.to_csv("watsonx_llm_as_judge_2.csv")
from project_lib import Project
project = Project(project_id='36e48345-0549-4909-8440-d050e3d59b8a')
pc = project.project_context
project.save_data(file_name = "watsonx_llm_as_judge_2.csv",data = consolidated_df.to_csv(index=False))


import csv

# Read the CSV file and reverse the rows
def reverse_csv(input_file, output_file):
    # Open and read the input CSV file
    with open(input_file, mode='r', newline='') as infile:
        reader = csv.reader(infile)
        rows = list(reader)  # Read all rows into a list
        
    # Reverse the rows
    rows.reverse()

    # Write the reversed rows to a new CSV file
    with open(output_file, mode='w', newline='') as outfile:
        writer = csv.writer(outfile)
        writer.writerows(rows)

# Usage
input_file = 'watsonx_llm_as_judge_2.csv'  # Replace with your input file path
output_file = 'watsonx_llm_as_judge_2_reversed.csv'  # Replace with your desired output file path
reverse_csv(input_file, output_file)

from project_lib import Project
project = Project(project_id='36e48345-0549-4909-8440-d050e3d59b8a')
pc = project.project_context
consolidated_df=pd.read_csv("watsonx_llm_as_judge_2_reversed.csv")
project.save_data(file_name = "watsonx_llm_as_judge_2_reversed.csv",data = consolidated_df.to_csv(index=False))

An asset named 'watsonx_llm_as_judge_2.csv' already exists.

2024-12-16 18:49:40,340 - __PROJECT_LIB__ - ERROR - An asset named 'watsonx_llm_as_judge_2.csv' already exists.
RuntimeError: An asset named 'watsonx_llm_as_judge_2.csv' already exists.


RuntimeError: An asset named 'watsonx_llm_as_judge_2.csv' already exists.

Plot faithfulness metrics against the records

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x, y_faithfulness, marker='o')

# Adding labels and title
plt.xlabel('X-axis - Record id (last 5 characters)')
plt.ylabel('Y-axis - Faithfulness')
plt.title('faithfulness vs record id')

# Display the graph
plt.show()

Plot answer relevance metrics against the records

In [None]:
import matplotlib.pyplot as plt
plt.scatter(x, y_answer_relevance, marker='o')

# Adding labels and title
plt.xlabel('X-axis - Record id (last 5 characters)')
plt.ylabel('Y-axis - Answer relevance')
plt.title('answer_relevance vs record id')

# Display the graph
plt.show()

## Step 8 - See factsheets information <a name="factsheetsspace"></a>

In [None]:
factsheets_url = factsheets_url = "{}/wx/prompt-details/{}/factsheet?context=wx&project_id={}".format(WML_CREDENTIALS["url"],project_pta_id, project_id)
print("User can navigate to the published facts in project {}".format(factsheets_url))

## Congratulations!

You have completed this notebook. You can now navigate to the prompt template asset in your OpenScale project / space and click on the `Evaluate` tab to visualize the results in the UI.

watsonx.governance

Copyright © 2024 IBM.