# Data Quality Check Demo

This notebook shows you how to improve the quality of your synthetic data with Data Quality Checks.

In [1]:
import os
from okareo import Okareo

OKAREO_API_KEY = os.environ['OKAREO_API_KEY']
okareo = Okareo(api_key=OKAREO_API_KEY)

First, upload a seed scenario.

In [2]:
# get seed scenario data points
seed_scenario = okareo.upload_scenario_set(
    scenario_name="WebBizz Articles",
    file_path="webbizz_10_articles.jsonl", # name already exists
)
seed_sdp = okareo.get_scenario_data_points(seed_scenario.scenario_id)
seed_sdp_by_id = {dp.id: {'input': dp.input_, 'label': dp.result} for dp in seed_sdp}

Next, generate questions from the seed scenario using our reverse question generator.

In [4]:
from okareo_api_client.models.scenario_set_generate import ScenarioSetGenerate, ScenarioType

vanilla_generate_request = ScenarioSetGenerate(
    name="Webbizz Articles - REVERSE_QUESTION",
    source_scenario_id=seed_scenario.scenario_id,
    number_examples=1,
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION,
)

vanilla_generated_scenario = okareo.generate_scenario_set(vanilla_generate_request)
vanilla_generated_sdp = okareo.get_scenario_data_points(vanilla_generated_scenario.scenario_id)

In [6]:
for data in vanilla_generated_scenario.scenario_data:
    print(data.input_)

How can I get personalized product recommendations and faster checkout processes when shopping?
What measures does WebBizz use to ensure the security of customer data?
What kind of benefits can members of a special program at a shopping service typically get?
Can you explain how a Wishlist feature might benefit me when shopping online?
How can I return a product I’m not satisfied with?
Where can I find helpful guides to troubleshoot my technical issues?
How can I sort products to find what I need quickly on WebBizz?
Does WebBizz offer any exclusive deals or sales on their products?
What's one of the perks of subscribing to a newsletter?
What actions can I take to support a more environmentally-friendly lifestyle through my shopping choices?


Some of these questions are not specific to WebBizz or its product offerings.

Let's try to make more specific questions by defining a data quality check to filter our synthetic data.

In [7]:
from okareo.checks import CheckOutputType

generate_request = ScenarioSetGenerate(
    name="Webbizz Articles - REVERSE_QUESTION (specific)",
    source_scenario_id=seed_scenario.scenario_id,
    number_examples=1,
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION,
    checks=[
        {
            "name": "reverse_qa_specific",
            "description": "Check if the question is specific enough to the business described in the context.",
            "check_config": {
                "prompt_template": "Return True if the Question is specific to the business described in the given Context. Return False if the Question can be answered based on general information/common knowledge or doesn't relate to the specific business.\n\Context: {input}\n\nQuestion: {generation}\n\nAnswer: ",
                "type": CheckOutputType.PASS_FAIL.value,
            },
        }
    ]
)
generated_scenario = okareo.generate_scenario_set(generate_request)

generated_sdp = okareo.get_scenario_data_points(generated_scenario.scenario_id)

This generation took slightly longer for two reasons:
- We user the results of the `PASS_FAIL` check to filter the generated data. So we generate extra data to get the desired number of data points post-filtering.
- We apply the `checks` to each generated row. For `ModelBasedChecks`, this can be time consuming.

Let's take a look at the generated data that was filtered by our `reverse_qa_specific` check.

In [8]:
for data in generated_scenario.scenario_data:
    print(data.input_)

What measures does WebBizz use to ensure the security of customer data?
How does WebBizz handle product delivery and inform customers about the status of their orders?
What kind of membership acknowledges loyal customers at WebBizz and what benefits does it offer?
Are there any pre-sale opportunities for members of any special program at WebBizz?
How long do I have to return an item if I'm not happy with my purchase?
What benefits do I get from joining the WebBizz Rewards program?
Where can I find helpful guides to troubleshoot my technical issues?
Is there a place that provides steps for resolving common user problems?
How can I easily find products based on their ratings and popularity?
What options do I have for sorting products to make my shopping experience smoother?
What kind of special promotions can I find on WebBizz?
How does WebBizz make sure to offer a variety of products?
Do subscribers often receive special offers or discounts?
What actions does WebBizz take to support sus

More of these rows either mention "WebBizz" directly or ask a more specific questiona about WebBizz.

We can also view the "failed" rows that were filtered out by the check we created.

In [9]:
for data in generated_scenario.failed_data:
    print(data.input_)

How can I get help if I have a question about my order at an online store?
What are the benefits of logging into an online shopping account before making a purchase?
Can you explain how a feature that lets you save products for later helps with shopping?
What are the benefits of using a system that offers special promotions on items you've shown interest in?
What are some benefits of subscribing to a newsletter?


These rows contain vague language that is not specific to WebBizz or its offerings, so the check seems to be doing a good job of filtering our synthetic data.

## Combining Checks

We can use more than one check to filter our data. Let's try generating data using the predefined `consistency` check with the custom check we created above.

In [10]:
consistency_request = ScenarioSetGenerate(
    name="Webbizz Articles - REVERSE_QUESTION (two checks)",
    source_scenario_id=seed_scenario.scenario_id,
    number_examples=1,
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION,
    checks=[
        "consistency", # predefined check
        "reverse_qa_specific", # custom check from before
    ]
)
consistency_scenario = okareo.generate_scenario_set(consistency_request)

Let's look at the filtered data along with the corresponding check values. The filtered data should have:
- Relatively high `consistency` (compared to the `failed_data`) AND
- `reverse_qa_specific == True`

In [13]:
for i, data in enumerate(consistency_scenario.scenario_data):
    print(f"#{i} - {data.input_}")
    print(f"#{i} - {data.meta_data.additional_properties['checks']}")

#0 - Is there a way to get faster checkout processes?
#0 - {'consistency': 4.814105667168666, 'reverse_qa_specific': True}
#1 - Are there any pre-sale opportunities for members of any special program at WebBizz?
#1 - {'consistency': 4.403081052062947, 'reverse_qa_specific': True}
#2 - Can you explain how the Wishlist feature helps in managing my shopping preferences over time?
#2 - {'consistency': 3.4808619371200806, 'reverse_qa_specific': True}
#3 - What benefits does the Wishlist offer when I'm ready to make a purchase?
#3 - {'consistency': 3.6777785409306993, 'reverse_qa_specific': True}
#4 - How long do I have to return an item if I'm not happy with my purchase?
#4 - {'consistency': 4.888951533476209, 'reverse_qa_specific': True}
#5 - Where can I find helpful guides to troubleshoot my technical issues?
#5 - {'consistency': 4.9627813717214435, 'reverse_qa_specific': True}
#6 - What kind of special promotions can I find on WebBizz?
#6 - {'consistency': 4.739095341213009, 'reverse_qa_

Conversely, the failed data should have:
- Relatively low `consistency` (compared to the `scenario_data`) OR
- `reverse_qa_specific == False`

In [14]:
for i, data in enumerate(consistency_scenario.failed_data):
    print(f"#{i} - {data.input_}")
    print(f"#{i} - {data.meta_data.additional_properties['checks']}")

#0 - How can I get help if I have questions about my order?
#0 - {'consistency': 1.1178213432462065, 'reverse_qa_specific': True}
#1 - What measures does WebBizz use to ensure the security of customer data?
#1 - {'consistency': 2.6760806297729336, 'reverse_qa_specific': True}
#2 - How does WebBizz handle product delivery and inform customers about the status of their orders?
#2 - {'consistency': 1.8538589990873957, 'reverse_qa_specific': True}
#3 - What kind of membership acknowledges loyal customers at WebBizz and what benefits does it offer?
#3 - {'consistency': 2.5772645245427177, 'reverse_qa_specific': True}
#4 - What benefits do I get from joining the WebBizz Rewards program?
#4 - {'consistency': 2.7965437382454956, 'reverse_qa_specific': True}
#5 - Is there a place that provides steps for resolving common user problems?
#5 - {'consistency': 2.649361166317781, 'reverse_qa_specific': True}
#6 - How can I sort products to find what I need quickly on an online shop?
#6 - {'consistenc

We can see that either the `consistency` is low (in this case, less than 3) or the `reverse_qa_specific` field is `False`.

You can use Okareo's predefined checks or your own custom checks to improve the quality of your synthetic data!