# Generate Mock Data

Demonstrates use of the Intelligence Toolkit library to generate mock data, both structured records and unstructured texts.

See [readme](https://github.com/microsoft/intelligence-toolkit/blob/main/app/workflows/generate_mock_data/README.md) for more details.

In [1]:
import sys

sys.path.append("..")
import os
from intelligence_toolkit.generate_mock_data import GenerateMockData
from intelligence_toolkit.AI.openai_configuration import OpenAIConfiguration
import pandas as pd
import json

In [2]:
# Create the workflow object
gmd = GenerateMockData()
# Set the AI configuration
ai_configuration = OpenAIConfiguration(
    {
        "api_type": "OpenAI",
        "api_key": os.environ["OPENAI_API_KEY"],
        "model": "gpt-4o",
    }
)
gmd.set_ai_configuration(ai_configuration)
# Load the data schema
schema_path = "../example_outputs/generate_mock_data/customer_complaints/customer_complaints_schema.json"
json_schema = json.loads(open(schema_path, "r").read())
# Set the schema
gmd.set_schema(json_schema)
print("Loaded data schema")
print(json_schema)

Loaded data schema
{'$schema': 'http://json-schema.org/draft/2020-12/schema', 'title': 'Customer complaints', 'description': 'An example schema storing an array of customer complaints', 'type': 'object', 'properties': {'customer_complaints': {'type': 'array', 'description': 'The list of customers and their complaints', 'items': {'type': 'object', 'properties': {'name': {'type': 'string', 'description': 'The name of the customer'}, 'street': {'type': 'string', 'description': 'The street of the customer, including property name/number'}, 'city': {'type': 'string', 'description': 'The city of the customer'}, 'age': {'type': 'number', 'description': 'The age of the customer'}, 'email': {'type': 'string', 'description': 'The email address of the customer'}, 'price_issue': {'type': 'boolean', 'description': 'The complaint is a price issue'}, 'quality_issue': {'type': 'boolean', 'description': 'The complaint is a quality issue'}, 'service_issue': {'type': 'boolean', 'description': 'The compla

In [3]:
# Generate mock data records
await gmd.generate_data_records(
    num_records_overall=100,
    records_per_batch=10,
    duplicate_records_per_batch=1,
    related_records_per_batch=1,
)
print("Generated data records")

100%|██████████| 10/10 [00:20<00:00,  2.09s/it]

Generated data records





In [4]:
# Inspect the data as JSON
print(gmd.json_object)

{'customer_complaints': [{'name': 'Jon Doe', 'street': '123 Elm St.', 'city': 'Springfield', 'age': 35, 'email': 'jon.doe@example.com', 'price_issue': True, 'quality_issue': False, 'service_issue': False, 'delivery_issue': False, 'description_issue': False, 'product_code': 'A', 'quarter': '2022-Q3'}, {'name': 'Jane Doe', 'street': '123 Elm Street', 'city': 'Springfield', 'age': 32, 'email': 'janedoe@example.com', 'price_issue': False, 'quality_issue': True, 'service_issue': False, 'delivery_issue': False, 'description_issue': False, 'product_code': 'A', 'quarter': '2022-Q3'}, {'name': 'Alice Smith', 'street': '456 Oak Avenue', 'city': 'Shelbyville', 'age': 28, 'email': 'alice.smith@example.com', 'price_issue': False, 'quality_issue': True, 'service_issue': False, 'delivery_issue': False, 'description_issue': True, 'product_code': 'B', 'quarter': '2022-Q4'}, {'name': 'Bob Johnson', 'street': '789 Pine Road', 'city': 'Capital City', 'age': 45, 'email': 'b.johnson@example.com', 'price_iss

In [5]:
# Inspect the data as dataframes (one per array field)
print(gmd.array_dfs)

{'customer_complaints':              name               street          city  age  \
0         Jon Doe          123 Elm St.   Springfield   35   
1        Jane Doe       123 Elm Street   Springfield   32   
2     Alice Smith       456 Oak Avenue   Shelbyville   28   
3     Bob Johnson        789 Pine Road  Capital City   45   
4       Eve Adams       321 Maple Lane    Ogdenville   39   
..            ...                  ...           ...  ...   
95    Sarah Black     321 Maple Street   Summerville   41   
96       Tom Blue  654 Birch Boulevard      Lakeside   36   
97      Emily Red      987 Cedar Court  Mountainview   27   
98  Oliver Yellow    159 Spruce Street       Seaside   50   
99   Grace Violet       753 Willow Way   Forestville   34   

                       email  price_issue  quality_issue  service_issue  \
0        jon.doe@example.com         True          False          False   
1        janedoe@example.com        False           True          False   
2    alice.smith@e

In [6]:
# Use the customer_complaints dataframe to generate mock text data (first 10 records only)
df = gmd.array_dfs["customer_complaints"][:10]
await gmd.generate_text_data(df)
print("Generated text data")

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:16<00:00,  1.69s/it]

Generated text data





In [7]:
# Inspect texts as dataframe
print(gmd.text_df)

                                           mock_text
0  **Customer Feedback Report**\n\n**Customer Inf...
1  **Customer Feedback Report**\n\n**Customer Inf...
2  ---\n\n**Customer Feedback Report**\n\n**Custo...
3  ---\n\n**Customer Feedback Report**\n\n**Name:...
4  # Customer Service Report\n\n## Customer Infor...
5  **Customer Support Report**\n\n**Customer Info...
6  **Customer Complaint Report**\n\n**Customer In...
7  **Customer Feedback Report**\n\n**Customer Inf...
8  **Customer Service Report**\n\n**Customer Info...
9  # Customer Feedback Report\n\n## Customer Info...
