In [1]:
!guardrails hub install hub://reflex/valid_python --quiet
!guardrails hub install hub://guardrails/two_words --quiet
!guardrails hub install hub://guardrails/valid_range --quiet

Installing hub:[35m/[0m[35m/reflex/[0m[95mvalid_python...[0m
✅Successfully installed reflex/valid_python!


Installing hub:[35m/[0m[35m/guardrails/[0m[95mtwo_words...[0m
✅Successfully installed guardrails/two_words!


Installing hub:[35m/[0m[35m/guardrails/[0m[95mvalid_range...[0m
✅Successfully installed guardrails/valid_range!




# Generating Structured Synthetic Data

!!! note
    To download this tutorial as a Jupyter notebook, click [here](https://github.com/ShreyaR/guardrails/blob/main/docs/examples/generate_structured_data.ipynb).

In this example, we'll generate structured dummy data for a `pandas` dataframe.

We make the assumption that:

1. We don't need any external libraries that are not already installed in the environment.
2. We are able to execute the code in the environment.

## Objective

We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:

1. There should be exactly 10 rows in the dataset.
2. Each user should have a first name and a last name.
3. The number of orders associated with each user should be between 0 and 50.
4. Each user should have a most recent order date.

## Step 1: Generating `RAIL` Spec

Ordinarily, we could create a separate `RAIL` spec in a file. However, for the sake of this example, we will generate the `RAIL` spec in the notebook as a string.  We will also show the same RAIL spec in a code-first format using a Pydantic model.

RAIL spec as an XML string:

In [2]:
rail_str = """
<rail version="0.1">

<output>
    <list name="user_orders" description="Generate a list of user, and how many orders they have placed in the past." format="length: 10 10" on-fail-length="noop">
        <object>
            <string name="user_id" description="The user's id." format="1-indexed" />
            <string name="user_name" description="The user's first name and last name" format="two-words" />
            <integer name="num_orders" description="The number of orders the user has placed" format="valid-range: 0 50" />
            <date name="last_order_date" description="Date of last order" />
        </object>
    </list>
</output>


<prompt>
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_json_suffix}</prompt>

</rail>
"""

Rail spec as a Pydantic model:

In [3]:
from pydantic import BaseModel, Field
from guardrails.hub import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List

prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_json_suffix}"""

class Order(BaseModel):
    user_id: str = Field(description="The user's id.", validators=[("1-indexed", "noop")])
    user_name: str = Field(
        description="The user's first name and last name",
        validators=[TwoWords()]
    )
    num_orders: int = Field(
        description="The number of orders the user has placed",
        validators=[ValidRange(0, 50)]
    )

class Orders(BaseModel):
    user_orders: List[Order] = Field(
        description="Generate a list of user, and how many orders they have placed in the past.",
        validators=[ValidLength(10, 10, on_fail="noop")]
    )

    Importing validators from `guardrails.validators` is deprecated.
    All validators are now available in the Guardrails Hub. Please install
    and import them from the hub instead. All validators will be
    removed from this module in the next major release.

    Install with: `guardrails hub install hub://<namespace>/<validator_name>`
    Import as: from guardrails.hub import `ValidatorName`
    
  warn(


## Step 2: Create a `Guard` object with the RAIL Spec

We create a `gd.Guard` object that will check, validate and correct the generated code. This object:

1. Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
2. Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
3. Compiles the schema and type info from the RAIL spec and adds it to the prompt.

In [4]:
import guardrails as gd

from rich import print

From our RAIL string:

In [5]:
guard = gd.Guard.from_rail_string(rail_str)

`from guardrails.validators import ValidLength` is deprecated and
support will be removed after version 0.5.x. Please switch to the Guardrails Hub syntax:
`from guardrails.hub import ValidLength` for future updates and support.
For additional details, please visit: https://hub.guardrailsai.com/validator/guardrails/valid_length.

  warn(
`from guardrails.validators import TwoWords` is deprecated and
support will be removed after version 0.5.x. Please switch to the Guardrails Hub syntax:
`from guardrails.hub import TwoWords` for future updates and support.
For additional details, please visit: https://hub.guardrailsai.com/validator/guardrails/two_words.

  warn(
`from guardrails.validators import ValidRange` is deprecated and
support will be removed after version 0.5.x. Please switch to the Guardrails Hub syntax:
`from guardrails.hub import ValidRange` for future updates and support.
For additional details, please visit: https://hub.guardrailsai.com/validator/guardrails/valid_range.

  w

From our Pydantic model:

In [6]:
guard = gd.Guard.from_pydantic(output_class=Orders, prompt=prompt)

The `Guard` object compiles the output schema and adds it to the prompt. We can see the final prompt below:

In [7]:
print(guard.base_prompt)

  print(guard.base_prompt)


## Step 3: Wrap the LLM API call with `Guard`

In [11]:
import openai


res = guard(
    openai.chat.completions.create,
    max_tokens=2048,
    temperature=0
)
res.validated_output


{'user_orders': [{'user_id': '1', 'user_name': 'John Doe', 'num_orders': 20},
  {'user_id': '2', 'user_name': 'Jane Smith', 'num_orders': 15},
  {'user_id': '3', 'user_name': 'Michael Johnson', 'num_orders': 30},
  {'user_id': '4', 'user_name': 'Emily Brown', 'num_orders': 10},
  {'user_id': '5', 'user_name': 'David Wilson', 'num_orders': 5},
  {'user_id': '6', 'user_name': 'Sarah Martinez', 'num_orders': 25},
  {'user_id': '7', 'user_name': 'Robert Taylor', 'num_orders': 40},
  {'user_id': '8', 'user_name': 'Olivia Anderson', 'num_orders': 12},
  {'user_id': '9', 'user_name': 'William Thomas', 'num_orders': 8},
  {'user_id': '10', 'user_name': 'Sophia Garcia', 'num_orders': 18}]}

Running the cell above returns:
1. The raw LLM text output as a single string.
2. A dictionary where the key `user_orders` key contains a list of dictionaries, where each dictionary represents a row in the dataframe.

In [12]:
print(guard.history.last.tree)