# Generating Structured Synthetic Data

!!! note
    To download this tutorial as a Jupyter notebook, click [here](https://github.com/ShreyaR/guardrails/blob/main/docs/examples/generate_structured_data.ipynb).

In this example, we'll generate structured dummy data for a `pandas` dataframe.

We make the assumption that:

1. We don't need any external libraries that are not already installed in the environment.
2. We are able to execute the code in the environment.

## Objective

We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:

1. There should be exactly 10 rows in the dataset.
2. Each user should have a first name and a last name.
3. The number of orders associated with each user should be between 0 and 50.
4. Each user should have a most recent order date.


## Step 1: Generating `RAIL` Spec

Ordinarily, we could create a separate `RAIL` spec in a file. However, for the sake of this example, we will generate the `RAIL` spec in the notebook as a string.

In [38]:
rail_str = """
<rail version="0.1">

<output>
    <list name="user_orders" description="Generate a list of user, and how many orders they have placed in the past." format="length: 10 10" on-fail-length="noop">
        <object>
            <string name="user_id" description="The user's id." format="1-indexed" />
            <string name="user_name" description="The user's first name and last name" format="two-words" />
            <integer name="num_orders" description="The number of orders the user has placed" format="valid-range: 0 50" />
            <date name="last_order_date" description="Date of last order" />
        </object>
    </list>
</output>


<prompt>
Generate a dataset of fake user orders. Each row of the dataset should be valid.

@complete_json_suffix</prompt>

</rail>
"""

## Step 2: Create a `Guard` object with the RAIL Spec

We create a `gd.Guard` object that will check, validate and correct the generated code. This object:

1. Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
2. Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
3. Compiles the schema and type info from the RAIL spec and adds it to the prompt.

In [39]:
import guardrails as gd

from rich import print

guard = gd.Guard.from_rail_string(rail_str)

The `Guard` object compiles the output schema and adds it to the prompt. We can see the final prompt below:

In [40]:
print(guard.base_prompt)

## Step 3: Wrap the LLM API call with `Guard`

In [43]:
import openai

raw_llm_response, validated_response = guard(
    openai.Completion.create, engine="text-davinci-003", max_tokens=2048, temperature=0
)

Running the cell above returns:
1. The raw LLM text output as a single string.
2. A dictionary where the key `user_orders` key contains a list of dictionaries, where each dictionary represents a row in the dataframe.

In [44]:
print(validated_response)