## 1. Setup and Installation

Install the required packages:
- `openai`: For calling the Azure OpenAI API
- `jsonschema`: For validating JSON outputs
- `python-dotenv`: For loading environment variables

In [1]:
# Install required packages
%pip install openai jsonschema python-dotenv

Note: you may need to restart the kernel to use updated packages.


## 2. Environment Configuration

Create a `.env` file with your Azure OpenAI credentials:

In [2]:
import os
from dotenv import load_dotenv
from openai import AzureOpenAI

In [None]:
# Load environment variables
load_dotenv()

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-12-01-preview",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

## 3. Define JSON Schema

Define a JSON schema that we want our outputs to conform to. This schema describes a person with name, age, email, and skills.

In [4]:
import json
import jsonschema

# Define the schema for a person
person_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "email": {"type": "string", "format": "email"},
        "skills": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age", "email", "skills"],
    "additionalProperties": False
}

print(json.dumps(person_schema, indent=2))

{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer",
      "minimum": 0
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "skills": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "name",
    "age",
    "email",
    "skills"
  ],
  "additionalProperties": false
}


## 4. Re Format Prompts to Request JSON

Since the model doesn't natively support JSON mode, we need to explicitly instruct it to return JSON.

In [5]:
def create_json_prompt(system_prompt, user_prompt, schema):
    """
    Create a prompt that instructs the model to return JSON according to a schema.
    """
    schema_str = json.dumps(schema, indent=2)
    
    # Create system message that instructs the model to return JSON
    system_message = f"""{system_prompt}

You MUST respond with a valid JSON object that conforms to the following schema:
```
{schema_str}
```

Your response should contain ONLY the JSON object, with no additional text before or after.
Do not include ```json or ``` markers in your response."""

    return system_message, user_prompt

In [6]:
# Example: Create a prompt for person information
system_prompt = "You are an assistant that provides information about people in JSON format."
user_prompt = "Give me information about a software engineer."

formatted_system, formatted_user = create_json_prompt(system_prompt, user_prompt, person_schema)

print("System prompt:")
print(formatted_system)
print("\nUser prompt:")
print(formatted_user)

System prompt:
You are an assistant that provides information about people in JSON format.

You MUST respond with a valid JSON object that conforms to the following schema:
```
{
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer",
      "minimum": 0
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "skills": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "name",
    "age",
    "email",
    "skills"
  ],
  "additionalProperties": false
}
```

Your response should contain ONLY the JSON object, with no additional text before or after.
Do not include ```json or ``` markers in your response.

User prompt:
Give me information about a software engineer.


## 5. Make API Calls to the LLM

Make a call to Azure OpenAI's GPT-4o and get the response:

In [7]:
def get_completion(system_prompt, user_prompt, deployment_name=None):
    """
    Get a completion from Azure OpenAI's GPT model.
    """
    if deployment_name is None:
        deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o")
        
    try:
        response = client.chat.completions.create(
            model=deployment_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0,
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error calling Azure OpenAI: {str(e)}")
        # Return dummy data for demonstration purposes
        return '{"name": "Jane Smith", "age": 28, "email": "jane@example.com", "skills": ["JavaScript", "React", "Node.js"]}'

In [8]:
# Get completion with our formatted prompts
raw_response = get_completion(formatted_system, formatted_user)
print("Raw model response:")
print(raw_response)

Raw model response:
{
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "skills": [
    "JavaScript",
    "Python",
    "Java",
    "C++",
    "SQL"
  ]
}


## 6. JSON Schema Validation Post-Processing

Validate response against our schema:

In [10]:
def validate_and_parse_json(json_string, schema):
    """
    Validate a JSON string against a schema and return the parsed object if valid.
    """
    try:
        # Try to parse the JSON string
        parsed_json = json.loads(json_string)
        
        # Validate against schema
        jsonschema.validate(instance=parsed_json, schema=schema)
        print("✅ JSON is valid according to the schema")
        return parsed_json, True
    except json.JSONDecodeError as e:
        print(f"❌ Invalid JSON: {str(e)}")
        return None, False
    except jsonschema.exceptions.ValidationError as e:
        print(f"❌ Schema validation failed: {str(e)}")
        return parsed_json, False

In [11]:
# Validate the response
parsed_data, is_valid = validate_and_parse_json(raw_response, person_schema)

if is_valid:
    print("\nParsed data:")
    print(json.dumps(parsed_data, indent=2))

✅ JSON is valid according to the schema

Parsed data:
{
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "skills": [
    "JavaScript",
    "Python",
    "Java",
    "C++",
    "SQL"
  ]
}


## 7. Repairing Invalid Outputs

Sometimes the model might return JSON that doesn't conform to our schema. Create a repair function:

In [12]:
def repair_json(parsed_json, schema):
    """
    Attempt to repair invalid JSON to make it conform to the schema.
    """
    # Make a copy to avoid modifying the original
    repaired = parsed_json.copy() if parsed_json else {}
    
    # Check each required field
    for field in schema.get("required", []):
        # If field is missing, add a default value based on its type
        if field not in repaired:
            field_schema = schema["properties"].get(field, {})
            field_type = field_schema.get("type")
            
            if field_type == "string":
                repaired[field] = f"default_{field}"
            elif field_type == "integer":
                repaired[field] = 0
            elif field_type == "number":
                repaired[field] = 0.0
            elif field_type == "boolean":
                repaired[field] = False
            elif field_type == "array":
                repaired[field] = []
            elif field_type == "object":
                repaired[field] = {}
    
    # Special case for email format
    if "email" in repaired and not "@" in repaired["email"]:
        repaired["email"] = "default@example.com"
    
    # Special case for arrays
    for field in repaired:
        field_schema = schema["properties"].get(field, {})
        if field_schema.get("type") == "array" and not isinstance(repaired[field], list):
            repaired[field] = []
    
    return repaired

In [13]:
# Example of invalid response for demonstration
invalid_response = '{"name": "John Smith", "age": "twenty-five", "email": "invalid-email"}'
print("Invalid response example:")
print(invalid_response)

# Try to parse and validate
parsed_invalid, is_valid = validate_and_parse_json(invalid_response, person_schema)

if not is_valid and parsed_invalid:
    # Attempt to repair
    repaired = repair_json(parsed_invalid, person_schema)
    
    print("\nRepaired data:")
    print(json.dumps(repaired, indent=2))
    
    # Validate the repaired data
    _, repaired_valid = validate_and_parse_json(json.dumps(repaired), person_schema)

Invalid response example:
{"name": "John Smith", "age": "twenty-five", "email": "invalid-email"}
❌ Schema validation failed: 'skills' is a required property

Failed validating 'required' in schema:
    {'type': 'object',
     'properties': {'name': {'type': 'string'},
                    'age': {'type': 'integer', 'minimum': 0},
                    'email': {'type': 'string', 'format': 'email'},
                    'skills': {'type': 'array',
                               'items': {'type': 'string'}}},
     'required': ['name', 'age', 'email', 'skills'],
     'additionalProperties': False}

On instance:
    {'name': 'John Smith', 'age': 'twenty-five', 'email': 'invalid-email'}

Repaired data:
{
  "name": "John Smith",
  "age": "twenty-five",
  "email": "default@example.com",
  "skills": []
}
❌ Schema validation failed: 'twenty-five' is not of type 'integer'

Failed validating 'type' in schema['properties']['age']:
    {'type': 'integer', 'minimum': 0}

On instance['age']:
    'twenty-

## 8. Complete Integration: From Prompt to Valid JSON

putting everything together into a single function:

In [14]:
def get_structured_output(system_prompt, user_prompt, schema, deployment_name=None):
    """
    Complete pipeline to get structured output from LLM:
    1. Format the prompt to request JSON
    2. Call the LLM
    3. Validate the response against the schema
    4. Repair if necessary
    5. Return the validated JSON
    """
    # Format prompt
    formatted_system, formatted_user = create_json_prompt(system_prompt, user_prompt, schema)
    
    # Get response from LLM
    raw_response = get_completion(formatted_system, formatted_user, deployment_name)
    
    # Parse and validate
    parsed_data, is_valid = validate_and_parse_json(raw_response, schema)
    
    # Repair if invalid
    if not is_valid and parsed_data:
        repaired = repair_json(parsed_data, schema)
        repaired_valid = validate_and_parse_json(json.dumps(repaired), schema)[1]
        
        if repaired_valid:
            return repaired, True
        else:
            return repaired, False
    
    return parsed_data, is_valid

In [15]:
# Example usage
results, success = get_structured_output(
    "You are an assistant that provides information about people.",
    "Give me information about a data scientist.",
    person_schema
)

print(f"Success: {success}")
print("Structured Output:")
print(json.dumps(results, indent=2))

✅ JSON is valid according to the schema
Success: True
Structured Output:
{
  "name": "Jane Doe",
  "age": 30,
  "email": "jane.doe@example.com",
  "skills": [
    "Python",
    "R",
    "SQL",
    "Machine Learning",
    "Data Visualization",
    "Statistics"
  ]
}


## 9. Testing with Different Queries

Try a few different queries to test implementation:

In [16]:
test_queries = [
    "Tell me about a product manager.",
    "Describe a UX designer.",
    "Who is a typical machine learning engineer?",
]

for query in test_queries:
    print(f"\n--- Query: {query} ---")
    results, success = get_structured_output(
        "You are an assistant that provides information about professionals.",
        query,
        person_schema
    )
    
    if success:
        print("✅ Success")
    else:
        print("❌ Failed even after repair")
    
    print(json.dumps(results, indent=2))
    print("-" * 50)


--- Query: Tell me about a product manager. ---
✅ JSON is valid according to the schema
✅ Success
{
  "name": "Jane Doe",
  "age": 35,
  "email": "jane.doe@example.com",
  "skills": [
    "Product Development",
    "Market Research",
    "Agile Methodologies",
    "Project Management",
    "User Experience Design"
  ]
}
--------------------------------------------------

--- Query: Describe a UX designer. ---
✅ JSON is valid according to the schema
✅ Success
{
  "name": "Jane Doe",
  "age": 35,
  "email": "jane.doe@example.com",
  "skills": [
    "Product Development",
    "Market Research",
    "Agile Methodologies",
    "Project Management",
    "User Experience Design"
  ]
}
--------------------------------------------------

--- Query: Describe a UX designer. ---
✅ JSON is valid according to the schema
✅ Success
{
  "name": "Alex Johnson",
  "age": 29,
  "email": "alex.johnson@example.com",
  "skills": [
    "User Research",
    "Wireframing",
    "Prototyping",
    "Usability Tes