# Smart Schema QuickStart Guide

This notebook demonstrates how to use the Smart Schema package to generate Pydantic models from various data sources. Smart Schema provides intelligent schema inference and model generation capabilities, with optional OpenAI integration for enhanced type inference.

## Installation

First, let's install the required packages:

In [1]:
# !pip install -e ../

[2mUsing Python 3.12.7 environment at /Users/braindead/Documents/Development/Projects/smart-schema/.venv[0m
[2mAudited [1m1 package[0m [2min 22ms[0m[0m


## Basic Usage

Let's start by importing the necessary modules:

In [2]:
from smart_schema.core.model_generator import ModelGenerator
import pandas as pd
import numpy as np
from datetime import datetime
import json

In [2]:
import os
OPENAI_API_KEY = "<API_KEY>"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

## 1. Generating Models from JSON Data

### 1.1 Basic JSON Example (without OpenAI)

Let's create a model from a simple JSON object:

In [4]:
# Create a sample JSON object
order_data = {
    "order_id": "ORD-2024-001",
    "customer_info": {
        "name": "Alice Johnson",
        "email": "alice.j@example.com",
        "shipping_address": {
            "street": "123 Main St",
            "city": "San Francisco",
            "state": "CA",
            "zip": "94105",
            "country": "USA"
        }
    },
    "items": [
        {
            "product_id": "PRD-001",
            "quantity": 2,
            "price": 299.99,
            "name": "Wireless Headphones"
        }
    ],
    "total_amount": 599.98,
    "payment_status": "completed",
    "shipping_details": {
        "carrier": "FedEx",
        "tracking_number": "FDX123456789",
        "estimated_delivery": "2024-03-25"
    }
}


# Create a model generator instance
generator = ModelGenerator(name="OrderModel")

# Generate the model
OrderModel = generator.from_json(order_data)

# Test the model
user = OrderModel(**order_data)
print("Generated model:")
OrderModel.model_json_schema()

Generated model:


{'$defs': {'CustomerInfo': {'properties': {'name': {'default': None,
     'title': 'Name',
     'type': 'string'},
    'email': {'default': None, 'title': 'Email', 'type': 'string'},
    'shipping_address': {'allOf': [{'$ref': '#/$defs/ShippingAddress'}],
     'default': None}},
   'title': 'CustomerInfo',
   'type': 'object'},
  'ItemsItem': {'properties': {'product_id': {'default': None,
     'title': 'Product Id',
     'type': 'string'},
    'quantity': {'default': None, 'title': 'Quantity', 'type': 'integer'},
    'price': {'default': None, 'title': 'Price', 'type': 'number'},
    'name': {'default': None, 'title': 'Name', 'type': 'string'}},
   'title': 'ItemsItem',
   'type': 'object'},
  'ShippingAddress': {'properties': {'street': {'default': None,
     'title': 'Street',
     'type': 'string'},
    'city': {'default': None, 'title': 'City', 'type': 'string'},
    'state': {'default': None, 'title': 'State', 'type': 'string'},
    'zip': {'default': None, 'title': 'Zip', 'type'

### 1.2 JSON Example with OpenAI

Now let's try the same example with OpenAI for enhanced type inference:

In [5]:
user_data = {
    "user_id": 1001,
    "username": "jane_doe",
    "email": "jane.doe@example.com",
    "age": 28,
    "is_active": True,
    "last_login": "2024-03-20T14:30:00"
}

# Create a model generator with smart inference
generator = ModelGenerator(name="UserModelSmart", smart_inference=True)

# Generate the model using OpenAI
UserModelSmart = generator.from_json(
    user_data,
    # api_key=OPENAI_API_KEY  # Or set OPENAI_API_KEY environment variable
)

# Test the model
user = UserModelSmart(**user_data)
print("Generated model with OpenAI:")
UserModelSmart.model_json_schema()

Generated schema: {
  "models": {
    "main_model": {
      "name": "UserModelSmart",
      "fields": {
        "user_id": {
          "type": "int",
          "is_nullable": false,
          "description": "Unique identifier for the user."
        },
        "username": {
          "type": "str",
          "is_nullable": false,
          "description": "The user's username."
        },
        "email": {
          "type": "str",
          "is_nullable": false,
          "description": "The user's email address."
        },
        "age": {
          "type": "int",
          "is_nullable": true,
          "description": "The user's age."
        },
        "is_active": {
          "type": "bool",
          "is_nullable": false,
          "description": "Status indicating if the user is active."
        },
        "last_login": {
          "type": "datetime",
          "is_nullable": true,
          "description": "The date and time when the user last logged in."
        }
      }
    }

{'properties': {'user_id': {'description': 'Unique identifier for the user.',
   'title': 'User Id',
   'type': 'integer'},
  'username': {'description': "The user's username.",
   'title': 'Username',
   'type': 'string'},
  'email': {'description': "The user's email address.",
   'title': 'Email',
   'type': 'string'},
  'age': {'anyOf': [{'type': 'integer'}, {'type': 'null'}],
   'description': "The user's age.",
   'title': 'Age'},
  'is_active': {'description': 'Status indicating if the user is active.',
   'title': 'Is Active',
   'type': 'boolean'},
  'last_login': {'anyOf': [{'format': 'date-time', 'type': 'string'},
    {'type': 'null'}],
   'description': 'The date and time when the user last logged in.',
   'title': 'Last Login'}},
 'required': ['user_id',
  'username',
  'email',
  'age',
  'is_active',
  'last_login'],
 'title': 'UserModelSmart',
 'type': 'object'}

## 2. Generating Models from DataFrames

### 2.1 Basic DataFrame Example

Let's create a model from a pandas DataFrame:

In [6]:
# Create a sample DataFrame
df = pd.DataFrame({
    'product_id': ['PRD-001', 'PRD-002', 'PRD-003'],
    'name': ['Headphones', 'Mouse', 'Keyboard'],
    'price': [299.99, 49.99, 89.99],
    'in_stock': [True, True, False],
    'created_at': pd.date_range(start='2024-01-01', periods=3)
})

# Create a model generator instance
generator = ModelGenerator(name="ProductModel")

# Generate the model
ProductModel = generator.from_dataframe(df)

# Test the model
product = ProductModel(**df.iloc[0].to_dict())
print("Generated model:")
ProductModel.model_json_schema()

Generated model:


{'properties': {'product_id': {'default': None,
   'title': 'Product Id',
   'type': 'string'},
  'name': {'default': None, 'title': 'Name', 'type': 'string'},
  'price': {'default': None, 'title': 'Price', 'type': 'number'},
  'in_stock': {'default': None, 'title': 'In Stock', 'type': 'boolean'},
  'created_at': {'description': 'Datetime field for created_at',
   'format': 'date-time',
   'title': 'Created At',
   'type': 'string'}},
 'required': ['created_at'],
 'title': 'ProductModel',
 'type': 'object'}

### 2.2 DataFrame Example with DateTime Columns

Let's try another example with explicit datetime column handling:

In [7]:
# Create a model generator instance
generator = ModelGenerator(name="ProductModelWithDates")

# Generate the model with datetime columns specified
ProductModelWithDates = generator.from_dataframe(
    df,
    datetime_columns=['created_at']
)

# Test the model
product = ProductModelWithDates(**df.iloc[0].to_dict())
print("Generated model with datetime handling:")
ProductModelWithDates.model_json_schema()

Generated model with datetime handling:


{'properties': {'product_id': {'default': None,
   'title': 'Product Id',
   'type': 'string'},
  'name': {'default': None, 'title': 'Name', 'type': 'string'},
  'price': {'default': None, 'title': 'Price', 'type': 'number'},
  'in_stock': {'default': None, 'title': 'In Stock', 'type': 'boolean'},
  'created_at': {'description': 'Datetime field for created_at',
   'format': 'date-time',
   'title': 'Created At',
   'type': 'string'}},
 'required': ['created_at'],
 'title': 'ProductModelWithDates',
 'type': 'object'}

## 3. Generating Models from Field Descriptions

### 3.1 Basic Field Description Example

Let's create a model from field descriptions:

In [16]:
# Define field descriptions
blog_fields = [
    {
        "name": "post_id",
        "description": "Unique identifier for the blog post, it is alphanumeric.",
        "nullable": False
    },
    {
        "name": "title",
        "description": "Blog post title",
        "nullable": False
    },
    {
        "name": "content",
        "description": "Full text content of the blog post",
        "nullable": False
    },
    {
        "name": "author_id",
        "description": "ID of the post author, it is alphanumeric.",
        "nullable": False
    },
    {
        "name": "published_at",
        "description": "Publication date and time",
        "nullable": True
    },
    {
        "name": "tags",
        "description": "List of keywords associated with the post.",
        "nullable": False
    },
    {
        "name": "view_count",
        "description": "Number of times the post has been viewed",
        "nullable": False
    }
]

# Create a model generator instance
generator = ModelGenerator(name="BlogModel", smart_inference=True, openai_model="gpt-4o")

# Generate the model
BlogModel = generator.from_description(
    blog_fields,
    # api_key=OPENAI_API_KEY  # Or set OPENAI_API_KEY environment variable
)

# Test the model
blog_post = {
    "post_id": "BLOG-2024-001",
    "title": "Getting Started with Python Data Science",
    "content": "In this comprehensive guide...",
    "author_id": "AUTH-123",
    "published_at": "2024-03-20T09:00:00",
    "tags": ["python", "data-science", "tutorial"],
    "view_count": 1250
}

validated_post = BlogModel(**blog_post)
print("Generated model:")
BlogModel.model_json_schema()

Generated model:


{'properties': {'post_id': {'description': 'Unique identifier for the blog post, it is alphanumeric.',
   'title': 'Post Id',
   'type': 'string'},
  'title': {'description': 'Blog post title',
   'title': 'Title',
   'type': 'string'},
  'content': {'description': 'Full text content of the blog post',
   'title': 'Content',
   'type': 'string'},
  'author_id': {'description': 'ID of the post author, it is alphanumeric.',
   'title': 'Author Id',
   'type': 'string'},
  'published_at': {'anyOf': [{'format': 'date-time', 'type': 'string'},
    {'type': 'null'}],
   'description': 'Publication date and time',
   'title': 'Published At'},
  'tags': {'description': 'List of keywords associated with the post.',
   'items': {'type': 'string'},
   'title': 'Tags',
   'type': 'array'},
  'view_count': {'description': 'Number of times the post has been viewed',
   'title': 'View Count',
   'type': 'integer'}},
 'required': ['post_id',
  'title',
  'content',
  'author_id',
  'published_at',
  '

### 3.2 Complex Nested Field Description Example

Let's create a more complex model with nested structures:

In [12]:
# Define field descriptions for a complex order model
order_fields = [
    {
        "name": "order_id",
        "description": "Unique identifier for the order, it is alphanumeric.",
        "nullable": False
    },
    {
        "name": "customer_info",
        "description": "Customer details including name, email, and shipping address, this is nested information.",
        "nullable": False
    },
    {
        "name": "items",
        "description": "List of ordered items with quantity and price",
        "nullable": False
    },
    {
        "name": "total_amount",
        "description": "Total order amount including tax and shipping",
        "nullable": False
    },
    {
        "name": "payment_status",
        "description": "Current status of the payment (pending, completed, failed)",
        "nullable": False
    },
    {
        "name": "shipping_details",
        "description": "Shipping information including carrier and tracking number, this is nested information.",
        "nullable": True
    }
]

# Create a model generator instance
generator = ModelGenerator(name="OrderModel", smart_inference=True)

# Generate the model
OrderModel = generator.from_description(
    order_fields,
    # api_key=OPENAI_API_KEY  # Or set OPENAI_API_KEY environment variable
)

# Test the model
order_data = {
    "order_id": "ORD-2024-001",
    "customer_info": {
        "name": "Alice Johnson",
        "email": "alice.j@example.com",
        "shipping_address": {
            "street": "123 Main St",
            "city": "San Francisco",
            "state": "CA",
            "zip": "94105",
            "country": "USA"
        }
    },
    "items": [
        {
            "product_id": "PRD-001",
            "quantity": 2,
            "price": 299.99,
            "name": "Wireless Headphones"
        }
    ],
    "total_amount": 599.98,
    "payment_status": "completed",
    "shipping_details": {
        "carrier": "FedEx",
        "tracking_number": "FDX123456789",
        "estimated_delivery": "2024-03-25"
    }
}

validated_order = OrderModel(**order_data)
print("Generated model:")
OrderModel.model_json_schema()

Generated model:


{'properties': {'order_id': {'description': 'Unique identifier for the order, it is alphanumeric.',
   'title': 'Order Id',
   'type': 'string'},
  'customer_info': {'description': 'Customer details including name, email, and shipping address, this is nested information.',
   'title': 'Customer Info',
   'type': 'object'},
  'items': {'description': 'List of ordered items with quantity and price',
   'items': {'type': 'object'},
   'title': 'Items',
   'type': 'array'},
  'total_amount': {'description': 'Total order amount including tax and shipping',
   'title': 'Total Amount',
   'type': 'number'},
  'payment_status': {'description': 'Current status of the payment (pending, completed, failed)',
   'title': 'Payment Status',
   'type': 'string'},
  'shipping_details': {'anyOf': [{'type': 'object'}, {'type': 'null'}],
   'description': 'Shipping information including carrier and tracking number, this is nested information.',
   'title': 'Shipping Details'}},
 'required': ['order_id',
 

## 4. Best Practices and Tips

1. **OpenAI Integration**:
   - Set your OpenAI API key in the environment variable `OPENAI_API_KEY`
   - Use `smart_inference=True` for more accurate type inference
   - Consider the cost implications of using OpenAI

2. **DateTime Handling**:
   - Always specify datetime columns when working with DataFrames
   - Use ISO format for datetime strings

3. **Field Descriptions**:
   - Be specific in your field descriptions
   - Include information about the expected format
   - Specify whether fields are nullable

4. **Model Validation**:
   - Always test your generated models with sample data
   - Handle validation errors appropriately
   - Use the model's `model_json_schema()` method to inspect the generated schema

5. **Performance Considerations**:
   - For large datasets, consider using DataFrame-based generation
   - Cache generated models when possible
   - Use appropriate data types to minimize memory usage

## 5. Common Use Cases

1. **API Development**:
   - Generate request/response models
   - Validate incoming data
   - Document API schemas

2. **Data Processing**:
   - Validate data before processing
   - Ensure data consistency
   - Handle data type conversions

3. **Database Operations**:
   - Generate models from database schemas
   - Validate data before insertion
   - Ensure type safety

4. **Configuration Management**:
   - Validate configuration files
   - Ensure required fields are present
   - Handle optional settings