# Thinking with Types: Whats the problem?

If you seen my [talk](https://www.youtube.com/watch?v=yj-wSRJwrrc&t=1s) on this topic, you can skip this chapter.

Many times, when we want to use language models, its not to make chatbots, but to communicate with other computer systems. This commonly means we want to use a model to output structured data like JSON. However, working with raw json or dictionaries can be a pain. 

In this section will go over introducing Pydantic as a tool we can leverage in our day to day programming, and then later use openai function calling to extract some simple data out of a string. Which will lay the ground work for introducing my library Instructor.

## Problem 1: Working with JSON, Validation, and Pydantic

Lets say we have a simple JSON object, and we want to work with it. We can use the `json` module to load it into a dictionary, and then work with it. However, this is a bit of a pain, because we have to manually check the types of the data, and we have to manually check if the data is valid. For example, lets say we have a JSON object that looks like this:

In [1]:
data = [
    {"first_name": "Jason", "age": 10}, 
    {"firstName": "Jason", "age": "10"}
]

We have a `name` field, which is a string, and an `age` field, which is an integer. However, if we were to load this into a dictionary, we would have no way of knowing if the data is valid. For example, we could have a string for the age, or we could have a float for the age. We could also have a string for the name, or we could have a list for the name.

In [2]:
for obj in data:
    name = obj.get("first_name")
    age = obj.get("age")
    print(f"{name} is {age}")
    print(f"Next year he will be {age+1} years old")

Jason is 10
Next year he will be 11 years old
None is 10


TypeError: can only concatenate str (not "int") to str

You see that while we were able to program with a dictionary, we had issues with the data being valid. We would have had to manually check the types of the data, and we had to manually check if the data was valid. This is a pain, and we can do better.

## Pydantic to the rescue

Pydantic is a library that allows us to define data structures, and then validate them. It also allows us to define data structures.

In [3]:
from pydantic import BaseModel, Field


class Person(BaseModel):
    name: str
    age: int

person = Person(name="Sam", age=30)
person

Person(name='Sam', age=30)

In [4]:
# Data is correctly casted to the right type
person = Person.model_validate({"name": "Sam", "age": "30"})
person

Person(name='Sam', age=30)

In [5]:
assert person.name == "Sam"
assert person.age == 20

AssertionError: 

In [6]:
# Data is validated to get better error messages
person = Person.model_validate({"name": "Sam", "age": "30.2"})
person

ValidationError: 1 validation error for Person
age
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='30.2', input_type=str]
    For further information visit https://errors.pydantic.dev/2.4/v/int_parsing

By introducing pydantic into any python codebase you can get a lot of benefits. You can get type checking, you can get validation, and you can get autocomplete. This is a huge win, because it means you can catch errors before they happen. This is even more useful when we rely on language models to generate data for us.

## Asking for JSON from OpenAI

In [7]:
from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Extract `Jason is 25 years old` into json"},
    ]
)

Person.model_validate_json(resp.choices[0].message.content)

Person(name='Jason', age=25)

In [8]:
resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Extract `Jason Liu is thirty years old` into json"},
    ]
)

Person.model_validate_json(resp.choices[0].message.content)

Person(name='Jason Liu', age=30)

But what happens if I want describe specifically how the schema should look? what if i want full_name and age and birthday as a datetime?

In [9]:
import datetime

class PersonBirthday(Person):
    birthday: datetime.date


resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": f"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}"},
    ]
)

Person.model_validate_json(resp.choices[0].message.content)

Person(name='Jason Liu', age=30)

## Introduction to Function Calling 

The json could be anything! we could add more and more into a prompt and hope it works, or we can use something called function calling to directly specify the schema we want. 


**Function Calling**

In an API call, you can describe functions and have the model intelligently choose to output a JSON object containing arguments to call one or many functions. The Chat Completions API does not call the function; instead, the model generates JSON that you can use to call the function in your code.

In [10]:
schema = {
    'properties': 
    {
        'name': {'type': 'string'},
        'age': {'type': 'integer'},
        'birthday': {'type': 'string', 'format': 'YYYY-MM-DD'},
    },
    'required': ['name', 'age'],
    'type': 'object'
}

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": f"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}"},
    ],
    functions=[{"name": "Person", "parameters": schema}],
    function_call="auto"
)


PersonBirthday.model_validate_json(resp.choices[0].message.function_call.arguments)

ValidationError: 1 validation error for PersonBirthday
birthday
  Input should be a valid date or datetime, input is too short [type=date_from_datetime_parsing, input_value='yesterday', input_type=str]
    For further information visit https://errors.pydantic.dev/2.4/v/date_from_datetime_parsing

But it turns out, pydantic actually not only does our serialization, we can define the schema as well as add additional documentation!

In [None]:
PersonBirthday.model_json_schema()

We can even define nested complex schemas, and documentation with ease.

In [None]:
class Address(BaseModel):
    address: str = Field(description="Full street address")
    city: str
    state: str


class PersonAddress(Person):
    """A Person with an address"""
    address: Address


PersonAddress.model_json_schema()

These simple concepts become what we built into `instructor` and most of the work has been around documenting how we can leverage schema engineering.
Except now we use `instructor.patch()` to add a bunch more capabilities to the OpenAI SDK.

In [None]:
import instructor

client = instructor.patch(client)

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user", 
            "content": f"""
            Today is {datetime.date.today()} 

            Extract `Jason Liu is thirty years old his birthday is yesturday` 
            he lives at 123 Main St, San Francisco, CA"""},
    ],
    response_model=PersonAddress
)
resp

Now you can see that when we set `response_model` create call will now return a pydantic model, and we can use that to validate the data. and work with it as if it was a python object.

## Is instructor the only way to do this?

No. Libraries like Marvin, Langchain, and LLamaindex all now leverage the pydantic object in similar ways however they all have different approaches to how they do it. With instructor the goal is to be as light weight as spossible, get you as close as possible to the openai api, and then get out of your way.

More importantly, we've also added straight forward validation and reasking to the mix.

The goal of instructor is to show you how to think about structured prompting and provide examples and documentation that you can take with you to any framework.