<a href="https://colab.research.google.com/github/jpsiyyadri/rag-pdf-content/blob/main/LLM_JSON_Schema_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM JSON Schema Workshop

This workshop is for developers and analysts.

You'll learn how to use JSON schema to define the exact JSON structure an LLM should generate.

There are 2 ways of generating JSON output.

1. [Prompts](#prompts): Telling the LLM the kind of JSON you want to generate (easy, less reliable)
2. [Tools](#tools): Giving the LLM the JSON schema to generate (harder, more reliable)

## Prompts

You can just tell an LLM to generate JSON.

[Anthropic's JSON guide](https://docs.anthropic.com/claude/docs/control-output-format) has good examples.

In addition, [OpenAI lets you specify a JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode).

Your prompt should mention the structure of the JSON you want to generate. The LLM will **TRY** to match that structure.

In [None]:
# Let's import a few libraries we'll use later
import json
import pandas as pd
import requests
from copy import deepcopy
from pprint import pprint
from tqdm import tqdm

# Let's extract the state name and ZIP code from these addresses
original_addresses = [
    {"place": "White House", "address": "1600 Pennsylvania Avenue, Washington DC"},
    {"place": "NYSE", "address": "11 Wall Street New York, NY"},
    {"place": "Empire State Building", "address": "350 Fifth Avenue New York, NY 10118"},
    {"place": "Hollywood sign", "address": "4059 Mt Lee Dr. Hollywood, CA 90068"},
    {"place": "Statue of Liberty", "address": "Liberty Island New York, NY 10004"},
]

In [None]:
# If you have an Anthropic and an Anthropic API key, set it here.
# claude_api_key = "sk-ant-api..."
# openai_api_key = "sk-..."

# If you have access, get your LLMFOUNDRY_TOKEN from https://llmfoundry.straive.com/code
# Then set a Google Colab secret named "LLMFOUNDRY_TOKEN" to that value.

from google.colab import userdata

claude_api_key = f"{userdata.get('LLMFOUNDRY_TOKEN')}:json-schema-workshop"
openai_api_key = f"{userdata.get('LLMFOUNDRY_TOKEN')}:json-schema-workshop"

In [None]:
# Let's define a function that will extract the state name and ZIP code

def extract_address_claude(address, debug=False):
    response = requests.post(
        "https://llmfoundry.straive.com/anthropic/v1/messages",
        headers={"Authorization": f"Bearer {claude_api_key}"},
        json={
            "model": "claude-3-haiku-20240307",
            "max_tokens": 50,
            "system": """Extract a flat JSON from message with ONLY these keys.

state_name: e.g. Texas
zip_code: 5-digit number (null if missing)
""",
            "messages": [{"role": "user", "content": address}],
        },
    )
    result = response.json()
    if debug:
        pprint(result, width=150, )
        print("-" * 80)
    return json.loads(result["content"][0]["text"])

In [None]:
# Let's try out an example
extract_address_claude("350 Fifth Avenue New York, NY 10118", debug=True)

{'content': [{'text': '{\n  "state_name": "New York",\n  "zip_code": "10118"\n}', 'type': 'text'}],
 'id': 'msg_01JH22VqgAJFsg1vdbYyYcEY',
 'model': 'claude-3-haiku-20240307',
 'role': 'assistant',
 'stop_reason': 'end_turn',
 'stop_sequence': None,
 'type': 'message',
 'usage': {'input_tokens': 56, 'output_tokens': 28}}
--------------------------------------------------------------------------------


{'state_name': 'New York', 'zip_code': '10118'}

In [None]:
# Fill in all the addresses
addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(extract_address_claude(address["address"]))

pd.DataFrame(addresses)

100%|██████████| 5/5 [00:01<00:00,  3.36it/s]


Unnamed: 0,place,address,state_name,zip_code
0,White House,"1600 Pennsylvania Avenue, Washington DC",District of Columbia,
1,NYSE,"11 Wall Street New York, NY",New York,
2,Empire State Building,"350 Fifth Avenue New York, NY 10118",New York,10118.0
3,Hollywood sign,"4059 Mt Lee Dr. Hollywood, CA 90068",California,90068.0
4,Statue of Liberty,"Liberty Island New York, NY 10004",New York,10004.0


Notice a few things:

1. It extracted state_name: "District of Columbia" even though that was not mentioned in the address.
2. It extracted state_name: "California" even though it was not in the address.

(These behaviours are not reliable, though.)

In [None]:
# Let's do the same thing with OpenAI

def extract_address_openai(address, debug=False):
    response = requests.post(
      "https://llmfoundry.straive.com/openai/v1/chat/completions",
      headers={"Authorization": f"Bearer {openai_api_key}"},
      json={
          "model": "gpt-3.5-turbo",
          # This increases the changes of getting a JSON object
          "response_format": { "type": "json_object" },
          "messages": [
              {"role": "system", "content": """Extract a flat JSON from message with ONLY these keys.

state_name: e.g. Texas
zip_code: 5-digit number (null if missing)
"""},
              {"role": "user", "content": address}
          ]
      }
  )
    result = response.json()
    if debug:
        pprint(result, width=150, )
        print("-" * 80)
    return json.loads(result["choices"][0]["message"]["content"])

In [None]:
# Let's try out an example
extract_address_openai("350 Fifth Avenue New York, NY 10118", debug=True)

{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'content': '{\n    "state_name": "New York",\n    "zip_code": "10118"\n}', 'role': 'assistant'}}],
 'created': 1715309471,
 'id': 'chatcmpl-9NAdj0ybv6EpYdIxBWbw7VSiyMv8S',
 'model': 'gpt-3.5-turbo-0125',
 'object': 'chat.completion',
 'system_fingerprint': None,
 'usage': {'completion_tokens': 20, 'prompt_tokens': 52, 'total_tokens': 72}}
--------------------------------------------------------------------------------


{'state_name': 'New York', 'zip_code': '10118'}

In [None]:
# Fill in all the addresses
addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(extract_address_openai(address["address"]))

pd.DataFrame(addresses)

100%|██████████| 5/5 [00:01<00:00,  3.19it/s]


Unnamed: 0,place,address,state_name,zip_code
0,White House,"1600 Pennsylvania Avenue, Washington DC",Washington,
1,NYSE,"11 Wall Street New York, NY",New York,
2,Empire State Building,"350 Fifth Avenue New York, NY 10118",New York,10118.0
3,Hollywood sign,"4059 Mt Lee Dr. Hollywood, CA 90068",California,90068.0
4,Statue of Liberty,"Liberty Island New York, NY 10004",New York,10004.0


Notice a few things:

1. It extracted state_name: "Washington", not "Washington DC"
2. It extracted state_name: "California" even though it was not in the address.

(These behaviours are not reliable, though.)

## Exercise 1

Modify the following code to add

1. `state_code` (a 2-letter state code)
2. `time_zone` (ET, CT, MT, or PT).

**Evaluation**: The output DataFrame should contain 4 columns: `state_name`, `state_code`, `time_zone` and `zip_code` (in any order), filled out correctly.

In [None]:
# Modify this code to add a state_code and time_zone column

def extract_address_details_claude(address, debug=False):
    response = requests.post(
        "https://llmfoundry.straive.com/anthropic/v1/messages",
        headers={"Authorization": f"Bearer {claude_api_key}"},
        json={
            "model": "claude-3-haiku-20240307",
            "max_tokens": 50,
            "system": """Extract a flat JSON from message with ONLY these keys.

state_name: e.g. Texas
zip_code: 5-digit number (null if missing)
""",
            "messages": [{"role": "user", "content": address}],
        },
    )
    result = response.json()
    if debug:
        pprint(result, width=150, )
        print("-" * 80)
    return json.loads(result["content"][0]["text"])


addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(extract_address_details_claude(address["address"]))

pd.DataFrame(addresses)

## You may not get valid JSON

Instead of JSON, you may get [text containing JSON](https://llmfoundry.straive.com/history#?t=1715310212349.2773)

<!-- https://llmfoundry.straive.com/history#?t=1715310212349.2773 -->

```
Here is the extracted JSON data:

{
  "state_name": "New York",
  "zip_code": null
}
```

Or Markdown containing JSON:

````
Here is the extracted JSON data:

```json
{
  "state_name": "New York",
  "zip_code": null
}
```
````

Or invalid JSON.

```
{
  "state_name": ""New York"",
  "zip_code": null
}
```


Parsing this kind of output is hard. So, for more reliable output, use [Tools](#tools)

## Tools

You can give LLMs the exact [JSON schema](https://json-schema.org/) you want them to generate.

- [Anthropic supports a "tools" parameter](https://docs.anthropic.com/claude/docs/tool-use)
- [OpenAI supports a "tools" parameter](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)

The easiest way to create a JSON schema is to have an LLM create it.

For example, the schema for state_name and zip code is:

In [None]:
address_schema = {
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "state_name": {
      "type": "string",
      "description": "Full state name, e.g. Texas"
    },
    "zip_code": {
      "type": ["string", "null"],
      "description": "5-digit ZIP code. null if missing. E.g. 10004"
    }
  },
  "required": ["state_name"]
}

In [None]:
# Let's define a function that will extract the state name and ZIP code

def tool_address_claude(address, schema=address_schema, debug=False):
    response = requests.post(
        "https://llmfoundry.straive.com/anthropic/v1/messages",
        headers={
            "Authorization": f"Bearer {claude_api_key}",
            # Tools use requires this header
            "Anthropic-Beta": "tools-2024-04-04"
        },
        json={
            "model": "claude-3-haiku-20240307",
            "max_tokens": 200,
            "tools": [{
                "name": "update_address",
                "description": "Updates address with state name and ZIP code",
                "input_schema": schema
            }],
            "system": "Update this address",
            "messages": [{"role": "user", "content": address}],
        },
    )
    result = response.json()
    if debug:
        pprint(result, width=150, )
        print("-" * 80)
    for content in result["content"]:
        if content["type"] == "tool_use" and content["name"] == "update_address":
            return content["input"]

In [None]:
tool_address_claude("350 Fifth Avenue New York, NY 10118", debug=True)

{'content': [{'text': "Okay, let's update the address:", 'type': 'text'},
             {'id': 'toolu_016EA1J4Kkve98QCMsY5cjvZ',
              'input': {'state_name': 'New York', 'zip_code': '10118'},
              'name': 'update_address',
              'type': 'tool_use'}],
 'id': 'msg_01Xmo4mAzEx8CvfqMD5LRhhe',
 'model': 'claude-3-haiku-20240307',
 'role': 'assistant',
 'stop_reason': 'tool_use',
 'stop_sequence': None,
 'type': 'message',
 'usage': {'input_tokens': 419, 'output_tokens': 86}}
--------------------------------------------------------------------------------


{'state_name': 'New York', 'zip_code': '10118'}

In [None]:
addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(tool_address_claude(address["address"]))

pd.DataFrame(addresses)

100%|██████████| 5/5 [00:01<00:00,  3.42it/s]


Unnamed: 0,place,address,state_name,zip_code
0,White House,"1600 Pennsylvania Avenue, Washington DC",District of Columbia,20500
1,NYSE,"11 Wall Street New York, NY",New York,10005
2,Empire State Building,"350 Fifth Avenue New York, NY 10118",New York,10118
3,Hollywood sign,"4059 Mt Lee Dr. Hollywood, CA 90068",California,90068
4,Statue of Liberty,"Liberty Island New York, NY 10004",New York,10004


Notice that it also filled out the (correct) ZIP codes for the White House and the NYSE. Again, this is not reliable behavior.

In [None]:
# Let's do the same with OpenAI

def tool_address_openai(address, schema=address_schema, debug=False):
    response = requests.post(
        "https://llmfoundry.straive.com/openai/v1/chat/completions",
        headers={"Authorization": f"Bearer {openai_api_key}"},
        json={
            "model": "gpt-3.5-turbo",
            "tools": [{
                "type": "function",
                "function": {
                    "name": "update_address",
                    "description": "Updates address with state name and ZIP code",
                    "parameters": schema
                }
            }],
            "tool_choice": {
                "type": "function",
                "function": { "name": "update_address"}
            },
            "messages": [
                {"role": "system", "content": "Update this address"},
                {"role": "user", "content": address}
            ],
        },
    )
    result = response.json()
    if debug:
        pprint(result, width=150, )
        print("-" * 80)
    for choice in result["choices"]:
        for tool_call in choice.get("message", {}).get("tool_calls", []):
            if tool_call.get("function", {}).get("name") == "update_address":
                return json.loads(tool_call["function"]["arguments"])

In [None]:
tool_address_openai("350 Fifth Avenue New York, NY 10118", debug=True)

{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'content': None,
                          'role': 'assistant',
                          'tool_calls': [{'function': {'arguments': '{"state_name":"New York","zip_code":"10118"}', 'name': 'update_address'},
                                          'id': 'call_HCNWVk3pYLiKzDsKudrFpvzG',
                                          'type': 'function'}]}}],
 'created': 1715312148,
 'id': 'chatcmpl-9NBKu45JZ58SBjmrOJRddkvXCgOZb',
 'model': 'gpt-3.5-turbo-0125',
 'object': 'chat.completion',
 'system_fingerprint': None,
 'usage': {'completion_tokens': 13, 'prompt_tokens': 147, 'total_tokens': 160}}
--------------------------------------------------------------------------------


{'state_name': 'New York', 'zip_code': '10118'}

In [None]:
addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(tool_address_openai(address["address"]))

pd.DataFrame(addresses)

100%|██████████| 5/5 [00:01<00:00,  3.21it/s]


Unnamed: 0,place,address,state_name,zip_code
0,White House,"1600 Pennsylvania Avenue, Washington DC",Washington D.C.,20500
1,NYSE,"11 Wall Street New York, NY",New York,10005
2,Empire State Building,"350 Fifth Avenue New York, NY 10118",New York,10118
3,Hollywood sign,"4059 Mt Lee Dr. Hollywood, CA 90068",California,90068
4,Statue of Liberty,"Liberty Island New York, NY 10004",New York,10004


## Exercise 2

In the code below, create a `detailed_address_schema` with these columns:

1. `state_name` (full name of the state)
2. `state_code` (a 2-letter state code)
3. `time_zone` (ET, CT, MT, or PT).
4. `zip_code` (5-digit ZIP code)

Run it and extract these parameters using Claude 3 Haiku.

**Evaluation**: The output DataFrame should contain 4 columns: `state_name`, `state_code`, `time_zone` and `zip_code` (in any order), filled out correctly.

In [None]:
detailed_address_schema = {
    # ... update the schema to include state_code and time_zone
}

addresses = deepcopy(original_addresses)
for address in tqdm(addresses):
    address.update(tool_address_claude(address["address"], schema=detailed_address_schema))

pd.DataFrame(addresses)

## Exercise 3

Create a JSON schema that returns an array of reactions from the description of reactions.

Each reaction contains an optional `reaction_title` and an array of reaction stages (max 4).

Each reaction stage has these keys:

- `stage`: number (1, 2, 3, ...)
- `compounds[]`: list of compounds mentioned in the reaction
  - `compound_name`: as mentioned in reaction text. Ignore quantities or concentrations. Just the compound name
  - `compound_type`: can be starting_material, solvent, or reagent (default)
- `duration`: number
- `duration_unit` (e.g. seconds, minutes, hours, days)
- `temperature`: number (Room temperature is 27.99 °C)
- `temperature_unit`: (e.g. C for Centigrade, K for Kelvin, F for Farenheit)

**Evaluation**: check if the results have exactly the right schema. (Don't worry about if it's extracted the results correctly. Just the right structure.)

Here's a sample output for the first reaction:

```json
  {
    "reaction_title": null,
    "reaction_stages": [
      {
        "stage": 1,
        "compounds": [
          {"compound_name": "yellowish oil", "compound_type": "starting_material"},
          {"compound name": "HCl", "compound_type: "reagent"
        ],
        "duration": 2,
        "duration_unit": "days",
        "temperature": 27.99,
        "temperature_unit": "C"
      }
    ]
  }
]
```

In [None]:
reactions = [
    "The yellowish oil made above was mixed in 2.0 N HCl (100 mL) and the resultant mixture was stirred at roomtemperature for 2 days.",
    "2.14 Synthesis of Alcohol 26: To a solution of α,β-unsaturated ester 25 (1.50 g, 2.76 mmol) in dry DCM (50 mL) wasS21added DIBAL-H (1.5 M in toluene, 5.5 mL, 8.3 mmol) at -78 ℃ under the nitrogen atmosphere. The resulting mixture was stirred at -78 ℃ for 1h before being quenched.",
    "Diol 7:To a solution of 21,2 (12.0 g, 62.4 mmol, 1 eq.) in dichloromethane (120 mL), DIBALH(129 mL of 1 M solution in DCM, 129 mmol, 2.1 eq.) was added at 0 °C over 1 hr, and thenstirred for 60 min at 0 °C. Methanol (10 mL) and water (3 mL) was added to the resultingmixture at 0 °C. NaF (13 g) was added portionwise, and then the mixture was vigorously stirredovernight at r.t. r 20 min.",
    "Synthesis of Limaspermidine (2):To a solution of 24 (97.4 mg, 0.346 mmol, 1.0 equiv.) and p-toluenesulfonic acid (328.7 mg, 1.73 mmol, 5.0 equiv.)in MeOH (7 mL) was bubbled with ozone -78 ℃ until the reaction mixture turned brown from pale yellow. The reactionmixture then bubbled with O2 to blow off the excess ozone for 5 min. To this solution was added Me2S (3 mL) followedby NaHCO3 powder (10.0 equiv.) at 0 ℃. To this solution was NaBH4 (10.0 equiv.) slowly, and the reaction mixture wasslowly warmed to room temperature. To this solution was added another portion of NaBH4 (10.0 equiv.) slowly at roomtemperature, and the resultant reaction mixture was stirred for overnight.",
]