# Using structured JSON output feature from OpenAI
Sources: 
- OpenAI web page for JSON formatting rules to use 'strict' mode: https://platform.openai.com/docs/guides/structured-outputs
- A Python Package for high-level usage of LLM structured output: https://github.com/dottxt-ai/outlines

In [1]:
# !pip uninstall -y sympy pysb torch outlines
# !pip install pysb sympy==1.11.1
# !pip install torch sympy==1.13.1 --no-deps
# !pip install outlines --no-deps

In [2]:
import sys
from pathlib import Path
import os
# get current path
sys.path.append(str(Path.cwd().parent))

In [3]:
from indra_gpt.util.util import merge_allOf, remove_null_from_anyOf
import json

## Providing the JSON schema directly as the response format
- I.e. Provide the indra JSON schema to the `response_format` parameter of `OpenAI.beta.chat.completions.parse` function of OpenAI

OpenAI enforces a subset of the 'JSON Schema' language. The below is a list of non-exhaustive rules that is enforced:
1. root JSON object is 'object' type
2. 'required' includes all and only the fields in 'property' 
    - This ensures LLM doesn't generate any key-val pair in 'property' that is not defined in the schema.
    - To make a defined field in 'property' optional, we can add "null" as an optional type of that field. 
3. 'additionalProperties' field is set to false and is included for every 'object' type.
4. Each sub-schema in 'anyOf', namely "#/definitions/RegulateActivity", "#/definitions/Modification", "#/definitions/SelfModification", etc. has a different first key
   in their 'property' field. <br>
   This is achievable by adding a new field with a unique constant value. E.g. "#/definitions/RegulateActivity" has 
   ```
   "kind": {
                "type": "string",
                "const": "RegulateActivity"
            }
    ```
    And "#/definitions/Modification" has
    ```
    "kind": {
                "type": "string",
                "const": "Modification"
            }
    ```

- The below schema follows all the rules for OpenAI 'strict' mode, except for one. It contains the 'allOf' syntax, which OpenAI doesn't currently support.
- We can resolve 'allOf' syntax by merging the items and removing the 'allOf' syntax.

In [4]:
# Load the JSON schema
schema_path = "/Users/thomaslim/gyorilab/indra_gpt/indra_gpt/resources/indra_schema_openai.json"
with open(schema_path, "r") as f:
    schema = json.load(f)

In [5]:
# Resolve `allOf` occurrences while resolving only first-level refs inside 'allOf'
post_processed_schema = merge_allOf(schema, schema)

In [6]:
post_processed_schema

{'type': 'object',
 'properties': {'statements': {'type': 'array',
   'items': {'anyOf': [{'$ref': '#/definitions/RegulateActivity'},
     {'$ref': '#/definitions/Modification'},
     {'$ref': '#/definitions/SelfModification'},
     {'$ref': '#/definitions/ActiveForm'},
     {'$ref': '#/definitions/Gef'},
     {'$ref': '#/definitions/Gap'},
     {'$ref': '#/definitions/Complex'},
     {'$ref': '#/definitions/Association'},
     {'$ref': '#/definitions/Translocation'},
     {'$ref': '#/definitions/RegulateAmount'},
     {'$ref': '#/definitions/Conversion'}]}}},
 'required': ['statements'],
 'additionalProperties': False,
 'definitions': {'RegulateActivity': {'type': 'object',
   'properties': {'kind': {'type': 'string', 'const': 'RegulateActivity'},
    'type': {'type': 'string',
     'enum': ['Activation', 'Inhibition'],
     'description': 'The type of the statement'},
    'subj': {'anyOf': [{'$ref': '#/definitions/Agent'}, {'type': 'null'}],
     'description': "The agent responsible

In [7]:
# Verify that the modifications did not happen in-place
assert schema != post_processed_schema

### Use `outlines` package to use OpenAI's 'structured outputs' feature.

In [8]:
os.environ["OPENAI_API_KEY"] = "<Your API Key>"

In [9]:
import outlines

# Use OpenAI API instead of local model
model = outlines.models.openai("gpt-4o")  # Or "gpt-4-turbo" for cheaper API cost
# Define generators
generator_v1 = outlines.generate.json(model, json.dumps(post_processed_schema))
# Sample Prompt
prompt = "We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4)."

In [10]:
result_v1 = generator_v1(prompt)

INFO: [2025-02-06 14:47:42] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [11]:
result_v1

{'statements': [{'kind': 'RegulateAmount',
   'type': 'IncreaseAmount',
   'subj': {'name': 'BMP',
    'mods': None,
    'mutations': None,
    'bound_conditions': None,
    'activity': None,
    'location': None,
    'db_refs': None,
    'sbo': None},
   'obj': {'name': 'PTEN',
    'mods': None,
    'mutations': None,
    'bound_conditions': None,
    'activity': None,
    'location': None,
    'db_refs': None,
    'sbo': None},
   'evidence': [{'text': 'We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4).',
     'source_api': None,
     'pmid': None,
     'source_id': None,
     'annotations': None,
     'epistemics': None}],
   'id': None,
   'supports': None,
   'supported_by': None}]}

Let us compare the above result with the actual JSON object from the benchmark dataset

In [13]:
indra_benchmark_corpus_sample_50 = json.load(open("/Users/thomaslim/gyorilab/indra_gpt/indra_gpt/resources/indra_benchmark_corpus_sample_50.json", "r"))
print(json.dumps(indra_benchmark_corpus_sample_50[0], indent=4))

{
    "type": "Activation",
    "subj": {
        "name": "BMP",
        "db_refs": {
            "FPLX": "BMP",
            "TEXT": "BMP"
        }
    },
    "obj": {
        "name": "PTEN",
        "db_refs": {
            "UP": "P60484",
            "HGNC": "9588",
            "TEXT": "PTEN"
        }
    },
    "obj_activity": "activity",
    "belief": 0.9145194475081968,
    "evidence": [
        {
            "source_api": "reach",
            "pmid": "21456062",
            "text": "We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4).",
            "annotations": {
                "found_by": "Positive_activation_syntax_1_verb",
                "agents": {
                    "coords": [
                        [
                            26,
                            29
                        ],
                        [
                            64,
                            68
                   

## Providing the Pydantic class representation of JSON schema as the response format
- We can provide Pydantic class to define the indra statements, and provide the class to the `response_format` parameter.
- If there exists a one-to-one, i.e. equivalent, translation between Pydantic class and JSON schema, then we can use Pydantic instead of JSON schema.

In [27]:
from pydantic import TypeAdapter, BaseModel
import json

# Your existing JSON schema
json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "hobbies": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age"]
}

# Convert JSON Schema to a Pydantic model
pydantic_model = TypeAdapter(BaseModel).validate_json(json.dumps(json_schema))
print(pydantic_model)

AttributeError: 'BaseModel' object has no attribute '__private_attributes__'

In [25]:
from pydantic import BaseModel

class LLMResponse(BaseModel):
    name: str
    age: int
    hobbies: list[str]

# Convert to JSON Schema
json_schema_from_pydantic = LLMResponse.model_json_schema()
print(json.dumps(json_schema_from_pydantic, indent=4))

{
    "properties": {
        "name": {
            "title": "Name",
            "type": "string"
        },
        "age": {
            "title": "Age",
            "type": "integer"
        },
        "hobbies": {
            "items": {
                "type": "string"
            },
            "title": "Hobbies",
            "type": "array"
        }
    },
    "required": [
        "name",
        "age",
        "hobbies"
    ],
    "title": "LLMResponse",
    "type": "object"
}
