# Using structured JSON output feature from OpenAI
Sources: 
- OpenAI web page for JSON formatting rules to use 'strict' mode: https://platform.openai.com/docs/guides/structured-outputs
- A Python Package for high-level usage of LLM structured output: https://github.com/dottxt-ai/outlines

In [1]:
# !pip uninstall -y sympy pysb torch outlines
# !pip install pysb sympy==1.11.1
# !pip install torch sympy==1.13.1 --no-deps
# !pip install outlines --no-deps


In [1]:
import sys
from pathlib import Path
import os
# get current path
sys.path.append(str(Path.cwd().parent))


## Providing the JSON schema directly as the response format
- I.e. Provide the indra JSON schema to the `response_format` parameter of `OpenAI.beta.chat.completions.parse` function of OpenAI

OpenAI enforces a subset of the 'JSON Schema' language. The below is a list of non-exhaustive rules that is enforced:
1. root JSON object is 'object' type
2. 'required' includes all and only the fields in 'property' 
    - This ensures LLM doesn't generate any key-val pair in 'property' that is not defined in the schema.
    - To make a defined field in 'property' optional, we can add "null" as an optional type of that field. 
3. 'additionalProperties' field is set to false and is included for every 'object' type.
4. Each sub-schema in 'anyOf', namely "#/definitions/RegulateActivity", "#/definitions/Modification", "#/definitions/SelfModification", etc. has a different first key
   in their 'property' field. <br>
   This is achievable by adding a new field with a unique constant value. E.g. "#/definitions/RegulateActivity" has 
   ```
   "kind": {
                "type": "string",
                "const": "RegulateActivity"
            }
    ```
    And "#/definitions/Modification" has
    ```
    "kind": {
                "type": "string",
                "const": "Modification"
            }
    ```

- The below schema follows all the rules for OpenAI 'strict' mode, except for one. It contains the 'allOf' syntax, which OpenAI doesn't currently support.
- We can resolve 'allOf' syntax by merging the items and removing the 'allOf' syntax.

In [2]:
import json
from indra.statements.io import stmts_from_json
import pandas as pd


In [3]:
from indra_gpt.util.util import merge_allOf

# Load the JSON schema
schema_path = "/Users/thomaslim/gyorilab/indra_gpt/indra_gpt/resources/indra_schema_openai_structured_output.json"
with open(schema_path, "r") as f:
    schema = json.load(f)

post_processed_schema = merge_allOf(schema, schema)


In [4]:
# Sample Prompt
prompt = "We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4)."


### Use `outlines` package to use OpenAI's 'structured outputs' feature.

In [12]:
import outlines

# Use OpenAI API instead of local model
model = outlines.models.openai("gpt-4o")  # Or "gpt-4-turbo" for cheaper API cost
# Define generators
generator = outlines.generate.json(model, json.dumps(post_processed_schema))


ValueError: Cannot parse schema {'$schema': 'http://json-schema.org/draft-06/schema#', 'definitions': {'ModCondition': {'type': 'object', 'description': 'Mutation state of an amino acid position of an Agent.', 'properties': {'mod_type': {'type': 'string', 'description': "The type of post-translational modification, e.g., 'phosphorylation'. Valid modification types currently include: 'phosphorylation', 'ubiquitination', 'sumoylation', 'hydroxylation', and 'acetylation'. If an invalid modification type is passed an InvalidModTypeError is raised."}, 'residue': {'type': 'string', 'description': "String indicating the modified amino acid, e.g., 'Y' or 'tyrosine'. If None, indicates that the residue at the modification site is unknown or unspecified."}, 'position': {'type': 'string', 'description': "String indicating the position of the modified amino acid, e.g., '202'. If None, indicates that the position is unknown or unspecified."}, 'is_modified': {'type': 'boolean', 'description': 'Specifies whether the modification is present or absent. Setting the flag specifies that the Agent with the ModCondition is unmodified at the site.'}}, 'required': ['mod_type', 'is_modified']}, 'MutCondition': {'type': 'object', 'description': 'Mutation state of an amino acid position of an Agent.', 'properties': {'position': {'type': ['string', 'null'], 'description': 'Residue position of the mutation in the protein sequence.'}, 'residue_from': {'type': ['string', 'null'], 'description': 'Wild-type (unmodified) amino acid residue at the given position.'}, 'residue_to': {'type': ['string', 'null'], 'description': 'Amino acid at the position resulting from the mutation.'}}, 'required': ['position', 'residue_from', 'residue_to']}, 'ActivityCondition': {'type': 'object', 'description': 'An active or inactive state of a protein.', 'properties': {'activity_type': {'type': 'string', 'description': "The type of activity, e.g. 'kinase'. The basic, unspecified molecular activity is represented as 'activity'. Examples of other activity types are 'kinase', 'phosphatase', 'catalytic', 'transcription', etc."}, 'is_active': {'type': 'boolean', 'description': 'Specifies whether the given activity type is present or absent.'}}, 'required': ['activity_type', 'is_active']}, 'BoundCondition': {'type': 'object', 'description': 'Identify Agents bound (or not bound) to a given Agent in a given context.', 'properties': {'agent': {'$ref': '#/definitions/Agent'}, 'is_bound': {'type': 'boolean', 'description': 'Specifies whether the given Agent is bound or unbound in the current context.'}}, 'required': ['agent', 'is_bound']}, 'Agent': {'type': 'object', 'description': 'A molecular entity, e.g., a protein.', 'properties': {'name': {'type': 'string', 'description': 'The name of the agent, preferably a canonicalized name such as an HGNC gene name.'}, 'mods': {'type': 'array', 'items': {'$ref': '#/definitions/ModCondition'}, 'description': 'Modification state of the agent.'}, 'mutations': {'type': 'array', 'items': {'$ref': '#/definitions/MutCondition'}, 'description': 'Amino acid mutations of the agent.'}, 'bound_conditions': {'type': 'array', 'items': {'$ref': '#/definitions/BoundCondition'}, 'description': 'Other agents bound to the agent in this context.'}, 'activity': {'$ref': '#/definitions/ActivityCondition', 'description': 'Activity of the agent.'}, 'location': {'type': 'string', 'description': "Cellular location of the agent. Must be a valid name (e.g. 'nucleus') or identifier (e.g. 'GO:0005634')for a GO cellular compartment."}, 'db_refs': {'type': 'object', 'description': 'Dictionary of database identifiers associated with this agent.'}, 'sbo': {'type': 'string', 'description': 'Role of this agent in the systems biology ontology'}}, 'required': ['name', 'db_refs']}, 'Concept': {'type': 'object', 'description': 'A concept/entity of interest that is the argument of a Statement', 'properties': {'name': {'type': 'string', 'description': 'The name of the concept, possibly a canonicalized name.'}, 'db_refs': {'type': 'object', 'description': 'Dictionary of database identifiers associated with this concept.'}}, 'required': ['name', 'db_refs']}, 'Context': {'type': 'object', 'description': 'The context in which a given Statement was reported.', 'properties': {'type': {'type': 'string', 'pattern': '^((bio)|(world))$', 'description': "Either 'world' or 'bio', depending on the type of context being repersented."}}}, 'BioContext': {'type': 'object', 'description': 'The biological context of a Statement.', 'properties': {'type': {'type': 'string', 'pattern': '^bio$', 'description': "The type of context, in this case 'bio'."}, 'location': {'$ref': '#/definitions/RefContext', 'description': 'Cellular location, typically a sub-cellular compartment.'}, 'cell_line': {'$ref': '#/definitions/RefContext', 'description': 'Cell line context, e.g., a specific cell line, like BT20.'}, 'cell_type': {'$ref': '#/definitions/RefContext', 'description': 'Cell type context, broader than a cell line, like macrophage.'}, 'organ': {'$ref': '#/definitions/RefContext', 'description': 'Organ context.'}, 'disease': {'$ref': '#/definitions/RefContext', 'description': 'Disease context.'}, 'species': {'$ref': '#/definitions/RefContext', 'description': 'Species context.'}}}, 'RefContext': {'type': 'object', 'description': 'Represents a context identified by name and grounding references.', 'properties': {'name': {'type': 'string', 'description': 'The name associated with the context.'}, 'db_refs': {'type': 'object', 'description': 'Dictionary of database identifiers associated with this context.'}}}, 'Evidence': {'type': 'object', 'description': 'Container for evidence supporting a given statement.', 'properties': {'source_api': {'type': 'string', 'description': "String identifying the INDRA API used to capture the statement, e.g., 'trips', 'biopax', 'bel'."}, 'pmid': {'type': 'string', 'description': 'String indicating the Pubmed ID of the source of the statement.'}, 'source_id': {'type': 'string', 'description': 'For statements drawn from databases, ID of the database entity corresponding to the statement.'}, 'text': {'type': 'string', 'description': 'Natural language text supporting the statement.'}, 'annotations': {'type': 'object', 'description': 'Dictionary containing additional information on the context of the statement, e.g., species, cell line, tissue type, etc. The entries may vary depending on the source of the information.'}, 'epistemics': {'type': 'object', 'description': 'A dictionary describing various forms of epistemic certainty associated with the statement.'}}}, 'Statement': {'type': 'object', 'description': "All statement types, below, may have these fields and 'inherit' from this schema", 'properties': {'evidence': {'type': 'array', 'items': {'$ref': '#/definitions/Evidence'}}, 'id': {'type': 'string', 'description': 'Statement UUID'}, 'supports': {'type': 'array', 'items': {'type': 'string'}}, 'supported_by': {'type': 'array', 'items': {'type': 'string'}}}, 'required': ['id']}, 'Modification': {'description': 'Statement representing the modification of a protein.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^((Phosphorylation)|(Dephosphorylation)|(Ubiquitination)|(Deubiquitination)|(Sumoylation)|(Desumoylation)|(Hydroxylation)|(Dehydroxylation)|(Acetylation)|(Deacetylation)|(Glycosylation)|(Deglycosylation)|(Farnesylation)|(Defarnesylation)|(Geranylgeranylation)|(Degeranylgeranylation)|(Palmitoylation)|(Depalmitoylation)|(Myristoylation)|(Demyristoylation)|(Ribosylation)|(Deribosylation)|(Methylation)|(Demethylation))$', 'description': 'The type of the statement'}, 'enz': {'$ref': '#/definitions/Agent', 'description': 'The enzyme involved in the modification.'}, 'sub': {'$ref': '#/definitions/Agent', 'description': 'The substrate of the modification.'}, 'residue': {'type': 'string', 'description': 'The amino acid residue being modified, or None if it is unknown or unspecified.'}, 'position': {'type': 'string', 'description': 'The position of the modified amino acid, or None if it is unknown or unspecified.'}}, 'required': ['type']}]}, 'SelfModification': {'description': 'Statement representing the self-modification of a protein.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^((Autophosphorylation)|(Transphosphorylation))$', 'description': 'The type of the statement'}, 'enz': {'$ref': '#/definitions/Agent', 'description': 'The enzyme involved in the modification.'}, 'residue': {'type': 'string', 'description': 'The amino acid residue being modified, or None if it is unknown or unspecified.'}, 'position': {'type': 'string', 'description': 'The position of the modified amino acid, or None if it is unknown or unspecified.'}}, 'required': ['type']}]}, 'RegulateActivity': {'description': 'Regulation of activity (such as activation and inhibition)', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^((Activation)|(Inhibition))$', 'description': 'The type of the statement'}, 'subj': {'$ref': '#/definitions/Agent', 'description': "The agent responsible for the change in activity, i.e., the 'upstream' node."}, 'obj': {'$ref': '#/definitions/Agent', 'description': "The agent whose activity is influenced by the subject, i.e., the 'downstream' node."}, 'obj_activity': {'type': 'string', 'description': "The activity of the obj Agent that is affected, e.g., its 'kinase' activity."}}, 'required': ['type']}]}, 'ActiveForm': {'description': 'Specifies conditions causing an Agent to be active or inactive. Types of conditions influencing a specific type of biochemical activity can include modifications, bound Agents, and mutations.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^ActiveForm$', 'description': 'The type of the statement'}, 'agent': {'$ref': '#/definitions/Agent', 'description': 'The Agent in a particular active or inactive state. The sets of ModConditions, BoundConditions, and MutConditions on the given Agent instance indicate the relevant conditions.'}, 'activity': {'type': 'string', 'description': "The type of activity influenced by the given set of conditions, e.g., 'kinase'."}, 'is_active': {'type': 'boolean', 'description': 'Whether the conditions are activating (True) or inactivating (False).'}}, 'required': ['type', 'agent', 'activity']}]}, 'Gef': {'description': 'Exchange of GTP for GDP on a small GTPase protein mediated by a GEF. Represents the generic process by which a guanosine exchange factor (GEF) catalyzes nucleotide exchange on a GTPase protein.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^Gef$', 'description': 'The type of the statement'}, 'gef': {'$ref': '#/definitions/Agent', 'description': 'The guanosine exchange factor.'}, 'ras': {'$ref': '#/definitions/Agent', 'description': 'The GTPase protein.'}}, 'required': ['type']}]}, 'Gap': {'description': "Acceleration of a GTPase protein's GTP hydrolysis rate by a GAP. Represents the generic process by which a GTPase activating protein (GAP) catalyzes GTP hydrolysis by a particular small GTPase protein.", 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^Gap$', 'description': 'The type of the statement'}, 'gap': {'$ref': '#/definitions/Agent', 'description': 'The GTPase activating protein.'}, 'ras': {'$ref': '#/definitions/Agent', 'description': 'The GTPase protein.'}}, 'required': ['type']}]}, 'Complex': {'description': 'A set of proteins observed to be in a complex.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^((Complex)|(Association))$', 'description': 'The type of the statement'}, 'members': {'type': 'array', 'items': {'$ref': '#/definitions/Agent'}}}, 'required': ['type']}]}, 'Association': {'description': 'A set of unordered concepts that are associated with each other.', 'allOf': [{'$ref': '#/definitions/Complex'}]}, 'Translocation': {'description': 'The translocation of a molecular agent from one location to another.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^Translocation$', 'description': 'The type of the statement'}, 'agent': {'$ref': '#/definitions/Agent', 'description': 'The agent which translocates.'}, 'from_location': {'type': 'string', 'description': "The location from which the agent translocates. This must be a valid GO cellular component name (e.g. 'cytoplasm') or ID (e.g. 'GO:0005737')."}, 'to_location': {'type': 'string', 'description': 'The location to which the agent translocates. This must be a valid GO cellular component name or ID.'}}, 'required': ['type', 'agent']}]}, 'RegulateAmount': {'description': 'Represents directed, two-element interactions.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^((IncreaseAmount)|(DecreaseAmount))$'}, 'subj': {'$ref': '#/definitions/Agent', 'description': 'The mediating protein'}, 'obj': {'$ref': '#/definitions/Agent', 'description': 'The affected protein'}}, 'required': ['type']}]}, 'Conversion': {'description': 'Conversion of molecular species mediated by a controller protein.', 'allOf': [{'$ref': '#/definitions/Statement'}, {'type': 'object', 'properties': {'type': {'type': 'string', 'pattern': '^Conversion$'}, 'subj': {'$ref': '#/definitions/Agent', 'description': 'The protein mediating the conversion.'}, 'obj_from': {'type': 'array', 'items': {'$ref': '#/definitions/Agent'}, 'description': 'The list of molecular species being consumed by the conversion.'}, 'obj_to': {'type': 'array', 'items': {'$ref': '#/definitions/Agent'}, 'description': 'The list of molecular species being created by the conversion.'}}, 'required': ['type']}]}}, 'type': 'array', 'items': {'anyOf': [{'$ref': '#/definitions/RegulateActivity'}, {'$ref': '#/definitions/Modification'}, {'$ref': '#/definitions/SelfModification'}, {'$ref': '#/definitions/ActiveForm'}, {'$ref': '#/definitions/Gef'}, {'$ref': '#/definitions/Gap'}, {'$ref': '#/definitions/Complex'}, {'$ref': '#/definitions/Association'}, {'$ref': '#/definitions/Translocation'}, {'$ref': '#/definitions/RegulateAmount'}]}}. The schema must be either a Pydantic object, a function or a string that contains the JSON Schema specification

In [6]:
result = generator(prompt)


INFO: [2025-02-27 14:24:11] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [7]:
print(json.dumps(result, indent=2))


{
  "statements": [
    {
      "kind": "RegulateAmount",
      "type": "IncreaseAmount",
      "subj": {
        "name": "BMP",
        "mods": [],
        "mutations": [],
        "bound_conditions": [],
        "activity": null,
        "location": null,
        "db_refs": {
          "HGNC": "BMP",
          "UP": null,
          "FPLX": null,
          "CHEBI": null,
          "GO": null,
          "TEXT": "BMP",
          "NCIT": null
        },
        "sbo": null
      },
      "obj": {
        "name": "PTEN",
        "mods": [],
        "mutations": [],
        "bound_conditions": [],
        "activity": null,
        "location": null,
        "db_refs": {
          "HGNC": "PTEN",
          "UP": null,
          "FPLX": null,
          "CHEBI": null,
          "GO": null,
          "TEXT": "PTEN",
          "NCIT": null
        },
        "sbo": null
      },
      "evidence": [
        {
          "text": "We found that 24 hours of BMP pretreatment caused a doubling in PTEN 

In [None]:
from indra_gpt.post_process.post_process import PostProcessor

pp = PostProcessor()
post_processed_statement_json = pp.post_process_extracted_statement_json(result)
post_processed_statement_json = post_processed_statement_json['statements']
from indra.statements import stmts_from_json

statements = stmts_from_json(post_processed_statement_json)


INFO: [2025-02-24 09:24:55] indra.preassembler.grounding_mapper.disambiguate - INDRA DB is not available for text content retrieval for grounding disambiguation.


In [8]:
statements


[IncreaseAmount(BMP(), PTEN())]

In [None]:
import boto3
import json

# Define the SageMaker runtime client
sm_runtime = boto3.client("sagemaker-runtime", region_name="us-east-2")

endpoint_name = "outlines-serverless-ep-2025-02-25-19-44-15"

# Example JSON payload
payload = {
    "json_schema": post_processed_schema,
    "prompt": prompt
}

# Invoke the endpoint
response = sm_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

# Parse the response
result = json.loads(response["Body"].read().decode("utf-8"))
print(result)


{'output': {'statements': [{'kind': 'RegulateAmount', 'type': 'IncreaseAmount', 'subj': {'name': 'BMP', 'mods': None, 'mutations': None, 'bound_conditions': None, 'activity': None, 'location': None, 'db_refs': {'HGNC': None, 'UP': None, 'FPLX': None, 'CHEBI': None, 'GO': None, 'TEXT': 'BMP', 'NCIT': None}, 'sbo': None}, 'obj': {'name': 'PTEN', 'mods': None, 'mutations': None, 'bound_conditions': None, 'activity': None, 'location': None, 'db_refs': {'HGNC': 'HGNC:9588', 'UP': 'P60484', 'FPLX': 'PTEN', 'CHEBI': None, 'GO': None, 'TEXT': 'PTEN', 'NCIT': None}, 'sbo': None}, 'evidence': [{'text': '24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4).', 'source_api': 'experimental', 'pmid': None, 'source_id': None, 'annotations': None, 'epistemics': None}], 'id': None, 'supports': None, 'supported_by': None}]}}


### Directly using OpenAI API call

In [20]:
response_format={
        "type": "json_schema", 
        "json_schema": {
            "name": "indra_statement_json",
            "strict": True, 
            "schema": post_processed_schema
        }
}


In [28]:
from openai import OpenAI
client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4)."},
    ]
)


INFO: [2025-02-11 18:29:28] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [16]:
completion.choices[0].message.content


'{"statements":[{"kind":"RegulateAmount","type":"IncreaseAmount","subj":{"name":"BMP","mods":null,"mutations":null,"bound_conditions":null,"activity":null,"location":null,"db_refs":null,"sbo":null},"obj":{"name":"PTEN","mods":null,"mutations":null,"bound_conditions":null,"activity":null,"location":null,"db_refs":null,"sbo":null},"evidence":[{"text":"We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4).","source_api":null,"pmid":null,"source_id":null,"annotations":null,"epistemics":null}],"id":null,"supports":null,"supported_by":null}]}'

In [9]:
print(json.dumps(result, indent=4))


{
    "statements": [
        {
            "kind": "RegulateAmount",
            "type": "IncreaseAmount",
            "subj": {
                "name": "BMP",
                "mods": null,
                "mutations": null,
                "bound_conditions": null,
                "activity": null,
                "location": null,
                "db_refs": null,
                "sbo": null
            },
            "obj": {
                "name": "PTEN",
                "mods": null,
                "mutations": null,
                "bound_conditions": null,
                "activity": null,
                "location": null,
                "db_refs": null,
                "sbo": null
            },
            "evidence": [
                {
                    "text": "We found that 24 hours of BMP pretreatment caused a doubling in PTEN half-life (15.1 hours to 28.4 hours, p = 0.03, n = 4).",
                    "source_api": null,
                    "pmid": null,
           

## Apply to benchmark corpus

In [5]:
indra_benchmark_corpus_sample_50 = json.load(open("/Users/thomaslim/gyorilab/indra_gpt/indra_gpt/resources/indra_benchmark_corpus_sample_50.json", "r"))


In [11]:
# Assuming indra_benchmark_corpus_all_correct is a list of JSON objects
data = [
    {
        'text': obj['evidence'][0]['text'] if 'evidence' in obj and obj['evidence'] else None,
        'original_json_statement': obj
    }
    for obj in indra_benchmark_corpus_sample_50
]

# Convert list to DataFrame
df = pd.DataFrame(data)


In [12]:
# Now use the indra.statements.io.stmt_from_json to convert the extracted_statement_json to a list of INDRA statements
for i, row in df.iterrows():
    try:
        indra_statement_from_original = stmts_from_json([row.original_json_statement], on_missing_support='handle')
        df.at[i, 'indra_statements_from_original'] = indra_statement_from_original
    except Exception as e:
        df.at[i, 'indra_statements_from_original'] = str(e)


In [13]:
# Extract statements
for i, row in df.iterrows():
    try:
        extracted_statement_json = generator(row.text)
        df.at[i, 'extracted_statement_json'] = extracted_statement_json['statements']
    except Exception as e:
        df.at[i, 'extracted_statement_json'] = str(e)


INFO: [2025-02-11 11:25:07] openai._base_client - Retrying request to /chat/completions in 0.414386 seconds


INFO: [2025-02-11 11:25:31] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-11 11:25:31] openai._base_client - Retrying request to /chat/completions in 0.464199 seconds
INFO: [2025-02-11 11:25:38] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-11 11:25:38] openai._base_client - Retrying request to /chat/completions in 0.395519 seconds
INFO: [2025-02-11 11:25:53] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-11 11:25:53] openai._base_client - Retrying request to /chat/completions in 0.396036 seconds
INFO: [2025-02-11 11:26:17] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO: [2025-02-11 11:26:17] openai._base_client - Retrying request to /chat/completions in 0.412226 seconds
INFO: [2025-02-11 11:26:46] httpx - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/

In [14]:
# Recursively go through each key-value, and if it is an empty string or list or dict, remove the key-value pair
def remove_empty_strings_and_lists(d):
    for key, value in list(d.items()):
        if isinstance(value, dict):
            remove_empty_strings_and_lists(value)
        elif isinstance(value, list):
            for i in value:
                if isinstance(i, dict):
                    remove_empty_strings_and_lists(i)
        if value in [None, "", [], {}]:
            del d[key]
    return d


In [17]:
# Now ally the indra.statements.io.stmt_from_json to convert the extracted_statement_json to a list of INDRA statements
for i, row in df.iterrows():
    try:
        cleaned = [remove_empty_strings_and_lists(x) for x in row.extracted_statement_json]
        indra_statement_from_generated = stmts_from_json(cleaned, on_missing_support='handle')
        df.at[i, 'indra_statement_from_generated'] = indra_statement_from_generated
    except Exception as e:
        df.at[i, 'indra_statement_from_generated'] = str(e)




In [72]:
i = 2
print(df.text[i])
print(df.indra_statements_from_original[i])
print(df.indra_statement_from_generated[i])


These HY specific CD8+ T cells produced interferon gamma (IFNG) following peptide stimulation, demonstrating their functional capacity.
[Activation(KDM5D(), IFNG())]
[ActiveForm(CD8+ T cell(location: extracellular region), cytokine production, True)]


(* There is issue with parsing 'Association' statements because in 'indra.statements.statements' module, line 2263 
`members = [Statement._from_json(m) for m in members]` where the dict objects of 'members' is expected to be Statement instead
of Agent, and Statement requires 'type' property, but Agent does not have 'type' property, leading to missing key error. Either this line should be changed to `members = [Agent._from_json(m) for m in members]` or the schema should add 'type' field in Agent definition)