# Using Intrinsics on OpenAI-Compatible Inference Backends

This notebook demonstrates how to use the shared input and output processing code for
intrinsics when performing model inference with an OpenAI-compatible backend such as
vLLM.

In [1]:
# Imports go in this cell
import openai
import json
import granite_common

## Constants

Change the value of the constants `intrinsic_name` and `base_model` in the cell that 
follows to change which intrinsic will be demonstrated in the remainder of this notebook.

Other constants will automatically adjust accordingly.

In [2]:
testdata_dir = "../tests/granite_common/rag_agent_lib/testdata"

# Change the following two constants to select a different intrinsic for the remainder
# of this notebook.
intrinsic_name = "context_relevance"
base_model = "granite-3.3-8b-instruct"

# Adjust remaining constants according to intrinsic
io_yaml_file = None  # None --> load from Hugging Face Hub
arg_file = None
if intrinsic_name == "answerability":
    request_json_file = f"{testdata_dir}/input_json/answerable.json"
    io_yaml_file = f"{testdata_dir}/input_yaml/answerability.yaml"
elif intrinsic_name == "context_relevance":
    request_json_file = f"{testdata_dir}/input_json/context_relevance.json"
    io_yaml_file = f"{testdata_dir}/input_yaml/context_relevance.yaml"
    arg_file = f"{testdata_dir}/input_args/context_relevance.json"

if io_yaml_file is None:
    # Fetch LoRA directory so we can read io.yaml, then parse io.yaml
    lora_dir = granite_common.rag_agent_lib.util.obtain_lora(intrinsic_name, base_model)
    io_yaml_file = lora_dir / "io.yaml"

## Instantiate the input and output processing classes

The constructors for the classes `RagAgentLibRewriter` and `RagAgentLibResultProcessor`
serve as factory methods to produce input and output processors, respectively, for 
a given intrinsic.

In [3]:
print(
    f"Instantiating input and output processing from configuration file:\n"
    f"{io_yaml_file}"
)

rewriter = granite_common.RagAgentLibRewriter(config_file=io_yaml_file)
result_processor = granite_common.RagAgentLibResultProcessor(config_file=io_yaml_file)

Instantiating input and output processing from configuration file:
../tests/granite_common/rag_agent_lib/testdata/input_yaml/context_relevance.yaml


## Input processing

The cells that follow load an example OpenAI-compatible chat completion request from
a local file, then show how to apply input processing to the request.

In [4]:
# Read original request from the appropriate file
print(f"Loading request data from {request_json_file}")
with open(request_json_file, encoding="utf-8") as f:
    request_json_str = f.read()
request_json = json.loads(request_json_str)

# Some parameters like model name aren't kept in the JSON files that we use for testing.
# Apply appropriate values for those parameters.
request_json["model"] = intrinsic_name
request_json["temperature"] = 0.0

print("Original request:")
print(json.dumps(request_json, indent=2))

Loading request data from ../tests/granite_common/rag_agent_lib/testdata/input_json/context_relevance.json
Original request:
{
  "messages": [
    {
      "content": "Who is the CEO of Microsoft?",
      "role": "user"
    }
  ],
  "extra_body": {
    "documents": [
      {
        "doc_id": "1",
        "text": "This document has nothing to do with the certainty check and is only here for testing."
      }
    ]
  },
  "model": "context_relevance",
  "temperature": 0.0
}


In [5]:
# Some intrinsics take one or more additional arguments besides the target chat
# completion request. Load the additional arguments from a file if that is the case.
intrinsic_kwargs = {}
if arg_file is not None:
    with open(arg_file, encoding="utf8") as file:
        intrinsic_kwargs = json.load(file)
    print(f"Using additional arguments:\n{intrinsic_kwargs}")

Using additional arguments:
{'document_content': 'Microsoft Corporation is an American multinational corporation and technology conglomerate headquartered in Redmond, Washington.[2] Founded in 1975, the company became influential in the rise of personal computers through software like Windows, and the company has since expanded to Internet services, cloud computing, video gaming and other fields. Microsoft is the largest software maker, one of the most valuable public U.S. companies,[a] and one of the most valuable brands globally.'}


In [6]:
# Run request through input processing.
rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

print("Request after input processing:")
print(rewritten_request.model_dump_json(indent=2))

Request after input processing:
{
  "messages": [
    {
      "content": "Who is the CEO of Microsoft?",
      "role": "user"
    },
    {
      "content": "DOCUMENT: Microsoft Corporation is an American multinational corporation and technology conglomerate headquartered in Redmond, Washington.[2] Founded in 1975, the company became influential in the rise of personal computers through software like Windows, and the company has since expanded to Internet services, cloud computing, video gaming and other fields. Microsoft is the largest software maker, one of the most valuable public U.S. companies,[a] and one of the most valuable brands globally.\n",
      "role": "user"
    }
  ],
  "model": "context_relevance",
  "extra_body": {
    "documents": [
      {
        "text": "This document has nothing to do with the certainty check and is only here for testing.",
        "doc_id": "1"
      }
    ],
    "guided_json": {
      "title": "ContextRelevanceOutput",
      "type": "object",
   

## Running inference

Passing a request through the input processing `RagAgentLibRewriter.transform()` 
turns the request into something that can be sent directly to an OpenAI-compatible
inference endpoint for the intrinsic.

The cells that follow show how to perform inference using the chat completions API.

To run these cells, you'll need to start a server such as vLLM that serves the intrinsic
under the appropriate model name at the base URL specified by `openai_base_url` in the
next cell.

In [7]:
# Connect to the local inference server
openai_base_url = "http://localhost:55555/v1"
openai_api_key = "granite_intrinsics_1234"
client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)

In [8]:
# Pass our rewritten request directly to `chat.completions.create()`
chat_completion = client.chat.completions.create(**rewritten_request.model_dump())

print("Immediately after low-level inference, first completion is:")
print(chat_completion.choices[0].model_dump_json(indent=2))

Immediately after low-level inference, first completion is:
{
  "finish_reason": "stop",
  "index": 0,
  "logprobs": {
    "content": [
      {
        "token": "{",
        "bytes": [
          123
        ],
        "logprob": -0.011047743260860443,
        "top_logprobs": [
          {
            "token": "{",
            "bytes": [
              123
            ],
            "logprob": -0.011047743260860443
          },
          {
            "token": "{\"",
            "bytes": [
              123,
              34
            ],
            "logprob": -4.511047840118408
          },
          {
            "token": "<|end_of_text|>",
            "bytes": [
              60,
              124,
              101,
              110,
              100,
              95,
              111,
              102,
              95,
              116,
              101,
              120,
              116,
              124,
              62
            ],
            "logprob": -9999.0


In [9]:
print(chat_completion.model_dump_json(indent=2))

{
  "id": "chatcmpl-d2739b219b984788b765b807d0696697",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": {
        "content": [
          {
            "token": "{",
            "bytes": [
              123
            ],
            "logprob": -0.011047743260860443,
            "top_logprobs": [
              {
                "token": "{",
                "bytes": [
                  123
                ],
                "logprob": -0.011047743260860443
              },
              {
                "token": "{\"",
                "bytes": [
                  123,
                  34
                ],
                "logprob": -4.511047840118408
              },
              {
                "token": "<|end_of_text|>",
                "bytes": [
                  60,
                  124,
                  101,
                  110,
                  100,
                  95,
                  111,
                  102,
              

## Post-processing inference results

The raw output of some intrinsics requires some additional postprocessing to turn it 
into a form that is easy to consume in an application. This postprocessing occurs in
the method `RagAgentLibResultProcessor.transform()`. 

The cells that follow show how to use this method to transform the raw output of the
`chat.completions.create()` API call into the intrinsic's application-level output
value.

By convention, this application-level output value is returned in the same format as a
chat completions request result.

In [10]:
processed_chat_completion = result_processor.transform(chat_completion)

print("After post-processing, first completion is:")
print(processed_chat_completion.choices[0].model_dump_json(indent=2))

Transforming path '('context_relevance',)'
    Old value: 'value='irrelevant' begin=23 end=35'
    New value: '0.47131842717316635'
After post-processing, first completion is:
{
  "index": 0,
  "message": {
    "content": "{\"context_relevance\": 0.47131842717316635}",
    "role": "assistant",
    "tool_calls": []
  },
  "finish_reason": "stop"
}
