# Evaluating an OpenAPI Chain

This notebook goes over ways to semantically evaluate an OpenAPI Chain, which calls an endpoint defined by the OpenAPI specification using purely natural language.

In [1]:
from langchain.tools import OpenAPISpec, APIOperation
from langchain.chains import OpenAPIEndpointChain, LLMChain
from langchain.requests import Requests
from langchain.llms import OpenAI

## Load the API Chain

Load a wrapper of the spec (so we can work with it more easily). You can load from a url or from a local file.

In [2]:
# Load and parse the OpenAPI Spec
spec = OpenAPISpec.from_url("https://www.klarna.com/us/shopping/public/openai/v0/api-docs/")
# Load a single endpoint operation
operation = APIOperation.from_openapi_spec(spec, '/public/openai/v0/products', "get")
# Select any LangChain LLM
llm = OpenAI()
# Create the endpoint chain
api_chain = OpenAPIEndpointChain.from_api_operation(
    operation, 
    llm, 
    requests=Requests(), 
    verbose=True,
    return_intermediate_steps=True # Return request and response text
)

Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.


### *Optional*: Generate Input Questions and Request Ground Truth Queries

In [3]:
# import re
# from langchain.prompts import PromptTemplate

# template = """Below is a service description:

# {spec}

# Imagine you're a new user trying to use {operation} through a search bar. What are 10 different things you want to request?
# Wants/Questions:
# 1. """

# prompt = PromptTemplate.from_template(template)

# generation_chain = LLMChain(llm=llm, prompt=prompt)

# questions_ = generation_chain.run(spec=operation.to_typescript(), operation=operation.operation_id).split('\n')
# # Strip preceding numeric bullets
# questions = [re.sub(r'^\d+\. ', '', q).strip() for q in questions_]
# questions

In [4]:
# ground_truths = [
# {"q": ...} # What are the best queries for each input?
# ]

## Run the API Chain

The two simplest questions a user of the API Chain are:
- Did the chain succesfully access the endpoint?
- Did the action accomplish the correct result?


In [5]:
from collections import defaultdict
# Collect metrics to report at completion
scores = defaultdict(list)

In [6]:
questions = [
     'What iPhone models are available?',
     'Are there any budget laptops?',
     'Show me the cheapest gaming PC.',
     'Are there any tablets under $400?',
     'What are the best headphones?',
     'What are the top rated laptops?',
     'I want to buy some shoes. I like Adidas and Nike.',
     'I want to buy a new skirt',
     'My company is asking me to get a professional Deskopt PC - money is no object.',
     'What are the best budget cameras?'
]

In [7]:
## Run the the API chain itself
chain_outputs = []
failed_examples = []
for question in questions:
    try:
        chain_outputs.append(api_chain(question))
        scores["completed"].append(1.0)
    except Exception as e:
        failed_examples.append({'q': question, 'error': e})
        scores["completed"].append(0.0)



[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken from the latest topic, if there is a new topic a new query is started. */
		q: string,
/* number of products returned */
		size?: number,
/* (Optional) Minimum price in local currency for the product searched for. Either explicitly sta


[1m> Finished chain.[0m
[33;1m[1;3mThere are several Apple iPhone models available, including the Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category 


[1m> Finished chain.[0m
[33;1m[1;3mYes, there are several budget laptops available in the API response. Some options include the HP 14-dq0055dx, HP 14-dq0054dx, HP 15-dw0083wm, HP 17-cn0023dx, Samsung Chromebook 4 XE310XBA-KC1US, HP 15-ef1041nr, and Apple MacBook Air MD760LL/A 13.3-Inch Laptop Intel Core.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any


[1m> Finished chain.[0m
[33;1m[1;3mThe cheapest gaming PC is the Alarco Gaming PC (X_BLACK_GTX750), which has a price of $499.99.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken from the latest topic, if there is a new topic a new query is started


[1m> Finished chain.[0m
[33;1m[1;3mYes, there are tablets available for under $400. These include the Apple iPad 10.2" 32GB (2019) for $249.99, the Samsung Galaxy Tab A8 10.5 SM-X200 32GB for $178.14, and the Amazon Fire HD 8" 32GB (10th Generation) for $44.99.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, bu


[1m> Finished chain.[0m
[33;1m[1;3mIt depends on your preferences, budget and what type of headphones you are looking for. The API response includes several different models, such as Apple AirPods (2nd generation) 2019, Apple AirPods Pro (2nd generation) 2022, Apple AirPods (3rd generation) with Lightning Charging Case, Apple AirPods Max, Apple AirPods (3rd generation) with MagSafe Charging Case, Bose Noise Cancelling Headphones 700, Beats Studio Buds, Beats Studio3 Wireless, Apple AirPods Pro (1st generation) with MagSafe Charging Case 2021, and Beats Solo3 Wireless.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that need


[1m> Finished chain.[0m
[33;1m[1;3mIt looks like the API response contains a list of laptops with their respective attributes and prices. Unfortunately, the API response does not provide ratings, so it is not possible to answer your query.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensi


[1m> Finished chain.[0m
[33;1m[1;3mI found a few Nike and Adidas options in the API response. Here are the details: Nike Dunk Low M - Black/White, Nike Air Jordan 4 Retro M - Midnight Navy, Nike Air Force 1 '07 M - White, Nike Dunk Low W - White/Black, Nike Air Jordan 1 Retro High M - White/University Blue/Black, Nike Air Jordan 1 Retro High OG M - True Blue/Cement Grey/White, Nike Air Jordan 11 Retro Cherry - White/Varsity Red/Black, Nike Dunk High W - White/Black, and Nike Air Jordan 5 Retro M - Black/Taxi/Aquatone.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user 


[1m> Finished chain.[0m
[33;1m[1;3mBased on the API response, there are several options for purchasing a new skirt. Visit the URL links provided to find the skirt that best fits your needs.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken from the 


[1m> Finished chain.[0m
[32;1m[1;3m{"q": "camera", "size": 10, "min_price": 0, "max_price": 500}[0m
[36;1m[1;3m{"products":[{"name":"DJI Mini 2 Dog Camera","url":"https://www.klarna.com/us/shopping/pl/cl1559/3200969756/RC-Toys/DJI-Mini-2-Dog-Camera/?utm_source=openai&ref-site=openai_plugin","price":"$448.50","attributes":[]},{"name":"Insta360 Sphere with Landing Pad","url":"https://www.klarna.com/us/shopping/pl/cl1559/3203217960/RC-Toys/Insta360-Sphere-with-Landing-Pad/?utm_source=openai&ref-site=openai_plugin","price":"$429.99","attributes":[]},{"name":"DJI FPV Gimbal Camera","url":"https://www.klarna.com/us/shopping/pl/cl1559/3200210144/RC-Toys/DJI-FPV-Gimbal-Camera/?utm_source=openai&ref-site=openai_plugin","price":"$121.06","attributes":[]},{"name":"Parrot Camera & Body","url":"https://www.klarna.com/us/shopping/pl/cl1559/4168697/RC-Toys/Parrot-Camera-Body/?utm_source=openai&ref-site=openai_plugin","price":"$36.19","attributes":[]},{"name":"DJI FPV Air Unit","url":"https://

In [8]:
# If the chain failed to run, show the failing examples
failed_examples

[]

In [9]:
answers = [res['output'] for res in chain_outputs]
answers

['There are several Apple iPhone models available, including the Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.',
 'Yes, there are several budget laptops available in the API response. Some options include the HP 14-dq0055dx, HP 14-dq0054dx, HP 15-dw0083wm, HP 17-cn0023dx, Samsung Chromebook 4 XE310XBA-KC1US, HP 15-ef1041nr, and Apple MacBook Air MD760LL/A 13.3-Inch Laptop Intel Core.',
 'The cheapest gaming PC is the Alarco Gaming PC (X_BLACK_GTX750), which has a price of $499.99.',
 'Yes, there are tablets available for under $400. These include the Apple iPad 10.2" 32GB (2019) for $249.99, the Samsung Galaxy Tab A8 10.5 SM-X200 32GB for $178.14, and the Amazon Fire HD 8" 32GB (10th Generation) for $44.99.',
 'It depends on your preferences, budget and what typ

## Evaluate the requests chain

The API Chain has two main components:
1. Translate the user query to an API request
2. Translate the API response to a natural language response

Here, we construct an evaluation chain to grade the request synthesizer against selected human queries 

In [10]:
import json

# Define ground truth labels
truth_queries = [
    {"q": "iPhone"},
    {"q": "laptop", "max_price": 300},
    {"q": "tablet"},
    {"q": "headphone"},
    {"q": "laptop", "max_price": 400},
    {"q": "shoe"},
    {"q": "skirt"},
    {"q": "professional desktop PC", "max_price": 10000},
    {"q": "camera", "max_price": 300},
]
truth_queries = [json.dumps(q) for q in truth_queries]

In [11]:
# Collect the API queries generated by the chain
predicted_queries = [output["intermediate_steps"]["request_args"] for output in chain_outputs]

In [12]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The query you know you should be executing against the API is:

> Query: {truth_query}

Is the following predicted query semantically the same (eg likely to produce the same answer)?

> Predicted Query: {predict_query}

Please give the Predicted Query a grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt)

In [13]:
request_eval_results = []
for question, predict_query, truth_query in list(zip(questions, predicted_queries, truth_queries))[:3]:
    eval_output = eval_chain.run(
        question=question,
        truth_query=truth_query,
        predict_query=predict_query,
    )
    request_eval_results.append(eval_output)
request_eval_results

[" This predicted query is asking for a list of iPhones with a size of 10, min price of 0, and max price of 10000. This query might return a list of iPhones that meet this criteria, but it will not necessarily answer the original question, which was to list all iPhone models. Therefore, this query would receive a grade of 'F'. Final Grade: F",
 ' The original query specifically includes the max_price parameter, but the predicted query does not. In this case, the predicted query might return results that are not budget laptops, because it does not specify a maximum price. Therefore, the predicted query does not accurately capture the intent of the original query. Final Grade: F',
 ' The Predicted Query does a good job of providing the necessary parameters to search for a gaming PC, specifying "gaming pc" as the query, limiting the size of the search results to 10, and setting a minimum price of 0. However, it does not set a maximum price for the gaming PC, so it may not find the cheapes

In [14]:
import re
from typing import List
# Parse the evaluation chain responses into a rubric
def parse_eval_results(results: List[str]) -> List[float]:
    rubric = {
        "A": 1.0,
        "B": 0.75,
        "C": 0.5,
        "D": 0.25,
        "F": 0
    }
    return [rubric[re.search(r'Final Grade: (\w+)', res).group(1)] for res in results]


parsed_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores['request_synthesizer'].extend(parsed_results)

## Evaluate the Response Chain

The second component translated the structured API response to a natural language response.
Evaluate this against the user's original question.

In [15]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The API returned a response of:

> Response: {api_response}

You told the user: {answer}

Please evaluate the accuracy and utility of your response to the user's original question.
Give a letter grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt)

In [16]:
# Extract the API responses from the chain
api_responses = [output["intermediate_steps"]["response_text"] for output in chain_outputs]

In [17]:
# Run the grader chain
response_eval_results = []
for question, api_response, answer in list(zip(questions, api_responses, answers))[:3]:
    request_eval_results.append(eval_chain.run(question=question, api_response=api_response, answer=answer))
request_eval_results

[" This predicted query is asking for a list of iPhones with a size of 10, min price of 0, and max price of 10000. This query might return a list of iPhones that meet this criteria, but it will not necessarily answer the original question, which was to list all iPhone models. Therefore, this query would receive a grade of 'F'. Final Grade: F",
 ' The original query specifically includes the max_price parameter, but the predicted query does not. In this case, the predicted query might return results that are not budget laptops, because it does not specify a maximum price. Therefore, the predicted query does not accurately capture the intent of the original query. Final Grade: F',
 ' The Predicted Query does a good job of providing the necessary parameters to search for a gaming PC, specifying "gaming pc" as the query, limiting the size of the search results to 10, and setting a minimum price of 0. However, it does not set a maximum price for the gaming PC, so it may not find the cheapes

In [18]:
# Reusing the rubric from above, parse the evaluation chain responses
parsed_response_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores['result_synthesizer'].extend(parsed_response_results)

In [19]:
# Print out Score statistics for the evaluation session
header = "{:<20}\t{:<10}\t{:<10}\t{:<10}".format("Metric", "Min", "Mean", "Max")
print(header)
for metric, metric_scores in scores.items():
    mean_scores = sum(metric_scores) / len(metric_scores) if len(metric_scores) > 0 else float('nan')
    row = "{:<20}\t{:<10.2f}\t{:<10.2f}\t{:<10.2f}".format(metric, min(metric_scores), mean_scores, max(metric_scores))
    print(row)


Metric              	Min       	Mean      	Max       
completed           	1.00      	1.00      	1.00      
request_synthesizer 	0.00      	0.25      	0.75      
result_synthesizer  	0.00      	0.62      	1.00      


In [20]:
# Re-show the examples for which the chain failed to complete
failed_examples

[]