# Evaluating an OpenAPI Chain

This notebook goes over ways to semantically evaluate an OpenAPI Chain, which calls an endpoint defined by the OpenAPI specification using purely natural language.

In [1]:
from langchain.tools import OpenAPISpec, APIOperation
from langchain.chains import OpenAPIEndpointChain, LLMChain
from langchain.requests import Requests
from langchain.llms import OpenAI

## Load the API Chain

Load a wrapper of the spec (so we can work with it more easily). You can load from a url or from a local file.

In [2]:
# Load and parse the OpenAPI Spec
spec = OpenAPISpec.from_url("https://www.klarna.com/us/shopping/public/openai/v0/api-docs/")
# Load a single endpoint operation
operation = APIOperation.from_openapi_spec(spec, '/public/openai/v0/products', "get")
# Select any LangChain LLM
llm = OpenAI()
# Create the endpoint chain
api_chain = OpenAPIEndpointChain.from_api_operation(
    operation, 
    llm, 
    requests=Requests(), 
    verbose=True,
    return_intermediate_steps=True # Return request and response text
)

Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.


### *Optional*: Generate Input Questions and Request Ground Truth Queries

In [3]:
# import re
# from langchain.prompts import PromptTemplate

# template = """Below is a service description:

# {spec}

# Imagine you're a new user trying to use {operation} through a search bar. What are 10 different things you want to request?
# Wants/Questions:
# 1. """

# prompt = PromptTemplate.from_template(template)

# generation_chain = LLMChain(llm=llm, prompt=prompt)

# questions_ = generation_chain.run(spec=operation.to_typescript(), operation=operation.operation_id).split('\n')
# # Strip preceding numeric bullets
# questions = [re.sub(r'^\d+\. ', '', q).strip() for q in questions_]
# questions

In [4]:
# ground_truths = [
# {"q": ...} # What are the best queries for each input?
# ]

## Run the API Chain

The two simplest questions a user of the API Chain are:
- Did the chain succesfully access the endpoint?
- Did the action accomplish the correct result?


In [5]:
from collections import defaultdict
# Collect metrics to report at completion
scores = defaultdict(list)

In [6]:
questions = [
     'What iPhone models are available?',
     'Are there any budget laptops?',
     'Show me the cheapest gaming PC.',
     'Are there any tablets under $400?',
     'What are the best headphones?',
     'What are the top rated laptops?',
     'I want to buy some shoes. I like Adidas and Nike.',
     'I want to buy a new skirt',
     'My company is asking me to get a professional Deskopt PC - money is no object.',
     'What are the best budget cameras?'
]

In [7]:
## Run the the API chain itself
chain_outputs = []
failed_examples = []
for question in questions:
    try:
        chain_outputs.append(api_chain(question))
        scores["completed"].append(1.0)
    except Exception as e:
        failed_examples.append({'q': question, 'error': e})
        scores["completed"].append(0.0)



[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken from the latest topic, if there is a new topic a new query is started. */
		q: string,
/* number of products returned */
		size?: number,
/* (Optional) Minimum price in local currency for the product searched for. Either explicitly sta


[1m> Finished chain.[0m
[33;1m[1;3mThe following iPhone models are available: Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, Apple iPhone 12 mini 64GB, Apple iPhone 14 Plus 128GB, Apple iPhone 14 256GB, Apple iPhone 11 Pro Max 512GB, Apple iPhone 14 Plus 256GB.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singu


[1m> Finished chain.[0m
[32;1m[1;3m{"q": "gaming PC", "size": 1, "min_price": 0, "max_price": null}[0m
[36;1m[1;3m{"products":[{"name":"Alarco Gaming PC (X_BLACK_GTX750)","url":"https://www.klarna.com/us/shopping/pl/cl223/3203154750/Desktop-Computers/Alarco-Gaming-PC-%28X_BLACK_GTX750%29/?utm_source=openai&ref-site=openai_plugin","price":"$499.99","attributes":["Graphics Memory:4 GB","Memory Type:DDR3","Operating System:Windows","Hard Drive:HDD","Processor:Intel Core i7","RAM:16 GB","Windows Version:Windows 10 Pro","Type:Tower","Processor Cores:4"]}]}[0m


[1m> Entering new APIResponderChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI assistant trained to answer user queries from API responses.
You attempted to call an API, which resulted in:
API_RESPONSE: {"products":[{"name":"Alarco Gaming PC (X_BLACK_GTX750)","url":"https://www.klarna.com/us/shopping/pl/cl223/3203154750/Desktop-Computers/Alarco-Gaming-PC-%28X_BLACK_GTX750%29/?utm_source=openai&r


[1m> Finished chain.[0m
[33;1m[1;3mYes, there are several tablets under $400. They include the Apple iPad 10.2" 32GB (2019) ($249.99), Samsung Galaxy Tab A8 10.5 SM-X200 32GB ($178.90), and Amazon Fire HD 8" 32GB (10th Generation) ($44.99).[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expens


[1m> Finished chain.[0m
[32;1m[1;3m{"q": "laptops", "size": 10}[0m
[36;1m[1;3m{"products":[{"name":"Apple MacBook Air (2020) M1 OC 7C GPU 8GB 256GB SSD 13\"","url":"https://www.klarna.com/us/shopping/pl/cl27/3200009205/Laptops/Apple-MacBook-Air-%282020%29-M1-OC-7C-GPU-8GB-256GB-SSD-13/?utm_source=openai&ref-site=openai_plugin","price":"$719.99","attributes":["Processor:Apple M1","Hard Drive Type:SSD","Screen Size:13.3\"","Screen Resolution:2560x1600","Operating System:macOS","Series:Apple Macbook Air","SSD Size:256 GB","RAM:8 GB","Release Year:2020"]},{"name":"Apple MacBook Air (2022) M2 OC 8C GPU 8GB 256GB SSD 13.6\"","url":"https://www.klarna.com/us/shopping/pl/cl27/3201585850/Laptops/Apple-MacBook-Air-%282022%29-M2-OC-8C-GPU-8GB-256GB-SSD-13.6/?utm_source=openai&ref-site=openai_plugin","price":"$1049.00","attributes":["Processor:Apple M2","Hard Drive Type:SSD","Screen Size:13.6\"","Screen Resolution:2560x1664","Operating System:macOS","Series:Apple Macbook Air","SSD Size:256


[1m> Finished chain.[0m
[33;1m[1;3mI attempted to call the API, however it returned an API_RESPONSE with a list of products. I'm not able to determine the top rated laptops from this API_RESPONSE.[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken fr


[1m> Finished chain.[0m
[33;1m[1;3mI found some Nike and Adidas shoes in the API response. Check out the product links and let me know which one you would like to purchase![0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always taken from the latest topic, if 


[1m> Finished chain.[0m
[33;1m[1;3mI have found several skirts that may suit your needs. You can find them here: https://www.klarna.com/us/shopping/pl/cl10001/?utm_source=openai&ref-site=openai_plugin[0m

[1m> Finished chain.[0m


[1m> Entering new OpenAPIEndpointChain chain...[0m


[1m> Entering new APIRequesterChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI Assistant. Please provide JSON arguments to agentFunc() based on the user's instructions.

API_SCHEMA: ```typescript
type productsUsingGET = (_: {
/* A precise query that matches one very small category or product that needs to be searched for to find the products the user is looking for. If the user explicitly stated what they want, use that as a query. The query is as specific as possible to the product name or category mentioned by the user in its singular form, and don't contain any clarifiers like latest, newest, cheapest, budget, premium, expensive or similar. The query is always take


[1m> Finished chain.[0m
[32;1m[1;3m{"q": "camera", "size": 10, "min_price": 0, "max_price": 100}[0m
[36;1m[1;3m{"products":[{"name":"Parrot Camera & Body","url":"https://www.klarna.com/us/shopping/pl/cl1559/4168697/RC-Toys/Parrot-Camera-Body/?utm_source=openai&ref-site=openai_plugin","price":"$36.19","attributes":[]},{"name":"Fujifilm Instax Mini 9","url":"https://www.klarna.com/us/shopping/pl/cl126/3992661/Analogue-Cameras/Fujifilm-Instax-Mini-9/?utm_source=openai&ref-site=openai_plugin","price":"$25.73","attributes":[]},{"name":"WiFi Hidden Spy Camera","url":"https://www.klarna.com/us/shopping/pl/cl589/3202535321/Surveillance-Cameras/WiFi-Hidden-Spy-Camera/?utm_source=openai&ref-site=openai_plugin","price":"$19.99","attributes":[]},{"name":"Blink Outdoor Add-on Camera","url":"https://www.klarna.com/us/shopping/pl/cl589/3201199946/Surveillance-Cameras/Blink-Outdoor-Add-on-Camera/?utm_source=openai&ref-site=openai_plugin","price":"$76.49","attributes":[]},{"name":"Wireless Secu

In [8]:
# If the chain failed to run, show the failing examples
failed_examples

[]

In [9]:
answers = [res['output'] for res in chain_outputs]
answers

['The following iPhone models are available: Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, Apple iPhone 12 mini 64GB, Apple iPhone 14 Plus 128GB, Apple iPhone 14 256GB, Apple iPhone 11 Pro Max 512GB, Apple iPhone 14 Plus 256GB.',
 'Yes, there are budget laptops in the API response. The HP 14-dq0055dx, HP 14-dq0054dx and HP 15-dw0083wm all have prices of $199.99 or less.',
 'The Alarco Gaming PC (X_BLACK_GTX750) is the cheapest gaming PC available for $499.99.',
 'Yes, there are several tablets under $400. They include the Apple iPad 10.2" 32GB (2019) ($249.99), Samsung Galaxy Tab A8 10.5 SM-X200 32GB ($178.90), and Amazon Fire HD 8" 32GB (10th Generation) ($44.99).',
 'Based on the API response, it looks like the best headphones available are the Apple AirPods Max, which come with active noise cancelling and an over-ear design for $409.00.',
 "I attempted to call the API, however it returned an API_RESPONSE wi

## Evaluate the requests chain

The API Chain has two main components:
1. Translate the user query to an API request
2. Translate the API response to a natural language response

Here, we construct an evaluation chain to grade the request synthesizer against selected human queries 

In [10]:
import json

# Define ground truth labels
truth_queries = [
    {"q": "iPhone"},
    {"q": "laptop", "max_price": 300},
    {"q": "tablet"},
    {"q": "headphone"},
    {"q": "laptop", "max_price": 400},
    {"q": "shoe"},
    {"q": "skirt"},
    {"q": "professional desktop PC", "max_price": 10000},
    {"q": "camera", "max_price": 300},
]
truth_queries = [json.dumps(q) for q in truth_queries]

In [11]:
# Collect the API queries generated by the chain
predicted_queries = [output["intermediate_steps"]["request_args"] for output in chain_outputs]

In [12]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The query you know you should be executing against the API is:

> Query: {truth_query}

Is the following predicted query semantically the same (eg likely to produce the same answer)?

> Predicted Query: {predict_query}

Please give the Predicted Query a grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt)

In [13]:
request_eval_results = []
for question, predict_query, truth_query in list(zip(questions, predicted_queries, truth_queries)):
    eval_output = eval_chain.run(
        question=question,
        truth_query=truth_query,
        predict_query=predict_query,
    )
    request_eval_results.append(eval_output)
request_eval_results

[" The first query is just the keyword 'iPhone' - this will likely return a list of iPhone models. The predicted query adds additional parameters to this query, but they are not necessary to answer the original question. The size parameter is irrelevant since the original question is not asking for a specific number of results. The min_price and max_price parameters are also irrelevant since the original question is not asking for a specific price range. All of these additional parameters are likely to limit the results, so the predicted query is not semantically the same as the original. Final Grade: D",
 ' The original query specifies that the maximum price for the laptop should be $300. The predicted query, on the other hand, sets the minimum price to $0 and the maximum price to $500. This means that the original query is more specific than the predicted query, as it excludes any laptops that cost more than $300. Therefore, the predicted query is likely to produce more results than 

In [14]:
import re
from typing import List
# Parse the evaluation chain responses into a rubric
def parse_eval_results(results: List[str]) -> List[float]:
    rubric = {
        "A": 1.0,
        "B": 0.75,
        "C": 0.5,
        "D": 0.25,
        "F": 0
    }
    return [rubric[re.search(r'Final Grade: (\w+)', res).group(1)] for res in results]


parsed_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores['request_synthesizer'].extend(parsed_results)

## Evaluate the Response Chain

The second component translated the structured API response to a natural language response.
Evaluate this against the user's original question.

In [15]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The API returned a response of:

> API result: {api_response}

Your response to the user: {answer}

Please evaluate the accuracy and utility of your response to the user's original question, conditioned on the information available.
Give a letter grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=True)

In [16]:
# Extract the API responses from the chain
api_responses = [output["intermediate_steps"]["response_text"] for output in chain_outputs]

In [17]:
# Run the grader chain
response_eval_results = []
for question, api_response, answer in list(zip(questions, api_responses, answers)):
    request_eval_results.append(eval_chain.run(question=question, api_response=api_response, answer=answer))
request_eval_results



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are trying to answer the following question by querying an API:

> Question: What iPhone models are available?

The API returned a response of:

> API result: {"products":[{"name":"Apple iPhone 12 128GB","url":"https://www.klarna.com/us/shopping/pl/cl1/5261401/Mobile-Phones/Apple-iPhone-12-128GB/?utm_source=openai&ref-site=openai_plugin","price":"$679.00","attributes":["Operating System:iOS","Series:Apple iPhone 12","Release Year:2020","Color:Red,White,Blue,Purple,Black,Green","Screen Size:6.1\"","Internal Memory Size:128GB"]},{"name":"Apple iPhone 13 128GB","url":"https://www.klarna.com/us/shopping/pl/cl1/3200339453/Mobile-Phones/Apple-iPhone-13-128GB/?utm_source=openai&ref-site=openai_plugin","price":"$799.00","attributes":["Operating System:iOS","Series:Apple iPhone 13","RAM:4 GB","Release Year:2021","Color:Red,Pink,White,Blue,Black,Green","Screen Size:6.1\"","Internal Memory Size:128GB"]},{"name":"


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are trying to answer the following question by querying an API:

> Question: Are there any tablets under $400?

The API returned a response of:

> API result: {"products":[{"name":"Apple iPad 10.2\" 64GB 2021 (9th Generation)","url":"https://www.klarna.com/us/shopping/pl/cl224/3200340170/Tablets/Apple-iPad-10.2-64GB-2021-%289th-Generation%29/?utm_source=openai&ref-site=openai_plugin","price":"$269.00","attributes":["OS Version:Apple iPadOS","Series:Apple iPad","USB Type:USB (Type C)","Operating System:iPadOS","Chipset:Apple A13","RAM:3 GB","Storage Capacity:64 GB","Video Resolution (max):1080p (Full HD)","Release Year:2021","Screen Type:LCD"]},{"name":"Apple iPad 10.9\" 64GB (2022)","url":"https://www.klarna.com/us/shopping/pl/cl224/3202789305/Tablets/Apple-iPad-10.9-64GB-%282022%29/?utm_source=openai&ref-site=openai_plugin","price":"$399.00","attributes":["Series:Apple iPad"


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are trying to answer the following question by querying an API:

> Question: I want to buy some shoes. I like Adidas and Nike.

The API returned a response of:

> API result: {"products":[{"name":"Nike Dunk Low M - Black/White","url":"https://www.klarna.com/us/shopping/pl/cl337/3200177969/Shoes/Nike-Dunk-Low-M-Black-White/?utm_source=openai&ref-site=openai_plugin","price":"$79.00","attributes":["Fastening:Laced","Target Group:Man","Color:White,Black","Upper Material:Leather","Size (US):9.5,10,11,12,13,14,15,16,17,18,11.5,10.5,2,3,4,5,6,7,8,9,2.5,3.5,4.5,16.5,5.5,15.5,6.5,14.5,13.5,7.5,8.5,12.5","Series:Nike Dunk"]},{"name":"Nike Air Jordan 4 Retro M - Midnight Navy","url":"https://www.klarna.com/us/shopping/pl/cl337/3202929835/Shoes/Nike-Air-Jordan-4-Retro-M-Midnight-Navy/?utm_source=openai&ref-site=openai_plugin","price":"$272.00","attributes":["Fastening:Laced","Midsole:Foa


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are trying to answer the following question by querying an API:

> Question: My company is asking me to get a professional Deskopt PC - money is no object.

The API returned a response of:

> API result: {"products":[{"name":"Blue Microphones Yeti X USB Condenser Microphone","url":"https://www.klarna.com/us/shopping/pl/cl176/3202810592/Microphones/Blue-Microphones-Yeti-X-USB-Condenser-Microphone/?utm_source=openai&ref-site=openai_plugin","price":"$169.99","attributes":["Suitable for:Podcast"]},{"name":"Lenovo 30DK000RUS P340 i510500 8G 256G Windows 10 Professional ThinkStation","url":"https://www.klarna.com/us/shopping/pl/cl223/3201082453/Desktop-Computers/Lenovo-30DK000RUS-P340-i510500-8G-256G-Windows-10-Professional-ThinkStation/?utm_source=openai&ref-site=openai_plugin","price":"$962.56","attributes":["Third-party Certifications:TCO,Energy Star,EPEAT","Memory Type:DDR4","O

[" The first query is just the keyword 'iPhone' - this will likely return a list of iPhone models. The predicted query adds additional parameters to this query, but they are not necessary to answer the original question. The size parameter is irrelevant since the original question is not asking for a specific number of results. The min_price and max_price parameters are also irrelevant since the original question is not asking for a specific price range. All of these additional parameters are likely to limit the results, so the predicted query is not semantically the same as the original. Final Grade: D",
 ' The original query specifies that the maximum price for the laptop should be $300. The predicted query, on the other hand, sets the minimum price to $0 and the maximum price to $500. This means that the original query is more specific than the predicted query, as it excludes any laptops that cost more than $300. Therefore, the predicted query is likely to produce more results than 

In [21]:
# Reusing the rubric from above, parse the evaluation chain responses
parsed_response_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores['result_synthesizer'].extend(parsed_response_results)

In [22]:
# Print out Score statistics for the evaluation session
header = "{:<20}\t{:<10}\t{:<10}\t{:<10}".format("Metric", "Min", "Mean", "Max")
print(header)
for metric, metric_scores in scores.items():
    mean_scores = sum(metric_scores) / len(metric_scores) if len(metric_scores) > 0 else float('nan')
    row = "{:<20}\t{:<10.2f}\t{:<10.2f}\t{:<10.2f}".format(metric, min(metric_scores), mean_scores, max(metric_scores))
    print(row)


Metric              	Min       	Mean      	Max       
completed           	1.00      	1.00      	1.00      
request_synthesizer 	0.00      	0.44      	1.00      
result_synthesizer  	0.00      	0.68      	1.00      


In [23]:
# Re-show the examples for which the chain failed to complete
failed_examples

[]