In [2]:
%load_ext autoreload
%autoreload 2

# Extraction

Let's see how to evaluate an agent's ability to use tools.

In [3]:
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

In [4]:
registry

Name,Type,Dataset ID,Description
Tool Usage - Typewriter (1 func),ToolUsageTask,placeholder,"Environment with a single function that accepts a single letter as input, and ""prints"" it on a piece of paper. The objective of this task is to evaluate the ability to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter,ToolUsageTask,placeholder,"Environment with 26 functions each representing a letter of the alphabet. In this variation of the typewriter task, there are 26 parameterless functions, where each function represents a letter of the alphabet (instead of a single function that takes a letter as an argument). The object is to evaluate the ability of use the functions to repeat the given string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Relational Data,ToolUsageTask,e95d45da-aaa3-44b3-ba2b-7c15ff6e46f5,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,ToolUsageTask,placeholder,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."
Email Extraction,ExtractionTask,https://smith.langchain.com/public/36bdfe7d-3cd1-4b36-b957-d12d95810a2b/d,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."


In [5]:
task = registry["Email Extraction"]
task

0,1
Name,Email Extraction
Type,ExtractionTask
Dataset ID,https://smith.langchain.com/public/36bdfe7d-3cd1-4b36-b957-d12d95810a2b/d
Description,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as ..."


In [6]:
print(task.description)

A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail.

Some additional cleanup of the data was done by hand after the initial pass.

See https://github.com/jacoblee93/oss-model-extraction-evals.
    


Clone the dataset associaetd with this task

In [7]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Email Extraction already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/c652e524-8faf-4796-a0fd-d56415b7a5a6.


In [8]:
import pprint

pprint.pprint(task.schema.schema())

{'definitions': {'ToneEnum': {'description': 'The tone of the email.',
                              'enum': ['positive', 'negative'],
                              'title': 'ToneEnum',
                              'type': 'string'}},
 'description': 'Relevant information about an email.',
 'properties': {'action_items': {'description': 'A list of action items '
                                                'requested by the email',
                                 'items': {'type': 'string'},
                                 'title': 'Action Items',
                                 'type': 'array'},
                'sender': {'description': "The sender's name, if available",
                           'title': 'Sender',
                           'type': 'string'},
                'sender_address': {'description': "The sender's address, if "
                                                  'available',
                                   'title': 'Sender Address',
                 

## Define an extraction chain

Let's build an agent that we can use for evaluation.

In [9]:
from langchain_benchmarks.extraction.implementations import (
    create_openai_function_based_extractor,
)
from langchain.chat_models import ChatOpenAI

In [10]:
extraction_chain = create_openai_function_based_extractor(
    task.instructions, ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0), task.schema
)

In [11]:
extraction_chain.invoke(
    {
        "email": "Hello Dear MR. I want you to send me gold to get rich. First buy an envelope. Then open it and put some gold inside. Then close it and finally mail it to my address at 12345 My Gold Way. You can call me any time at 000-1212-1111."
    }
)

{'output': {'sender': 'Unknown',
  'sender_phone_number': '000-1212-1111',
  'sender_address': '12345 My Gold Way',
  'action_items': ['Buy an envelope',
   'Put gold inside',
   'Close the envelope',
   "Mail it to sender's address"],
  'topic': 'Request to send gold',
  'tone': 'positive'}}

Let's test that our agent works

## Eval

Let's evaluate an agent now

In [12]:
from langchain_benchmarks.extraction import get_eval_config
from langsmith.client import Client

In [13]:
client = Client()

In [15]:
eval_config = get_eval_config(ChatOpenAI(model="gpt-4"))

In [17]:
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=extraction_chain,
    evaluation=eval_config,
    verbose=True,
    tags=["openai-functions"],
)

View the evaluation results for project 'test-worthwhile-sound-23' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/7d5e8f03-26d1-4f2f-b954-7b8aff02a086?eval=true

View all tests for Dataset Email Extraction at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/c652e524-8faf-4796-a0fd-d56415b7a5a6
[------------------------------------------------->] 42/42
 Eval quantiles:
                       0.25  0.5  0.75      mean  mode
score_string:accuracy   0.3  0.3   0.7  0.471429   0.3


# Inspect

Here, we'll take a look at the underlying results a little bit.

A few things to note:

* The correctness is 66% (so it's messing up a lot!)
* The number of tool invocations made by the agent can be very large; e.g., 15 invocations, when only a single invocation was actually needed.

In [23]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [26]:
df

Unnamed: 0,score_string:accuracy,input.email,output.output.sender,output.output.sender_address,output.output.action_items,output.output.topic,output.output.tone,reference.output.tone,reference.output.topic,reference.output.sender,reference.output.action_items,reference.output.sender_address,reference.output.sender_phone_number,output.output.sender_phone_number
0,0.7,--- \n|\n\n# We Provide Unique Financing Opti...,info@championadvance.com,"42 Broadway, New York, NY 10004","[Fill out the application, Provide last 3 mont...",Unique financing options,positive,positive,Financing options for businesses,Champion Advance,"[Start the financing process, Complete the one...","42 Broadway, New York, NY 10004",888-422-2162,
1,0.3,--- \n| | QUALIFY NOW \n--- \n \n \nHell...,Sam at EMC,,[Check how much financing you're eligible to r...,Operational evolution and low-rate business loans,positive,positive,Business loan and credit line offer,Sam at EMC,[Check how much financing you're eligible to r...,"475 Washington Blvd, Suite 7, Marina Del Rey, ...",(310) 491-7947,
2,0.3,--- \n|\n\nCostco \n \n--- \n \nANSWER \...,Costco,,[unsubscribe],Loyalty Program,positive,positive,Invitation to participate in a Loyalty Program...,Enterprise,"[Participate in the Loyalty Program, Receive a...","9101 W. Sahara Ave, Las Vegas, NV 89117",,
3,0.3,---|--- \n \n| \n--- \n \n| \n--- \n|\n...,info@championadvance.com,,[],Unique financing options,positive,positive,Financing options for businesses,Champion Advance,"[Start the financing process, Complete the one...","42 Broadway, New York, NY 10004",888-422-2162,
4,0.1,"| | | | |\n\nOCTOBER 2023, VOL. 23 NO. 2 ...",,,[],Email Summary,positive,positive,Princeton University Alumni News and Events,,[Send comments and questions to alumweb@prince...,"University Advancement, Princeton University, ...",,
5,0.7,"Hello, \n \nThis is Eli Zafrani from Getty A...",Eli Zafrani,"75 Broad St, New York, NY 10004","[Fill out updated application, Provide last 4 ...",Offer for extra working capital,positive,positive,Loan offer from Getty Advance,Eli Zafrani,"[Fill out the updated application, Provide the...","75 Broad St, New York, NY 10004",,
6,0.3,| Auroras and Adventure Await in Anchorage \n...,,,[],Auroras and Adventure Await in Anchorage,positive,positive,"Travel and Tourism in Anchorage, Alaska",Dunhill Vacations Inc.,[START PLANNING],"2307 W. Broward Blvd, Ste 402 - Fort Lauderdal...",,
7,0.7,"Hi Jacob, it's Scott. \n \n=0AIt's been the h...",Scott Wiener,"312 Clay St. Suite 300 Oakland, CA 94607 Unite...",[Join re-election campaign as a founding donor...,Announcement of re-election campaign,positive,positive,Announcement of Senator Scott Wiener's run for...,Senator Scott Wiener,[Join the re-election campaign as a founding d...,"312 Clay St., Suite 300, Oakland, CA 94607, Un...",,
8,0.1,---|---|---|--- \n \nBook with Fall Sale Ext...,,,[],Fall Sale Extras,positive,positive,Fall Cruise Sale Promotions,Dunhill Vacations Inc.,[Book a cruise with Fall Sale Extras by Novemb...,"2307 W. Broward Blvd, Ste 402 - Fort\nLauderda...",,
9,0.3,--- \n \n|\n\nNewsom Signed Our Bill to Expa...,Matt Haney,,"[visit my website, send me an email]",Newsom Signed Our Bill to Expand Access to Add...,positive,positive,Governor Newsom signing AB 816 to expand acces...,Matt Haney,"[visit my website, send me an email]","Capitol Office: 1021 O Street, Suite 5310, P.O...",,


An example of a poorly behaving agent that seems to have gotten stuck in a loop!