# Email Extraction

Let's examine how to evaluate an email extraction task

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

In [2]:
task = registry["Email Extraction"]
task

0,1
Name,Email Extraction
Type,ExtractionTask
Dataset ID,36bdfe7d-3cd1-4b36-b957-d12d95810a2b
Description,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."


In [3]:
print(task.description)

A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail.

Some additional cleanup of the data was done by hand after the initial pass.

See https://github.com/jacoblee93/oss-model-extraction-evals.
    


Clone the dataset associated with this task

In [9]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

  0%|          | 0/42 [00:00<?, ?it/s]

Finished fetching examples. Creating dataset...
New dataset created you can access it at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570.
Done creating dataset.


In [4]:
import pprint

pprint.pprint(task.schema.schema())

{'definitions': {'ToneEnum': {'description': 'The tone of the email.',
                              'enum': ['positive', 'negative'],
                              'title': 'ToneEnum',
                              'type': 'string'}},
 'description': 'Relevant information about an email.',
 'properties': {'action_items': {'description': 'A list of action items '
                                                'requested by the email',
                                 'items': {'type': 'string'},
                                 'title': 'Action Items',
                                 'type': 'array'},
                'sender': {'description': "The sender's name, if available",
                           'title': 'Sender',
                           'type': 'string'},
                'sender_address': {'description': "The sender's address, if "
                                                  'available',
                                   'title': 'Sender Address',
                 

## Define an extraction chain

Let's build an agent that we can use for evaluation.

In [None]:
from langchain.chat_models import ChatOpenAI

from langchain_benchmarks.extraction.implementations import (
    create_openai_function_based_extractor,
)

In [None]:
extraction_chain = create_openai_function_based_extractor(
    task.instructions, ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0), task.schema
)

In [None]:
extraction_chain.invoke(
    {
        "email": "Hello Dear MR. I want you to send me gold to get rich. First buy an envelope. Then open it and put some gold inside. Then close it and finally mail it to my address at 12345 My Gold Way. You can call me any time at 000-1212-1111."
    }
)

Let's test that our agent works

## Eval

Let's evaluate an agent now

In [None]:
from langsmith.client import Client

from langchain_benchmarks.extraction import get_eval_config

In [None]:
client = Client()

In [None]:
eval_config = get_eval_config(ChatOpenAI(model="gpt-4"))

In [None]:
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=extraction_chain,
    evaluation=eval_config,
    verbose=True,
    tags=["openai-functions"],
)

# Inspect

Here, we'll take a look at the underlying results a little bit.

A few things to note:

* The correctness is 66% (so it's messing up a lot!)
* The number of tool invocations made by the agent can be very large; e.g., 15 invocations, when only a single invocation was actually needed.

In [None]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [None]:
df

An example of a poorly behaving agent that seems to have gotten stuck in a loop!