# Email Extraction

Let's evaluate an LLM on its ability to extract structured information from email texts.

In [None]:
# %pip install -U langchain langchain_benchmarks openai

In [1]:
import os

# Get your API key from https://smith.langchain.com/settings
# os.environ["LANGCHAIN_API_KEY"] = "sk-..."
# os.environ["OPENAI_API_KEY"] = "sk-..."

In [2]:
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

In [3]:
task = registry["Email Extraction"]
task

0,1
Name,Email Extraction
Type,ExtractionTask
Dataset ID,a1742786-bde5-4f51-a1d8-e148e5251ddb
Description,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."


In [4]:
print(task.description)

A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail.

Some additional cleanup of the data was done by hand after the initial pass.

See https://github.com/jacoblee93/oss-model-extraction-evals.
    


Clone the dataset associated with this task

In [5]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Email Extraction already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570.


In [6]:
import pprint

pprint.pprint(task.schema.schema())

{'definitions': {'ToneEnum': {'description': 'The tone of the email.',
                              'enum': ['positive', 'negative'],
                              'title': 'ToneEnum',
                              'type': 'string'}},
 'description': 'Relevant information about an email.',
 'properties': {'action_items': {'description': 'A list of action items '
                                                'requested by the email',
                                 'items': {'type': 'string'},
                                 'title': 'Action Items',
                                 'type': 'array'},
                'sender': {'description': "The sender's name, if available",
                           'title': 'Sender',
                           'type': 'string'},
                'sender_address': {'description': "The sender's address, if "
                                                  'available',
                                   'title': 'Sender Address',
                 

## Define an extraction chain

Let's build the extraction chain that we can use to get structured information from the emails.

In [10]:
from langchain.chat_models import ChatOpenAI

from langchain_benchmarks.extraction.implementations import (
    create_openai_function_based_extractor,
)

In [11]:
extraction_chain = create_openai_function_based_extractor(
    task.instructions, ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0), task.schema
)

In [12]:
extraction_chain.invoke(
    {
        "input": "Hello Dear MR. I want you to send me gold to get rich."
        " First buy an envelope. Then open it and put some gold inside. "
        "Then close it and finally mail it to my address at 12345 My Gold Way."
        " You can call me any time at 000-1212-1111."
    }
)

{'output': {'sender': 'Unknown',
  'sender_phone_number': '000-1212-1111',
  'sender_address': '12345 My Gold Way',
  'action_items': ['Buy an envelope',
   'Put gold inside',
   'Close the envelope',
   "Mail it to sender's address"],
  'topic': 'Request to send gold',
  'tone': 'positive'}}

Now it's time to measure our chain's effectiveness!

## Evaluate

Let's evaluate the chain now.

In [13]:
from langsmith.client import Client

from langchain_benchmarks.extraction import get_eval_config

In [14]:
client = Client()

In [17]:
eval_llm = ChatOpenAI(model="gpt-4", model_kwargs={"seed": 42})
eval_config = get_eval_config(eval_llm)

In [None]:
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=extraction_chain,
    evaluation=eval_config,
    verbose=True,
    tags=["openai-functions"],
)

View the evaluation results for project 'test-notable-cake-39' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/9950f779-8f98-4ca0-90ab-30e4f9f7af6c?eval=true

View all tests for Dataset Email Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/309a2fce-ce68-43aa-befb-67f94d0c3570
[------------------------------------------------->] 42/42

## Inspect

Here, we'll take a look at the underlying results a little bit.

A few things to note:

* The correctness is 66% (so it's messing up a lot!)
* The number of tool invocations made by the agent can be very large; e.g., 15 invocations, when only a single invocation was actually needed.

In [None]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [None]:
df

An example of a poorly behaving agent that seems to have gotten stuck in a loop!