---
author: Martiño Ríos García
date: 2024-05-17
title: 9 | Agents
keep-ipynb: True
---

This notebook has the objective to show how to build LLM agents and how they work. For this practical application we will focus on trying to extract chemical reactions from an image.

::: {.callout-caution title="Installing the packages needed"}

All the dependencies needed to run the Notebooks of the book can be installed by installing the package. This Notebook it is an exception to that because we are going to use a package that need specific dependencies versions. Thus, to install the dependencies that are used in this Notebook, we recommend to execute the next commands in your console: 

```
conda create --name agent_notebook
conda activate agent_notebook
conda install python=3.8
git clone https://github.com/thomas0809/RxnScribe.git
cd RxnScribe
pip install -e .
cd -
pip install torch PubChemPy rdkit-pypi matplotlib huggingface-hub python-dotenv langchain langchain-community openai
```

With this you will be able to run the Notebook.
:::

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
import time
import requests
import base64

import torch
import pubchempy as pcp
from rxnscribe import RxnScribe
from rdkit import Chem  

import matplotlib.pyplot as plt 
from huggingface_hub import hf_hub_download
from dotenv import load_dotenv

from langchain import hub
from langchain.pydantic_v1 import BaseModel, Field
from langchain.agents import AgentExecutor
from langchain.tools import StructuredTool
from langchain.agents.react.agent import create_react_agent
from langchain_openai import ChatOpenAI

from openai import OpenAI

In [3]:
load_dotenv(".env", override=True)

True

::: {.column-margin}
The *.env* needs to contain your personal OpenAI key. We insist that using the environment variables is the safest way of keeping personal API keys secret.
:::

For this small demonstration we will use *OpenAI* newest model: *GPT-4o*.

In [4]:
client = OpenAI()

In [5]:
model = "gpt-4o"

The *GPT-4o* model by *OpenAI* as its predecessor, *GPT-4.5 Turbo*, are multimodal models meaning that they can work with multimodal inputs such as text and images. Despite this models work quite well for some tasks, they can not perform that well when the data is very field-specific. To demonstrate this, we are going to provide an image describing a chemical reaction asking the model to extract the information describing the reaction.

@article{Gibbons2015,
  title = {Synthesis of Naamidine A and Selective Access to N2-Acyl-2-aminoimidazole Analogues},
  volume = {80},
  ISSN = {1520-6904},
  url = {http://dx.doi.org/10.1021/acs.joc.5b01703},
  DOI = {10.1021/acs.joc.5b01703},
  number = {20},
  journal = {The Journal of Organic Chemistry},
  publisher = {American Chemical Society (ACS)},
  author = {Gibbons,  Joseph B. and Salvant,  Justin M. and Vaden,  Rachel M. and Kwon,  Ki-Hyeok and Welm,  Bryan E. and Looper,  Ryan E.},
  year = {2015},
  month = sep,
  pages = {10076–10085}
}

To do the test we are going to work with an image extracted from a work by Gibbons et al.[@Gibbons2015].

In [6]:
image_file = "acs.joc.5b01703-Scheme-c1.png"

![Figure taken from a published article by Gibbons et al.[@Gibbons2015] illustrating the synthesis of Naamine A(8) via an addition−hydroamination−isomerization sequenceutilizing the propargylcyanamide 7.](acs.joc.5b01703-Scheme-c1.png)

To pass the image to the model through the prompt, the image needs to be encoded. To do that we use the function that [*OpenAI* propose in their vision guidelines](https://platform.openai.com/docs/guides/vision).

In [7]:
# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")
      
# Getting the base64 string
base64_image = encode_image(image_file)

After that, we generate the prompt using the function just defined to encode the image.

In [8]:
messages = [
    {
        "role": "system",
        "content": "You are a chemistry expert assistant and your task is to extract information about chemical reactions from images."
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please extract all the information from the next image containing a chemical reaction."},
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                },
            },
        ],
    },
]

And we do the completion using the *OpenAI* API.

In [9]:
response = client.chat.completions.create(
  model=model,
  messages=messages,
)

In [10]:
response.choices[0]

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The image contains a stepwise chemical reaction sequence involving a series of transformations leading to the synthesis of several compounds, including naamidine A. Here are the details:\n\nStarting Compound:\n\\[ \\text{Compound } 7 \\]\n- Structure: \n  - Contains a benzyl-protected phenol group.\n  - An alkyne with a methoxy phenyl group.\n  - An imine group with a methyl substituent.\n\n#### Reactions:\n\n1. **First Step:**\n   - Reagent: La(OTf)\\(_3\\) \n   - Yield: 76%\n   - Product: Intermediate structure (not explicitly shown).\n\n2. **Second Step:**\n   - Reagents: \n     - HCl \n     - JandaJel-NH2, EtOH\n   - Yield: 70%\n   - Product: Intermediate compound \n     - [Not explicitly named]\n\n3. **Third Step:**\n   - Reagent: Pd/C, H\\(_2\\)\n   - Yield: 63%\n   - Products:\n     - **Compound 8:**\n       - Structure: \n         - A phenyl group with an OH substitution (R1 = OH).\n    

It is possible to see that even such a recent and powerful multimodal model has not much clue about what is going on. Specially extracting the molecules characterized as chemical representations.

But luckily, some tools were developed to extract this information from the images.

One of these tools is *RxnScribe* developed by Qian et al.[@RxnScribe]. This tool can extract chemical reactions from images. So, we will use it as a tool given to an agent that we create bellow. With this and other tools, we will try that the model can extract information such as the IUPAC name and the InChI representation for the reactants involved in the reaction.

@article{RxnScribe,
    author = {Qian, Yujie and Guo, Jiang and Tu, Zhengkai and Coley, Connor W. and Barzilay, Regina},
    title = {RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing},
    journal = {Journal of Chemical Information and Modeling},
    doi = {10.1021/acs.jcim.3c00439}
}

In [11]:
# Define a class to describe the input to the tool
class ExtractionInput(BaseModel):
    image_path: str = Field(description="Path to the image-file that contain the reaction")

# Define the function that will do the reaction extraction.
def extractor(image_path:str) -> list:
    ckpt_path = hf_hub_download("yujieq/RxnScribe", "pix2seq_reaction_full.ckpt")
    model = RxnScribe(ckpt_path, device=torch.device("cuda"))
    results = model.predict_image_file(image_file, molscribe=True, ocr=True)

    # Clean the output to reduce the number of tokens
    for result in results:
        for key, value in result.items():
            for v in value:
                if 'molfile' in v:
                    v.pop('molfile')
    return results
                
# Describe the tool for the model
image_extractor = StructuredTool.from_function(
    func=extractor,
    name="Reaction extractor",
    description="Extract chemical reactions information such as reactants, products and catalysts from images",
    args_schema=ExtractionInput,
)

::: {.column-margin}
To create the model we use the *LangChain* package that allows to create an agent in a few lines of code. However, this is done because it better fits in the purpose of this Notebook. But we point as in the article, that building an agent without these frameworks such as LangChain and LlamaIndex is feasible and will offer more clarity and flexibility to the process.
:::
::: {.column-margin}
Note that before returning the output from the RxnScribe tool, we "pop" some elements from it. This is simply because those elements are quite big, and since we are not going to need them, we will save some tokens.
:::

If we print the variable containing the tool, we will be able to see what is going to be passed to the agent.

In [12]:
print(image_extractor.name)
print(image_extractor.description)
print(image_extractor.args)
print(image_extractor.return_direct)

Reaction extractor
Extract chemical reactions information such as reactants, products and catalysts from images
{'image_path': {'title': 'Image Path', 'description': 'Path to the image-file that contain the reaction', 'type': 'string'}}
False


Similarly we can define another tools for helping the agent to go from SMILES to InChI and to IUPAC name.

In [13]:
class ChemicalTools(BaseModel):
    smiles: str = Field(description="SMILES representation for the molecule")

def smiles_to_inchi(smiles:str) -> str:
    molecule = Chem.MolFromSmiles(smiles)
    return Chem.MolToInchi(molecule)

CACTUS = "https://cactus.nci.nih.gov/chemical/structure/{0}/{1}"

def smiles_to_iupac(smiles:str) -> str:
    """
    Use the chemical name resolver https://cactus.nci.nih.gov/chemical/structure. 
    If this does not work, use pubchem.
    """
    try:
        time.sleep(0.001)
        rep = "iupac_name"
        url = CACTUS.format(smiles, rep)
        response = requests.get(url, allow_redirects=True, timeout=10)
        response.raise_for_status()
        name = response.text
        if "html" in name:
            return None
        return name
    except Exception:
        try:
            compound = pcp.get_compounds(smiles, "smiles")
            return compound[0].iupac_name
        except Exception:
            return None

smiles_to_inchi_converter = StructuredTool.from_function(
    func=smiles_to_inchi,
    name="Smiles to InChI",
    description="Return the inchi representation of the given smiles representation",
    args_schema=ChemicalTools,
)

smiles_to_iupac_converter = StructuredTool.from_function(
    func=smiles_to_iupac,
    name="Smiles to IUPAC",
    description="Return the iupac name of the given smiles representation",
    args_schema=ChemicalTools,
)

Once we defined all the tools, we create a list that is going to be passed to the agent indicating which tools it has available.

In [14]:
tools = [image_extractor, smiles_to_inchi_converter, smiles_to_iupac_converter]

After creating and defining the tools, we can start to construct the agent itself starting by defining the model to use. For the agent we will use the *GPT-4o* model as well.

In [15]:
llm = ChatOpenAI(
    temperature=0,
    model_name="gpt-4o",
    request_timeout=1000,
    streaming='False',
)

Then we define the prompt, the agent and what it is called by *LangChain* as the "*chain*".

::: {.callout-tip title="ReAct prompt used" collapse="true"}
Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}
:::

In [16]:
# Import the ReAct prompt from the hub.
prompt = hub.pull("hwchase17/react")
# Construct the ReAct agent
agent = create_react_agent(llm, tools, prompt)
# Define the chain for the agent.
agent_executor = AgentExecutor(
    agent=agent, 
    tools=tools,
    verbose=True,
)

The last step before calling the agent is to define the query that we want the agent to solve.

In [17]:
query = f"I want the iupac name, the smiles and the inchi representations for all the reactants of the reaction contained in the image: {image_file}"

And finally we run the agent.

In [19]:
agent_executor.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo answer this question, I need to extract the chemical reactions information from the provided image, and then convert the SMILES representations of the reactants to their IUPAC names and InChI representations.

Action: Reaction extractor
Action Input: "RxnScribe/assets/acs.joc.5b01703-Scheme-c1.png"[0m[36;1m[1;3m[{'reactants': [{'category': '[Mol]', 'bbox': (0.0050025012506253125, 0.0168063706650073, 0.3171585792896448, 0.2835397373483489), 'category_id': 1, 'smiles': 'COc1ccc(C#CC(Cc2ccc(OCc3ccccc3)cc2)N(C)C#N)cc1'}], 'conditions': [{'category': '[Txt]', 'bbox': (0.3536768384192096, 0.02385420352452649, 0.37868934467233617, 0.05421409891937838), 'category_id': 2, 'text': []}, {'category': '[Mol]', 'bbox': (0.3646823411705853, 0.0016264229675813516, 0.5147573786893447, 0.11222318476311326), 'category_id': 1, 'smiles': 'C1CC2(CCN1)OCCO2'}, {'category': '[Txt]', 'bbox': (0.36368184092046024, 0.11276532575230705, 0.53676838

{'input': 'I want the iupac name, the smiles and the inchi representations for all the reactants of the reaction contained in the image: RxnScribe/assets/acs.joc.5b01703-Scheme-c1.png',
 'output': '- **Reactant 1**:\n  - **SMILES**: `COc1ccc(C#CC(Cc2ccc(OCc3ccccc3)cc2)N(C)C#N)cc1`\n  - **IUPAC Name**: [4-(4-methoxyphenyl)-1-(4-phenylmethoxyphenyl)but-3-yn-2-yl]-methylcyanamide\n  - **InChI**: InChI=1S/C26H24N2O2/c1-28(20-27)24(13-8-21-9-14-25(29-2)15-10-21)18-22-11-16-26(17-12-22)30-19-23-6-4-3-5-7-23/h3-7,9-12,14-17,24H,18-19H2,1-2H3\n\n- **Reactant 2**:\n  - **SMILES**: `C1CC2(CCN1)OCCO2`\n  - **IUPAC Name**: 1,4-dioxa-8-azaspiro[4.5]decane\n  - **InChI**: InChI=1S/C7H13NO2/c1-3-8-4-2-7(1)9-5-6-10-7/h8H,1-6H2'}

It is possible to observe that the agent reason correctly to first use the tool to extract the information from the image, and then it converts the SMILES into InChI representations and IUPAC names.