# Chemistry Agent System

A multi-agent system for computational chemistry using [smolagents](https://github.com/huggingface/smolagents).

## Quick Install via conda or pip:
```bash
conda create -n chemagent python=3.11 -y
conda activate chemagent
conda install -c conda-forge rdkit morfeus-ml ipykernel -y
pip install 'smolagents[toolkit]' 'smolagents[transformers]' scikit-learn pandas
```
OR
```bash
pip install rdkit morfeus-ml ipykernel 'smolagents[toolkit]' 'smolagents[transformers]' scikit-learn pandas huggingface-hub
```

You need a HuggingFace account to utilize light-weight, open-weight LLMs!
Run this and enter a personal access token (you have to generate one in your Huggingface account):
```bash
hf auth login
```

In [None]:
# In a Google Colab environment, run this:
!pip install rdkit morfeus-ml ipykernel 'smolagents[toolkit]' 'smolagents[transformers]' scikit-learn pandas
!hf auth login

## Setup

In [1]:
from smolagents import CodeAgent, InferenceClientModel, tool
from smolagents.local_python_executor import LocalPythonExecutor
import json, os, requests, pickle, csv
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from typing import Any

OUTPUT_DIR = "agent_outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Model setup
model = InferenceClientModel("Qwen/Qwen2.5-Coder-7B-Instruct")

# Secure executor with authorized chemistry imports
AUTHORIZED_IMPORTS = [
    "json", "csv", "os", "pathlib",
    "numpy", "pandas", "pickle",
    "rdkit", "rdkit.Chem", "rdkit.Chem.AllChem", "rdkit.Chem.Descriptors",
    "rdkit.Chem.rdMolAlign", "rdkit.Chem.rdFingerprintGenerator",
    "rdkit.DataStructs.cDataStructs", "morfeus", "sklearn", "scipy",
]

chemistry_executor = LocalPythonExecutor(
    additional_authorized_imports=AUTHORIZED_IMPORTS,
    max_print_outputs_length=50000
)

## Tools

In [3]:
@tool
def write_file(path: str, content : Any, format: str = "txt") -> str:
    """
    Write data to a file in various formats.
    
    Args:
        path: File path where data will be saved
        content: Data to write (dict for json, list for csv, str for txt, Mol for pickle, etc.)
        format: File format - 'txt', 'json', 'csv', 'xyz', 'pickle'
    
    Returns:
        Confirmation message with file path
    """
    try:
        if format == "pickle":
            with open(path, "wb") as f:
                pickle.dump(content, f)
        elif format == "json":
            with open(path, "w") as f:
                json.dump(content, f, indent=2)
        elif format == "csv":
            with open(path, "w", newline="") as f:
                if isinstance(content, list) and len(content) > 0:
                    writer = csv.DictWriter(f, fieldnames=content[0].keys())
                    writer.writeheader()
                    writer.writerows(content)
                else:
                    f.write(str(content))
        elif format == "xyz":
            with open(path, "w") as f:
                f.write(str(content))
        else:  # txt and default
            with open(path, "w") as f:
                f.write(str(content))
        return f"Successfully wrote to {path}"
    except Exception as e:
        return f"Error writing file: {str(e)}"

@tool
def function_help(function_name: str) -> str:
    """
    Provide help information for a given function.
    
    Args:
        function_name: Name of the function to get help for
    Returns:
        Help string for the specified function
    """
    return help(function_name)

@tool
def read_file(path: str, format: str = "txt") -> Any:
    """
    Read data from a file in various formats.
    
    Args:
        path: File path to read from
        format: File format - 'txt', 'json', 'csv', 'xyz', 'pickle'
    
    Returns:
        Data from file (dict/list for json, list of dicts for csv, str for txt, object for pickle, etc.)
    """
    try:
        if format == "pickle":
            with open(path, "rb") as f:
                return pickle.load(f)
        elif format == "json":
            with open(path, "r") as f:
                return json.load(f)
        elif format == "csv":
            with open(path, "r") as f:
                reader = csv.DictReader(f)
                return list(reader)
        elif format == "xyz" or format == "txt":
            with open(path, "r") as f:
                return f.read()
        else:
            with open(path, "r") as f:
                return f.read()
    except Exception as e:
        return f"Error reading file: {str(e)}"

@tool
def get_smiles(name: str) -> str:
    """
    Convert any molecule name to a valid SMILES. Never guess a SMILES string!
    Args:
        name (str): Molecule name that is used as query.
    Returns:
        smiles: The corresponding SMILES string or None.
    """
    url_smiles = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{name}/property/SMILES/JSON"
    try:
        response = requests.get(url_smiles)
        response.raise_for_status()
        data = response.json()

        properties = data.get("PropertyTable", {}).get("Properties", [])
        if properties and "SMILES" in properties[0]:
            smiles = properties[0]["SMILES"]
        else:
            smiles = None

    except Exception as e:
        if isinstance(e, requests.exceptions.ConnectionError):
            print("No connection PubChem API could be established.")
        return None
    return smiles

@tool
def get_amino_acids_with_solubility() -> str:
    """
    Get all 20 amino acids with their SMILES and water solubility data.
    Return:
        A description of the JSON file at the provided path.
    """
    # Solubility in g/L at 25°C
    solubility = {
        "glycine": 249.9, "alanine": 166.5, "valine": 88.5, "leucine": 21.7,
        "isoleucine": 34.2, "proline": 1620.0, "phenylalanine": 27.0, 
        "tryptophan": 11.4, "serine": 362.0, "threonine": 97.0, "cysteine": 277.0,
        "tyrosine": 0.45, "asparagine": 28.5, "glutamine": 41.0, 
        "aspartic_acid": 5.0, "glutamic_acid": 8.6, "lysine": 739.0, 
        "arginine": 182.0, "histidine": 41.9, "methionine": 56.2
    }
    amino_acids = {'glycine': 'NCC(=O)O',
    'alanine': 'C[C@H](N)C(=O)O',
    'valine': 'CC(C)[C@H](N)C(=O)O',
    'leucine': 'CC(C)C[C@H](N)C(=O)O',
    'isoleucine': 'CC[C@H](C)[C@H](N)C(=O)O',
    'proline': 'OC(=O)[C@@H]1CCCN1',
    'phenylalanine': 'N[C@@H](Cc1ccccc1)C(=O)O',
    'tryptophan': 'N[C@@H](Cc1c[nH]c2ccccc12)C(=O)O',
    'serine': 'N[C@@H](CO)C(=O)O',
    'threonine': 'C[C@@H](O)[C@H](N)C(=O)O',
    'cysteine': 'N[C@@H](CS)C(=O)O',
    'tyrosine': 'N[C@@H](Cc1ccc(O)cc1)C(=O)O',
    'asparagine': 'N[C@@H](CC(=O)N)C(=O)O',
    'glutamine': 'N[C@@H](CCC(=O)N)C(=O)O',
    'aspartic_acid': 'N[C@@H](CC(=O)O)C(=O)O',
    'glutamic_acid': 'N[C@@H](CCC(=O)O)C(=O)O',
    'lysine': 'NCCCC[C@H](N)C(=O)O',
    'arginine': 'N[C@@H](CCCNC(=N)N)C(=O)O',
    'histidine': 'N[C@@H](Cc1cnc[nH]1)C(=O)O',
    'methionine': 'CSCC[C@H](N)C(=O)O',
    'trimethylphosphine': 'CP(C)C',
    'triphenylphosphine': 'c1ccc(P(c2ccccc2)c3ccccc3)cc1'
    }

    
    data = []
    for name in solubility.keys():
        data.append({
            "name": name,
            "smiles": amino_acids.get(name, ""),
            "solubility_g_L": solubility[name]
        })
    path = "inputs/amino_acids_with_solubility.json"
    with open(path, "w") as f:
        f.write(json.dumps(data, indent=2))
    return "File created at:  " + path + " \n\nThe file contains a list of json objects with the fields: name, smiles, solubility_g_L."


@tool
def generate_conformers(smiles: str, n: int = 1, rmsd_threshold: float = 0.0) -> Chem.Mol:
    '''Generate conformers for a given molecule represented by its SMILES string.
    Args:
        smiles (str): SMILES string of the molecule.
        n (int): Number of conformers to generate.
        rmsd_threshold (float): RMSD threshold for pruning similar conformers.
    Returns:
        Chem.Mol: RDKit molecule object with generated conformers.
    '''
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMultipleConfs(mol, n, pruneRmsThresh=rmsd_threshold)
    AllChem.MMFFOptimizeMoleculeConfs(mol)
    return mol 


@tool
def smiles_to_xyz(smiles: str) -> str:
    '''Convert a SMILES string to an XYZ file.
    Args:
        smiles (str): SMILES string of the molecule. 
    Returns:
        str: Path to the created XYZ file.
    '''
    smiles = get_smiles(smiles)
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.MMFFOptimizeMolecule(mol)
    Chem.MolToXYZFile(mol, f"agent_outputs/{smiles}.xyz")
    return f"Created agent_outputs/{smiles}.xyz"

@tool
def get_descriptors(mol: Chem.Mol) -> float:
    '''Calculate a specified descriptor for a given molecule.
    Args:
        mol (Chem.Mol): RDKit molecule object.
        descriptor_name (str): Name of the descriptor to calculate.
    Returns:
        dict: Value of the specified descriptor.
    '''
    return Descriptors.CalcMolDescriptors(mol)


## Expert Agents

In [6]:
RDKIT_INSTRUCTIONS = """You are an RDKit expert. You write Python code to perform cheminformatics tasks.

WORKFLOW:
1. Use get_smiles tool to look up SMILES for molecule names. Never guess names
2. Write RDKit code to perform calculations
3. ALWAYS save results to files in 'agent_outputs/' directory:
   - CSV for tabular data (fingerprints, descriptors)
   - JSON for structured results
   - pickle (using tools) for mol objects

IMPORTANT:
- Long vectors like fingerprints should never be printed directly!
- ALWAYS use the provided tools to read/write files.

CAPABILITIES:
- smiles to xyz conversion
- conformer generation
- descriptor calculation
"""

rdkit_expert = CodeAgent(
    tools=[get_smiles, get_amino_acids_with_solubility, write_file, read_file, generate_conformers, smiles_to_xyz, get_descriptors],
    model=model,
    name="rdkit_expert",
    description="RDKit expert for fingerprints, descriptors, and conformers",
    instructions=RDKIT_INSTRUCTIONS,
    executor=chemistry_executor,
    max_steps=10
)

In [7]:
MORFEUS_INSTRUCTIONS = """You are a Morfeus expert for 3D steric descriptor calculations.

CAPABILITIES:
- SASA: Solvent accessible surface area
- BuriedVolume: %Vbur around a central atom
- Sterimol: L, B1, B5 parameters for substituents  
- ConeAngle: Tolman cone angle for ligands

WORKFLOW:
1. Load a xyz structure
2. Calculate steric descriptors
3. Save results to 'agent_outputs/'

EXAMPLE CODE PATTERN:
from morfeus import SASA, read_xyz
elements, coordinates = read_xyz("n-heptane.xyz")
sasa = SASA(elements, coordinates)
print(sasa.volume, sasa.area, sasa.atom_areas[1])  # Example of accessing volume, area, and atom area

IMPORTANT:
- Morfeus uses 1-based atom indexing! 
- ALWAYS use help() to get the detailed parameters of a function e.g. help(Chem.AddHs)
- ALWAYS use the provided tools to read/write files.

"""

morfeus_expert = CodeAgent(
    tools=[get_smiles, write_file, read_file],
    model=model,
    name="morfeus_expert", 
    description="Morfeus expert for 3D steric descriptors",
    instructions=MORFEUS_INSTRUCTIONS,
    executor=chemistry_executor,
    max_steps=8
)

In [8]:
COORDINATOR_INSTRUCTIONS = """You coordinate chemistry tasks between expert agents.

TEAM:
- rdkit_expert: Fingerprints, molecular descriptors, conformers
- morfeus_expert: 3D steric descriptors (buried volume, Sterimol, cone angle, SASA)

WORKFLOW:
- make a plan as markdown todolist, including how to use the expert agents
- step-by-step follow the plan by invoking tools or using agents
- You can run code yourself (e.g. pandas or sklearn). NEVER run RDKit or Morfeus code yourself but use the experts.

OUTPUT:
- All results are saved to 'agent_outputs/' directory. Summarize what files were created (never include the raw file content).
"""

coordinator = CodeAgent(
    tools=[write_file, read_file],
    model=model,
    name="coordinator",
    description="Coordinates chemistry expert agents",
    instructions=COORDINATOR_INSTRUCTIONS,
    managed_agents=[rdkit_expert, morfeus_expert],
    executor=chemistry_executor,
    max_steps=12,
)

## Examples

#### Example 1: Generate conformers

In [7]:
result = rdkit_expert.run("Generate 100 conformers for ibuprofen with RMSD threshold 0.125 Å. Output the number of generated conformers.")

#### Example 2: Calculate Morfeus descriptors

In [None]:
def make_xyz(name):
    smiles = get_smiles(name)
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.MMFFOptimizeMolecule(mol)
    path = f"inputs/{name}.xyz"
    Chem.MolToXYZFile(mol, path)

make_xyz("ibuprofen")
make_xyz("caffeine")

In [10]:
result = morfeus_expert.run("Calculate SASA for inputs/ibuprofen.xyz and inputs/caffeine.xyz. Save as CSV.")

#### Challenging Example: ML dataset for amino acid solubility

In [None]:
result = coordinator.run("""Build ML dataset for amino acid solubility:
1. Data at inputs/amino_acids_with_solubility.json (name, smiles, solubility_g_L)
2. Use RDKit to generate molecular descriptors and put them in rdkit_descriptors.csv
3. Find feature correlations with solubility and train regression model on combined features
""")

#### How to "debug" this agent?
- Use more expensive model and *hope* that it is more intelligent
- Refine instructions
- Give hints
- Provide helpful tools that decrease complexity of the task
- ...