# Validation case study: Matching NMR spectra to composition of the molecule

Here's an improved version of the writing:
Following data extraction, it is crucial to implement automated checks to ensure the validity of the extracted data. One effective validation method involves matching the extracted NMR spectra to the corresponding analyzed molecule. This process compares the number of protons and peaks in the molecule's theoretical NMR spectra with those in the extracted NMR spectra. This approach is similar to that employed by {cite:t}`Patiny_2023`.

In this notebook, we demonstrate an example of how to perform this automated validation check. By implementing such checks, researchers can significantly enhance the reliability and accuracy of their extracted spectroscopic data, thereby improving the overall quality of their analyses. 

### Data extraction

The first step in our process involves extracting the NMR spectra and the analyzed molecule using a Large Language Model (LLM). To accomplish this, we developed a basic prompt that includes the desired information and the content of the article. As a result, we obtain the names of the molecules and the NMR spectra of all included molecules in a structured JSON format.

We first define some logic to call the LLM.

In [None]:
import llmstructdata  # noqa: F401
from litellm import completion
import json


# Add the content of the XML file to the prompt
def format_prompt(template: str, data: dict) -> str:
    return template.format(**data)


# Define the function to call the LiteLLM API
def call_litellm(prompt: str, model: str="gpt-4o", temperature: float = 0.0, **kwargs) -> tuple:
    """Call LiteLLM model

    Args:
        prompt (str): Prompt to send to model
        model (str, optional): Name of the API. Defaults to "gpt-4o".
        temperature (float, optional): Inference temperature. Defaults to 0.
        kwargs (dict, optional): Additional arguments to pass to the API.

    Returns:
        tuple: message content and token usage (message_content, input_tokens, output_tokens)
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a scientific assistant, extracting NMR spectra and the analyzed molecule "
                "out of XML documents in valid JSON format. Extract just data which you are 100% confident about the "
                "accuracy. Keep the entries short without details. Be careful with numbers."
            ),
        },
        {"role": "user", "content": prompt},
    ]

    response = completion(
        model=model,
        messages=messages,
        temperature=temperature,
        response_format={"type": "json_object"},
        **kwargs,
    )

    # Extract and return the message content and token usage
    message_content = response["choices"][0]["message"]["content"]
    input_tokens = response["usage"]["prompt_tokens"]
    output_tokens = response["usage"]["completion_tokens"]
    return message_content, input_tokens, output_tokens

In [3]:
# Define the prompt template
prompt_template = """Extract all 1H-NMR-spectra and the related analyzed molecule out of this XML file: {data}. 
Extract the complete 1-H-NMR-spectra as text. Extract the full IUPAC name of the molecules without abbreviations and details.
Extract the data in the following JSON format:"
    {{"molecules": [
        {{
            "molecule":
            "nmr_spectra":
        }},
        {{
            "molecule":
            "nmr_spectra":
        }}
        ]
    }}"""

with open("XML_content.json", "r", encoding="utf-8") as file:
    data = json.load(file)

# Add the XML data to the promp
prompt = format_prompt(prompt_template, {"data": data})

Now we can perform the actual call to the LLM.

In [4]:
# Call the LiteLLM API and print the output and token usage
output, input_tokens, output_tokens = call_litellm(prompt=prompt)
output = json.loads(output)
print("Output: ", output)
print("Input tokens used:", input_tokens, "Output tokens used:", output_tokens)

with open("NMR_data.json", "w", encoding="utf-8") as json_file:
    json.dump(output, json_file, indent=4)

Output:  {'molecules': [{'molecule': '1-(N,N-diphenylamino)pyrene', 'nmr_spectra': 'δ H (CDCl3) 6.92–7.23 (10H, m, Ar-H) and 7.90–8.37 (9H, m, Py-H)'}, {'molecule': '1-[N,N-di(4-methylphenyl)amino]pyrene', 'nmr_spectra': 'δ H (CDCl3) 2.28 (6H, s, Me), 6.94 (4H, d, J =8.7Hz, Ar-H), 7.00 (4H, d, J =8.7Hz, Ar-H) and 7.78–8.16 (9H, m, Py-H)'}, {'molecule': 'isomer of 1,6- and 1,8-bis-(N,N-diphenylamino)pyrene', 'nmr_spectra': 'δ H (CDCl3) 6.89–7.24 (20H, m, Ar-H) and 7.79–8.16 (8H, m, Py-H)'}, {'molecule': 'isomer of 1,6- and 1,8-bis-[N,N-di(4-methylphenyl)amino]pyrene', 'nmr_spectra': 'δ H (CDCl3) 2.26, 2.28 (12H, s, Me), 6.88–7.02 (16H, m, Ar-H) and 7.74–8.12 (8H, m, Py-H)'}, {'molecule': '1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene', 'nmr_spectra': 'δ H (CDCl3) 6.89–7.17 (40H, m, Ar-H), 7.66 (2H, s, Py-H a) and 7.97 (4H, s, Py-H b)'}, {'molecule': '1,3,6,8-tetrakis-[N,N-di(4-methylphenyl)amino]pyrene', 'nmr_spectra': 'δ H (CDCl3) 2.35 (24H, s, Me), 6.85 (16H, d, J =8.7Hz, Ar-H), 6.93 (16

### Validity check with NMR spectra and SMILES

Next, we count and compare the hydrogen atoms in the extracted NMR spectra and molecule. We also calculate and compare the number of peaks in the NMR spectra and diastereotopic protons in the molecule. If the numbers do not match, we can assume an error in the extraction.

For doing so, we will need to define a few helper functions. The first one will compute the number of symmetry equivalent hydrogen atoms.

In [10]:
import rdkit
from rdkit import Chem
import numpy as np 

In [13]:
def get_number_of_topologically_distinct_atoms(molecule, atomic_number: int = 1):
    """Return the number of unique `element` environments based on environmental topology.

    Args:
        molecule (rdkit.Chem.rdchem.Mol): Molecular instance.
        atomic_number (int, optional): Atomic number. Defaults to 1.

    Returns:
        int: Number of unique environments.
    """
    if atomic_number == 1:
        # add hydrogen
        mol = Chem.AddHs(molecule)
    else: 
        mol = molecule

    # Get unique canonical atom rankings
    atom_ranks = list(rdkit.Chem.rdmolfiles.CanonicalRankAtoms(mol, breakTies=False))

    # Select the unique element environments
    atom_ranks = np.array(atom_ranks)

    # Atom indices
    atom_indices = [
        atom.GetIdx() for atom in mol.GetAtoms() if atom.GetAtomicNum() == atomic_number
    ]
    # Count them
    return len(set(atom_ranks[atom_indices]))

If we look at an example, e.g., for benzene `c1ccccc1`, we can see that the number of topologically distinct hydrogen atoms is 1.
In contrast, if we look at ethanol, `CCO`, we can see that the number of topologically distinct hydrogen atoms is 3.

In [14]:
get_number_of_topologically_distinct_atoms(Chem.MolFromSmiles("c1ccccc1"), atomic_number=1)

1

In [15]:
get_number_of_topologically_distinct_atoms(Chem.MolFromSmiles("CCO"), atomic_number=1)

3

In [15]:
import json
import re
from rdkit import Chem
import package.llmstructdata.utils as utils

correct_H_number = 0
incorrect_H_number = 0
correct_peak_number = 0
incorrect_peak_number = 0
pattern = re.compile(r"\d+\.\d+\s*\([^)]*\)")

# Load JSON NMR data
with open("NMR_data.json", "r") as file:
    data = json.load(file)


# Calculate the number of hydrogen atoms
def count_hydrogens_from_nmr(nmr_spectra: str) -> int:
    pattern2 = r"\b(\d+)H\b"
    matches = re.findall(pattern2, nmr_spectra)
    return sum(int(match) for match in matches)


# Loop over all molecules in data
for molecule_data in data["molecules"]:
    molecule_name = molecule_data["molecule"]
    nmr_spectra = molecule_data["nmr_spectra"]

    print(f"Processing molecule: {molecule_name}")

    # Calculate number of hydrogen atoms in NMR data
    H_number_nmr = count_hydrogens_from_nmr(nmr_spectra)
    print(f"Number of hydrogen atoms in the NMR spectra: {H_number_nmr}")

    # Count the number of peaks in the NMR spectra
    peaks = pattern.findall(nmr_spectra)
    number_of_peaks = len(peaks)

    # Convert molecules into SMILES
    mol_smiles = utils.name_to_smiles(molecule_name)
    if mol_smiles:
        # Convert SMILES into RDKit objects
        mol = Chem.MolFromSmiles(mol_smiles)
        if mol:
            # Calculate the number of hydrogen atoms from formula
            my_mol_with_explicit_h = Chem.AddHs(mol)
            H_number = (
                my_mol_with_explicit_h.GetNumAtoms()
                - my_mol_with_explicit_h.GetNumHeavyAtoms()
            )
            print(f"Number of H atoms in the molecule {molecule_name}: {H_number}")

            # Calculate the number of peaks in the NMR spectra of the molecule
            peak_number = get_num_hydrogen_symmetry_classes(my_mol_with_explicit_h)
            print(
                f"The number of peaks in the NMR spectra is {number_of_peaks}, the calculated peak number from the SMILES is {peak_number}."
            )
        else:
            print(
                f"Failed to create RDKit molecule object from SMILES for {molecule_name}"
            )
            H_number = None
    else:
        print(f"Failed to convert {molecule_name} to SMILES")
        H_number = None

    if H_number is not None and H_number_nmr == H_number:
        correct_H_number += 1
    else:
        incorrect_H_number += 1
    if peak_number == number_of_peaks:
        correct_peak_number += 1
    else:
        incorrect_peak_number += 1

print(
    f"Out of {correct_H_number+incorrect_H_number} molecules, {correct_H_number} molecules had a matching hydrogen atom number to the extracted NMR spectra."
)
print(
    f"Out of {correct_peak_number+incorrect_peak_number} molecules, {correct_peak_number} molecules had a matching number of NMR peaks to the extracted NMR spectra."
)

Processing molecule: 1-(N,N-diphenylamino)pyrene
Number of hydrogen atoms in the NMR spectra: 19
Number of H atoms in the molecule 1-(N,N-diphenylamino)pyrene: 19
The number of peaks in the NMR spectra is 2, the calculated peak number from the SMILES is 12.
Processing molecule: 1-[N,N-di(4-methylphenyl)amino]pyrene
Number of hydrogen atoms in the NMR spectra: 23
Number of H atoms in the molecule 1-[N,N-di(4-methylphenyl)amino]pyrene: 23
The number of peaks in the NMR spectra is 4, the calculated peak number from the SMILES is 12.
Processing molecule: 1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene
Number of hydrogen atoms in the NMR spectra: 46
Number of H atoms in the molecule 1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene: 46
The number of peaks in the NMR spectra is 3, the calculated peak number from the SMILES is 5.
Processing molecule: 1,3,6,8-tetrakis-[N,N-di(4-methylphenyl)amino]pyrene
Number of hydrogen atoms in the NMR spectra: 62
Number of H atoms in the molecule 1,3,6,8-tetrakis-[N,N

One could use the matching of hydrogen atoms and number of peaks as a validation of the extracted data. Since hydrogen atoms with a very similar environment could appear in an NMR spectra as one overlapped peak, the calculated number of peaks could deviate from the observed one. To meet this challenge, one could score the extracted datapoints based on their deviation from the calculated values. For example, if one molecule does not have a matching peak number and hydrogen number, one could delete the datapoint. If just one value mismatches, one could give this datapoint a lower weight for model training. 