# Validation case study: Matching NMR spectra to composition of the molecule

After the data extraction one should have automated checks to asure the validity of the data. One possible validation check could be to match the extracted NMR spectra to the extracted analyzed molecule. Therefore, one could compare the number of protons and peaks in NMR spectra of the molecule and in the extracted NMR spectra. In this notebook we show an example of how one could do this automated check. 

### Data extraction

As first step we extract the NMR spectra and the analyzed molecule using a LLM. 
Therefore, we created a basic prompt including the desired information and the content of the article. We receive the name of the molecules and the NMR spectra of all included molecules in a JSON format.

In [10]:
import os
from litellm import completion
from bs4 import BeautifulSoup
import json

# Add the content of the XML file to the prompt
def format_prompt(template: str, data: dict) -> str:
    return template.format(**data)

# Define the function to call the LiteLLM API
def call_litellm(prompt, model="gpt-4o", temperature: float = 0.0, **kwargs):
    """Call LiteLLM model

    Args:
        prompt (str): Prompt to send to model
        model (str, optional): Name of the API. Defaults to "gpt-4o".
        temperature (float, optional): Inference temperature. Defaults to 0.

    Returns:
        dict: New data
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a scientific assistant, extracting NMR spectra and the analyzed molecule "
                "out of XML documents in valid JSON format. Extract just data which you are 100% confident about the "
                "accuracy. Keep the entries short without details. Be careful with numbers."
            ),
        },
        {"role": "user", "content": prompt},
    ]

    response = completion(
        model=model,
        messages=messages,
        temperature=temperature,
        response_format={"type": "json_object"},
        **kwargs,
    )

    # Extract and return the message content and token usage
    message_content = response["choices"][0]["message"]["content"]
    input_tokens = response["usage"]["prompt_tokens"]
    output_tokens = response["usage"]["completion_tokens"]
    return message_content, input_tokens, output_tokens

# Define the prompt template
prompt_template = ("""Extract all 1H-NMR-spectra and the related analyzed molecule out of this XML file: {data}. 
Extract the complete 1-H-NMR-spectra as text. Extract the full IUPAC name of the molecules without abbreviations and details.
Extract the data in the following json format:"
    {{"molecules": [
        {{
            "molecule":
            "nmr_spectra":
        }},
        {{
            "molecule":
            "nmr_spectra":
        }}
        ]
    }}""")


# Load the OpenAI API key from environment variables
api_key = os.getenv("OPENAI_API_KEY")

# Set the API key for LiteLLM
os.environ["OPENAI_API_KEY"] = api_key

with open("XML_content.json", 'r', encoding='utf-8') as file:
    data = json.load(file)

# Add the XML data to the prompt
prompt = format_prompt(prompt_template, {"data": data})

# Call the LiteLLM API and print the output and token usage
output, input_tokens, output_tokens = call_litellm(prompt=prompt)
output = json.loads(output)
print("Output: ", output)
print("Input tokens used:", input_tokens, "Output tokens used:", output_tokens)

with open("NMR_data.json", "w", encoding='utf-8') as json_file:
    json.dump(output, json_file, indent=4)

Output:  {'molecules': [{'molecule': '1-(N,N-diphenylamino)pyrene', 'nmr_spectra': 'δ H (CDCl3) 6.92–7.23 (10H, m, Ar-H) and 7.90–8.37 (9H, m, Py-H)'}, {'molecule': '1-[N,N-di(4-methylphenyl)amino]pyrene', 'nmr_spectra': 'δ H (CDCl3) 2.28 (6H, s, Me), 6.94 (4H, d, J =8.7Hz, Ar-H), 7.00 (4H, d, J =8.7Hz, Ar-H) and 7.78–8.16 (9H, m, Py-H)'}, {'molecule': '1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene', 'nmr_spectra': 'δ H (CDCl3) 6.89–7.17 (40H, m, Ar-H), 7.66 (2H, s, Py-H a) and 7.97 (4H, s, Py-H b)'}, {'molecule': '1,3,6,8-tetrakis-[N,N-di(4-methylphenyl)amino]pyrene', 'nmr_spectra': 'δ H (CDCl3) 2.35 (24H, s, Me), 6.85 (16H, d, J =8.7Hz, Ar-H), 6.93 (16H, d, J =8.7Hz, Ar-H), 7.57 (2H, s, Py-H a) and 7.93 (4H, s, Py-H b)'}]}
Input tokens used: 7432 Output tokens used: 428


### Valididty check with NMR spectra and SMILES

As a next step the Hydrogen atoms in the extracted NMR spectra and the extracted molecule were counted and compared. Additionally, the number of peaks in the NMR spectra and the number of diastereotopic protons in the molecule get calculated and compered. If the hydrogen atom number or the number of peaks doesn't match an error in the extraction can be assumed.

In [15]:
import json
import re
from rdkit import Chem
import package.llmstructdata.utils as utils

correct_H_number = 0
incorrect_H_number = 0
correct_peak_number = 0
incorrect_peak_number = 0
pattern = re.compile(r'\d+\.\d+\s*\([^)]*\)')

# Load JSON NMR data
with open('NMR_data.json', 'r') as file:
    data = json.load(file)

def get_num_hydrogen_symmetry_classes(m):
    orders = Chem.CanonicalRankAtoms(m, breakTies=False)
    h_classes = set()
    for atom, sym_class in zip(m.GetAtoms(), orders):
        if atom.GetAtomicNum() == 1:
           h_classes.add(sym_class)
    return len(h_classes)

# Calculate the number of hydrogen atoms
def count_hydrogens_from_nmr(nmr_spectra: str) -> int:
    pattern2 = r'\b(\d+)H\b'
    matches = re.findall(pattern2, nmr_spectra)
    return sum(int(match) for match in matches)

# Loop over all molecules in data
for molecule_data in data['molecules']:
    molecule_name = molecule_data['molecule']
    nmr_spectra = molecule_data['nmr_spectra']
    
    print(f"Processing molecule: {molecule_name}")
    
    # Calculate number of hydrogen atoms in NMR data
    H_number_nmr = count_hydrogens_from_nmr(nmr_spectra)
    print(f"Number of hydrogen atoms in the NMR spectra: {H_number_nmr}")
    
    # Count the number of peaks in the NMR spectra
    peaks = pattern.findall(nmr_spectra)
    number_of_peaks = len(peaks)
    
    # Convert molecules into SMILES
    mol_smiles = utils.name_to_smiles(molecule_name)
    if mol_smiles:
        # Convert SMILES into RDKit objects
        mol = Chem.MolFromSmiles(mol_smiles)
        if mol:
            # Calculate the number of hydrogen atoms from formula
            my_mol_with_explicit_h = Chem.AddHs(mol)
            H_number = my_mol_with_explicit_h.GetNumAtoms() - my_mol_with_explicit_h.GetNumHeavyAtoms()
            print(f"Number of H atoms in the molecule {molecule_name}: {H_number}")
            
            # Calculate the number of peaks in the NMR spectra of the molecule
            peak_number = get_num_hydrogen_symmetry_classes(my_mol_with_explicit_h)
            print(f"The number of peaks in the NMR spectra is {number_of_peaks}, the calculated peak number from the SMILES is {peak_number}.")
        else:
            print(f"Failed to create RDKit molecule object from SMILES for {molecule_name}")
            H_number = None
    else:
        print(f"Failed to convert {molecule_name} to SMILES")
        H_number = None
    
    if H_number is not None and H_number_nmr == H_number:
        correct_H_number += 1
    else: 
        incorrect_H_number += 1
    if peak_number == number_of_peaks:
        correct_peak_number += 1
    else: 
        incorrect_peak_number += 1

print(f"Out of {correct_H_number+incorrect_H_number} molecules, {correct_H_number} molecules had a matching hydrogen atom number to the extracted NMR spectra.")
print(f"Out of {correct_peak_number+incorrect_peak_number} molecules, {correct_peak_number} molecules had a matching number of NMR peaks to the extracted NMR spectra.")

Processing molecule: 1-(N,N-diphenylamino)pyrene
Number of hydrogen atoms in the NMR spectra: 19
Number of H atoms in the molecule 1-(N,N-diphenylamino)pyrene: 19
The number of peaks in the NMR spectra is 2, the calculated peak number from the SMILES is 12.
Processing molecule: 1-[N,N-di(4-methylphenyl)amino]pyrene
Number of hydrogen atoms in the NMR spectra: 23
Number of H atoms in the molecule 1-[N,N-di(4-methylphenyl)amino]pyrene: 23
The number of peaks in the NMR spectra is 4, the calculated peak number from the SMILES is 12.
Processing molecule: 1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene
Number of hydrogen atoms in the NMR spectra: 46
Number of H atoms in the molecule 1,3,6,8-tetrakis-(N,N-diphenylamino)pyrene: 46
The number of peaks in the NMR spectra is 3, the calculated peak number from the SMILES is 5.
Processing molecule: 1,3,6,8-tetrakis-[N,N-di(4-methylphenyl)amino]pyrene
Number of hydrogen atoms in the NMR spectra: 62
Number of H atoms in the molecule 1,3,6,8-tetrakis-[N,N

One could use the matching of hydrogen atoms and number of peaks as a validation of the extracted data. Since Hydrogen atoms with a very similar environment could appear in an NMR spectra as one overlapped peak the calculated number of peaks could deviate from the observed one. To meet this challenge one could score the extracted datapoints based on their deviation from the calculated values. For example if one molecule does not have a matching peak number and hydrogen number one could delete the datapoint. If just one value mismatches one could give these datapoint a lower weight for model training. 