# Prompt to NOMAD searches

NOMAD has an extensive search api with hundreds of potential seachable quantities. Creating search queries can be hard for people not knowing the keys and our query format. In this project, we want to explore if and how LLMs can generate NOMAD search queries from human input.

## Search key documentation

As a first step, we need to produce some information on our search quantities, i.e. the keys you can use in your search queries. We use the following process:
- get all search quantities from the NOMAD code
- in the beginning: filter exotic quantities (e.g. from plugins, optimade, etc)
- in the beginning: only use a sub-set, i.e. the most "popular", quantities
- try to retrieve good values for those quantities that can be aggregated over values
- provide the quantities with a fixed schema: `key` -> `description`, `type`, `values`

In [1]:
from nomad import config
from nomad.datamodel import EntryArchive
from nomad.metainfo import Reference, MEnum
from nomad.metainfo.elasticsearch_extension import entry_index
import json
import requests

Schema is deprecated, use plugins. ()


In [2]:
# Read all possible search quantities from NOMAD and filter exotic and technical keys
search_keys = dict()
if not entry_index.doc_type.mapping:
    entry_index.doc_type.create_mapping(EntryArchive.m_def)
for key, value in entry_index.doc_type.quantities.items():
    annotation = value.annotation

    try:
        if isinstance(annotation.definition.type, Reference):
            continue

        if isinstance(annotation.definition.type, MEnum):
            type = list(annotation.definition.type)
        else:
            type = annotation.definition.type.__name__
    except:
        type = str(annotation.definition.type)

    if annotation.field:
        key = f'{key}.{annotation.field}'

    # In the beginning we do not want these
    if '__suggestion' in key:
        continue
    if 'optimade' in key:
        continue
    if 'topology' in key:
        continue
    if 'eln' in key:
        continue
    if key.startswith('data'):
        continue
    if not annotation.definition.description:
        continue
    
    search_keys[key] = dict(
        description=annotation.definition.description,
        aggregatable=annotation.aggregatable,
        type=type
    )

print(f'roughly {len(json.dumps(search_keys).split())} tokens')

roughly 6213 tokens


In [3]:
# In the beginning we might limit ourselves to a sub-set of search quantities
limited_keys = """
results.material.material_name
results.material.structural_type
results.material.dimensionality
results.material.elements
results.material.n_elements
results.material.elements_exclusive
results.material.chemical_formula_descriptive
results.material.chemical_formula_reduced
results.material.chemical_formula_hill
results.material.chemical_formula_iupac
results.material.symmetry.bravais_lattice
results.material.symmetry.crystal_system
results.material.symmetry.hall_number
results.material.symmetry.hall_symbol
results.material.symmetry.point_group
results.material.symmetry.space_group_number
results.material.symmetry.space_group_symbol
results.material.symmetry.structure_name
results.material.symmetry.strukturbericht_designation
results.method.method_name
results.method.simulation.program_name
results.method.simulation.dft.basis_set_type
results.method.simulation.dft.core_electron_treatment
results.method.simulation.dft.spin_polarized
results.method.simulation.dft.scf_threshold_energy_change
results.method.simulation.dft.van_der_Waals_method
results.method.simulation.dft.relativity_method
results.method.simulation.dft.smearing_kind
results.method.simulation.dft.smearing_width
results.method.simulation.dft.jacobs_ladder
results.method.simulation.dft.xc_functional_type
results.method.simulation.dft.xc_functional_names
results.properties.available_properties
results.properties.electronic.band_gap.value
results.properties.electronic.band_gap.type
results.properties.geometry_optimization.convergence_tolerance_energy_difference
results.properties.geometry_optimization.convergence_tolerance_force_maximum
results.properties.geometry_optimization.final_force_maximum
results.properties.geometry_optimization.final_energy_difference
results.properties.geometry_optimization.final_displacement_maximum""".split()
search_keys = {key: value for key, value in search_keys.items() if key in limited_keys}
print(f'roughly {len(json.dumps(search_keys).split())} tokens')

roughly 1126 tokens


In [4]:
# Run aggregations against the contral NOMAD for those quantities that are aggregatable
aggregatable_keys = [key for key, value in search_keys.items() if value['aggregatable']]

# For those keys we want all values, because these are the exact values to use.
# For the other aggregatable keys we are more interested in example values.
all_value_keys = """
results.material.symmetry.bravais_lattice
results.material.symmetry.crystal_system
results.material.symmetry.hall_symbol
results.material.symmetry.point_group
results.material.symmetry.space_group_symbol
results.material.symmetry.strukturbericht_designation
results.method.method_name
results.method.simulation.program_name
results.method.simulation.dft.basis_set_type
results.method.simulation.dft.core_electron_treatment
results.method.simulation.dft.spin_polarized
results.method.simulation.dft.van_der_Waals_method
results.method.simulation.dft.relativity_method
results.method.simulation.dft.smearing_kind
results.method.simulation.dft.jacobs_ladder
results.method.simulation.dft.xc_functional_type
results.method.simulation.dft.xc_functional_names
results.properties.available_properties
results.properties.electronic.band_gap.type
""".split()

def create_aggregation(key):
    return {
        "terms": {
            "pagination": {
                "page_size": 500 if key in all_value_keys else 5
            }, 
            "quantity": key
        }
    }

query = {
    "pagination": {
        "page_size": 0
    },
    "aggregations": {
        key: create_aggregation(key) for key in aggregatable_keys
    }
}

url = f'{config.client.url}/v1/entries/query'
response = requests.post(url, json=query)
response

<Response [200]>

In [5]:
# Add the aggregation values to the search key data. 
# Override the aggregated information if there are 
# fixed enum values. Remove the distracting "aggregatable" key.
aggregation_data = response.json()["aggregations"]
for key in aggregatable_keys:
    if key in all_value_keys:
        search_keys[key]['values'] = {
            item['value']: item['count'] for item in aggregation_data[key]['terms']['data']
        }
    else:
        search_keys[key]['example_values'] = [item['value'] for item in aggregation_data[key]['terms']['data']]
for value in search_keys.values():
    if 'aggregatable' in value:
        del(value['aggregatable'])
    if isinstance(value['type'], list):
        value['values'] = value['type']
        value['type'] = 'enum'

print(f'roughly {len(json.dumps(search_keys).split())} tokens')

roughly 3752 tokens


This is what the search key data looks like

In [6]:
print(json.dumps(search_keys['results.method.simulation.program_name'], indent=2))
print(json.dumps(search_keys['results.material.chemical_formula_hill'], indent=2))

{
  "description": "The name of the used program.",
  "type": "str",
  "values": {
    "ABINIT": 18795,
    "AMS": 34942,
    "ASAP": 1572,
    "ATK": 163,
    "Amber": 1,
    "BAND": 538,
    "BigDFT": 702,
    "CASTEP": 6203,
    "CP2K": 4143,
    "CPMD": 5,
    "Charmm": 5,
    "Crystal": 12473,
    "DFTB+": 24,
    "DL_POLY": 1,
    "DL_POLY_4": 1,
    "DMol3": 1,
    "FHI-aims": 1312459,
    "FHI-vibes": 227,
    "GAMESS": 73,
    "GPAW": 9467,
    "GROMACS": 13532,
    "Gaussian": 2196293,
    "Gromos": 3,
    "LAMMPS": 1597,
    "LOBSTER": 241,
    "MOLCAS": 4,
    "MaterialsProject": 500,
    "NWChem": 2533,
    "OCEAN": 651,
    "ONETEP": 7,
    "ORCA": 96521,
    "Octopus": 107861,
    "OpenKIM": 1,
    "OpenMX": 2186,
    "Phonopy": 1282,
    "Quantum ESPRESSO XSPECTRA": 363,
    "Quantum Espresso": 117873,
    "Quantum Espresso EPW": 189,
    "Quantum Espresso Phonon": 451,
    "Siesta": 10,
    "VASP": 8666823,
    "WIEN2k": 2324,
    "Wannier90": 1,
    "YAMBO": 137,
    

## Example queries

Having some information on the search keys is just a description of the search "vocabulary", 
we still need to teach the LLM some syntax. We use a few different queries to teach by example. 

In [7]:
# Some example queries
example_queries = """
This query looks for VASP calculations for materials that contain C and O and other elements and produced a DOS (density of states) property:
{
  "query": {
    "results.method.simulation.program_name:any": [  
      "VASP"
    ],
    "results.material.elements:any": [
      "C",
      "O"
    ],
    "results.properties.available_properties:all": [
      "dos_electronic"
    ]
  }
}

This query looks for exciting calculations with certain elements in the unit cell with cubic symmetry. The simulations need to have calculated
a DOS, a band gap, and a band structure.
{
  "query": {
    "results.method.simulation.program_name:any": [  
      "exciting"
    ],
    "results.material.elements:any": [
      "Ti",
      "O"
    ],
    "results.material.symmetry.crystal_system": "cubic",
    "results.properties.available_properties:all": [
      "dos_electronic", "band_gap", "band_structure_electronic"
    ]
  }
}  

Here we are looking for either exciting or vasp calcualtions on materials that exclusivly are made from titanium and oxygen.
{
  "query": {
    "and": [
        {
            "or": [
                {
                    "results.method.simulation.program_name": "exciting",
                },
                {
                    "results.method.simulation.program_name": "VASP",
                },
            ]
        },
        {
            "results.material.elements:all": [
              "Ti",
              "O"
            ]
        }
    ]
  }
}
"""
print(f'roughly {len(example_queries.split())} tokens')

roughly 143 tokens


## The prompt

We know combine everything into a prompt template that comprises:
- some instructions
- the search key data (vocabulary)
- search examples (syntax)
- the human input

In [8]:
template = '''
Your job is to generate JSON queries for a specific search API for a database of 
computational materials science data that consists of DFT calculations and simulations.

In the generated queries, you are only allowed to use a specific set of keys. 
It is important to use the full keys.
Most keys also only allow a specific set of values. The following JSON data describes 
all available search keys with example values, potential exclusive values, the value type, 
and a human description. If the values are given, you can only use those values. 
Capitalization is important. The following 
data also provides how many entries might have a given value.

```
{search_keys}
```

The JSON queries need to follow a strict syntactical format. Each quere is a JSON
object with on top level "query" key. Behind this "query" key, 
multiple criteria can be combined with "and", "or", and "not" operators.
If you want to pass multiple values to a key, use the ":any" (some values match) and ":all" (all values match) suffix on the keys. 
Don't use a $ sign. Here are a few example queries in JSON format:

```
{example_queries}
```

Now generate a query that matches the following description. Only use keys that are necessary based on the description.

```
{input}
```

Your output has to be valid JSON and only valid JSON.
'''

print(f'roughly {len(template.split())} tokens')

roughly 224 tokens


## Use the LLM

We finally use the LLM by generating a prompt based on the template, feeding it into the llm, and parsing the output as JSON.

In [9]:
# Generating the search query
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
import json
from langchain_community.llms import Ollama

def generate_query(input):
    llm = Ollama(model="llama3:70b")
    llm.base_url = 'http://172.28.105.30/backend'
    
    prompt = ChatPromptTemplate.from_template(template)
    output_parser = JsonOutputParser()
    
    chain = prompt | llm | output_parser

    params = {
        "search_keys": json.dumps(search_keys, indent=2),
        "example_queries": example_queries,
        "input": input
    }

    # print(f'roughly {len(prompt.invoke(params).messages[0].content.split())} tokens')
    
    return chain.invoke(params)

prompt = "I am looking for VASP simulations of bulk materials made from nickel or iron that have a dos available."
api_query = generate_query(prompt)
queries = {prompt: api_query}

print(json.dumps(api_query, indent=2))

{
  "query": {
    "and": [
      {
        "results.method.simulation.program_name": "VASP"
      },
      {
        "results.material.elements:any": [
          "Ni",
          "Fe"
        ]
      },
      {
        "results.properties.available_properties:all": [
          "dos_electronic"
        ]
      }
    ]
  }
}


Lets do a few more "wild" examples

In [10]:
prompts = [
    'I need to know the band structure of pure silicon',
    'Are there any elastic constant calculations?',
    'My teacher told me, I need to learn about a code called octopy or something.'
]


for prompt in prompts:
    query = None
    for i in range(0, 3):
        try:
            query = generate_query(prompt)
            break
        except:
            pass
    queries[prompt] = query
    print(f'{prompt}:\n\n{json.dumps(query, indent=2)}\n\n')

I need to know the band structure of pure silicon:

{
  "query": {
    "results.material.elements:all": [
      "Si"
    ],
    "results.properties.available_properties:any": [
      "band_structure_electronic"
    ]
  }
}


Are there any elastic constant calculations?:

{
  "query": {
    "results.properties.available_properties:any": [
      "elastic_constants"
    ]
  }
}


My teacher told me, I need to learn about a code called octopy or something.:

{
  "query": {
    "results.method.simulation.program_name": "octopus"
  }
}




## Run the generated queries

Let's run the queries agains NOMAD:
- do they produce errors?
- do they produce results?

In [11]:
def search_database(query_json: dict) -> int:
    api_query = dict(**query_json)
    api_query.update(owner='visible', pagination=dict(page_size=0))
    
    """ Send a query to the search API of the database and return the number of results. """
    # Set the API endpoint URL
    url = f'{config.client.url}/v1/entries/query'

    # Send a POST request to the API endpoint with the query JSON object
    response = requests.post(url, json=api_query)
    
    # Check if the response was successful (200 OK)
    if response.status_code != 200:
        raise Error(f"Error. Status code {response.status_code}, {response.text}")

    # print(json.dumps(response.json(), indent=2))
    
    return response.json()["pagination"]["total"]

In [12]:
for prompt, query in queries.items():
    print(f'{prompt}: {search_database(query)}')

I am looking for VASP simulations of bulk materials made from nickel or iron that have a dos available.: 472820
I need to know the band structure of pure silicon: 19596
Are there any elastic constant calculations?: 0
My teacher told me, I need to learn about a code called octopy or something.: 0


## Conclusions

It works, some times. 

- What is good for a human is also good for LLMs. It is much easier to use the API if you know the keys, if you know the values.
- Correcting after an uncessful search might help a lot.
- LLM seem to struggle when needing to be specific: capitalization, numbers, restricting to fix values. Maybe a more coding oriented LLM would do a better job?