In this notebook, we show an example of how to use modules in `orchestrator.computer.score` to compute:
* QUESTS dataset efficiency
* QUESTS dataset diversity
* QUESTS delta entropy

In [None]:
import orchestrator
from orchestrator.utils.input_output import safe_read
from orchestrator.utils.setup_input import init_and_validate_module_type

import numpy as np

First, we load our dataset from local storage.

In [None]:
dataset_path = "./query_configs.extxyz"
dataset = safe_read(dataset_path, index=":")
print("Number of configurations in the dataset:", len(dataset))

Alternatively, we could retrieve a dataset from the Colabfit storage, assuming you have already uploaded a dataset to the database.
You can check the available datasets by calling `storage.list_data()`.

To compute QUESTS efficiency, diversity, and delta entropy scores, each configuration should have QUESTS descriptor computed, and this information should be stored in `.arrays` within the ASE atoms object.

In [None]:
# Instantiate QUESTSDescriptor
descriptor_input = {
    "descriptor_type":"QUESTSDescriptor",
    "descriptor_args": {
        "num_nearest_neighbors": 32,
        "cutoff": 5.0
    }
}
descriptor = init_and_validate_module_type(
    module_name="descriptor",
    input_args=descriptor_input,
    single_input_dict=True
)

# Compute the descriptor
calc_ids = descriptor.run(
    'quests_descriptor_generation', 
    compute_args={}, # no special args for computing QUESTS descriptors
    configs=dataset,
    workflow=None, # will use the default (local) workflow
    job_details={}, # no special job_details to modify
    batch_size=100, # we'll compute all descriptors at once
)

In [None]:
# Parse the output and save to new data. Alternatively could overwrite dataset
dataset_with_descriptors = descriptor.data_from_calc_ids(calc_ids)

Additionally, we generate a dummy redundant dataset by duplicating the original atomic configurations.
Comparing the QUESTS scores of the original and redundant datasets illustrates what each score means.

In [None]:
# Generated by repeating one of the configuration
redundant_dataset = dataset_with_descriptors * 5  # Duplicate the original configurations 5 times

# QUESTS dataset efficiency score

The QUESTS dataset efficiency score is computed using `orchestrator.computer.score.quests.QUESTSEfficiencyScore` module.
The efficiency score measures how oversample the dataset is.
The score near 1 means that the dataset has very little redundancy.

In [None]:
# Instantiate QUESTSEfficiencyScore
score_input = {"score_type":"QUESTSEfficiencyScore", "score_args": {}}
score = init_and_validate_module_type(
    module_name="score",
    input_args=score_input,
    # Set the following argument to True because the input dictionary
    # only contains argument for one class instance
    single_input_dict=True,
)

In [None]:
# Compute the QUESTS efficiency score
compute_args = {
    "score_quantity":"EFFICIENCY",
    "apply_mask": False,
    "descriptors_key": descriptor.OUTPUT_KEY + "_descriptors",  # Key for atoms.arrays dictionary for the precomputed QUESTS descriptor
    "bandwidth": 0.015  # Gaussian kernel bandwith for KDE, default is 0.015
}
calc_ids = score.run(
    'efficiency_calc',
    dataset_with_descriptors, 
    compute_args, 
    batch_size=1
)
efficiency_score = score.data_from_calc_ids(calc_ids)[0][score.OUTPUT_KEY+'_score'][0] # nested list, just have one score
print("Efficiency score:", efficiency_score)

In [None]:
# Compute the QUESTS efficiency score for the redundant dataset
calc_ids = score.run(
    'efficiency_calc',
    redundant_dataset, 
    compute_args, 
    batch_size=1
)
efficiency_score = score.data_from_calc_ids(calc_ids)[0][score.OUTPUT_KEY+'_score'][0] # nested list, just have one score
print("Efficiency score:", efficiency_score)

The QUESTS efficiency score of the original dataset is higher than that of the redundant dataset, implying that the later dataset has more redundancy than the original, which intuitively agrees with how the later dataset is constructed.

# QUESTS dataset diversity score

The QUESTS dataset diversity score is calculated using `orchestrator.computer.score.quests.QUESTSDiversityScore`.
The diversity score gives a measure of the dataset coverage, i.e., dataset with higher diversity score covers larger regions of the configuration space.

In [None]:
# Instantiate QUESTSDiversityScore
score_input = {"score_type":"QUESTSDiversityScore", "score_args": {}}
score = init_and_validate_module_type(
    module_name="score",
    input_args=score_input,
    single_input_dict=True
)

In [None]:
# Compute the QUESTS diversity score
# The arguments are the same as the computation for the dataset efficiency score except the score_quantity
compute_args = {
    "score_quantity": "DIVERSITY",
    "apply_mask": False,
    "descriptors_key": descriptor.OUTPUT_KEY + "_descriptors",  # Key for atoms.arrays dictionary for the precomputed QUESTS descriptor
    "bandwidth": 0.015
}
calc_ids = score.run(
    'diversity_calc',
    dataset_with_descriptors, 
    compute_args, 
    batch_size=1
)
diversity_score = score.data_from_calc_ids(calc_ids)[0][score.OUTPUT_KEY+'_score'][0] # nested list, just have one score
print("Diversity score:", diversity_score)

In [None]:
# Compute the QUESTS diversity score for the redundant dataset
calc_ids = score.run(
    'efficiency_calc',
    redundant_dataset, 
    compute_args, 
    batch_size=1
)
diversity_score = score.data_from_calc_ids(calc_ids)[0][score.OUTPUT_KEY+'_score'][0] # nested list, just have one score
print("Diversity score:", diversity_score)

Notice that the two datasets have the same diversity.
Although we have more configurations in the redundant dataset, they don't increase the coverage because they are just duplicates of the original dataset.

# QUESTS delta entropy score

The QUESTS delta entropy ($\delta \mathcal{H}$)score is calculated using `orchestrator.computer.score.quests.QUESTSDeltaEntropyScore`.
The delta entropy measures the contribution of a data point (atomic environment) to the total entropy of the reference dataset.
Rare environments have large $\delta \mathcal{H}$ values.

In [None]:
# Instantiate QUESTSDeltaEntropyScore
score_input = {"score_type":"QUESTSDeltaEntropyScore", "score_args": {}}
score = init_and_validate_module_type(
    module_name="score",
    input_args=score_input, 
    single_input_dict=True
)

In [None]:
# we'll compute dH for 2/3 of the dataset referenced to the other 1/3
reference = dataset_with_descriptors[:3]
reference_descriptors = np.concatenate([c.get_array(descriptor.OUTPUT_KEY+'_descriptors') for c in reference])
compute_dataset = dataset_with_descriptors[3:]

compute_args = {
    "score_quantity":"DELTA_ENTROPY",
    "reference_set": reference_descriptors,
    "descriptors_key": descriptor.OUTPUT_KEY + "_descriptors",
    "bandwidth": 0.015,
    "approx": False,  # Don't use approximation of dH for this example
    "num_nearest_neighbors": 3,  # Number of nearest neighbor used in the dH calculation - only used if approx = True
    "graph_neighbors": 10  # Parameter for performing the approximate nearest neighbor search - only used if approx = True
}
calc_ids = score.run(
    'dH_calc',
    compute_dataset, 
    compute_args,
    batch_size=6, # compute all at once
)
configs_with_score = score.data_from_calc_ids(calc_ids)
print("Sample Diversity score:", configs_with_score[0].get_array(score.OUTPUT_KEY+"_score"))

We can use the delta entropy information to mask (include or exclude) some atomic environments in the configurations.

In [None]:
from orchestrator.utils.data_standard import SELECTION_MASK_KEY
dH_threshold = -3  # Exclude environment with dH value below this threshold
for atoms in configs_with_score:
    natoms = atoms.get_global_number_of_atoms()
    masks_atoms = np.zeros(natoms, dtype=bool) 
    masks_atoms[atoms.get_array(score.OUTPUT_KEY+"_score") > dH_threshold] = 1
    # Add the masking array to atoms.arrays
    atoms.set_array(SELECTION_MASK_KEY, masks_atoms)

In [None]:
total_atoms = np.sum([len(x) for x in configs_with_score])
atoms_after_mask = np.sum(np.concatenate([x.get_array(SELECTION_MASK_KEY) for x in configs_with_score]))
print(f'After masking, a dataset with {total_atoms} atoms was reduced in size to {atoms_after_mask} atoms')