# Aggregating data with MDF

Searches using `Forge.search()` are limited to 10,000 results. However, there are two methods to circumvent this restriction: `Forge.aggregate_source()` and `Forge.aggregate()`.

In [1]:
import json
from mdf_forge.forge import Forge

In [2]:
mdf = Forge()

## aggregate_source - NIST XPS DB
Example: We want to collect all records from the NIST XPS Database and analyze the binding energies. This database has almost 30,000 records, so we have to use `aggregate()`.

In [3]:
# First, let's aggregate all the nist_xps_db data.
all_entries = mdf.aggregate_sources("nist_xps_db")
print(len(all_entries))

100%|██████████| 29189/29189 [00:27<00:00, 1056.63it/s]

29189





In [4]:
# Now, let's parse out the enery_uncertainty_ev and print the results for analysis.
uncertainties = {}
for record in all_entries:
    if record["mdf"]["resource_type"] == "record":
        unc = record.get("nist_xps_db_v1", {}).get("energy_uncertainty_ev", 0)
        if not uncertainties.get(unc):
            uncertainties[unc] = 1
        else:
            uncertainties[unc] += 1
print(json.dumps(uncertainties, sort_keys=True, indent=4, separators=(',', ': ')))

{
    "": 24344,
    "0.001": 2,
    "0.003": 2,
    "0.005": 12,
    "0.01": 27,
    "0.010": 3,
    "0.012": 2,
    "0.015": 2,
    "0.02": 159,
    "0.020": 6,
    "0.025": 10,
    "0.026": 1,
    "0.03": 122,
    "0.030": 14,
    "0.04": 56,
    "0.042": 1,
    "0.05": 416,
    "0.050": 1,
    "0.06": 12,
    "0.07": 33,
    "0.070": 1,
    "0.075": 1,
    "0.08": 14,
    "0.1": 1501,
    "0.10": 14,
    "0.100": 2,
    "0.12": 4,
    "0.13": 1,
    "0.15": 220,
    "0.17": 1,
    "0.2": 1660,
    "0.20": 2,
    "0.200": 1,
    "0.25": 24,
    "0.3": 266,
    "0.30": 4,
    "0.4": 117,
    "0.40": 4,
    "0.5": 108,
    "0.6": 9,
    "0.7": 2,
    "0.8": 4,
    "0.9": 2,
    "1.2": 1,
    "2.0": 1
}


## aggregate - Multiple Datasets
Example: We want to analyze how often elements are studied with Gallium (Ga), and what the most frequent elemental pairing is. There are more than 10,000 records containing Gallium data.

In [5]:
# First, let's aggregate everything that has "Ga" in the list of elements.
all_results = mdf.aggregate("material.elements:Ga")
print(len(all_results))

100%|██████████| 25582/25582 [01:03<00:00, 381.77it/s]

25582





In [6]:
# Now, let's parse out the other elements in each record and keep a running tally to print out.
elements = {}
for record in all_results:
    if record["mdf"]["resource_type"] == "record":
        elems = record["material"]["elements"]
        for elem in elems:
            if elem in elements.keys():
                elements[elem] += 1
            else:
                elements[elem] = 1
print(json.dumps(elements, sort_keys=True, indent=4, separators=(',', ': ')))

{
    "Ac": 651,
    "Ag": 550,
    "Al": 556,
    "Ar": 2,
    "As": 1296,
    "Au": 589,
    "B": 528,
    "Ba": 670,
    "Be": 496,
    "Bi": 550,
    "Br": 52,
    "C": 85,
    "Ca": 613,
    "Cd": 562,
    "Ce": 599,
    "Cl": 75,
    "Co": 875,
    "Cr": 678,
    "Cs": 501,
    "Cu": 741,
    "Dy": 578,
    "Er": 641,
    "Eu": 561,
    "F": 105,
    "Fe": 708,
    "Ga": 25582,
    "Gd": 575,
    "Ge": 643,
    "H": 167,
    "Hf": 630,
    "Hg": 526,
    "Ho": 567,
    "I": 59,
    "In": 585,
    "Ir": 543,
    "K": 583,
    "La": 719,
    "Li": 851,
    "Lu": 513,
    "Mg": 1004,
    "Mn": 608,
    "Mo": 635,
    "N": 150,
    "Na": 722,
    "Nb": 539,
    "Nd": 573,
    "Ni": 754,
    "Np": 503,
    "O": 2514,
    "On": 6,
    "Os": 665,
    "Ox": 39,
    "P": 160,
    "Pa": 609,
    "Pb": 519,
    "Pd": 604,
    "Pm": 624,
    "Pr": 692,
    "Pt": 699,
    "Pu": 527,
    "Rb": 544,
    "Re": 496,
    "Rh": 554,
    "Ru": 533,
    "S": 186,
    "Sb": 568,
    "Sc": 680,
    "Se