# Aggregating data with MDF

Searches using `Forge.search()` are limited to 10,000 results. However, there are two methods to circumvent this restriction: `Forge.aggregate_source()` and `Forge.aggregate()`.

In [2]:
import json
from mdf_forge.forge import Forge

In [3]:
mdf = Forge()

Error: Unable to create Transfer client (invalid_grant).
Error: Unable to create Search client (invalid_grant).
Error: Unable to create MDF Authorizer (invalid_grant).


## aggregate_source - NIST XPS DB
Example: We want to collect all records from the NIST XPS Database and analyze the quality metrics. This database has almost 30,000 records, so we have to use `aggregate()`.

In [3]:
# First, let's aggregate all the nist_xps_db data.
all_entries = mdf.aggregate_source("nist_xps_db")
print(len(all_entries))

100%|██████████| 29189/29189 [00:27<00:00, 1067.83it/s]

29189





In [4]:
# Now, let's parse out the "Quality of Data" and print te results for analysis.
qualities = {}
for record in all_entries:
    if record["mdf"]["resource_type"] == "record":
        raw = json.loads(record["mdf"]["raw"])
        if raw["Quality of Data"] in qualities.keys():
            qualities[raw["Quality of Data"]] += 1
        else:
            qualities[raw["Quality of Data"]] = 1
print(qualities)

{'': 15940, 'good': 4, 'Good': 1615, 'Adequate': 11630}


## aggregate - Multiple Datasets
Example: We want to analyze how often elements are studied with Gallium (Ga), and what the most frequent elemental pairing is. There are more than 10,000 records containing Gallium data.

In [5]:
# First, let's aggregate everything that has "Ga" in the list of elements.
all_results = mdf.aggregate("mdf.elements:Ga")
print(len(all_results))

100%|██████████| 29168/29168 [01:07<00:00, 298.34it/s]

29168





In [6]:
# Now, let's parse out the other elements in each record and keep a running tally to print out.
elements = {}
for record in all_results:
    if record["mdf"]["resource_type"] == "record":
        elems = record["mdf"]["elements"]
        for elem in elems:
            if elem in elements.keys():
                elements[elem] += 1
            else:
                elements[elem] = 1
print(json.dumps(elements, sort_keys=True, indent=4, separators=(',', ': ')))

{
    "Ac": 651,
    "Ag": 588,
    "Al": 576,
    "Ar": 2,
    "As": 1330,
    "Au": 649,
    "B": 681,
    "Ba": 802,
    "Be": 496,
    "Bi": 583,
    "Br": 127,
    "C": 1843,
    "Ca": 682,
    "Cd": 581,
    "Ce": 639,
    "Cl": 672,
    "Co": 954,
    "Cr": 712,
    "Cs": 552,
    "Cu": 817,
    "Dy": 600,
    "Er": 670,
    "Eu": 614,
    "F": 344,
    "Fe": 793,
    "Ga": 29168,
    "Gd": 612,
    "Ge": 776,
    "H": 1933,
    "Hf": 659,
    "Hg": 534,
    "Ho": 598,
    "I": 187,
    "In": 603,
    "Ir": 557,
    "K": 698,
    "La": 960,
    "Li": 990,
    "Lu": 527,
    "Mg": 955,
    "Mn": 715,
    "Mo": 606,
    "N": 1402,
    "Na": 840,
    "Nb": 564,
    "Nd": 616,
    "Ni": 836,
    "Np": 506,
    "O": 4031,
    "Os": 668,
    "P": 676,
    "Pa": 607,
    "Pb": 544,
    "Pd": 653,
    "Pm": 624,
    "Pr": 714,
    "Pt": 743,
    "Pu": 535,
    "Rb": 588,
    "Re": 498,
    "Rh": 584,
    "Ru": 562,
    "S": 512,
    "Sb": 632,
    "Sc": 715,
    "Se": 374,
    "Si": 107