This tool is written using only standard python libraries. To manipulate the structure objects included here, you will need the `pymatgen` python package. You can substantially speed up the operations performed here by also installing better json libraries like `ujson` or `orjson`.

In [None]:
import requests
import gzip
import json

The first step is to download the datasets, which are hosted on FigShare for MP-ALOE, and Amazon S3 OpenData for MatPES.

In [None]:
for dataset_file, url in {
    "MP-ALOE-2025.jsonl.gz": "https://figshare.com/ndownloader/files/55909331",
    "MatPES-R2SCAN-2025.1.json.gz": "https://s3.us-east-1.amazonaws.com/materialsproject-contribs/MatPES_2025_1/MatPES-R2SCAN-2025.1.json.gz",
}.items():
    resp = requests.get(url)
    if resp.status_code != 200:
        raise ValueError(
            "Unable to request resource at this time, please try again shortly."
        )
    with open(dataset_file, "wb") as f:
        f.write(resp.content)

We then load the MatPES dataset first, to allow us to determine which MP materials are included in the dataset.

In [None]:
with gzip.open("MatPES-R2SCAN-2025.1.json.gz", "rt") as f:
    matpes_r2scan = json.load(f)

In [None]:
matpes_mpids = {
    matpes_doc["provenance"]["original_mp_id"]
    for matpes_doc in matpes_r2scan
    if matpes_doc["provenance"]["original_mp_id"]
}

We now load the MP-ALOE dataset, and only retain those structures which have an origin which does not belong to the MP structures in MatPES.

In [None]:
mp_aloe = []
with gzip.open("MP-ALOE-2025.jsonl.gz", "rt") as f:
    for line in f:
        aloe_doc = json.loads(line)
        if aloe_doc["provenance"]["original_mp_id"] in matpes_mpids:
            continue
        mp_aloe.append(aloe_doc)

We can now write the joint dataset to a gzipped JSON lines file.

In [None]:
with gzip.open("MP-ALOE-MATPES-R2SCAN-2025.jsonl.gz", "wt") as f:
    for doc in matpes_r2scan + mp_aloe:
        f.write(json.dumps(doc) + "\n")

To read in the dataset, you'll use pymatgen to parse the structures

In [None]:
from pymatgen.core import Structure

combined_dataset = []
with gzip.open("MP-ALOE-MATPES-R2SCAN-2025.jsonl.gz", "rt") as f:
    for line in f:
        doc = json.loads(line)
        doc["structure"] = Structure.from_dict(doc["structure"])
        combined_dataset.append(doc)

In [None]:
len(combined_dataset)