### Requirements:

In [None]:
!pip install mp_api>=0.45.13 mpcontribs-client>=5.10.4 pandas>=2.3.0

## Programatically uploading data to MPContribs using the python client

This guide will walk you through uploading a dataset of experimentally-determined properties of a handful of cubic solids from the following paper:

> F. Tran, J. Stelzl, and P. Blaha, J. Chem. Phys., vol. 144, p. 204120, 2016, DOI: [10.1063/1.4948636](https://doi.org/10.1063/1.4948636)

(This data has been parsed manually and is redistributed without claims of copyright.)

To the MPContribs project [`test_solid_data`](https://next-gen.materialsproject.org/contribs/projects/test_solid_data).

Let's start by inspecting the data. It is semi-structured JSON, with a `data` field and `metadata` field:

In [None]:
import gzip
import json

with gzip.open("cubic_solid_expt_data.json.gz", "rt") as f:
    user_data = json.load(f)

The `data` subset of `user_data` contains a list of entries, each with the following fields:
- `formula` (str) : The chemical formula of the entry
- `a0` (float) : The cubic lattice constant of the structure in the entry
- `b0` (float) : The bulk modulus of the solid
- `e0` (float) : The cohesive energy of the solid
- `cif` (str) : A crystallographic information file (CIF) representation of the structure in the entry

Fortunately for us, the `data` field is structured to be readable with `pandas`. While you don't strictly need to use `pandas` to upload data to MPContribs, it is well-suited to its columnar format

In [None]:
import pandas as pd

pd.DataFrame(user_data["data"])

The `metadata` subset contains `citation` information, the `units` of the fields in `data`, and their `long_names` / descriptions. These are the basic metadata/provenance information required by MPContribs. Note that a preprint is also an acceptable citation.

In [None]:
for k, v in user_data["metadata"].items():
    print(k, v)

Column names in MPContribs cannot contain special characters beyond alphanumeric ones. The `snake_to_camel` function is useful if you have snake-case (e.g., `some_data_to_upload`) column names (will be converted to `someDataToUpload`).

Let's start by creating the project. If your `MP_API_KEY` is set as an environment variable, you don't need to call `MPRester` with any kwargs. Otherwise, pass an `api_key = <str>` kwarg to it

In [None]:
from mp_api.client import MPRester
from mpcontribs.client import Client as ContribsClient


def snake_to_camel(in_str: str) -> str:
    return "".join(
        (s[0].upper() if i > 0 else s[0]) + s[1:]
        for i, s in enumerate(in_str.lower().split("_"))
    )


PROJECT_NAME = "test_solid_data"

In [None]:
with (
    MPRester(
        # api_key= # use as needed, snake_case `api_key`
    ) as mpr
):
    mpr.contribs.create_project(
        name=PROJECT_NAME,
        title="Example MPContribs entry",  # can be anything with standard characters
        authors=", ".join(user_data["metadata"]["citation"]["authors"]),
        description=(  # again can be anything, could be a paper/preprint abstract
            "Experimental data on cubic solid state geometries, elastic, and energetic properties with zero-point corrections."
        ),
        url=user_data["metadata"]["citation"]["url"],
    )

In [None]:
client = ContribsClient(
    project=PROJECT_NAME,
    apikey=MPRester().api_key,  # Note that this is one word, not snake case as before
)

Suppose now that you want to update the project metadata, for example, adding more relevant links to the data / code used to process it. Those simple updates are handled with `ContribsClient.update_project`:

In [None]:
client.update_project(
    {
        "references": [
            {"label": "doi", "url": user_data["metadata"]["citation"]["url"]},
            {
                "label": "github",
                "url": "https://github.com/esoteric-ephemera/mpcontribs-example/",
            },
        ]
    }
)

Now we can actually start annotating and uploading data! The first step is specifying the columns and their long decriptions (`metadata`). Note that the MPContribs client can interpret units/dimensions from [`pint`](https://github.com/hgrecco/pint). Most common scientific units are handled there, including fundamental constants like the electron charge `e`, or angstrom.

When specifying units, any values which are text only should be given units of `None` type. Any numeric or boolean data which is dimensionless, e.g., the number of sites in a structure (`num_sites` below), should be given a unit of `str()`. All other numeric data can use units as needed.

Also note that fields can be nested using dot notation. The `lattice.a`, `lattice.b`, etc. examples below will nest an `a` and `b` field under a lattice super-heading, as will the `symmetry.number` and `symmetry.symbol` under a `symmetry` super-heading.

In [None]:
_columns_with_units = {
    **user_data["metadata"]["units"],
    **{f"lattice.{k}": "angstrom" for k in ("a", "b", "c")},
    **{f"lattice.{k}": "degree" for k in ("alpha", "beta", "gamma")},
    "num_sites": None,
    "symmetry.number": None,  # Numeric or boolean values which are dimensionless or unitless require `None` units
    "symmetry.symbol": "",  # String data requires emptry str() as units
}

columns_with_units = {snake_to_camel(k): v for k, v in _columns_with_units.items()}

column_descriptions = {
    snake_to_camel(k): v
    for k, v in {
        **{
            k: f"{v}, in {user_data['metadata']['units'][k]}"
            for k, v in user_data["metadata"]["long_names"].items()
        },
        **{
            f"lattice.{k}": f"The lattice {k} parameter in angstrom"
            for k in ("a", "b", "c")
        },
        **{
            f"lattice.{k}": f"The lattice {k} angle in angstrom"
            for k in ("alpha", "beta", "gamma")
        },
        "num_sites": "The number of sites in the primitive cell",
        "symmetry.number": "The international space group number",
        "symmetry.symbol": "The international space group symbol",
    }.items()
}

All contribs entries require an identifier field, which can be arbitrary (UUID, integer, etc.) and non-unique. In this example, we will use the Materials Project ID (MPID) of a corresponding matched structure. Using MPIDs will actually permit dynamic linking of data between MPContribs and the core MP data. The code below just sets up the columns.

<details>
<summary>
The links to MPIDs were pre-built as follows:
</summary>

```python
from mp_api.client import MPRester
from pymatgen.core import Structure
from emmet.core.mpid import AlphaID
from pymatgen.analysis.structure_matcher import StructureMatcher
import gzip
import json

with gzip.open("cubic_solid_expt_data.json.gz","rt") as f:
    user_data = json.load(f)
    
matcher = StructureMatcher()
with MPRester() as mpr:
    for i, entry in enumerate(user_data["data"]):
        s = Structure.from_str(entry["cif"],fmt="cif")
        mp_summary = mpr.materials.summary.search(formula=s.composition.reduced_formula)
        for doc in mp_summary:
            if matcher.fit(doc.structure,s):
                user_data["data"][i]["material_id"] = AlphaID(doc.material_id).formatted
                break

with gzip.open("cubic_solid_expt_data.json.gz","wt") as f:
    json.dump(user_data, f)

```

</details>

In [None]:
client.init_columns(columns_with_units)
client.update_project({"other": column_descriptions})

Let's first simulate running some basic analysis on our data as a `pandas.DataFrame`

In [None]:
from pymatgen.core import Structure
import pandas as pd

data_upload = pd.DataFrame(user_data["data"])
data_upload["structure"] = data_upload.cif.apply(
    lambda x: Structure.from_str(x, fmt="cif").to_primitive()
)

for k in ("a", "b", "c", "alpha", "beta", "gamma"):
    data_upload[f"lattice.{k}"] = data_upload.structure.apply(
        lambda x: getattr(x.lattice, k)
    )

data_upload["num_sites"] = data_upload.structure.apply(lambda x: len(x))

for i, k in enumerate(
    [
        "symbol",
        "number",
    ]
):
    data_upload[f"symmetry.{k}"] = data_upload.structure.apply(
        lambda x: x.get_space_group_info()[i]
    )

Now we put the data in a format that the MPContribs client can parse. This should be a list of dicts of the form:
```python
contribs_format = [
    {
        "data": {
            ... # data for a single entry, basically one row in the table
        },
        "identifier": ..., # required 
        "project": ... , # required
        "formula": ..., # strongly recommended
        "structures": [], # optional, list of `pymatgen.core.structure.Structure` objects, see below.
        "tables": [], # optional, list of `pandas.DataFrame`-like objects
        "attachments": [], # optional list of small, non-columnar data objects, see note below
    }
]
```

<b><i><span style="color: red;">All data in `data` must include units which match those in `columns_with_units` for those units to be parsed correctly.</span></i></b>

Note that complex objects beyond standard JSON (str, bool, int, and float) should not be put in the `data` dict. There is a separate field for uploading `pymatgen.core.structure.Structure` data: `structures` above.

<details>
<summary>
Simple array/matrix/tensorial data can be input as dot notation, e.g., a 2x2 array as:
</summary>

```python

a = [
    [1., 2.],
    [3., 4.]
] # original

contribs_format = [
    {
        "data": {
            f"a.{i}{j}": a[i][j] for j in range(2) for i in range(2)
        },
        ...
    }
]
```

</details>

Smaller data objects can also optionally be included as [tables (columnar data) or attachments (non-columnar, like JPEGs)](https://docs.materialsproject.org/uploading-data/what-is-mpcontribs), however it may be more performant to upload these objects to AWS separately.

In [None]:
contribs_format = [
    {
        "data": {
            snake_to_camel(k): f"{getattr(row, k)} {unit if unit else ''}"
            for k, unit in _columns_with_units.items()
        },
        "identifier": row.material_id,
        "project": PROJECT_NAME,
        "formula": row.structure.formula,
        "structures": [row.structure],
    }
    for _, row in data_upload.iterrows()
]

Now we're ready to upload our data! You can upload up to 500 entries (rows) of a dataset before it requires approval from MP admins.

<b><i>For uploading, querying, or deleting very large datasets (>10,000 rows), you may want to explicitly pass the `timeout = <time in sec>` kwarg to many of the functions below to avoid issues related to MongoDB timeouts.</i></b>

In [None]:
client.submit_contributions(contribs_format)  # timeout = ... can be passed here

Suppose you made a mistake and need to remove one or more entries. You can delete them as follows.

To remove subsets of the data, you can use `pymongo`-syntax as a dict query in `delete_contributions`.

In [None]:
client.delete_contributions()  # timeout = ... can be passed here

When you've received approval from MP admins, you can make your project public

In [None]:
client.init_columns()
client.make_public()

And if you change your mind, just make it private again

In [None]:
client.make_private()

<b><i><span style="color: red;">DANGER ZONE:</span> to start from scratch, you can delete your project!</i></b>

In [None]:
client.delete_project()