## Requirements

In [None]:
!pip install pandas>=2.3.0 pyarrow>=21.0.0 boto3>=1.40.0 python-dotenv>=1.0.0

## Making a large-data download from OpenData

To upload larger data objects, such as raw data, to MPContribs, you will need to obtain IAM credentials from MP staff. You should also think carefully about how you structure your data so that it is amenable to cloud storage.
Formats like `parquet` for columnar data and `zarr` for hierarchical data will aid in partial retrieval of data.
`parquet` also permits easily filtering data.

In this example, we will simulate using a `.env` file to store AWS credentials:
```bash
>>> ~/.env
aws_access_key_id=abcdefg...
aws_secret_access_key=abcdefg...
```

And convert the JSON data file to a single parquet file for download:

In [None]:
import gzip
import json
import pandas as pd
import pyarrow as pa

OPENDATA_BUCKET = "materialsproject-contribs"
PROJECT_NAME = "test_solid_data"

with gzip.open("cubic_solid_expt_data.json.gz", "rt") as f:
    user_data = json.load(f)

data = pd.DataFrame(user_data["data"])
arrow_table = pa.Table.from_pandas(data)

As a reasonable best practice, we will embed the citation and column metadata in the `arrow` table schema. To do this, the metadata must be a dict of string keys mapped to strings (or bytes in both cases). This means we must flatten the `metadata` dict of the dataset, and serialize its values:

In [None]:
import pyarrow.parquet as pq

pq.write_table(
    arrow_table.replace_schema_metadata(
        metadata={
            **arrow_table.schema.metadata,
            **{
                f"{key}.{sub_key}": json.dumps(v)
                for key, vals in user_data["metadata"].items()
                for sub_key, v in vals.items()
            },
        }
    ),
    "solid_data.parquet",
)

The following convenience functions are designed to make it easy to upload or delete data from AWS S3 OpenData.

In [None]:
import boto3
from dotenv import load_dotenv
from pathlib import Path
import os


def get_s3_client():
    """Start an S3 client from credentials stored in a ~/.env file."""
    load_dotenv()
    return boto3.client(
        "s3",
        **{k: os.environ[k] for k in ("aws_access_key_id", "aws_secret_access_key")},
    )


def upload_single_file_to_aws(
    in_path: str | Path,
    out_path: str | Path | None = None,
) -> None:
    """Convenience function to upload a file to S3.

    Parameters
    -----------
    in_path : str | Path
        The path to the file to be uploaded.
    out_path : str, Path, or None
        If a str or Path, the prefix and key of the file to upload.
        If None (default), defaults to `s3://materialsproject-contribs/{PROJECT_NAME}/{file_name relative to cwd()}`.
    """
    out_path = str(out_path or Path(PROJECT_NAME) / str(Path(in_path).name))
    with open(str(in_path), "rb") as f:
        get_s3_client().upload_fileobj(f, Bucket=OPENDATA_BUCKET, Key=out_path)


def delete_single_file_in_aws(in_path: str) -> None:
    """Remove a key from your OpenData bucket, specified by `in_path`."""
    get_s3_client().delete_object(
        Bucket=OPENDATA_BUCKET, Key=str(Path(PROJECT_NAME) / str(Path(in_path).name))
    )

Now we simply upload the parquet file we created earlier

In [None]:
upload_single_file_to_aws("solid_data.parquet")