# MDF Connect Client Tutorial

In [1]:
from mdf_connect_client import MDFConnectClient

## Table of contents

- [Overview/instantiation](#MDF-Connect-Client)
- [Mandatory inputs](#Mandatory-inputs)
- [Recommended inputs](#Recommended-inputs)
- [Optional inputs](#Optional-inputs)
- [Submitting a dataset](#Submitting-a-dataset)
- [Checking submission status](#Checking-submission-status)

## MDF Connect Client
The MDF Connect Client (`MDFConnectClient`) is a class designed to help you submit datasets to MDF Connect using Python. When you instantiate the Client, it will attempt to authenticate you with Globus automatically. You cannot use MDF Connect anonymously.

Note: While you can access and modify the internal variables of a client (for example, `mdfcc.data`), it is recommended that you instead only use the helper functions. This tutorial accesses those variables only for display purposes.

**IMPORTANT**: To submit data to MDF Connect, you must have an account recognized by Globus Auth (including Google, ORCiD, many academic institutions, or a [free Globus ID](https://www.globusid.org/create)). Additionally, you must be in the [MDF Connect Convert](https://www.globus.org/app/groups/cc192dca-3751-11e8-90c1-0a7c735d220a/about) Globus Group.

### MDFCC Constructor (`MDFConnectClient`)
It is recommended that you use the helper functions (detailed below) to construct your MDF Connect submission. However, if you have already assembled all or part of your submission, you can pre-load your client with the appropriate metadata.

#### Recommended arguments:
- `test` (boolean): When `True`, enables test mode. When `False`, disables test mode. For more information about test mode, see `set_test()`.

#### Optional arguments (if you have already prepared metadata):
- `dc` (dictionary): The DataCite block.
- `mdf` (dictionary): The MDF block.
- `mrr` (dictionary): The MRR block.
- `custom` (dictionary): The `__custom` block.
- `data` (list): The list of data locations.
- `index` (dictionary): The special indexing instructions.
- `services` (dictionary): The requested service integrations.

Note: If you have the entire submission already prepared, you can skip to [Submitting a dataset](#Submitting-a-dataset) and pass your submission directly to `submit_dataset()`.

#### Advanced optional arguments (developer use only):
- `service_instance` (string): The instance of the MDF Connect service to use. Normal users should not alter this value.
- `authorizer` (Authorizer): A valid, authenticated Authorizer from the Globus SDK. Normal users do not need to change this value. Accepted Authorizers are:
    - globus_sdk.RefreshTokenAuthorizer
    - globus_sdk.ClientCredentialsAuthorizer
    - globus_sdk.NullAuthorizer (This Authorizer will fail authentication.)

The advanced arguments may cause issues, and are not recommended for normal users.

In [2]:
mdfcc = MDFConnectClient(test=True)

## Mandatory inputs

- [`create_dc_block`](#create_dc_block)
- [`add_data`](#add_data)

### `create_dc_block`
The `dc` (DataCite) block is mandatory for all submissions. `create_dc_block()` helps create the `dc` block for you.

#### Required arguments:
- `title` (string): The title of the dataset.
- `authors` (string or list of strings): The authors of the dataset, in one of these forms:
    - "Givenname Familyname"
    - "Familyname, Givenname"
    - "Familyname; Givenname"

#### Arguments with defaults:
- `publisher` (string): The publisher of this dataset (*not* an associated paper). The default is "Materials Data Facility".
- `publication_year` (integer or string): The year the dataset was published. The default is the current year.
- `resource_type` (string): The type of resource. Except in unusual cases, this should be "Dataset". The default is "Dataset". Unless you know that your submission needs a different value, please leave it as the default.

#### Optional arguments (not present by default):
- `affiliations` (string or list of strings): The affiliations of the authors, in the same order as the authors. If a different number of affiliations are given, all affiliations will be applied to all authors. Multiple author affiliations can be given as a list. (See examples below for more details.)
- `description` (string): A description of the dataset.
- `dataset_doi` (string): The DOI for this dataset (*not* an associated paper).
- `related_dois` (string or list of strings): DOIs related to this dataset, such as an associated paper's DOI. This *does not* include a DOI for the dataset itself. 
- `subjects` (string or list of strings): Subjects (in Datacite terminology) or tags related to the dataset.


If you understand the DataCite schema, you can also add other keyword arguments corresponding to DataCite fields. Additional information on DataCite fields is available from the [official DataCite website](https://schema.datacite.org/meta/kernel-4.1/).

You cannot clear the `dc` block. You can overwrite the `dc` block by calling this method again.

In [3]:
# Extra affiliations examples
# Assume we have three authors: Alice, Bob, and Cathy
authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"]

# If all authors are from NIST:
affiliations = "NIST"
# Equivalent to ["NIST", "NIST", "NIST"]

# If all authors are from both NIST and UChicago:
affiliations = ["NIST", "UChicago"]
# Equivalent to [["NIST", "UChicago"], ["NIST", "UChicago"], ["NIST", "UChicago"]]

# If Alice and Bob are from NIST, Cathy is from NIST and UChicago:
affliliations = ["NIST", "NIST", ["NIST", "UChicago"]]
# This is the only way to express these affiliations

# This is incorrect! If applying affiliations to all authors, lists must not be nested.
# These apply to all authors because there are 4 affiliations for 3 authors.
affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"]
# Do not use this format, it is incorrect for this number of authors.

In [4]:
mdfcc.create_dc_block(title="Example of Dataset Submission",
                      authors=["Foo Smith", "Smith, Bar", "Smith; Baz"],
                      affiliations=["The International Institute of Data"],
                      publisher="The Journal of Datasets",
                      publication_year=2000,
                      description="This is an example submission.",
                      dataset_doi="10.1234/dataset",
                      related_dois=["10.1234/paper1", "10.1234/paper2"],
                      subjects=["Examples", "Datasets"])
mdfcc.dc

{'creators': [{'affiliations': ['The International Institute of Data'],
   'creatorName': 'Smith, Foo',
   'familyName': 'Smith',
   'givenName': 'Foo'},
  {'affiliations': ['The International Institute of Data'],
   'creatorName': 'Smith, Bar',
   'familyName': 'Smith',
   'givenName': 'Bar'},
  {'affiliations': ['The International Institute of Data'],
   'creatorName': 'Smith, Baz',
   'familyName': 'Smith',
   'givenName': 'Baz'}],
 'descriptions': [{'description': 'This is an example submission.',
   'descriptionType': 'Other'}],
 'identifier': {'identifier': '10.1234/dataset', 'identifierType': 'DOI'},
 'publicationYear': '2000',
 'publisher': 'The Journal of Datasets',
 'relatedIdentifiers': [{'relatedIdentifier': '10.1234/paper1',
   'relatedIdentifierType': 'DOI',
   'relationType': 'IsPartOf'},
  {'relatedIdentifier': '10.1234/paper2',
   'relatedIdentifierType': 'DOI',
   'relationType': 'IsPartOf'}],
 'resourceType': {'resourceType': 'Dataset', 'resourceTypeGeneral': 'Datase

In [5]:
mdfcc.create_dc_block(title="Smaller Example Dataset Submission",
                      authors=["Foo Smith", "Smith, Bar"],
                      affiliations=["Foo University", "The Bureau of Bar"])

mdfcc.dc

{'creators': [{'affiliations': ['Foo University'],
   'creatorName': 'Smith, Foo',
   'familyName': 'Smith',
   'givenName': 'Foo'},
  {'affiliations': ['The Bureau of Bar'],
   'creatorName': 'Smith, Bar',
   'familyName': 'Smith',
   'givenName': 'Bar'}],
 'publicationYear': '2018',
 'publisher': 'Materials Data Facility',
 'resourceType': {'resourceType': 'Dataset', 'resourceTypeGeneral': 'Dataset'},
 'titles': [{'title': 'Smaller Example Dataset Submission'}]}

### `add_data`
Some kind of data is mandatory for all submissions. `add_data()` will add a data location to your dataset. This action is cumulative, so each calls adds more data. Subsequent calls do not overwrite.

#### Required arguments:
- `data_location` (string): The location of the data.

You can add data located at a Globus endpoint, HTTP(S) link, or on Google Drive. MDF Connect will extract data located in archives, including zips.

- Globus endpoint: `globus://endpoint_id/path/to/data` or you can copy the "Get link" link from the Globus Web App. The data must be shared with `mdf_dataset_submission`, which is the MDF Connect Globus account. (This account will show up as `c17f27bb-f200-486a-b785-2a25e82af505@clients.auth.globus.org`.)
- HTTP(S): Copy the link exactly. The data must be accessible without authentication.
- Google Drive: `googledrive:///path/from/shared/location`. You must share the data with materialsdatafacility@gmail.com.

To clear all data from the submission, call `clear_data()`.

In [6]:
mdfcc.add_data("https://dl.dropboxusercontent.com/u/12345/abcdef")
mdfcc.add_data(["googledrive:///mydata.zip", "globus://1a2b3c/data/"])
mdfcc.data

['https://dl.dropboxusercontent.com/u/12345/abcdef',
 'googledrive:///mydata.zip',
 'globus://1a2b3c/data/']

In [7]:
mdfcc.clear_data()
mdfcc.data

[]

In [8]:
# This is the actual test data, using a Globus Web App link
mdfcc.add_data("https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F")
mdfcc.data

['https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F']

## Recommended inputs

- [`add_index`](#add_index)
- [`add_service`](#add_service)
- [`set_test`](#set_test)
- [`add_repositories`](#add_repositories)

### `add_index`
To process JSON, CSV, YAML, XML, or Excel files, MDF Connect requires a mapping that translates the file into MDF schema format. `add_index()` will add a mapping for a specific data type to your submission.

#### Required arguments:
- `data_type` (string): The type of data being mapped. Supported types include:
    - `json`
    - `csv`
    - `yaml`
    - `xml`
    - `excel`
    - `filename` (This type is special; see below.)
- `mapping` (dictionary of strings): The mapping of MDF fields to your data type's fields (see below).

#### Arguments with defaults:
- `delimiter` (string): For tabular data (ex. CSV), the column delimiter. The default is "," (comma).
- `na_values` (string or list of strings): Values to treat as "data missing" entries. The default for tabular data (ex. CSV), v is blank and space, while the default for other types (ex. JSON) is nothing (no values will be discarded).

#### About `mapping`:
Mappings must be dictionaries, where the key is the MDF schema field (expressed in dot notation) and the value is the data's field or column name. "Dot notation" means one string that uses a period between dictionary levels. For example, `block.field.subfield` is the dot notation equivalent of `my_dict["block"]["field"]["subfield"]` in Python.
The exception to this is `filename` mapping. To extract data from a file's name, create a regular expression that returns the correct information. The mapping field is still the associated MDF field in dot notation, but the mapping value is the regular expression you created.

Fields with missing data will be ignored. If you have multiple schemas for one data type in one dataset, you can combine the mappings safely.

Each data type can only have one associated mapping, so multiple calls with the same data type will overwrite. Calls with different data types will not overwrite. To clear all the mappings, call `clear_index()`.

For more information on the MDF schemas, see the [official schema repository](https://github.com/materials-data-facility/data-schemas).

For the following example, assume we're submitting a dataset that contains a JSON file structured like this:
```json
{
    "my_data": {
        "mat": {
            "comp": "H"
        },
        "atom_num": 1
    },
    "space_grp": 10
}
```

In [9]:
# This is the mapping we would use to get the JSON file into MDF format.
mapping = {
    "material.composition": "my_data.mat.comp",
    "crystal_structure.number_of_atoms": "my_data.atom_num",
    "crystal_structure.space_group_number": "space_grp",
    # We could add another field here, if we had multiple JSON schemas.
    "dft.converged": "dft_info.conv"
    # This field would be ignored by MDF Connect in this submission because the field doesn't exist in the data.
}

In [10]:
mdfcc.add_index("json", mapping)
mdfcc.index

{'json': {'mapping': {'crystal_structure.number_of_atoms': 'my_data.atom_num',
   'crystal_structure.space_group_number': 'space_grp',
   'dft.converged': 'dft_info.conv',
   'material.composition': 'my_data.mat.comp'}}}

In [11]:
mdfcc.clear_index()
mdfcc.index

{}

### `add_service`
MDF Connect has integrations to submit data to other community services, as well as additional MDF-related options. To automatically submit your dataset to an integrated service, use `add_service()`.

#### Required arguments:
- `service` (string): One service to push your dataset to. Integrated services include:
    - `globus_publish`, the Globus/MDF publication service with DOI minting
    - `citrine`, industry-partnered machine-learning specialists
    - `mrr`, the NIST Materials Resource Registry

#### Arguments with defaults:
- `parameters` (dictionary): Optional, service-specific parameters. Fields include:
    - For `globus_publish`:
        - `parameters` currently unavailable.
    - For `citrine`:
        - public (boolean): When `True`, the data will be made public. Otherwise, the data will be inaccessible. The default is `True`.

This action is cumulative, so subsequent calls will add more services, not overwrite previous.

To clear all the service selections from your submission, call `clear_services()`.

In [12]:
mdfcc.add_service("citrine")
mdfcc.services

{'citrine': True}

In [13]:
mdfcc.clear_services()
mdfcc.services

{}

### `set_test`
You can use `set_test()` to create a test submission as a dry-run for MDF Connect. The submission will go through the normal processing, but the results will not be submitted to the normal locations. This flag is a great way to tell if your submission will process the way you want it to.

#### Required arguments:
- `test` (boolean): When `True`, enables test mode. When `False`, disables test mode.

#### About test mode:
Test datasets are submitted to test/sandbox/temporary resources instead of live resources, including the following. These setting override all other parameters.
- Tests are ingested into the `mdf-test` search index
- Tests are submitted to the MDF Test collection in Globus Publish (if `globus_publish` is a requested service)
- Tests are not made public on Citrination (if `citrine` is a requested service)
- Tests are given a special `source_id` by prepending `_test_`

In [14]:
mdfcc.set_test(False)
mdfcc.test

False

In [15]:
mdfcc.set_test(True)
mdfcc.test

True

### `add_repositories`
`add_repositories()` adds repository tags to your submission. This action is cumulative, so each call adds more repositories. Subsequent calls do not overwrite.

#### Required arguments:
- `repositories` (string or list of strings): The repositories to add.

Some repositories may be added automatically if implied by your supplied tags. Repositories that aren't recognized may be discarded.

To clear your repository tags, call `clear_repositories()`.

In [16]:
mdfcc.add_repositories(["APS", "NREL"])
mdfcc.mdf

{'repositories': ['APS', 'NREL']}

In [17]:
mdfcc.clear_repositories()
mdfcc.mdf

{}

## Optional inputs

- [`set_custom_block`](#set_custom_block)
- [`set_custom_descriptions`](#set_custom_descriptions)
- [`set_acl`](#set_acl)
- [`set_source_name`](#set_source_name)
- [`create_mrr_block`](#create_mrr_block)

### `set_custom_block`
The `__custom` block is an area for you to add your own custom metadata, if it isn't covered by the MDF schema. It can be set by calling `set_custom_block()`.

#### Required arguments:
- `custom_fields` (dictionary): Custom field-value pairs for your dataset.

You are allowed ten keys in your custom dictionary. You may additionally add descriptions of your fields by creating a new field called "\[field\]\_desc" with the string description inside. You can also add descriptions by calling `set_custom_descriptions()`.

Note that, unlike the `index` mappings, you supply the actual values for the dataset-level `__custom` block.

Subsequent calls will overwrite your `__custom` block. You can clear the `__custom` block by passing in an empty dictionary.

In [18]:
custom_values = {
        "quench_method": "water",
        "quench_method_desc": "The method of quenching"
}
mdfcc.set_custom_block(custom_values)
mdfcc.custom

{'quench_method': 'water', 'quench_method_desc': 'The method of quenching'}

In [19]:
mdfcc.set_custom_block({})
mdfcc.custom

{}

### `set_custom_descriptions`
To add descriptions for your `__custom` block fields, you can call `set_custom_descriptions()`.

#### Required arguments:
- `custom_descriptions` (dictionary): The custom fields and descriptions. The dictionary fields must be the same as your `__custom` block fields, and the values must be their descriptions.

Every field in `__custom` can have a description, but descriptions are not allowed without a corresponding field.

Subsequent calls will overwrite the descriptions you provide. To clear descriptions, you have to use `set_custom_block()`.

In [20]:
custom_values = {
        "quench_method": "water"
}
mdfcc.set_custom_block(custom_values)
mdfcc.custom

{'quench_method': 'water'}

In [21]:
custom_desc = {
        "quench_method": "The method of quenching"
}
mdfcc.set_custom_descriptions(custom_desc)
mdfcc.custom

{'quench_method': 'water', 'quench_method_desc': 'The method of quenching'}

In [22]:
mdfcc.set_custom_block({})
mdfcc.custom

{}

### `set_acl`
`set_acl()` sets the Access Control List for this submission.

#### Required arguments:
- acl (string or list of strings): The Access Control List. The ACL must contain either the Globus UUIDs of users and/or groups allowed to access the submission, or `"public"` to make the submission open to everyone. The default ACL is `"public"`.

You can reset the ACL to the default with `clear_acl()`.

In [23]:
mdfcc.set_acl(["UUID1", "UUID2"])
mdfcc.mdf

{'acl': ['UUID1', 'UUID2']}

In [24]:
mdfcc.clear_acl()
mdfcc.mdf

{}

### `set_source_name`
`set_source_name()` sets the `source_name` of your dataset. By default, the `source_name` is generated based on the title of your dataset (as set in the `dc` block). If your title is long or otherwise unwieldy to type or remember, setting a custom `source_name` can help.

#### Required arguments:
- `source_name` (string): The desired `source_name`, which must be unique for new datasets.

Please note that your source name will be cleaned when submitted to MDF Connect, so the actual source_name may differ from this value. Additionally, the `source_id` (which is the `source_name` plus version) is required to fetch the status of a submission. `check_status()` can handle this for you.

You can reset the `source_name` to the default by calling `clear_source_name()`.

In [25]:
mdfcc.set_source_name("my_foobar_dataset")
mdfcc.mdf

{'source_name': 'my_foobar_dataset'}

In [26]:
mdfcc.clear_source_name()
mdfcc.mdf

{}

### `create_mrr_block`
`create_mrr_block()` adds data for the NIST Materials Resource Registry into your submission. Currently, you must build a dictionary with the appropriate fields yourself in order to attach MRR metadata.

#### Required arguments:
- `mrr_data` (dictionary): The Materials Resource Registry metadata.

You can clear the `mrr` block by passing in an empty dictionary.

In [27]:
mdfcc.create_mrr_block({"dataOrigin": "experiment"})
mdfcc.mrr

{'dataOrigin': 'experiment'}

In [28]:
mdfcc.create_mrr_block({})
mdfcc.mrr

{}

## Submitting a dataset
After you have created your submission with the above helpers, you can submit and check your submission with the following helpers.

- [`get_submission`](#get_submission)
- [`reset_submission`](#reset_submission)
- [`submit_dataset`](#submit_dataset)

### `get_submission`
`get_submission()` shows you your current submission, as it will be sent to MDF Connect. This method is a great way to check for any errors.

In [29]:
mdfcc.get_submission()

{'data': ['https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F'],
 'dc': {'creators': [{'affiliations': ['Foo University'],
    'creatorName': 'Smith, Foo',
    'familyName': 'Smith',
    'givenName': 'Foo'},
   {'affiliations': ['The Bureau of Bar'],
    'creatorName': 'Smith, Bar',
    'familyName': 'Smith',
    'givenName': 'Bar'}],
  'publicationYear': '2018',
  'publisher': 'Materials Data Facility',
  'resourceType': {'resourceType': 'Dataset',
   'resourceTypeGeneral': 'Dataset'},
  'titles': [{'title': 'Smaller Example Dataset Submission'}]},
 'test': True}

### `reset_submission`
If you need to clear away your entire submission, call `reset_submission()`. This is irreversible.

Caution: This method will clear the current `source_id`, which means that you will have to keep track of any previous `source_id`s from other submissions to see their statuses.

In [30]:
# mdfcc.reset_submission()

### `submit_dataset`
`submit_dataset()` will send your dataset to MDF Connect for indexing. You will get back the `source_id` if the submission is successful. The `source_id` is the unique identifier for your specific submission, and can be used to check the status of your submission later. The `source_id` is also saved to the client.

#### Optional arguments:
- `resubmit` (boolean): If you wish to submit this dataset after submitting it previously, set this to `True`. If this is the first submission, leave this `False`. The default is `False`.
- `submission` (dictionary): If you have assembled your own MDF Connect submission without this client, you can submit it by passing the dictionary in here. By default, the submission made in the client will be used.
- `reset` (boolean): If True, the submission will be cleared after the submission attempt, with `reset_submission()`. The test flag will be preserved. The default is `False`. Caution: This flag will clear your `source_id`, which means that you will have to keep track of it manually to check your submission's status.

#### Return value:
- A dictionary with the following submission information:
    - `success` (boolean): `True` if the submission was successfully sent to MDF Connect. `False` otherwise.
    - `source_id` (string): The `source_id` of the submission, from MDF Connect. This value may be `None` or an old ID if the submission failed.
    - `error` (string): If the submission failed, the reason for failure. If the submission suceeded, this will be `None` instead.

In [31]:
mdfcc.submit_dataset()

{'error': None,
 'source_id': '_test_smaller_example_dataset_submission_v3-3',
 'success': True}

## Checking submission status
After submitting a dataset to MDF Connect, you can see the status of the submission's processing with these helpers.

- [`check_status`](#check_status)
- [`check_all_submissions`](#check_all_submissions)

### `check_status`
To see the progress your submission is making, use `check_status()`. If you haven't cleared the submission from the client, you can use it without arguments to check the most recent submission status.

#### Optional arguments:
- `source_id` (string): The `source_id` of the submission you want to check on. If you don't supply a `source_id`, the ID of the last submission you made with the client will be used instead (an error will result if you have not submitted a dataset with the client yet and also don't supply an ID).
- `raw` (boolean): When `False`, a nicely-formatted status summary will be printed to standard output. When `True`, the full status result will be returned instead (the full result is not recommended for direct human consumption). The default is `False`.

In [1]:
mdfcc.check_status()

NameError: name 'mdfcc' is not defined

In [None]:
mdfcc.check_status("_test_smaller_example_dataset_submission_v1-1")

### `check_all_submissions`
If you want to see the status of all submissions you've made to MDF Connect, use `check_all_submissions()`. This method is helpful if you forget a submission's `source_id`, or you have multiple submissions processing at once.

#### Optional arguments:
- `verbose` (boolean): When `False`, a basic summary of your submissions will be printed. When `True`, the full status summary of each submission will be printed, in the same form as `check_status()`. The default is `False`. (This argument has no effect if `raw` is `True`.)
- `active` (boolean): When `False`, all of your submissions will be shown. When `True`, only submissions that are still active will be shown. The default is `False`.
- `raw` (boolean): When `False`, the summary selected by `verbose` will be printed. When `True`, the full status result will be returned instead (the full result is not recommended for direct human consumption). The default is `False`.

In [33]:
mdfcc.check_all_submissions(active=True)

Error 404. MDF Connect may be experiencing technical difficulties.


In [None]:
mdfcc.check_all_submissions(verbose=True, active=True)