In [1]:
import mdf_toolbox

# Table of contents

- [Overview/instantiation](#MDF-Connect-Client)
- [Mandatory inputs](#Mandatory-inputs)
- [Recommended inputs](#Recommended-inputs)
- [Optional inputs](#Optional-inputs)
- [Submitting a dataset](#Submitting-a-dataset)

# MDF Connect Client
The MDF Connect Client (`MDFConnectClient`) is a class designed to help you submit datasets to MDF Connect using Python. It can be automatically created with the `login()` utility (see the Authentication Utilities tutorial).

Note: While you can access and modify the internal variables of a client (for example, `mdfcc.data`), it is recommended that you instead only use the helper functions. This tutorial accesses those variables only for display purposes.

**IMPORTANT**: To submit data to MDF Connect, you must have an account recognized by Globus Auth (including Google, ORCiD, many academic institutions, or a [free Globus ID](https://www.globusid.org/create)). Additionally, you must be in the [MDF Connect Convert](https://www.globus.org/app/groups/cc192dca-3751-11e8-90c1-0a7c735d220a/about) Globus Group.

In [2]:
mdfcc = mdf_toolbox.login(services=["mdf_connect"])

You can also instantiate the client directly, and it will attempt to authenticate you automatically. You cannot use MDF Connect anonymously.

Additionally, you can set the initial state of the client this way.

In [3]:
mdfcc = mdf_toolbox.MDFConnectClient(test=True)

# Mandatory inputs

## `create_dc_block`
The `dc` (DataCite) block is mandatory for all submissions. `create_dc_block()` creates the `dc` block for you.

You must provide the `title` and `authors` of your submission. You may also provide the authors' `affiliations` (in the same order as the authors), the `publisher` and `publication_year`, and the `description`, as well as the DOI for the dataset itself as `dataset_doi` and related DOIs (such as a journal article) as `related_dois`.

Note about `affiliations`: If the affiliations for each author are different, you must supply each author's affiliation(s) in order. If you supply a different number of affiliations (1 affiliation for 3 authors, for example), all affiliations will apply to all authors. See the extended example below.

If you understand the DataCite schema, you can also add other keyword arguments corresponding to DataCite fields.

You cannot clear the `dc` block. You can overwrite the `dc` block by calling this method again.

In [4]:
# Extra affiliations examples
authors = ["Fromnist, Alice", "Fromnist; Bob", "Cathy Multiples"]

# If all authors are from NIST:
affiliations = "NIST"

# If all authors are from both NIST and UChicago:
affiliations = ["NIST", "UChicago"]

# If Alice and Bob are from NIST, Cathy is from NIST and UChicago:
affliliations = ["NIST", "NIST", ["NIST", "UChicago"]]

# This is incorrect! If applying affiliations to all authors, lists must not be nested.
# These apply to all authors because there are 4 affiliations for 3 authors.
affiliations = ["NIST", ["NIST", "UChicago"], "Argonne", "Oak Ridge"]


In [5]:
mdfcc.create_dc_block(title="Foo, Bar, and Baz in Big Data",
                      authors=["Foo Smith", "Smith, Bar", "Smith; Baz"],
                      affiliations=["The Foo Bar Institute of Data"],
                      publisher="The Journal of Foo-Bar Data")
mdfcc.dc

{'creators': [{'affiliations': ['The Foo Bar Institute of Data'],
   'creatorName': 'Smith, Foo',
   'familyName': 'Smith',
   'givenName': 'Foo'},
  {'affiliations': ['The Foo Bar Institute of Data'],
   'creatorName': 'Smith, Bar',
   'familyName': 'Smith',
   'givenName': 'Bar'},
  {'affiliations': ['The Foo Bar Institute of Data'],
   'creatorName': 'Smith, Baz',
   'familyName': 'Smith',
   'givenName': 'Baz'}],
 'publicationYear': '2018',
 'publisher': 'The Journal of Foo-Bar Data',
 'resourceType': {'resourceType': 'Dataset', 'resourceTypeGeneral': 'Dataset'},
 'titles': [{'title': 'Foo, Bar, and Baz in Big Data'}]}

## `add_data`
Some kind of data is mandatory for all submissions. `add_data()` will add a data location to your dataset. This action is cumulative, so each calls adds more data. Subsequent calls do not overwrite.

You can add data located at a Globus endpoint, HTTP(S) link, or Google Drive. Connect will extract data located in archives, including zips.

- Globus endpoint: `globus://endpoint_id/path/to/data` or you can copy the "Get link" link from the Globus Web App. The data must be shared with `mdf_dataset_submission`, which is the MDF Connect Globus account. (This account will show up as `c17f27bb-f200-486a-b785-2a25e82af505@clients.auth.globus.org`.)
- HTTP(S): Copy the link exactly. The data must be accessible without authentication.
- Google Drive: `googledrive:///path/from/shared/location`. You must share the data with materialsdatafacility@gmail.com.

To clear all data from the submission, call `clear_data()`.

In [6]:
mdfcc.add_data("https://dl.dropboxusercontent.com/u/12345/abcdef")
mdfcc.add_data(["googledrive:///mydata.zip", "globus://1a2b3c/data/"])
mdfcc.data

['https://dl.dropboxusercontent.com/u/12345/abcdef',
 'googledrive:///mydata.zip',
 'globus://1a2b3c/data/']

In [7]:
mdfcc.clear_data()
mdfcc.data

[]

In [8]:
# This is the actual test data, using a Globus Web App link
mdfcc.add_data("https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F")

# Recommended inputs

## `add_index`
To process JSON, CSV, YAML, XML, or Excel files, MDF Connect requires a mapping that translates the file into MDF schema format. `add_index()` allows you to attach a `mapping` to a `data_type`, and specify a `delimiter` for tabular data, as well as the `na_values`, when applicable.

Mappings must be dictionaries, where the key is the MDF schema field, expressed in dot notation (see examples below) and the value is the data's field or column.

Each data type can only have one associated mapping, so multiple calls with the same data type will overwrite. Calls with different data types will not overwrite. To clear all the mappings, call `clear_index()`.

For more information on the MDF schemas, see the official JSONSchema repository at https://github.com/materials-data-facility/data-schemas.

For the following example, assume there is a JSON file in the data that is structured like this:
```json
{
    "my_data": {
        "mat": {
            "comp": "H"
        },
        "atom_num": 1
    },
    "space_grp": 10
}
```

In [9]:
mapping = {
    "material.composition": "my_data.mat.comp",
    "crystal_structure.number_of_atoms": "my_data.atom_num",
    "crystal_structure.space_group_number": "space_grp"
}

In [10]:
mdfcc.add_index("json", mapping)
mdfcc.index

{'json': {'mapping': {'crystal_structure.number_of_atoms': 'my_data.atom_num',
   'crystal_structure.space_group_number': 'space_grp',
   'material.composition': 'my_data.mat.comp'}}}

In [11]:
mdfcc.clear_index()
mdfcc.index

{}

## `add_service`
MDF Connect has integrations to submit data to other community services, as well as additional MDF-related options. To automatically submit your dataset to an integrated service, use `add_service()`. If the service you're submitting to has additional configuration parameters, use `parameters` to set them. This action is cumulative, so subsequent cals will add more services, not overwrite previous.

Integrated services include:

- `globus_publish`, the Globus/MDF publication service with DOI minting
- `citrine`, industry-partnered machine-learning specialists
- `mrr`, the NIST Materials Resource Registry

To clear all the service selections from your submission, call `clear_services()`.

In [12]:
mdfcc.add_service("citrine")
mdfcc.services

{'citrine': True}

In [13]:
mdfcc.clear_services()
mdfcc.services

{}

## `set_test`
You can use `set_test()` to create a test submission as a dry-run for MDF Connect. The submission will go through the normal processing, but the results will not be submitted to the normal locations. This flag is a great way to tell if your submission will process the way you want it to.

- Tests are ingested into the `mdf-test` search index
- Tests are submitted to the MDF Test collection in Globus Publish (if `globus_publish` requested)
- Tests are not made public on Citrination (if `citrine` requested)
- Tests are given a special `source_id` by prepending `_test_`

To turn the test flag off, use `test=False`.

In [14]:
mdfcc.set_test(False)
mdfcc.test

False

In [15]:
mdfcc.set_test(True)
mdfcc.test

True

## `add_repositories`
`add_repositories()` adds repository tags to your submission. This action is cumulative, so each call adds more repositories. Subsequent calls do not overwrite.

Some repositories may be added automatically if implied by your supplied tags. Repositories that aren't recognized may be discarded.

To clear your repository tags, call `clear_repositories()`.

In [16]:
mdfcc.add_repositories(["APS", "NREL"])
mdfcc.mdf

{'repositories': ['APS', 'NREL']}

In [17]:
mdfcc.clear_repositories()
mdfcc.mdf

{}

# Optional inputs

## `set_custom_block`
The `__custom` block is an area for you to add your own custom metadata, if it isn't covered by the MDF schema. It can be set by calling `set_custom_block()`. You are allowed ten keys in this dictionary.

Note that, unlike the `index` mappings, you supply the actual values for the dataset-level `__custom` block.

You can clear the `__custom` block by passing in an empty dictionary.

In [18]:
custom_values = {
        "quench_method": "water"
}
mdfcc.set_custom_block(custom_values)
mdfcc.custom

{'quench_method': 'water'}

In [19]:
mdfcc.set_custom_block({})
mdfcc.custom

{}

## `set_acl`
`set_acl()` sets the Access Control List for this submission. It can contain the Globus UUIDs of users and/or groups allowed to access the submission, or `"public"` to make the submission open to everyone.

You can reset the ACL to the default (public) with `clear_acl()`.

In [20]:
mdfcc.set_acl(["UUID1", "UUID2"])
mdfcc.mdf

{'acl': ['UUID1', 'UUID2']}

In [21]:
mdfcc.clear_acl()
mdfcc.mdf

{}

## `set_source_name`
`set_source_name()` sets the `source_name` of your dataset. By default, the `source_name` is generated based on the title of your dataset (as set in the `dc` block). If your title is long or otherwise unwieldy to type or remember, setting a custom `source_name` can help.

You can reset the `source_name` to the default by calling `clear_source_name()`.

In [22]:
mdfcc.set_source_name("my_foobar_dataset")
mdfcc.mdf

{'source_name': 'my_foobar_dataset'}

In [23]:
mdfcc.clear_source_name()
mdfcc.mdf

{}

## `create_mrr_block`
`create_mrr_block()` adds data for the NIST Materials Resource Registry into your submission. Currently, you must build a dictionary with the appropriate fields yourself.

You can clear the `mrr` block by passing in an empty dictionary.

In [24]:
mdfcc.create_mrr_block({"dataOrigin": "experiment"})
mdfcc.mrr

{'dataOrigin': 'experiment'}

In [25]:
mdfcc.create_mrr_block({})
mdfcc.mrr

{}

# Submitting a dataset
After you have created your submission with the above helpers, you can submit and check your submission with the following helpers.

## `get_submission`
`get_submission()` shows you your current submission, as it will be sent to MDF Connect. This method is a great way to check for any errors.

In [26]:
mdfcc.get_submission()

{'data': ['https://www.globus.org/app/transfer?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Fcitrine_mdf_demo%2Falloy.pbe%2FAlFe%2F'],
 'dc': {'creators': [{'affiliations': ['The Foo Bar Institute of Data'],
    'creatorName': 'Smith, Foo',
    'familyName': 'Smith',
    'givenName': 'Foo'},
   {'affiliations': ['The Foo Bar Institute of Data'],
    'creatorName': 'Smith, Bar',
    'familyName': 'Smith',
    'givenName': 'Bar'},
   {'affiliations': ['The Foo Bar Institute of Data'],
    'creatorName': 'Smith, Baz',
    'familyName': 'Smith',
    'givenName': 'Baz'}],
  'publicationYear': '2018',
  'publisher': 'The Journal of Foo-Bar Data',
  'resourceType': {'resourceType': 'Dataset',
   'resourceTypeGeneral': 'Dataset'},
  'titles': [{'title': 'Foo, Bar, and Baz in Big Data'}]},
 'test': True}

## `reset_submission`
If you need to clear away your entire submission, call `reset_submission()`. This is irreversible.

Caution: This method will clear the current `source_id`, which means that you will have to keep track of any previous `source_id`s from other submissions.

In [27]:
# mdfcc.reset_submission()

## `submit_dataset`
`submit_dataset()` will send your dataset to MDF Connect for indexing. You will get back the `source_id` if the submission is successful. The `source_id` is the unique identifier for your specific submission, and can be used to check the status of your submission later. The `source_id` is also saved to the client.

You can set `test=True` here to force a test submission (`test=False` is the default and has no effect). If you need to submit the same dataset again, you can use `resubmit=True` which bypasses the duplicate submission check. If you want to clear your submission after sending it, set `reset=True`.

If you have assembled your submission manually (not using the helpers), you can give the dictionary to `submission` and the method will send if for you.

In [28]:
mdfcc.submit_dataset()

'_test_foo_bar_baz_in_big_data_v4'

## `check_status`
To see the progress your submission is making, use `check_status()`. If you haven't cleared the submission from the client, you can use it without arguments to check the most recent submission status. You can also pass in a `source_id` to check the status of a different submission.

In [29]:
mdfcc.check_status()


Status of TEST convert submission _test_foo_bar_baz_in_big_data_v4 (Foo, Bar, and Baz in Big Data)
Submitted by Jonathon Gaff at 2018-07-05T14:22:37.862744Z

Conversion initialization was successful.
Conversion data download is in progress.
Data conversion has not started yet.
Ingestion preparation has not started yet.
Ingestion initialization has not started yet.
Ingestion data download has not started yet.
Integration data download has not started yet.
Globus Search ingestion has not started yet.
Globus Publish publication has not started yet.
Citrine upload has not started yet.
Materials Resource Registration has not started yet.
Post-processing cleanup has not started yet.



In [30]:
mdfcc.check_status("_test_foo_bar_baz_in_big_data_v1")


Status of TEST convert submission _test_foo_bar_baz_in_big_data_v1 (Foo, Bar, and Baz in Big Data)
Submitted by Jonathon Gaff at 2018-06-27T21:14:52.076218Z

Conversion initialization was successful.
Conversion data download was successful.
Data conversion was successful: 7 records parsed out of 7 groups.
Ingestion preparation was successful.
Ingestion initialization was successful.
Ingestion data download was not requested or required.
Integration data download was not requested or required.
Globus Search ingestion was successful.
Globus Publish publication was not requested or required.
Citrine upload was not requested or required.
Materials Resource Registration was not requested or required.
Post-processing cleanup was successful.



You can also use `raw=True` to get the raw response from MDF Connect.

This response is messy and not meant for general human consumption.

In [31]:
mdfcc.check_status("_test_foo_bar_baz_in_big_data_v1", raw=True)

{'source_id': '_test_foo_bar_baz_in_big_data_v1',
 'status_code': 'SSMSSNNSNNNS',
 'status_list': [{'signal': 'success',
   'text': 'Conversion initialization was successful.'},
  {'signal': 'success', 'text': 'Conversion data download was successful.'},
  {'signal': 'success',
   'text': 'Data conversion was successful: 7 records parsed out of 7 groups.'},
  {'signal': 'success', 'text': 'Ingestion preparation was successful.'},
  {'signal': 'success', 'text': 'Ingestion initialization was successful.'},
  {'signal': 'idle',
   'text': 'Ingestion data download was not requested or required.'},
  {'signal': 'idle',
   'text': 'Integration data download was not requested or required.'},
  {'signal': 'success', 'text': 'Globus Search ingestion was successful.'},
  {'signal': 'idle',
   'text': 'Globus Publish publication was not requested or required.'},
  {'signal': 'idle', 'text': 'Citrine upload was not requested or required.'},
  {'signal': 'idle',
   'text': 'Materials Resource Regi