# Getting Data

*author: Joseph Montoya*

This notebook demonstrates a few basic examples from matminer's data retrieval features.  Matminer supports data retrival from the following sources.

* [Materials Project](https://materialsproject.org)
* [Citrine Informatics](https://citrination.com)
* [The Materials Platform for Data Science (MPDS)](mpds.io)
* [The Materials Data Facility](https://materialsdatafacility.org/)

Each resource has a corresponding object in matminer designed for retrieving data and preprocessing it into a pandas dataframe.  In addition, matminer can also access and aggregate data from your own [mongo database](https://www.mongodb.com/), if you have one.

## Materials Project

The materials project data retrieval tool, `matminer.data_retrieval.retrieve_MP.MPDataRetrieval` is initialized using an api_key that can be found on your personal dashboard page on [materialsproject.org](materialsproject.org) if you've created an account.  If you've set your api key via pymatgen (e.g. `pmg config --add PMG_MAPI_KEY YOUR_API_KEY_HERE`), the data retrieval tool may be initialized without an input argument.

In [None]:
from matminer.data_retrieval.retrieve_MP import MPDataRetrieval

In [None]:
mpdr = MPDataRetrieval() # or MPDataRetrieval(api_key=YOUR_API_KEY here)

Getting a dataframe corresponding to the materials project is essentially equivalent to using the MPRester's query method.(see [`pymatgen.ext.matproj.MPRester`](http://pymatgen.org/_modules/pymatgen/ext/matproj.html))  The inputs are  `criteria`, a mongo-style dictionary with which to filter the data, and `properties`, a list of supported properties which to return.  See the [MAPI documentation](https://github.com/materialsproject/mapidoc/tree/master/materials) for a list of and information about supported properties.

#### Example 1: Get densities of all elemental materials, i. e. those that contain one element

In [None]:
df = mpdr.get_dataframe({"nelements": 1}, ['density', 'pretty_formula'])
print("There are {} entries on MP with 1 element".format(df['density'].count()))

In [None]:
df.head()

#### Example 2: Get all bandgaps larger than 3.0 eV

In [None]:
df = mpdr.get_dataframe({"band_gap": {"$gt": 4.0}}, ['band_gap'])

In [None]:
print("There are {} entries on MP with a band gap larger than 4.0".format(df['band_gap'].count()))

In [None]:
df.head()

#### Example 3: Get all VRH shear and bulk moduli from the "elasticity" sub-document for which no warnings are found

In [None]:
df = mpdr.get_dataframe({"elasticity": {"$exists": True}, "elasticity.warnings": None},
                        ['elasticity.K_VRH', 'elasticity.G_VRH'])

In [None]:
print("There are {} elastic entries on MP with no warnings".format(df['elasticity.K_VRH'].count()))

In [None]:
df.head()

## Citrine informatics

The materials project data retrieval tool, `matminer.data_retrieval.retrieve_Citrine.CitrineDataRetrieval` is initialized using an api_key that can be found on your "Account Settings" tab under your username in the upper right hand corner of the user interface at [citrination.com](citrination.com).  You can also set an environment variable, `CITRINE_KEY` to have your API key read automatically by the citrine informatics python API, (e. g. put `export CITRINE_KEY=YOUR_API_KEY_HERE` into your .bashrc).

In [None]:
from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval

#### Example 1: Get band gaps of various entries with formula PbTe

In [None]:
cdr = CitrineDataRetrieval() # or CitrineDataRetrieval(api_key=YOUR_API_KEY) if $CITRINE_KEY is not set

In [None]:
df = cdr.get_dataframe(formula='Si', prop='band gap', data_type='EXPERIMENTAL')

In [None]:
df.head()

#### Example 2: Get adsorption energies of O\* and OH\*

In [None]:
df_OH = cdr.get_dataframe(prop='adsorption energy of OH')
df_O = cdr.get_dataframe(prop='adsorption energy of O')

In [None]:
df_OH.head()

In [None]:
df_O.head()

## MPDS - The Materials Platform for Data Science

The [Materials Platform for Data Science](https://mpds.io/) interface is contained in `matminer.data_retrieval.retrieve_MPDS.MPDSDataRetrieval`, and is invoked using an API key and an optional endpoint.  Similarly to the Citrine and MP interfaces, MPDS can be invoked without specifying your API key if MPDS_KEY is set as an environment variable (e. g. put `export MPDS_KEY=YOUR_MPDS_KEY` into your .bashrc or .bash_profile).

In [None]:
from matminer.data_retrieval.retrieve_MPDS import MPDSDataRetrieval

In [None]:
mpdsdr = MPDSDataRetrieval() # or MPDSDataRetrieval(api_key=YOUR_API_KEY)

The `get_dataframe` method of the MPDSDataRetrieval class uses a search functionality documented on the [MPDS website](http://developer.mpds.io/#Categories).  Basically, the `search` keyword argument should take a dictionary with keys and values corresponding to search categories and values.  Note that the search functionality of the MPDS interface may be severely limited without full (i.e. paid subscription) access to the database.

In [None]:
df = mpdsdr.get_dataframe(search={"elements": "K-Ag", 
                                  "props": "heat capacity"})

In [None]:
df.head()

## MDF - The Materials Data Facility


The MDF data retrieval tool, `matminer.data_retrieval.retrieve_MDF.MDFDataRetrieval` is initialized using a globus initialization key.  Upon the first invocation of a MDFDataRetrieval object, you should be prompted with a string of numbers and letters you can enter on the MDF globus authentication web site.  One advantage of this system is that it doesn't actually require authentication at all.  You can use `anonymous=True` and several of the MDF datasets will be available.  However, a number of them will not, and you will have to authenticate using the web to access the entirety of MDF.

In [None]:
from matminer.data_retrieval.retrieve_MDF import MDFDataRetrieval

In [None]:
mdf_dr = MDFDataRetrieval(anonymous=True) # Or anonymous=False if you have a globus login

In [None]:
df = mdf_dr.get_dataframe(elements=['Ag', 'Be'], sources=["oqmd"])

In [None]:
df.head()

In [None]:
print("There are {} entries in the Ag-Be chemical system".format(len(df)))