# Materials data science: data retrieval and filtering – Exercises

## Load and examine the `elastic_tensor` dataset

Matminer includes a dataset called "elastic_tensor". It contains a set of computed elastic properties of materials sourced from the paper:

> "Charting the complete elastic properties of inorganic crystalline compounds", M. de Jong et al., Sci. Data. 2 (2015) 150009."

Load this dataset using the `load_dataset()` function and determine:
- the number of entries it contains (tip: pandas `DataFrame` objects have a `count()` function)
- the largest value of bulk modulus in the dataset (bulk modulus is given in the `K_VRH` column)

There are two ways to load most matminer datasets. How else can you access the `elastic_tensor` dataset?

In [None]:
from matminer.datasets import load_dataset

df = load_dataset(______)

## Filter the dataset based on the number of sites

You are constructing a machine learning model for elastic constants that is only designed to be employed on structures containing a small number of atomic sites. You should filter the dataset to only include entries where `nsites` is less then 20 and determine:
- the number of entries in the filtered dataset
- the number of entries that you have removed from the original dataset
- the average number of sites across all entries in your filtered dataset


In [None]:
from matminer.datasets.convenience_loaders import load_elastic_tensor

df = load_elastic_tensor()

## Remove columns unnecessary for machine learning

The elastic tensor dataset contains many columns that are not particular relevant for machine learning. You should filter the dataset, so that it only contains the `formula`, `structure`, and `K_VRH` (bulk modulus) columns.

*Tip: the pandas `DataFrame` objects implement a `drop()` function that can be used for dropping both rows and columns. Make sure you set the `axis` argument correctly.*

In [None]:
from matminer.datasets.convenience_loaders import load_elastic_tensor

df = load_elastic_tensor()

## Advanced exercise: calculate Young's modulus

Young's modulus, $E$, is given by:

$$
E = \frac{9KG}{G+3K},
$$

where $K$ is the bulk modulus (column `K_VRH`), and $G$ is the shear modulus (column `G_VRH`).

Calculate Young's modulus for all entries in the dataset and store them in a new column called `E_VRH`. What is the average Young modulus over the entire dataset?

In [None]:
from matminer.datasets.convenience_loaders import load_elastic_tensor

df = load_elastic_tensor()