# Part 2: Generating descriptors for machine learning

In this lesson, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.


<img src="resources/featurizers_overview.png" alt="featurizers overview" style="width: 700px;"/>

### Featurizers transform materials primitives into machine-learnable features

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a vector. For example:


\begin{align}
f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09]
\end{align}

#### Matminer contains featurizers for the following pymatgen objects:
* Composition
* Crystal structure
* Crystal sites
* Bandstructure
* Density of states

#### Depending on the featurizer, the features returned may be:
* numerical, categorical, or mixed vectors
* matrices 
* other pymatgen objects (for further processing)

### Featurizers play nice with dataframes
Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson.


### Featurizers present in matminer
Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the [matminer website](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html). All featurizers have parallelization and convenient error tolerance built into their core methods.

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

## The `featurize` method and basics

The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

In [None]:
from pymatgen import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the `ElementFraction` featurizer.

In [None]:
from matminer.featurizers.composition import ElementFraction



Now we can featurize our composition.

We've managed to generate features for learning, but what do they mean? One way to check is by reading the `Features` section in the documentation of any featurizer... but a much easier way is to use the `feature_labels()` method.

We now see the labels in the order that we generated the features. 

## Featurizing  dataframes

We just generated some descriptors and their labels from an individual sample but most of the time our data is in pandas dataframes. Fortunately, matminer featurizers implement a `featurize_dataframe()` method which interacts natively with dataframes.

Let's grab a new dataset from matminer and use our `ElementFraction` featurizer on it.

First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of super hard materials.

In [None]:
from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("brgoch_superhard_training")
df.head()

Next, we can use the `featurize_dataframe()` method (implemented by all featurizers) to apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is `composition`). `featurize_dataframe()` is parallelized by default using multiprocessing.

If we look at the database we can see our new feature columns.

## Structure Featurizers

We can use the same syntax for other kinds of featurizers. Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. First, let's load a dataset containing structures. 

In [None]:
df = load_dataset("phonon_dielectric_mp")

df.head()

Let's calculate some basic density features of these structures using `DensityFeatures`.

In [None]:
from matminer.featurizers.structure import DensityFeatures


These are the features we will get. Now we use `featurize_dataframe()` to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column.

Let's examine the dataframe and see the structural features.

## Conversion Featurizers

In addition to Bandstructure/DOS/Structure/Composition featurizers, matminer also provides a featurizer interface for converting between pymatgen objects (e.g., assinging oxidation states to compositions) in a fault-tolerant fashion. These featurizers are found in `matminer.featurizers.conversion` and work with the same `featurize`/`featurize_dataframe` etc. syntax as the other featurizers.

The dataset we loaded previously only contains a `formula` column with string objects. To convert this data into a `composition` column containing pymatgen `Composition` objects, we can use the `StrToComposition` conversion featurizer on the `formula` column.

In [None]:
from matminer.featurizers.conversions import StrToComposition


We can see a new `composition` column has been added to the dataframe.

## Advanced capabilities

There are powerful functionalities of Featurizers which are worth quickly mentioning before we go practice (and _many_ more not mentioned here).

**Dealing with Errors**

Often, data is messy and certain featurizers will encounter errors. Set `ignore_errors=True` in `featurize_dataframe()` to skip errors; if you'd like to see the errors returned in an additional column, also set `return_errors=True`.

**Citing the authors**

Many featurizers are implemented using methods found in peer reviewed studies. Please cite these original works using the `citations()` method, which returns the BibTex-formatted references in a Python list. For example:

## Let's practice!

Now, let's practice. You'll pick up where you left off from the last lesson, add some descriptors using the techiques described here, and prepare your data for the final unit. 