# Lesson 2: Generating descriptors for machine learning

In this lesson, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.


<img src="resources/featurizers_overview.png" alt="featurizers overview" style="width: 700px;"/>

### Featurizers transform materials primitives into machine-learnable features

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a tensor. For example:


\begin{align}
f(\text{Fe2O3}) \rightarrow [1.5, 7.8, 9.1, 0.09]
\end{align}

#### Matminer contains featurizers for the following pymatgen objects:
* Composition
* Crystal structure
* Crystal sites
* Bandstructure
* DOS

#### Depending on the featurizer, the features returned may be:
* numerical, categorical, or mixed vectors
* matrices 
* other pymatgen objects (for further processing)

### Featurizers play nice with dataframes
Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson.


### Featurizers present in matminer
Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the [matminer website](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html). All featurizers have parallelization and convenient error tolerance built into their core methods.

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

## Part 1: The "featurize" method and basics

### 1.1 The "featurize" method
The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

In [1]:
from pymatgen import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the `ElementFraction` featurizer.

In [2]:
from matminer.featurizers.composition import ElementFraction

ef = ElementFraction()

Now we can featurize our composition.

In [3]:
element_fractions = ef.featurize(fe2o3)

print(element_fractions)

[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### 1.2 Feature labels

We've managed to generate features for learning, but what do they mean? One way to check is by reading `Features` section in the documentation of any featurizer... but a much easier way is to use the `feature_labels` method.

In [4]:
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']


We now see the labels in the order that we generated the features. 

In [5]:
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])

O 0.6
Fe 0.4


## Part 2: Featurizing with dataframes

We just generated some descriptors and their labels from an individual sample. But most of the time our data is in pandas dataframes! Fortunately, matminer's featurizers implement a `featurize_dataframe` method which interacts natively with dataframes.


Let's grab a new dataset from matminer and use our `ElementFraction` featurizer on it.

### 2.1 Prepare the dataset
First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of experimental thermal conductivities.

In [6]:
from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("citrine_thermal_conductivity")

df.head()

Unnamed: 0,formula,k_expt,k-units,k_condition,k_condition_units
0,BeS,157.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':..."
1,CdS,19.9,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':..."
2,GaN,181.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':..."
3,ZnO,64.5,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':..."
4,ZnSe,15.6,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':..."


We need to convert the string objects in the "formula" column to pymatgen `Composition` objects.  For now, we'll just use a simple list comprehension (later, we'll show how to use conversion featurizers for more robust conversions). 

In [7]:
df["composition"] = [Composition(f) for f in df["formula"]]
df.head()

Unnamed: 0,formula,k_expt,k-units,k_condition,k_condition_units,composition
0,BeS,157.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Be, S)"
1,CdS,19.9,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Cd, S)"
2,GaN,181.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Ga, N)"
3,ZnO,64.5,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, O)"
4,ZnSe,15.6,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, Se)"


### 2.2 Using "featurize_dataframe"

Next, we can use the "featurize_dataframe" method (implemented in all featurizers) to robustly apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is `composition`). `featurize_dataframe` is parallelized by default using multiprocessing (not like we particularly need it for this example, though).

In [8]:
feature_df = ef.featurize_dataframe(df, "composition")

feature_df.head()

HBox(children=(IntProgress(value=0, description='ElementFraction', max=872, style=ProgressStyle(description_wi…




Unnamed: 0,formula,k_expt,k-units,k_condition,k_condition_units,composition,H,He,Li,Be,...,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr
0,BeS,157.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Be, S)",0.0,0,0.0,0.5,...,0,0,0,0,0,0,0,0,0,0
1,CdS,19.9,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Cd, S)",0.0,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,GaN,181.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Ga, N)",0.0,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,ZnO,64.5,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, O)",0.0,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,ZnSe,15.6,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, Se)",0.0,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


The `featurize_dataframe` method returns the augmented dataframe (use `inplace=True` to modify the original). 

## Part 3: Different kinds of featurizers

We can use the similar syntax for other kinds of featurizers. Lets get MagPie statistics on our composition using the `magpie` preset of the `ElementProperty` Composition featurizer. The `ElementProperty` featurizer generates elemental statistics from stoichiometries. 

Commonly used forms of some featurizers can be instantiated with the `from_preset` static method for quick setup.

### 3.1 Using ElementProperty
First, let's instantiate our new featurizer using `from_preset`. 

In [9]:
from matminer.featurizers.composition import ElementProperty

ep_magpie = ElementProperty.from_preset("magpie")

We can now use `featurize`, `feature_labels`, and `featurize_dataframe` in the same way as we did for `ElementFraction`.

In [10]:
magpie_stats = ep_magpie.featurize(fe2o3)
magpie_labels = ep_magpie.feature_labels()
magpie_df = ep_magpie.featurize_dataframe(df, "composition")


print(f"Statistics from featurizing Fe2O3: \n{magpie_stats}")
print(f"\n\nFeature labels of magpie: \n{magpie_labels}\n")
magpie_df.head()

HBox(children=(IntProgress(value=0, description='ElementProperty', max=872, style=ProgressStyle(description_wi…


Statistics from featurizing Fe2O3: 
[8.0, 26.0, 18.0, 15.2, 8.64, 8.0, 55.0, 87.0, 32.0, 74.2, 15.36, 87.0, 15.9994, 55.845, 39.8456, 31.93764, 19.125887999999996, 15.9994, 54.8, 1811.0, 1756.2, 757.28, 842.976, 54.8, 8.0, 16.0, 8.0, 12.8, 3.84, 16.0, 2.0, 4.0, 2.0, 2.8, 0.96, 2.0, 66.0, 132.0, 66.0, 92.4, 31.68, 66.0, 1.83, 3.44, 1.6099999999999999, 2.7960000000000003, 0.7727999999999999, 3.44, 2.0, 2.0, 0.0, 2.0, 0.0, 2.0, 0.0, 4.0, 4.0, 2.4, 1.9200000000000004, 4.0, 0.0, 6.0, 6.0, 2.4, 2.88, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 8.0, 2.0, 6.8, 0.96, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 1.2, 0.9600000000000002, 2.0, 0.0, 4.0, 4.0, 1.6, 1.9200000000000004, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 4.0, 2.0, 2.8, 0.96, 2.0, 9.105, 10.73, 1.625, 9.755, 0.78, 9.105, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1106628, 2.1106628, 0.84426512, 1.0131181439999999, 0.0, 12.0, 229.0, 217.0, 98.8, 104.16, 12.0]


Feature labels of magpie: 
['MagpieData minimum Number', 'MagpieData maximu

Unnamed: 0,formula,k_expt,k-units,k_condition,k_condition_units,composition,MagpieData minimum Number,MagpieData maximum Number,MagpieData range Number,MagpieData mean Number,...,MagpieData range GSmagmom,MagpieData mean GSmagmom,MagpieData avg_dev GSmagmom,MagpieData mode GSmagmom,MagpieData minimum SpaceGroupNumber,MagpieData maximum SpaceGroupNumber,MagpieData range SpaceGroupNumber,MagpieData mean SpaceGroupNumber,MagpieData avg_dev SpaceGroupNumber,MagpieData mode SpaceGroupNumber
0,BeS,157.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Be, S)",4.0,16.0,12.0,10.0,...,0.0,0.0,0.0,0.0,70.0,194.0,124.0,132.0,62.0,70.0
1,CdS,19.9,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Cd, S)",16.0,48.0,32.0,32.0,...,0.0,0.0,0.0,0.0,70.0,194.0,124.0,132.0,62.0,70.0
2,GaN,181.0,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Ga, N)",7.0,31.0,24.0,19.0,...,0.0,0.0,0.0,0.0,64.0,194.0,130.0,129.0,65.0,64.0
3,ZnO,64.5,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, O)",8.0,30.0,22.0,19.0,...,0.0,0.0,0.0,0.0,12.0,194.0,182.0,103.0,91.0,12.0
4,ZnSe,15.6,W/m.K,room temperature,"[{'name': 'Temperature', 'scalars': [{'value':...","(Zn, Se)",30.0,34.0,4.0,32.0,...,0.0,0.0,0.0,0.0,14.0,194.0,180.0,104.0,90.0,14.0


All featurizers follow this general syntax.

### 3.2 Featurizing a structure

Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. First, let's load a dataset containing structures. 

In [11]:
df_structures = load_dataset("phonon_dielectric_mp")

df_structures.head()

Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe


Let's calculate some basic density features of these structures using `DensityFeatures`.

In [12]:
from matminer.featurizers.structure import DensityFeatures

densityf = DensityFeatures()
densityf.feature_labels()

['density', 'vpa', 'packing fraction']

These are the features we will get. Now we use `featurize_dataframe` to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column.

In [13]:
densityf.featurize_dataframe(df_structures, "structure")

HBox(children=(IntProgress(value=0, description='DensityFeatures', max=1296, style=ProgressStyle(description_w…




Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula,density,vpa,packing fraction
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe,4.937886,44.545547,0.596286
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC,9.868234,16.027886,0.531426
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC,5.760895,12.199996,0.394180
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs,5.087634,13.991016,0.319600
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe,2.750191,35.937000,0.428523
5,mp-1008506,18.476618,23.405620,1138.585689,"[[0. 0. 0.] Ba, [2.15053493 1.24161183 2.85808...",BaGaSiH,4.643219,21.112798,0.668709
6,mp-1008556,5.189262,9.319102,718.585722,"[[-2.23741407 0. -2.23366548] Al, [0....",AlGaN2,4.630279,11.181778,0.440107
7,mp-1008559,9.327246,9.467906,795.585716,"[[1.60015264 0.92384464 2.65049608] B, [0.0000...",BP,2.953063,11.748010,0.287761
8,mp-10086,6.035038,15.705183,339.585752,"[[2.84699546 0.94899849 0. ] F, [0.9489...",YSF,4.685387,16.535411,0.587455
9,mp-1008680,19.183229,23.997046,297.585755,"[[0. 0. 0.] Ti, [ 2.99535473 2.99535473 -2.99...",TiGePt,9.749872,17.916515,0.557796


## Part 4: More capabilities

There are powerful functionalities of Featurizers which are worth quickly mentioning before we go practice (and _many_ more not mentioned here).


### Dealing with Errors
Often, data is messy and certain featurizers will encounter errors. Set `ignore_errors=True` in `featurize_dataframe` to skip errors; if you'd like to see the errors returned in an additional column, also set `return_errors=True`.

### Citing the authors
Many featurizers are implemented using methods found in peer reviewed studies. Please cite these original works using the `citations` method, which returns the BibTex-formatted references in a Python list. 

### Conversions
In addition to Bandstructure/DOS/Structure/Composition featurizers, matminer also provides a featurizer interface for converting between pymatgen objects (e.g., assinging oxidation states to compositions) in a fault-tolerant fashion. These featurizers are found in `matminer.featurizers.conversion` and work with the same `featurize`/`featurize_dataframe` etc. syntax as the other featurizers.

Here's an example converting string formulas into pymatgen Compositions:

In [16]:
from matminer.featurizers.conversions import StrToComposition
df_structures = StrToComposition().featurize_dataframe(df_structures, "formula")
df_structures.head()

Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula,composition
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe,"(Ba, Te)"
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC,"(Hf, C)"
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC,"(Ge, C)"
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs,"(B, As)"
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe,"(Ca, Se)"


We'll use this conversion featurizer in the exercises section. 

### Running multiple featurizers
Use the `MultipleFeaturizer` featurizer to run multiple featurizers in a single command. Enable the `multiindex` parameter to more easily keep track of your features. Here's a more complex example (don't worry about all the details!):  

In [17]:
from matminer.featurizers.structure import GlobalSymmetryFeatures, SineCoulombMatrix
from matminer.featurizers.base import MultipleFeaturizer


gsm = GlobalSymmetryFeatures()               # Generate some symmetry features about the structures
mf = MultipleFeaturizer([densityf, gsm])     # Put our density featurizer and new featurizer in on MultipleFeaturizer
    
    
mf.featurize_dataframe(df_structures, "structure", multiindex=True)

HBox(children=(IntProgress(value=0, description='MultipleFeaturizer', max=1296, style=ProgressStyle(descriptio…




Unnamed: 0_level_0,Input Data,Input Data,Input Data,Input Data,Input Data,Input Data,Input Data,DensityFeatures,DensityFeatures,DensityFeatures,GlobalSymmetryFeatures,GlobalSymmetryFeatures,GlobalSymmetryFeatures,GlobalSymmetryFeatures
Unnamed: 0_level_1,mpid,eps_electronic,eps_total,last phdos peak,structure,formula,composition,density,vpa,packing fraction,spacegroup_num,crystal_system,crystal_system_int,is_centrosymmetric
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe,"(Ba, Te)",4.937886,44.545547,0.596286,225,cubic,1,True
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC,"(Hf, C)",9.868234,16.027886,0.531426,216,cubic,1,False
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC,"(Ge, C)",5.760895,12.199996,0.394180,216,cubic,1,False
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs,"(B, As)",5.087634,13.991016,0.319600,216,cubic,1,False
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe,"(Ca, Se)",2.750191,35.937000,0.428523,216,cubic,1,False
5,mp-1008506,18.476618,23.405620,1138.585689,"[[0. 0. 0.] Ba, [2.15053493 1.24161183 2.85808...",BaGaSiH,"(Ba, Ga, Si, H)",4.643219,21.112798,0.668709,156,trigonal,3,False
6,mp-1008556,5.189262,9.319102,718.585722,"[[-2.23741407 0. -2.23366548] Al, [0....",AlGaN2,"(Al, Ga, N)",4.630279,11.181778,0.440107,115,tetragonal,4,False
7,mp-1008559,9.327246,9.467906,795.585716,"[[1.60015264 0.92384464 2.65049608] B, [0.0000...",BP,"(B, P)",2.953063,11.748010,0.287761,186,hexagonal,2,False
8,mp-10086,6.035038,15.705183,339.585752,"[[2.84699546 0.94899849 0. ] F, [0.9489...",YSF,"(Y, S, F)",4.685387,16.535411,0.587455,129,tetragonal,4,True
9,mp-1008680,19.183229,23.997046,297.585755,"[[0. 0. 0.] Ti, [ 2.99535473 2.99535473 -2.99...",TiGePt,"(Ti, Ge, Pt)",9.749872,17.916515,0.557796,216,cubic,1,False


## Let's practice!

Now, let's practice. You'll pick up where you left off from the last lesson, add some descriptors using the techiques described here, and prepare your data for the final unit. 