# Materials data science: data retrieval and filtering – Exercises

## Exercise 1: Load and examine the `elastic_tensor_2015` dataset

Matminer includes a dataset called `elastic_tensor_2015`. It contains a set of computed elastic properties of materials sourced from the paper:

> "Charting the complete elastic properties of inorganic crystalline compounds", M. de Jong et al., Sci. Data. 2 (2015) 150009."

Load this dataset using the `load_dataset()` function and determine:
- the number of entries it contains (tip: pandas `DataFrame` objects have a `describe()` function)
- the largest value of bulk modulus in the dataset (bulk modulus is given in the `K_VRH` column)

In [1]:
from matminer.datasets import load_dataset

df = load_dataset("elastic_tensor_2015")

df.describe()

Unnamed: 0,nsites,space_group,volume,elastic_anisotropy,G_Reuss,G_VRH,G_Voigt,K_Reuss,K_VRH,K_Voigt,poisson_ratio,kpoint_density
count,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0,1181.0
mean,12.42591,163.403895,207.177098,2.145013,64.050568,67.543145,71.03572,135.171392,136.259661,137.347932,0.287401,7536.833192
std,11.817997,65.040733,192.355747,19.140097,44.69638,44.579408,45.388731,73.582579,72.886978,72.922887,0.062177,3446.890979
min,2.0,4.0,15.850527,5e-06,1.87027,2.722175,3.57408,4.714976,6.476135,6.476138,0.042582,1000.0
25%,5.0,124.0,83.944059,0.14503,30.244413,34.117959,37.270657,74.960699,76.43535,76.520333,0.249159,7000.0
50%,10.0,193.0,168.920404,0.355287,56.263878,59.735163,62.635382,129.98479,130.382766,131.849056,0.290198,7000.0
75%,16.0,221.0,261.420345,0.923117,86.979486,91.332142,95.785011,189.195104,189.574194,190.912352,0.328808,7000.0
max,152.0,229.0,2398.906164,397.297866,520.845926,522.921225,524.996524,435.658754,435.661487,435.66422,0.467523,45000.0


## Exercise 2: Filter the dataset based on the number of sites

You are constructing a machine learning model for elastic constants that is only designed to be employed on structures containing a small number of atomic sites. You should filter the dataset to only include entries where `nsites` is less than 20 and determine:
- the number of entries in the filtered dataset
- the mean number of sites across all entries in your filtered dataset


In [2]:
from matminer.datasets import load_dataset

df = load_dataset("elastic_tensor_2015")

# complete exercise below

mask = df["nsites"] < 20
filtered_df = df[mask]

n_filtered = len(filtered_df)
mean_n_sites = filtered_df["nsites"].mean()

print("# entries in filtered dataset: {}".format(n_filtered))
print("mean # sites: {}".format(mean_n_sites))

# entries in filtered dataset: 975
mean # sites: 8.554871794871795


## Exercise 3: Remove columns unnecessary for machine learning

The elastic tensor dataset contains many columns that are not particular relevant for machine learning. You should filter the dataset, so that it only contains the `formula`, `structure`, and `K_VRH` (bulk modulus) columns.

*Tip: the pandas `DataFrame` objects implement a `drop()` function that can be used for dropping both rows and columns. Make sure you set the `axis` argument correctly.*

In [3]:
from matminer.datasets import load_dataset

df = load_dataset("elastic_tensor_2015")

# complete exercise below

df.drop(['material_id', 'nsites', 'space_group', 'volume',
         'elastic_anisotropy', 'G_Reuss', 'G_VRH', 'G_Voigt',
         'K_Reuss', 'K_Voigt', 'poisson_ratio', 'compliance_tensor',
         'elastic_tensor', 'elastic_tensor_original', 'cif', 'kpoint_density',
         'poscar'],
       axis=1)

Unnamed: 0,formula,structure,K_VRH
0,Nb4CoSi,"[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...",194.268884
1,Al(CoSi)2,"[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...",175.449907
2,SiOs,"[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]",295.077545
3,Ga,"[[0. 1.09045794 0.84078375] Ga, [0. ...",49.130670
4,SiRu2,"[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...",256.768081
5,AlCo3C,"[[0. 0. 0.] Al, [0. 1.861157 1.861157] C...",234.454510
6,CdSnSb2,"[[0. 3.322252 9.855906] Cd, [0. 0....",36.233286
7,Ir,"[[0. 0. 0.] Ir, [0. 1.938308 1.938308] I...",346.322761
8,SbIr,"[[2.03154403 1.17291049 4.19959275] Sb, [-2.03...",160.280886
9,MnSbIr,"[[3.0584545 0. 0. ] Mn, [3.058454...",131.974860


## Advanced exercise: calculate Young's modulus

Young's modulus, $E$, is given by:

$$
E = \frac{9KG}{G+3K},
$$

where $K$ is the bulk modulus (column `K_VRH`), and $G$ is the shear modulus (column `G_VRH`).

Calculate Young's modulus for all entries in the dataset and store them in a new column called `E_VRH`. What is the average Young modulus over the entire dataset?

In [4]:
from matminer.datasets import load_dataset

df = load_dataset("elastic_tensor_2015")

# complete exercise below

df["E_VRH"] = (9 * df["K_VRH"] * df["G_VRH"]) / (df["G_VRH"] + 3 * df["K_VRH"])

df

Unnamed: 0,material_id,formula,nsites,space_group,volume,structure,elastic_anisotropy,G_Reuss,G_VRH,G_Voigt,...,K_VRH,K_Voigt,poisson_ratio,compliance_tensor,elastic_tensor,elastic_tensor_original,cif,kpoint_density,poscar,E_VRH
0,mp-10003,Nb4CoSi,12,124,194.419802,"[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...",0.030688,96.844535,97.141604,97.438674,...,194.268884,194.270146,0.285701,"[[0.004385293093993, -0.0016070693558990002, -...","[[311.33514638650246, 144.45092552856926, 126....","[[311.33514638650246, 144.45092552856926, 126....",#\#CIF1.1\n###################################...,7000,Nb8 Co2 Si2\n1.0\n6.221780 0.000000 0.000000\n...,249.790066
1,mp-10010,Al(CoSi)2,5,164,61.987320,"[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...",0.266910,93.939650,96.252006,98.564362,...,175.449907,177.252050,0.268105,"[[0.0037715428949660003, -0.000844229828709, -...","[[306.93357350984974, 88.02634955100905, 105.6...","[[306.93357350984974, 88.02634955100905, 105.6...",#\#CIF1.1\n###################################...,7000,Al1 Co2 Si2\n1.0\n3.932782 0.000000 0.000000\n...,244.115367
2,mp-10015,SiOs,2,221,25.952539,"[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]",0.756489,120.962289,130.112955,139.263621,...,295.077545,295.077545,0.307780,"[[0.0019959391925840004, -0.000433146670736000...","[[569.5291276937579, 157.8517489654999, 157.85...","[[569.5291276937579, 157.8517489654999, 157.85...",#\#CIF1.1\n###################################...,7000,Si1 Os1\n1.0\n2.960692 0.000000 0.000000\n0.00...,340.318316
3,mp-10021,Ga,4,63,76.721433,"[[0. 1.09045794 0.84078375] Ga, [0. ...",2.376805,12.205989,15.101901,17.997812,...,49.130670,49.235377,0.360593,"[[0.021647143908635, -0.005207263618160001, -0...","[[69.28798774976904, 34.7875015216915, 37.3877...","[[70.13259066665267, 40.60474945058445, 37.387...",#\#CIF1.1\n###################################...,7000,Ga4\n1.0\n2.803229 0.000000 0.000000\n0.000000...,41.095069
4,mp-10025,SiRu2,12,62,160.300999,"[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...",0.196930,100.110773,101.947798,103.784823,...,256.768081,258.480904,0.324682,"[[0.00410214297725, -0.001272204332729, -0.001...","[[349.3767766177825, 186.67131003104407, 176.4...","[[407.4791016459293, 176.4759188081947, 213.83...",#\#CIF1.1\n###################################...,7000,Si4 Ru8\n1.0\n4.037706 0.000000 0.000000\n0.00...,270.096776
5,mp-10037,AlCo3C,5,221,51.574959,"[[0. 0. 0.] Al, [0. 1.861157 1.861157] C...",0.420936,111.795761,116.501644,121.207527,...,234.454510,234.454510,0.286852,"[[0.0024941939373720004, -0.000536220847369000...","[[454.4459976453145, 124.45737520710227, 124.4...","[[454.4459976453145, 124.45737520710227, 124.4...",#\#CIF1.1\n###################################...,7000,Al1 Co3 C1\n1.0\n3.722314 0.000000 0.000000\n0...,299.840792
6,mp-10063,CdSnSb2,16,122,580.176940,"[[0. 3.322252 9.855906] Cd, [0. 0....",0.629264,16.692188,17.742410,18.792631,...,36.233286,36.235001,0.289520,"[[0.031566609273636005, -0.011127154138626002,...","[[52.009961164143604, 28.587634307507095, 28.4...","[[52.009961164143604, 28.587634307507095, 28.4...",#\#CIF1.1\n###################################...,7000,Cd4 Sn4 Sb8\n1.0\n6.644504 0.000000 0.000000\n...,45.758371
7,mp-101,Ir,4,225,58.258386,"[[0. 0. 0.] Ir, [0. 1.938308 1.938308] I...",0.174540,212.791803,216.505869,220.219935,...,346.322761,346.322761,0.241326,"[[0.002253121746957, -0.000645312071643, -0.00...","[[576.3314487020527, 231.31771069302795, 231.3...","[[576.3314487020527, 231.31771069302795, 231.3...",#\#CIF1.1\n###################################...,7000,Ir4\n1.0\n3.876616 0.000000 0.000000\n0.000000...,537.508631
8,mp-10125,SbIr,4,194,80.054967,"[[2.03154403 1.17291049 4.19959275] Sb, [-2.03...",0.591712,55.983343,59.245410,62.507476,...,160.280886,161.001027,0.335456,"[[0.0077860474611950005, -0.00433997100871, -0...","[[220.0614254478324, 138.7015190388057, 101.96...","[[220.0614254478324, 138.7015190388057, 101.96...",#\#CIF1.1\n###################################...,7000,Sb2 Ir2\n1.0\n4.063084 0.000000 0.000000\n-2.0...,158.239306
9,mp-10154,MnSbIr,12,216,228.873769,"[[3.0584545 0. 0. ] Mn, [3.058454...",0.109313,39.747198,40.181671,40.616145,...,131.974860,131.975106,0.361794,"[[0.010785803911000001, -0.004120264435156, -0...","[[176.45558761804512, 109.47766517684619, 109....","[[176.45558761804512, 109.47766517684619, 109....",#\#CIF1.1\n###################################...,7000,Mn4 Sb4 Ir4\n1.0\n6.116909 0.000000 0.000000\n...,109.438316
