# Extract data and featurizers from Materials Project

In this notebook we will demonstrate how to 
1. Extract all materials with ICSD entry and GGA-calculated bandgap larger than $0.1$ eV from Materials Project
2. Featurize all resulting entries with low memory demand functions (however, notoriously time consuming)
3. Concatenate the results from 1 and 2.

This notebook will only show how to execute the commands and in which order, but it is highly recommended to do the commands in a command prompt and not in a jupyter notebook. This is reasoned by jupyter notebook´s standard preferences that put on restrictions for i/o over long durations, and that a jupyter notebook has as strength to visualize and explain, rather running hardcore scripts. 

Without further ado, we start with initial imports.

In [1]:
import pandas as pd
import os

Then, open a command prompt and run the following script

>> python3 extract_all_MP_prop

In [2]:
#MP_entries = pd.read_csv("MP_featurized_*",sep=",")
numbers=[]
directory="."
for filename in os.listdir(directory):
    if filename.endswith(".csv"):
        if (str(filename[:14]) == "MP_featurized_"):
            numbers.append(int(filename[14:-4]))
        continue
    else:
        continue
numbers.sort()
print(numbers)

[3950, 4451, 6152, 9903, 10404, 11455, 13056, 17457, 18458, 19009, 22460, 23161, 23462]


In [3]:
#numbers = [3950, 4451, 6152, 9903, 10404, 11455, 13056, 17457, 18458, 19009, 22460, 23161, 23462]
MP_featurized = pd.DataFrame({})
for num in numbers: 
    MP_featurized_portion = pd.read_csv("./MP_featurized_" + str(num) + ".csv", sep=",")
    MP_featurized = pd.concat([MP_featurized,MP_featurized_portion]).reset_index(drop=True)
MP_featurized_portion = pd.read_csv("./MP_featurized.csv", sep=",")
MP_featurized = pd.concat([MP_featurized,MP_featurized_portion]).reset_index(drop=True)
MP_featurized
MP_featurized

Unnamed: 0,material_id,full_formula,band_gap,is_gap_direct,direct_gap,p_ex1_norm,p_ex1_degen,n_ex1_norm,n_ex1_degen,cbm_hybridization,cbm_character_1,cbm_specie_1,cbm_location_1,cbm_score_1,vbm_hybridization,vbm_character_1,vbm_specie_1,vbm_location_1,vbm_score_1
0,mp-540537,Cr4Te8O22,2.3209,1.0,2.3209,0.500000,1.0,0.500000,1.0,3.171776,p,Te,0.860122;0.67244;0.586467,0.128493,3.034352,d,Cr,0.001039;0.682243;0.884618,0.137514
1,mp-554288,Na4P4H16O20,,,,,,,,4.056414,s,Na,0.98348;0.504075;0.485511,0.033103,3.007152,p,O,0.235423;0.772961;0.000403,0.105398
2,mp-23931,Li4P4H8O12,,,,,,,,3.786044,p,P,0.815108;0.744944;0.844605,0.052936,2.751642,p,O,0.734207;0.705863;0.21093,0.096116
3,mp-27499,Ba4Te4S12,1.9176,0.0,2.1318,0.444444,2.0,0.333333,2.0,3.215407,p,Te,0.25;0.025213;0.74493,0.112398,3.010518,p,S,0.487639;0.123794;0.159797,0.069645
4,mp-680577,K4Ga8Cl28,,,,,,,,3.854201,s,Ga,0.629841;0.714246;0.629212,0.066665,3.318495,p,Cl,0.930067;0.572759;0.237617,0.053614
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25266,mp-999461,Na1Pr1Se2,1.7927,0.0,2.4461,0.000000,1.0,0.707107,3.0,0.828770,d,Pr,0.0;0.0;0.0,0.798910,0.968218,p,Se,0.254429;0.254429;0.254429,0.466686
25267,mp-999470,Na1Nd1S2,2.1922,0.0,2.7634,0.092655,6.0,0.707107,3.0,0.624306,d,Nd,0.0;0.0;0.0,0.839284,0.969339,p,S,0.74477;0.74477;0.74477,0.466185
25268,mp-999471,Na1Nd1Se2,1.7868,0.0,2.4733,0.000000,1.0,0.707107,3.0,0.828332,d,Nd,0.0;0.0;0.0,0.800581,0.963803,p,Se,0.745088;0.745088;0.745088,0.467247
25269,mp-999472,Na1La1Se2,2.2374,0.0,2.7279,0.121682,6.0,0.707107,3.0,1.099126,f,La,0.5;0.5;0.5,0.542193,1.171197,p,Se,0.755143;0.755143;0.755143,0.430345
