# Data mining of semiconductor materials

This notebook will initially serve as an example in how to extract information of the database Materials Project [1]. 
In total, there are 126335 entries in the materials project whereas 48644 of them, roughly 39%, are deemed to be structurally similar from an experimental ICSD entry according to pymatgen's StructureMatcher algorithm. Additionally, a total of 65783 entries have been calculated to have a band gap larger than $0.1$ eV. These two have an overlap of 25271 entries, which is our starting point. 

The notebook will consist of 4 stages, and is strongly inspired from Ferrenti et al [2]. However, we diverge at stage 3 where we have moved the extraction of other databases out before we attend this notebook. 

Additionally, we will use the datamining process to find fitting and unfitting candidates. The process can be described by following some given criteria for good candidates, resulting in the label $1$. Then, we will use the complete opposite criteria to find unfitted candidates, resulting in the label $0$. 
 
 
### Contents
  #### Fitting candidates
    - Stage 1
        - $50\% + l = 0$ isotopes
        - Calculated non-magnetic
        - Has experimental ICSD entry
        - Crystallize in non-polar space groups
    - Stage 2
        - No Th, U, Cd, Hg
        - No noble gases or rare-earth elements
    - Stage 3
        - bandgap restriction
    - Stage 4
        - thermodynamically unstable compounds
        
  #### Unfitted candidates
    - Stage 1
        - Query
    - Stage 2
        - Query
    - Stage 3
        - bandgap restriction
        - Summarize bandgaps and other properties
    - Stage 4
        - thermodynamically unstable compounds

[1] Ong, S. P.; Cholia, S.; Jain, A.; Brafman, M.; Gunter, D.; Ceder, G.; 
Persson, K. a. The Materials Application Programming Interface (API): A 
simple, flexible and efficient API for materials data based on
REpresentational State Transfer (REST) principles, Comput. Mater. Sci.,
2015, 97, 209–215. doi:10.1016/j.commatsci.2014.10.037.

[2] Ferrenti, A.M., de Leon, N.P., Thompson, J.D. et al. Identifying candidate hosts for quantum defects via data mining. npj Comput Mater 6, 126 (2020). https://doi.org/10.1038/s41524-020-00391-7

Initially, we start of with some imports. 

In [1]:
import os
if not os.path.exists('data'):
    os.makedirs('data')

for i in range(1,5):
    if not os.path.exists('data/stage_'+str(i)):
        os.makedirs('data/stage_'+str(i))

# Tools for query of data
from pymatgen import MPRester, Composition

# Finding correct use of polar groups
from pymatgen.symmetry.groups import SYMM_DATA, sg_symbol_from_int_number

# pandas
import pandas as pd
import numpy as np

from tqdm import tqdm

from helperFunctions.keepTypesReadWrite import read_csv, to_csv

# Ignore warnings from nan-values in  
np.warnings.filterwarnings('ignore')

# Fitted candidates
## Stage 1
Thereafter, we need to find out what elements we want to include. The article mentioned above have some strict restrictions that are as follows: 
- $50\% + l = 0$ isotopes
- Calculated non-magnetic
- Has experimental ICSD entry
- Crystallize in non-polar space groups

This stage will be done entirely through a query to pymatgen, however, the restrictions need to be set properly first. 

In [2]:
spin_zero_isotopes = [
    "H", "Li", "Be", "B", "N", "F", "Na", "Al", "P", "Cl", "K", "Sc",
    "V", "Mn", "Co", "Cu", "Ga", "As", "Br", "Rb", "Y", "Nb", "Tc", 
    "Rh", "Ag", "In", "Sb", "I", "Cs", "Lu", "Ta", "Re", "Ir", "Au",
    "Tl", "Bi", "Po", "At", "Rn", "La","Pr", "Pm", "Eu", "Tb", "Ho",
    "Tm", "Ac", "Pa", "Np", "Pu", "Am"]  #(51)

print("Number of excluded periodic-elements from isotopes: {}".format(len(spin_zero_isotopes)))

Number of excluded periodic-elements from isotopes: 51


In [3]:
# This is a list of the point groups as noted in pymatgen
point_groups = []
for i in range(1,231):
    symbol = sg_symbol_from_int_number(i)
    point_groups.append(SYMM_DATA['space_group_encoding'][symbol]['point_group'])

# Note that there are 40 of them, rather than 32.
print("Number of point groups denoted in pymatgen: {}".format(len(set(point_groups))))

# This is because multiple conventions are used for the same point group.
# This dictionary can be used to convert between them.
point_group_conv = {'321' :'32', '312': '32', '3m1' :'3m', '31m': '3m',
                    '-3m1' : '-3m', '-31m': '-3m', '-4m2': '-42m', '-62m': '-6m2' }

# Using this dictionary we can correct to the standard point group notation.
corrected_point_groups = [point_group_conv.get(pg, pg) for pg in point_groups]
# Which produces the correct number of point groups. 32.
print("Number of point groups in conventional notation: {}".format(len(set(corrected_point_groups))))

# There are 10 polar point groups
polar_point_groups = ['1', '2', 'm', 'mm2', '4', '4mm', '3', '3m', '6', '6mm']

# Polar spacegroups have polar point groups.
polar_spacegroups = []

# There are 230 spacegroups
for i in range(1,231):
    symbol = sg_symbol_from_int_number(i)
    pg = SYMM_DATA['space_group_encoding'][symbol]['point_group']
    if point_group_conv.get(pg, pg) in polar_point_groups:
        polar_spacegroups.append(i)
        
# 68 of the 230 spacegroups are polar.
print("Number of polar spacegroups: {}".format(len(polar_spacegroups)))

Number of point groups denoted in pymatgen: 40
Number of point groups in conventional notation: 32
Number of polar spacegroups: 68


### Query

In [4]:
if os.path.exists('data/stage_1/fitted_MP_data_stage_1.csv'):
    fitted_entries = pd.read_csv('data/stage_1/fitted_MP_data_stage_1.csv', sep=",")
else: 
    with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
        criteria = {'elements':{"$nin": spin_zero_isotopes}, #not included
                    'icsd_ids': {'$gte': 0}, #All compounds deemed similar to a structure in ICSD
                    "magnetic_type": {"$eq": "NM"}, #non-magnetic
                    "spacegroup.number": {"$nin": polar_spacegroups}
                    }

        props = ["material_id","full_formula","icsd_ids", "band_gap","spacegroup","pretty_formula"]
        fitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))    
        
print("Number of entries after query: {}".format(len(fitted_entries)))

Number of entries after query: 4511


By comparing to the work of Ferrenti et al, they have at this stage $3363$ compounds against our $4511$ compounds. 

In [5]:
def polarGroupUnitTest(entries):
    # Unit tests for polar groups. 
    ###############################
    # Remove all entries with polar space groups
    exclude_polar_space_groups =  {
         "triclinic":   ["1"], 
         "monoclinic":  ["2", "m"],
         "orthorhombic":["mm2"],
         "tetragonal":  ["4", "4mm"],
         "trigonal":    ["3", "3m"],
         "hexagonal":   ["6", "6mm"]
        }

    #remove polar spacegroups
    deleteEntries = []
    for i, entry in entries.iterrows():

        if os.path.exists('data/stage_1/fitted_MP_data_stage_1.csv'):
            if dict(eval(entry["spacegroup"]))["crystal_system"] in exclude_polar_space_groups.keys():
                if dict(eval(entry["spacegroup"]))["point_group"] in exclude_polar_space_groups[dict(eval(entry["spacegroup"]))["crystal_system"]]:
                    deleteEntries.append(i)
        else: 
            if (entry["spacegroup"]["crystal_system"]) in exclude_polar_space_groups.keys():
                if entry["spacegroup"]["point_group"] in exclude_polar_space_groups[entry["spacegroup"]["crystal_system"]]:
                    deleteEntries.append(i)
                
    #Every delete will return a smaller dict
    numberDeleted = 0
    for deleteEntry in deleteEntries: 
        del entries[deleteEntry-numberDeleted]
        numberDeleted += 1
    if numberDeleted > 0:
        print("Test not passed, polar groups could be wrong")
    else:
        print("Polar group test passed.")
polarGroupUnitTest(fitted_entries)

fitted_entries.to_csv("data/stage_1/fitted_MP_data_stage_1.csv", sep=",", index = False)

Polar group test passed.


## Stage 2

Here, we ensure our data does not contain the following.
- No Th, U, Cd, Hg
- No noble gases or rare-earth elements

In [6]:
#Not include the following elements:
exclude_elements = [ 
    "Th", "U", "Cd", "Hg", #restriction nr 1 above (4)
    "He", "Ne", "Ar", "Kr", "Xe", "Rn", "Og", #no noble gases (7)
    "Sc", "Y", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd",
     "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu"]
    # No rare-earth elements (17)

exclude_elements.extend(spin_zero_isotopes)
exclude_elements = list(set(exclude_elements))

### Query

In [7]:
if os.path.exists('data/stage_2/fitted_MP_data_stage_2.csv'):
    fitted_entries = pd.read_csv("data/stage_2/fitted_MP_data_stage_2.csv", sep=",")
else: 
    with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
        criteria = {"elements":{"$nin": exclude_elements}, #not included
                    "task_id":{"$in": fitted_entries["material_id"].values.tolist()},
                    "icsd_ids": {"$gt": 0}, #All compounds deemed similar to a structure in ICSD
                    "magnetic_type": {"$eq": "NM"}, #non-magnetic
                    "spacegroup.number": {"$nin": polar_spacegroups}
                    }

        props = ["material_id","full_formula","icsd_ids", "band_gap","spacegroup","pretty_formula"]
        fitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))
        fitted_entries.to_csv("data/stage_2/fitted_MP_data_stage_2.csv", sep=",", index = False)
        
print("Number of entries after query: {}".format(len(fitted_entries)))
fitted_entries

Number of entries after query: 2730


Unnamed: 0,material_id,full_formula,icsd_ids,band_gap,spacegroup,pretty_formula
0,mp-100,Hf1,"[76412, 638608, 104208, 53023]",0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Hf
1,mp-1000,Ba1Te1,"[616165, 616163, 29152, 43656]",1.8555,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BaTe
2,mp-10000,Hf4S2,[43203],0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Hf2S
3,mp-10008,Ca7Ge1,[43321],0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Ca7Ge
4,mp-10013,Sn1S1,[43409],0.2371,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",SnS
...,...,...,...,...,...,...
2725,mp-999192,Si2Ni2,"[187623, 187622]",0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",SiNi
2726,mp-999200,Si4,[181908],0.4517,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Si
2727,mp-999259,Ru2C2,[188285],0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",RuC
2728,mp-999308,Os2C2,"[181031, 181032, 181029, 168278]",0.0000,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",OsC


By comparing to the work of Ferrenti et al, they have at this stage $1993$ compounds against our $2730$ compounds. 

# Stage 3

The functional GGA is commonly known to underestimate the band gap of materials, thus we will in this stage try to find as many predictions as possible and compare it to experimental data. 

We will only use inputs in Materials Project that has a band gap larger than $1.1$ eV since we are looking for semiconductors such as Si. 

In [8]:
lowerBandGapLimit = 0.4

fitted_entries = fitted_entries[fitted_entries["band_gap"] >= lowerBandGapLimit]
fitted_entries.to_csv("data/stage_3/fitted_MP_data_stage_3.csv", sep=",", index = False)

fitted_entries

Unnamed: 0,material_id,full_formula,icsd_ids,band_gap,spacegroup,pretty_formula
1,mp-1000,Ba1Te1,"[616165, 616163, 29152, 43656]",1.8555,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BaTe
8,mp-1002124,Hf1C1,[185992],0.5774,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HfC
9,mp-1002164,Ge1C1,[182363],1.8486,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",GeC
16,mp-10064,Si1O2,[44271],1.9997,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",SiO2
17,mp-1006878,Ba1O2,[180398],2.3433,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BaO2
...,...,...,...,...,...,...
2706,mp-985829,Hf1S2,"[638851, 601164, 603757, 182677, 638847]",1.2325,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HfS2
2707,mp-985831,Hf1Se2,"[195308, 638902, 182678, 638899, 603743]",0.5549,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HfSe2
2710,mp-9921,Zr2S6,"[42073, 651463, 651485]",1.0948,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",ZrS3
2711,mp-9922,Hf2S6,"[638846, 42074]",1.1361,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HfS3


# Stage 4

In the last stage we will reject all E Above Hull values over 0.2 eV, since larger E Above Hull values are indicative of a thermodynamically unstable material. We will use the values calculated in Materials Project for this prospect. 

In [9]:
EAboveHullLimit = 0.2

#bandGaps    = bandGaps   [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#spaceGroups = spaceGroups[entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#icsdIDs     = icsdIDs    [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#entries     = entries    [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)

In [10]:
fitted_entries["candidate"] = np.ones(len(fitted_entries))
fitted_entries.to_csv    ("data/stage_4/fitted_MP_data_stage_4.csv",     sep=",", index = False)
#bandGaps.to_csv   ("data/stage_4/bandgaps_stage_4.csv",    sep=",", index = False)
#spaceGroups.to_csv("data/stage_4/spaceGroups_stage_4.csv", sep=",", index = False)



# Unfitted candidates
## Stage 1 + 2
To find unfitted candidates, we will begin with using the opposite criteria for which elements to include based on stable zero nuclear spin isotopes, while we will remove the magnetic and the polar space group criteria.

- $50\% + l != 0$ isotopes
- Calculated magnetic
- Has experimental ICSD entry
- Crystallize in polar space groups

Additionally, we will here make include stage 2 into this step (or else we would widen the search from stage 1 to stage 2). That means we include all the elements we excluded in stage 1 and stage 2. 
- Include Th, U, Cd, Hg
- Include noble gases or rare-earth elements

In [11]:
spin_isotopes = ["He","C", "O", "Ne", "Mg", "Si", "S", "Ar",
                "Ca", "Ti", "Cr", "Fe", "Ni", "Zn", "Ge", "Se",
                "Se", "Kr", "Sr", "Zr", "Mo", "Ru", "Pd", "Cd",
                "Sn", "Te", "Xe", "Ba", "Hf", "W", "Os", "Pt",
                "Hg", "Pb", "Ce", "Nd", "Sm", "Gd", "Dy", "Er",
                "Yb", "Th", "U"] #43

#Not include the following elements:
include_elements = [ 
    "Th", "U", "Cd", "Hg", #restriction nr 1 above (4)
    "He", "Ne", "Ar", "Kr", "Xe", "Rn", "Og", #no noble gases (7)
    "Sc", "Y", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd",
     "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu"]
    # No rare-earth elements (17)

indices = []
for i, spin in enumerate(spin_isotopes): 
    for ele in include_elements:
        if spin == ele:
            indices.append(i)
            
for i in sorted(indices, reverse=True):
    del spin_isotopes[i]

len(spin_isotopes)

27

### Query

Here we find that a strict criteria is the polar space group together with magnetic type, since this reduces the amount of entries to $52$. However, if we remove the spacegroup restriction, we end up with $1683$ entries, or by removing magnetic type ; $486$. Thus, it is clear that the amount of negative non-magnetic calculations is limited.

In [12]:
if os.path.exists('data/stage_1/unfitted_MP_data_stage_1.csv'):
    unfitted_entries = pd.read_csv('data/stage_1/unfitted_MP_data_stage_1.csv', sep=",")
else: 
    with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
        criteria = {'elements':{"$nin": spin_isotopes}, #not included
                    'icsd_ids': {'$gte': 0}, #All compounds deemed similar to a structure in ICSD
                    #"magnetic_type": {"$ne": "NM"}, #non-magnetic not equal
                    "spacegroup.number": {"$in": polar_spacegroups}
                    }

        props = ["material_id","full_formula","icsd_ids", "band_gap","spacegroup","pretty_formula"]
        unfitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))    
        
print("Number of entries after query: {}".format(len(unfitted_entries)))

Number of entries after query: 486


In [13]:
unfitted_entries.to_csv("data/stage_1/unfitted_MP_data_stage_1.csv", sep=",", index = False)

By comparing to the work of Ferrenti et al, they have at this stage $1993$ compounds against our $2730$ compounds. 

# Stage 3

The functional GGA is commonly known to underestimate the band gap of materials, thus we will in this stage try to find as many predictions as possible and compare it to experimental data. 

We will only use inputs in Materials Project that has a band gap larger than $1.1$ eV since we are looking for semiconductors such as Si. 

In [14]:
lowerBandGapLimit = 0.1

unfitted_entries = unfitted_entries[unfitted_entries["band_gap"] >= lowerBandGapLimit]
unfitted_entries.to_csv("data/stage_3/unfitted_MP_data_stage_3.csv", sep=",", index = False)

unfitted_entries

Unnamed: 0,material_id,full_formula,icsd_ids,band_gap,spacegroup,pretty_formula
0,mp-680577,K4Ga8Cl28,[12],4.5052,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",KGa2Cl7
1,mp-27529,P2I6,[311],2.3551,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",PI3
3,mp-23515,K24Co12Cl48,"[419658, 419685, 660, 71303]",0.7891,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",K2CoCl4
4,mp-683945,K8Er24F80,[1658],7.0219,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",KEr3F10
5,mp-27755,K4Al4Cl16,[1704],5.2915,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",KAlCl4
...,...,...,...,...,...,...
453,mp-1198361,Al12Bi48Cl48,[426517],1.4598,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Al(BiCl)4
454,mp-1195481,Hg4As8Xe20F88,[428000],2.1803,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HgAs2Xe5F22
455,mp-1190782,Li8Yb8Bi8,[602201],0.2643,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",LiYbBi
464,mp-1008559,B2P2,[615155],1.0759,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BP


# Stage 4

In the last stage we will reject all E Above Hull values over 0.2 eV, since larger E Above Hull values are indicative of a thermodynamically unstable material. We will use the values calculated in Materials Project for this prospect. 

In [15]:
EAboveHullLimit = 0.2

#bandGaps    = bandGaps   [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#spaceGroups = spaceGroups[entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#icsdIDs     = icsdIDs    [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)
#entries     = entries    [entries["e_above_hull"] < EAboveHullLimit].reset_index(drop=True)

In [16]:
unfitted_entries["candidate"] = np.zeros(len(unfitted_entries))
unfitted_entries.to_csv    ("data/stage_4/unfitted_MP_data_stage_4.csv", sep=",", index = False)
#bandGaps.to_csv   ("data/stage_4/bandgaps_stage_4.csv",    sep=",", index = False)
#spaceGroups.to_csv("data/stage_4/spaceGroups_stage_4.csv", sep=",", index = False)

# Combine data and create training and test set

In this section we will combine the data we have extracted, both from the Generated Data notebook, and the labels from the previous dataMining section. 

In [17]:
training_data = pd.concat([fitted_entries,unfitted_entries]).reset_index(drop=True)
training_data
#entries = pd.read_csv("data/databases/initialDataMP/MP/MP_FLAGBIGFILE.csv", sep=",")

Unnamed: 0,material_id,full_formula,icsd_ids,band_gap,spacegroup,pretty_formula,candidate
0,mp-1000,Ba1Te1,"[616165, 616163, 29152, 43656]",1.8555,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BaTe,1.0
1,mp-1002124,Hf1C1,[185992],0.5774,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HfC,1.0
2,mp-1002164,Ge1C1,[182363],1.8486,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",GeC,1.0
3,mp-10064,Si1O2,[44271],1.9997,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",SiO2,1.0
4,mp-1006878,Ba1O2,[180398],2.3433,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BaO2,1.0
...,...,...,...,...,...,...,...
1617,mp-1198361,Al12Bi48Cl48,[426517],1.4598,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",Al(BiCl)4,0.0
1618,mp-1195481,Hg4As8Xe20F88,[428000],2.1803,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",HgAs2Xe5F22,0.0
1619,mp-1190782,Li8Yb8Bi8,[602201],0.2643,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",LiYbBi,0.0
1620,mp-1008559,B2P2,[615155],1.0759,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",BP,0.0
