# Data mining of semiconductor materials

This notebook will initially serve as an example in how to extract information of the database Materials Project [1]. 
In total, there are 126335 entries in the materials project whereas 48644 of them, roughly 39%, are deemed to be structurally similar from an experimental ICSD entry according to pymatgen's StructureMatcher algorithm. Additionally, a total of 65783 entries have been calculated to have a band gap larger than $0.1$ eV. These two have an overlap of 25271 entries, which is our starting point. 

The notebook will consist of 4 stages, and is strongly inspired from Ferrenti et al [2]. However, we diverge at stage 3 where we have moved the extraction of other databases out before we attend this notebook. 

Additionally, we will use the datamining process to find fitting and unfitting candidates. The process can be described by following some given criteria for good candidates, resulting in the label $1$. Then, we will use the complete opposite criteria to find unfitted candidates, resulting in the label $0$. 
 
 
### Contents
  #### Fitting candidates
    - Stage 1
        - $50\% + l = 0$ isotopes
        - Calculated non-magnetic
        - Has experimental ICSD entry
        - Crystallize in non-polar space groups
    - Stage 2
        - No Th, U, Cd, Hg
        - No noble gases or rare-earth elements
    - Stage 3
        - bandgap restriction
        
  #### Unfitted candidates
    - Stage 1
        - $50\% + l != 0$ isotopes
        - Calculated magnetic
        - Has experimental ICSD entry
        - Crystallize in polar space groups
    - Stage 2
        - Include Th, U, Cd, Hg
        - Include noble gases or rare-earth elements    
    - Stage 3
        - bandgap restriction
        
   #### Summarize bandgaps and other properties

[1] Ong, S. P.; Cholia, S.; Jain, A.; Brafman, M.; Gunter, D.; Ceder, G.; 
Persson, K. a. The Materials Application Programming Interface (API): A 
simple, flexible and efficient API for materials data based on
REpresentational State Transfer (REST) principles, Comput. Mater. Sci.,
2015, 97, 209–215. doi:10.1016/j.commatsci.2014.10.037.

[2] Ferrenti, A.M., de Leon, N.P., Thompson, J.D. et al. Identifying candidate hosts for quantum defects via data mining. npj Comput Mater 6, 126 (2020). https://doi.org/10.1038/s41524-020-00391-7

Initially, we start of with some imports. 

In [1]:
# Optional: Load the "autoreload" extension so that code can change
%load_ext autoreload

#OPTIONAL: Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

In [2]:
from pathlib import Path
data_dir = Path.cwd().parent.parent / "data" 
print("Current data directory {}".format(data_dir))

In [3]:
# Tools for query of data
from pymatgen import MPRester, Composition

# Finding correct use of polar groups
from src.data.utils import polarGroupUsedInMP, sortByMPID, filterIDs

# pandas
import pandas as pd
import numpy as np

from tqdm import tqdm

# Ignore warnings from nan-values in  
np.warnings.filterwarnings('ignore')

In [4]:
InsertApproach = "01-naive-approach"

# Fitted candidates
## Stage 1
Thereafter, we need to find out what elements we want to include. The article mentioned above have some strict restrictions that are as follows: 
- $50\% + l = 0$ isotopes
- Calculated non-magnetic
- Has experimental ICSD entry
- Crystallize in non-polar space groups

This stage will be done entirely through a query to pymatgen, however, the restrictions need to be set properly first. 

In [5]:
spin_zero_isotopes = [
    "H", "Li", "Be", "B", "N", "F", "Na", "Al", "P", "Cl", "K", "Sc",
    "V", "Mn", "Co", "Cu", "Ga", "As", "Br", "Rb", "Y", "Nb", "Tc", 
    "Rh", "Ag", "In", "Sb", "I", "Cs", "Lu", "Ta", "Re", "Ir", "Au",
    "Tl", "Bi", "Po", "At", "Rn", "La","Pr", "Pm", "Eu", "Tb", "Ho",
    "Tm", "Ac", "Pa", "Np", "Pu", "Am"]  #(51)

print("Number of excluded periodic-elements from isotopes: {}".format(len(spin_zero_isotopes)))

Number of excluded periodic-elements from isotopes: 51


In [6]:
polar_spacegroups = polarGroupUsedInMP()

Number of point groups denoted in pymatgen: 40
Number of point groups in conventional notation: 32
Number of polar spacegroups: 68


### Query

In [7]:
with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
    criteria = {'elements':{"$nin": spin_zero_isotopes}, #not included
                    'icsd_ids': {'$gte': 0}, #All compounds deemed similar to a structure in ICSD
                    "magnetic_type": {"$eq": "NM"}, #non-magnetic
                    "spacegroup.number": {"$nin": polar_spacegroups}
                    }

    props = ["material_id","full_formula", "spacegroup", "band_gap"]
    fitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))    
        
print("Number of entries after query: {}".format(len(fitted_entries)))

  0%|          | 0/4347 [00:00<?, ?it/s]

Number of entries after query: 4347


By comparing to the work of Ferrenti et al, they have at this stage $3363$ compounds against our $4405$ compounds. 

In [8]:
def polarGroupUnitTest(entries):
    # Unit tests for polar groups. 
    ###############################
    # Remove all entries with polar space groups
    exclude_polar_space_groups =  {
         "triclinic":   ["1"], 
         "monoclinic":  ["2", "m"],
         "orthorhombic":["mm2"],
         "tetragonal":  ["4", "4mm"],
         "trigonal":    ["3", "3m"],
         "hexagonal":   ["6", "6mm"]
        }

    #remove polar spacegroups
    deleteEntries = []
    for i, entry in entries.iterrows():
        #else: 
        if (entry["spacegroup"]["crystal_system"]) in exclude_polar_space_groups.keys():
            if entry["spacegroup"]["point_group"] in exclude_polar_space_groups[entry["spacegroup"]["crystal_system"]]:
                deleteEntries.append(i)
                
    #Every delete will return a smaller dict
    numberDeleted = 0
    for deleteEntry in deleteEntries: 
        del entries[deleteEntry-numberDeleted]
        numberDeleted += 1
    if numberDeleted > 0:
        print("Test not passed, polar groups could be wrong")
    else:
        print("Polar group test passed.")
polarGroupUnitTest(fitted_entries)

#fitted_entries.to_csv("data/stage_1/fitted_MP_data_stage_1.csv", sep=",", index = False)

Polar group test passed.


## Stage 2

Here, we ensure our data does not contain the following.
- No Th, U, Cd, Hg
- No noble gases or rare-earth elements

In [9]:
#Not include the following elements:
exclude_elements = [ 
    "Th", "U", "Cd", "Hg", #restriction nr 1 above (4)
    "He", "Ne", "Ar", "Kr", "Xe", "Rn", "Og", #no noble gases (7)
    "Sc", "Y", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd",
     "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu"]  # No rare-earth elements (17)

exclude_elements.extend(spin_zero_isotopes)
exclude_elements = list(set(exclude_elements))

### Query

In [10]:
with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
    criteria = {"elements":{"$nin": exclude_elements}, #not included
                    "task_id":{"$in": fitted_entries["material_id"].values.tolist()},
                    "icsd_ids": {"$gt": 0}, #All compounds deemed similar to a structure in ICSD
                    "magnetic_type": {"$eq": "NM"}, #non-magnetic
                    "spacegroup.number": {"$nin": polar_spacegroups}
                    }

    props = ["material_id","full_formula", "spacegroup", "band_gap"]
    fitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))
        
print("Number of entries after query: {}".format(len(fitted_entries)))
fitted_entries

  0%|          | 0/2644 [00:00<?, ?it/s]

Number of entries after query: 2644


Unnamed: 0,material_id,full_formula,spacegroup,band_gap
0,mp-100,Hf1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000
1,mp-1000,Ba1Te1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.8555
2,mp-10000,Hf4S2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000
3,mp-10008,Ca7Ge1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000
4,mp-10013,Sn1S1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.2371
...,...,...,...,...
2639,mp-999192,Si2Ni2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000
2640,mp-999200,Si4,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.4517
2641,mp-999259,Ru2C2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000
2642,mp-999308,Os2C2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.0000


By comparing to the work of Ferrenti et al, they have at this stage $1993$ compounds against our $2661$ compounds. 

# Stage 3

The functional GGA is commonly known to underestimate the band gap of materials, thus we will in this stage try to find as many predictions as possible and compare it to experimental data. 

We will only use inputs in Materials Project that has a band gap larger than $1.1$ eV since we are looking for semiconductors such as Si. 

In [11]:
lowerBandGapLimit = 0.4

fitted_entries = fitted_entries[fitted_entries["band_gap"] >= lowerBandGapLimit]
#fitted_entries.to_csv("data/stage_3/fitted_MP_data_stage_3.csv", sep=",", index = False)

fitted_entries

Unnamed: 0,material_id,full_formula,spacegroup,band_gap
1,mp-1000,Ba1Te1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.8555
8,mp-1002124,Hf1C1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.5774
9,mp-1002164,Ge1C1,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.8486
16,mp-10064,Si1O2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.9997
17,mp-1006878,Ba1O2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.3433
...,...,...,...,...
2620,mp-985829,Hf1S2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.2325
2621,mp-985831,Hf1Se2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.5549
2624,mp-9921,Zr2S6,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.0948
2625,mp-9922,Hf2S6,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.1361


In [12]:
fitted_entries["candidate"] = np.ones(len(fitted_entries))

# Unfitted candidates
## Stage 1 + 2
To find unfitted candidates, we will begin with using the opposite criteria for which elements to include based on stable zero nuclear spin isotopes, while we will remove the magnetic and the polar space group criteria.

- $50\% + l != 0$ isotopes
- Calculated magnetic
- Has experimental ICSD entry
- Crystallize in polar space groups

Additionally, we will here make include stage 2 into this step (or else we would widen the search from stage 1 to stage 2). That means we include all the elements we excluded in stage 1 and stage 2. 
- Include Th, U, Cd, Hg
- Include noble gases or rare-earth elements

In [13]:
spin_isotopes = ["He","C", "O", "Ne", "Mg", "Si", "S", "Ar",
                "Ca", "Ti", "Cr", "Fe", "Ni", "Zn", "Ge", "Se",
                "Se", "Kr", "Sr", "Zr", "Mo", "Ru", "Pd", "Cd",
                "Sn", "Te", "Xe", "Ba", "Hf", "W", "Os", "Pt",
                "Hg", "Pb", "Ce", "Nd", "Sm", "Gd", "Dy", "Er",
                "Yb", "Th", "U"] #43

#Not include the following elements:
include_elements = [ 
    "Th", "U", "Cd", "Hg", #restriction nr 1 above (4)
    "He", "Ne", "Ar", "Kr", "Xe", "Rn", "Og", #no noble gases (7)
    "Sc", "Y", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd",
     "Tb", "Dy", "Ho", "Er", "Tm", "Yb", "Lu"]
    # No rare-earth elements (17)

indices = []
for i, spin in enumerate(spin_isotopes): 
    for ele in include_elements:
        if spin == ele:
            indices.append(i)
            
for i in sorted(indices, reverse=True):
    del spin_isotopes[i]

len(spin_isotopes)

27

In [14]:
#spin_isotopes

### Query

Here we find that a strict criteria is the polar space group together with magnetic type, since this reduces the amount of entries to $52$. However, if we remove the spacegroup restriction, we end up with $1683$ entries, or by removing magnetic type ; $486$. Thus, it is clear that the amount of negative non-magnetic calculations is limited.

In [15]:

with MPRester("b7RtVfJTsUg6TK8E") as mpr:
    
    criteria = {'elements':{"$nin": spin_isotopes}, #not included
                'icsd_ids': {'$gte': 0}, #All compounds deemed similar to a structure in ICSD
                #"magnetic_type": {"$ne": "NM"}, #non-magnetic not equal
                "spacegroup.number": {"$in": polar_spacegroups}
                }

    props = ["material_id","full_formula", "spacegroup", "band_gap"]
    unfitted_entries = pd.DataFrame(mpr.query(criteria=criteria, properties=props))    
        
print("Number of entries after query: {}".format(len(unfitted_entries)))

  0%|          | 0/526 [00:00<?, ?it/s]

Number of entries after query: 526


# Stage 3

For consistency we are only looking at semiconductors. Thus, we will maintain a lower band gap limit for unfitted candidates, since the features we are generating are based on that a material has a band gap. 

In [16]:
lowerBandGapLimit = 0.1

unfitted_entries = unfitted_entries[unfitted_entries["band_gap"] >= lowerBandGapLimit]
unfitted_entries

Unnamed: 0,material_id,full_formula,spacegroup,band_gap
3,mp-1008559,B2P2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.0759
4,mp-1018100,Al2Sb2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.9121
6,mp-1019323,Th2H1N3,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.5488
7,mp-10419,Na8Re2N6,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.4466
11,mp-1064952,Na1N3,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",4.0425
...,...,...,...,...
517,mp-570089,Cd12I24,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.4122
521,mp-696998,K12Bi2H6Cl16F8,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",4.1881
522,mp-866663,K6Ta2F16,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",3.8500
523,mp-8880,Al2P2,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.0228


In [17]:
unfitted_entries["candidate"] = np.zeros(len(unfitted_entries))

# Combine data and create training and test set

In [18]:
assert unfitted_entries[unfitted_entries["material_id"].isin(fitted_entries["material_id"])].shape[0]==0
assert fitted_entries[fitted_entries["material_id"].isin(unfitted_entries["material_id"])].shape[0]==0

In [19]:
trainingTargets = pd.concat([fitted_entries,unfitted_entries]).reset_index(drop=True)

trainingTargets = sortByMPID(trainingTargets)
trainingTargets = filterIDs(trainingTargets)
trainingTargets

A total of 43 MPIDs are inconsistent with the rest.
A total of 43 MPIDs were dropped from the dataset provided.


Unnamed: 0,material_id,full_formula,spacegroup,band_gap,candidate
0,mp-7,S6,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.4881,1.0
1,mp-14,Se3,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",1.0119,1.0
2,mp-19,Te3,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",0.5752,1.0
3,mp-24,C8,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.7785,1.0
4,mp-47,C4,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",3.3395,1.0
...,...,...,...,...,...
1644,mp-1205479,K44Sb22F110,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",4.7325,0.0
1645,mp-1208643,Sr4Hf4S12,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",2.0461,1.0
1646,mp-1210722,Mg2Te2Mo2O12,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",3.2147,1.0
1647,mp-1232407,Li6B6H32N4,"{'symprec': 0.1, 'source': 'spglib', 'symbol':...",5.9610,0.0


In [20]:
drop_columns =  ["full_formula", "spacegroup", "band_gap"]
trainingTargets = trainingTargets.drop(drop_columns, axis=1)

In [21]:
trainingTargets

Unnamed: 0,material_id,candidate
0,mp-7,1.0
1,mp-14,1.0
2,mp-19,1.0
3,mp-24,1.0
4,mp-47,1.0
...,...,...
1644,mp-1205479,0.0
1645,mp-1208643,1.0
1646,mp-1210722,1.0
1647,mp-1232407,0.0


In [22]:
#Path(data_dir / InsertApproach / "processed").mkdir(parents=True, exist_ok=True)
#training_data.to_pickle(data_dir / InsertApproach / "processed" / "trainingCombo.pkl")

# Distribution of entries in the data
How is the distribution of entries in the different features? 
## Elements

In [23]:
trainingTargets

Unnamed: 0,material_id,candidate
0,mp-7,1.0
1,mp-14,1.0
2,mp-19,1.0
3,mp-24,1.0
4,mp-47,1.0
...,...,...
1644,mp-1205479,0.0
1645,mp-1208643,1.0
1646,mp-1210722,1.0
1647,mp-1232407,0.0


In [24]:
#trainingTargets = pd.read_pickle(data_dir / InsertApproach / "processed" / "trainingCombo.pkl")
generatedData = pd.read_pickle(data_dir / "interim" / "featurized" / "featurized-19-03-2021.pkl")

In [25]:
testSet = (
    trainingTargets.merge(generatedData, 
              on='material_id', 
              how='outer', 
              indicator=True)
    .query('_merge != "both"')
    .drop(columns='_merge')
)
trainingSet = (
    trainingTargets.merge(generatedData,
                on="material_id",
                indicator=False,
                how="left",
                suffixes=(False, False))
)

In [26]:
print(testSet.shape)
print(trainingSet.shape)

(23655, 4872)
(1649, 4872)


In [27]:
print("The amount of good qubit host candidates: {}, or {:0.4f}%".format(trainingSet[trainingTargets.values==1].shape[0], trainingSet[trainingTargets.values==1].shape[0]/trainingSet.shape[0]))
print("The amount of bad qubit host candidates: {}, or {:0.4f}%".format(trainingSet[trainingTargets.values==0].shape[0], trainingSet[trainingTargets.values==0].shape[0]/trainingSet.shape[0]))

The amount of good qubit host candidates: 1278, or 0.7750%
The amount of bad qubit host candidates: 371, or 0.2250%


In [59]:
print("Information about elements in compoounds are given in the interval:")
print(np.where(generatedData.columns=="H")[0][0],np.where(generatedData.columns=="Pu")[0][0])
# plotting 
import plotly.graph_objects as go
elements = generatedData.columns[
    np.where(generatedData.columns=="H")[0][0]:np.where(generatedData.columns=="Pu")[0][0]
].values
print(len(elements))
fig = go.Figure( 
    layout = go.Layout (
        title=go.layout.Title(text="Distribution of elements in training data"),
        yaxis=dict(title='Number'),
        xaxis=dict(title='Elements'),
        xaxis_tickangle=-90
        )
    )
width_plotly = 548.1896533333334

fittedCounts = trainingSet[elements][trainingTargets.values==1].fillna(0).astype(bool).sum(axis=0)
unFittedCounts   = trainingSet[elements][trainingTargets.values==0].fillna(0).astype(bool).sum(axis=0)

fig.add_traces(go.Bar(name = "Fitted candidates",#prettyNames[i], 
                        x = trainingSet[elements].columns,
                        y = fittedCounts,
                        text = trainingSet[elements].columns,
                        )
                )
fig.add_traces(go.Bar(name = "Unfitted candidates",#prettyNames[i], 
                        x = trainingSet[elements].columns,
                        y = unFittedCounts,
                        text = trainingSet[elements].columns,
                        )
                )
fig.update_layout(
                    {"plot_bgcolor": "rgba(0, 0, 0, 0)",
                       "paper_bgcolor": "rgba(0, 0, 0, 0)",
                      },
                      font=dict(
                        family="Palatino",
                        color="Black",
                        size=12),
                      autosize=False,
                      width=width_plotly*1.5,
                      height=width_plotly/2,
                     )
fig.write_image(str(Path.cwd().parent.parent / 
                                    "reports" / "figures"  / "buildingFeatures" 
                                    / "-normalized-elements-histogram.pdf"))
fig.show()

Information about elements in compoounds are given in the interval:
4702 4795
93


In [60]:
fig = go.Figure( 
    layout = go.Layout (
        title=go.layout.Title(text="Normal adjusted counts of elements in training data"),
        yaxis=dict(title='Normal adjusted counts'),
        xaxis=dict(title='Elements'),
        xaxis_tickangle=-45
        )
    )

fig.add_traces(
    go.Bar(name = "Fitted candidates",#prettyNames[i], 
                   x =    trainingSet[elements].columns,
                   y =    fittedCounts/np.sum(fittedCounts),
                   text = trainingSet[elements].columns,
                    )
                )
fig.add_traces(go.Bar(name = "Unfitted candidates",#prettyNames[i], 
                    x = trainingSet[elements].columns,
                    y = unFittedCounts/np.sum(unFittedCounts),
                    text = trainingSet[elements].columns,
                    )
                )
fig.update_layout(
                    {"plot_bgcolor": "rgba(0, 0, 0, 0)",
                       "paper_bgcolor": "rgba(0, 0, 0, 0)",
                      },
                      font=dict(
                        family="Palatino",
                        color="Black",
                        size=12),
                      autosize=False,
                      width=width_plotly,
                      height=height_plotly,
                     )
fig.show()

## Known candidates 

How are the known candidates already doing in the datamining process? SiC, etc. 

In [None]:
feat = ["material_id","MP_Eg","pretty_formula","full_formula"]

#generatedData[feat][generatedData["pretty_formula"]=="SiC"]

In [None]:
#generatedData[feat][generatedData["pretty_formula"]=="C"]

In [None]:
feat = ["material_id","MP_Eg","pretty_formula","full_formula", "candidate"]

trainingSet[feat][(trainingSet["pretty_formula"]=="SiC")]

In [None]:
trainingSet[feat][(trainingSet["pretty_formula"]=="C")]

## Combine training targets with preprocessed data

In [None]:
PCAGeneratedData = pd.read_pickle(data_dir / "processed" / "processedData.pkl")
PCAGeneratedData.shape

In [None]:
trainingSet = (
    trainingTargets.merge(PCAGeneratedData,
                on="material_id",
                indicator=False,
                how="left",
                suffixes=(False, False))
)
trainingSet.shape

In [None]:
# Drop new candidates that might have been added to MP after featurization 
trainingSet = trainingSet[trainingSet["material_id"].isin(PCAGeneratedData["material_id"])]
trainingSet.shape

In [None]:
testSet = (
    trainingTargets.merge(PCAGeneratedData, 
              on='material_id', 
              how='outer', 
              indicator=True)
    .query('_merge != "both"')
    .drop(columns='_merge')
)
testSet.shape

In [None]:
# Drop new candidates that might have been added to MP after featurization 
testSet = testSet[testSet["material_id"].isin(PCAGeneratedData["material_id"])]

# Change column order
trainingSet.insert(1, 'full_formula', trainingSet.pop("full_formula"))
testSet    .insert(1, 'full_formula', testSet    .pop("full_formula"))
testSet    .insert(1, 'pretty_formula', testSet    .pop("pretty_formula"))

### Write to file

In [None]:
trainingSet = trainingSet.drop(["pretty_formula"], axis=1)
trainingTarget = trainingSet.pop("candidate")

Path(data_dir / InsertApproach / "processed").mkdir(parents=True, exist_ok=True)
trainingSet   .to_pickle(data_dir / InsertApproach /"processed" / "trainingData.pkl")
trainingTarget.to_pickle(data_dir / InsertApproach / "processed" / "trainingTarget.pkl")
testSet       .to_pickle(data_dir / InsertApproach / "processed" / "testSet.pkl")

In [None]:
testSet