# Preprocessing

In this Jupyter notebook we will investigate how the data from generatedDataset.ipynb looks like, and try to understand how it is put together. 

We will sort the data after the feature material projects ID "material_id" in ascending order. In addition, we will hotencode neccessary features. 

In [1]:
# pandas
import pandas as pd
import numpy as np

from tqdm import tqdm

# plotting 
import plotly.graph_objects as go

# Ignore warnings from nan-values in  
np.warnings.filterwarnings('ignore')

## Reading and sorting data 

In [2]:
def sortByMPID(df):
    mpid_num = []
    for i in df["material_id"]:
        mpid_num.append(int(i[3:]))
    df["mpid_num"] = mpid_num
    df = df.sort_values(by="mpid_num").reset_index(drop=True)
    df = df.drop(columns=["mpid_num"])
    #df = df.set_index("material_id")
    return df

In [3]:
trainingTargets = pd.read_csv("data/generatedData/trainingData.csv")
generatedData = pd.read_csv("data/generatedData/generatedDataset.csv")
trainingTargets = sortByMPID(trainingTargets)
trainingTargets

Unnamed: 0,material_id,full_formula,candidate
0,mp-7,S6,1.0
1,mp-14,Se3,1.0
2,mp-19,Te3,1.0
3,mp-24,C8,1.0
4,mp-47,C4,1.0
...,...,...,...
1617,mp-1205479,K44Sb22F110,0.0
1618,mp-1208643,Sr4Hf4S12,1.0
1619,mp-1210722,Mg2Te2Mo2O12,1.0
1620,mp-1232407,Li6B6H32N4,0.0


In [4]:
generatedData = sortByMPID(generatedData)
generatedData.describe()

Unnamed: 0,band_gap,is_gap_direct,direct_gap,p_ex1_norm,p_ex1_degen,n_ex1_norm,n_ex1_degen,cbm_hybridization,cbm_score_1,vbm_hybridization,...,mean EN difference,std_dev EN difference,MP_Eg,OQMD_Eg,AFLOW_Eg,AFLOW-fitted_Eg,AFLOWML_Eg,JARVIS-TBMBJ_Eg,JARVIS-OPT_Eg,Exp_Eg
count,12848.0,12848.0,12848.0,12848.0,12848.0,12848.0,12848.0,19678.0,19678.0,19679.0,...,23737.0,23737.0,25270.0,10600.0,1938.0,1938.0,1293.0,3004.0,4888.0,211.0
mean,2.492202,0.255915,2.597508,0.369069,1.809776,0.292602,1.577133,2.814983,0.201666,2.693822,...,1.455607,0.462081,2.595326,2.779184,2.157502,3.821313,2.886241,3.689999,2.559848,3.85082
std,1.631642,0.436391,1.614176,0.318004,1.670934,0.32296,1.299285,1.025838,0.186079,0.8419221,...,0.593101,0.314908,1.639196,1.781121,1.6488,2.222582,1.486446,2.427274,1.809115,2.890128
min,0.0012,0.0,0.0,0.0,1.0,0.0,1.0,0.016361,0.006692,-1e-10,...,-0.53,0.0,0.1001,0.139,0.0323,0.95654,0.161,0.0125,0.0103,0.1
25%,1.212075,0.0,1.34365,0.0,1.0,0.0,1.0,2.076816,0.073008,2.141951,...,1.041143,0.212132,1.311575,1.38775,1.03765,2.31175,1.825,2.0365,1.151525,1.71
50%,2.22645,0.0,2.3456,0.458333,1.0,0.0,1.0,2.835915,0.141427,2.775154,...,1.476667,0.473762,2.3756,2.429,1.8536,3.41165,2.756,3.1426,2.1723,3.115
75%,3.544275,1.0,3.6266,0.707107,2.0,0.592655,2.0,3.567915,0.255938,3.270814,...,1.87,0.687388,3.6897,3.877,2.798725,4.685678,3.971,4.9736,3.65405,5.735
max,17.8914,1.0,18.3462,1.354982,24.0,1.327173,24.0,5.676938,0.998379,5.080257,...,3.19,1.59099,17.9023,18.278,17.5012,24.5046,7.043,32.1886,17.9682,13.6


## Hotencoding features

Here we can tell that the features listed in hotencodeColumns underneath are categorical. It is neccessary to turn the categorical features into several hotencoding features for an algorithm to understand what is going on.

In [5]:
hotencodeColumns = ["vbm_specie_1","vbm_character_1","cbm_character_1","cbm_specie_1"]
# Get one hot encoding of columns B
one_hot = pd.get_dummies(generatedData[hotencodeColumns])
# Drop column B as it is now encoded
generatedData = generatedData.drop(hotencodeColumns, axis = 1)
# Join the encoded df
print("Number of new features from hotencoding categorical features:{}".format(len(one_hot.columns)))
generatedData = generatedData.join(one_hot)
generatedData

Number of new features from hotencoding categorical features:176


Unnamed: 0,material_id,band_gap,is_gap_direct,direct_gap,p_ex1_norm,p_ex1_degen,n_ex1_norm,n_ex1_degen,cbm_hybridization,cbm_location_1,...,cbm_specie_1_Tl,cbm_specie_1_Tm,cbm_specie_1_U,cbm_specie_1_V,cbm_specie_1_W,cbm_specie_1_Xe,cbm_specie_1_Y,cbm_specie_1_Yb,cbm_specie_1_Zn,cbm_specie_1_Zr
0,mvc-12905,1.1543,0.0,1.1581,0.433013,2.0,0.417062,4.0,2.213249,0.649809;0.834621;0.48443,...,0,0,0,0,0,0,0,0,0,0
1,mp-7,2.5113,0.0,2.6150,0.307164,2.0,0.467268,6.0,2.261753,0.875688;0.771772;0.102434,...,0,0,0,0,0,0,0,0,0,0
2,mp-14,1.0119,0.0,1.1935,0.707107,3.0,0.687184,3.0,1.604914,0.0;0.219209;0.666667,...,0,0,0,0,0,0,0,0,0,0
3,mp-19,0.1845,0.0,0.2035,0.654854,6.0,0.687184,3.0,1.527698,0.26895;0.0;0.333333,...,0,0,0,0,0,0,0,0,0,0
4,mp-24,2.4070,1.0,2.4070,0.866025,1.0,0.866025,1.0,2.621089,0.0;0.688271;0.5,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25265,mp-1540000,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
25266,mp-1541522,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
25267,mp-1541714,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
25268,mp-1542038,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0


## Splitting one feature into several floating features

Additionally, we can see in the data that there is features that posess the type string, and should be converted to floats. This accounts for the features "cbm_location_1" and "vbm_location_1". 

In [6]:
splitColumns = ["cbm_location_1", "vbm_location_1"]
for column in splitColumns:
    
    newColumns = generatedData[column].str.split(";", n = 2, expand = True)
    #print(newColumns)
    for i in range(0,3):
        generatedData[column + "_" + str(i)] = np.array(newColumns[i]).astype(np.float)
generatedData = generatedData.drop(splitColumns, axis=1)
print(generatedData.shape)

(25270, 310)


## Fill NaN, inf, 0s values

It is important to decide what to do about non-logic numbers such as NaN or very large numbers. For this scenario we experience many missing values. A common procedures that deals with this issue is to either set it as 0, or use the mean value of the feature. 

In [7]:
#remove columns with only zeros
print(generatedData.shape)
generatedData = generatedData.loc[:, (generatedData != 0).any(axis=0)]
#fill nan
print(generatedData.shape)
generatedData = generatedData.fillna(generatedData.mean())

(25270, 310)
(25270, 296)


## Training data and test data

Now, we would like to find the data that we can use as training and test set before the machine-learning procedure. 

In [8]:
trainingSet = (
    trainingTargets.merge(generatedData,
                on="material_id",
                indicator=False,
                how="left",
                suffixes=(False, False))
)

In [9]:
testSet = (
    trainingTargets.merge(generatedData, 
              on='material_id', 
              how='outer', 
              indicator=True)
    .query('_merge != "both"')
    .drop(columns='_merge')
)

In [10]:
trainingTarget = trainingSet.pop("candidate")
#trainingSet    = trainingSet.drop(["material_id"], axis=1)


## Writing to file

If the preprocessing procedure has been satisified, we can finally write the preprocessed data to a folder and prepare it for the next notebook. 

In [11]:
trainingSet   .to_csv("data/preprocessedData/trainingSet.csv",    sep=",", index=False)
trainingTarget.to_csv("data/preprocessedData/trainingTarget.csv", sep=",", index=False)
testSet       .to_csv("data/preprocessedData/testSet.csv",        sep=",", index=False)


In [12]:
import requests
preamble="https://www.materialsproject.org/rest/v2/"
url = preamble + "api_check"
params = {'API_KEY':'unique_api_key'}
response=requests.get(url, params)
(response)

<Response [200]>

In [13]:
response.json()

{'valid_response': True,
 'response': {'api_key_valid': False,
  'details': 'API_KEY is not a valid key.',
  'version': {'db': '2020_09_08', 'pymatgen': '2020.8.13', 'rest': '2.0'}}}