# 네번째. SDR Rest API 을 이용한 학습 데이터 편집: 새로운 Descriptor 추가하기
<hr/>
## 예제 목표:
  이번 예제에서는 SDR Rest API를 활용하여 기축된 소재 Database에서 사용자가 원하는 종류의 데이터를 추출하고, 추가 테이블 값을 참조하여 새로운 descriptor 를 생성하며, 마지막으로 새로 만들어진 학습 데이터를 활용하여 Deep Learning을 수행하려고 한다. 데이터 추출은 lucene query 를 기반으로 만들어진 SDR RESTful API 를 이용한다. 이해를 돕기 위해 먼저 복잡한 형태의 query 를 먼저 수행하여 작은 양의 데이터를 추출하여 average electronegativity 등의 물성 값 계산을 수행하고 학습 데이터를 building한다. 이후 많은 양의 데이터를 추출한 후 가시화 기법등을 사용하여 데이터를 검증 및 분석한다.이전의 예제들과 마찬가지로, Materials Scientific Community에 기축되어 있는 Open Quantum Materials Database [[1]](http://dx.doi.org/10.1007/s11837-013-0755-4)[[2]](http://dx.doi.org/10.1038/npjcompumats.2015.10) ([OQMD](http://www.oqmd.org), NorthWestern Univ.)의 DFT 계산 데이터를 기반으로 예제를 학습해본다.
  
### Overview
  예제는 크게 다음의 순서로 구성된다.
1. 데이터 추출 (Data Extraction)
<br/> 1.1. 작은 양의 데이터 추출 (by Searching)
<br><br>
2. 학습 데이터 빌드 (Building Learning Datasets)
<br/> 2.1. 화학론량적 계산값 추가 - Electronegativity
<br/> 2.2. 데이터 프레임 병합(Merging)
<br><br>
3. 심화된 학습 데이터 분석 및 가시화
<br/> 3.1. Binary Compounds: 데이터 추출
<br/> 3.2. Binary Compounds: 학습 데이터 빌드
<br/> 3.3. Binary Compounds: 데이터 가시화
<br/> 3.4. 전체 데이터 추출 (by Crawling)
<hr/>

### 1. 데이터 추출 (Data Extraction)
#### 1.1. 작은 양의 데이터 추출 (by Searching)
데이터를 추출하기에 앞서 SDR REST API 의 동작과정을 간단히 설명한다. 먼저, 유저가 소재 웹 페이지의 Advanced Search Page 와 동일한 형태의 lucene based syntax 를 이용하여 https 패킷을 생성하여 서버에 전달한다. 해당 패킷은 사용자 id와 password 등을 포함하여 암호화되어 서버에 전달된다. 서버는 유저의 인증 과정을 거친 후 쿼리 부분을 추출하여 데이터베이스를 검색한 후, 매칭이 되는 데이터를 유저에게 반환한다. 서버로 부터 전달받은 데이터는 유저의 개발 환경에 json 포멧의 파일로 저장된다.

In [1]:
########################################
## path settings #######################
########################################
#
# path setting
currentPath = "./"
dataPath = "data/"
figPath = ".figures/"
modelPath = "models/"
data_name = "final_energy_per_atom_"

보안 상 user_id, user_pwd, sever_address 부분은 임의로 작성해 두었다. 실습을 수행하려면 해당 부분을 적절한 내용으로 대체 후 실행해야한다. 예제에서는 쿼리 부분의 공백과 소수점 입력등을 원활히 하기위해 curl 대신 wget 명령어를 사용하였다. 이번 예제에서는 oqmd 타입의 데이터들 중 Li를 포함하면서 원소 종류가 6개인 화합물을 검색하였다.
** 개인인증서 lets encrypt 를 이후에 더 설명해아할지 고민.

In [2]:
import subprocess
user_id = "NONE"
user_pwd = "NONE"

extra_args = "--no-check-certificate"
server_address = "NONE" 
rest_api_option = "/rest/api/search/"

#데이터 타입 설정
data_type = "oqmd"
basic_lucene_query = "DataType:"+data_type+" AND "

#쿼리 설정
other_lucene_query = "elements: Li AND nelements:6"

full_query = '"'+ server_address + rest_api_option + basic_lucene_query + other_lucene_query + '"'
jsonResultFileName = "query_result.json"

command_line = 'wget -O' + ' ' + currentPath + dataPath + jsonResultFileName +' '+'--user'+ ' '+ user_id + ' '+ '--password' + ' ' + user_pwd + ' ' + extra_args + ' ' + full_query

subprocess.call(command_line, shell=True)

0

Datatype, collectionId, datasetId 등의 정보를 포함하는 datasets의 리스트가 json 형태로 반환됨을 확인할 수 있다.

In [4]:
import pandas as pd
query_result = pd.read_json(currentPath+dataPath+jsonResultFileName)
query_result

Unnamed: 0,DataType,avg_dielectric_constant,bandgap,collectionId,coordinate,createDate,crystalsystem,dataTypeId,datasetId,density,...,runtype,spacegrouphall,spacegroupnum,spacegroupsymbol,status,title,unitcellformula,userId,userName,volume
0,oqmd,,5.933,21912,"[{'value': [0.192493, 0.7608969999999999, 0.08...",Thu Oct 26 16:26:01 GMT+09:00 2017,Triclinic,21915,868039,2.940309,...,GGA,-P 1,2,P-1,0,CsLiH4S2N2O6,"{'S': 4, 'Li': 2, 'Cs': 2, 'N': 4, 'O': 12, 'H...",20433,siahn,374.892
1,oqmd,,5.934,21912,"[{'value': [0.354311, 0.9872749999999999, 0.03...",Thu Oct 26 16:26:03 GMT+09:00 2017,Monoclinic,21915,868220,2.253201,...,GGA,P 2yb,4,P21,0,KLiH4S2N2O6,"{'S': 4, 'Li': 2, 'N': 4, 'O': 12, 'H': 8, 'K'...",20433,siahn,350.994
2,oqmd,,5.231,21912,"[{'value': [0, 0, 0], 'label': 'Cs'}, {'value'...",Thu Oct 26 16:30:32 GMT+09:00 2017,Tetragonal,21915,892156,2.898363,...,GGA,-I 4,87,I4/m,0,CsKNa2Li12Si4O16,"{'Si': 4, 'Li': 12, 'Cs': 1, 'O': 16, 'Na': 2,...",20433,siahn,383.509
3,oqmd,,0.0,21912,"[{'value': [0.72818, 0, 0.25], 'label': 'B'}, ...",Thu Oct 26 16:30:56 GMT+09:00 2017,Monoclinic,21915,894362,3.624381,...,GGA,-C 2yc,15,C2/c,0,LiCu2BP2H2O10,"{'P': 4, 'B': 2, 'Li': 2, 'Cu': 4, 'O': 20, 'H...",20433,siahn,337.829


총 4개의 dataset이 확인되었다. 이 중 하나의 dataset의 상세정보를 확인해본다. unitcelformula, finalenergy, density, volume, lattice 등의 소재 정보와 collectionId, datasetId, userId 등의 관리 데이터를 확인할 수 있다.

In [5]:
query_result.loc[0]

DataType                                                                oqmd
avg_dielectric_constant                                                     
bandgap                                                                5.933
collectionId                                                           21912
coordinate                 [{'value': [0.192493, 0.7608969999999999, 0.08...
createDate                                Thu Oct 26 16:26:01 GMT+09:00 2017
crystalsystem                                                      Triclinic
dataTypeId                                                             21915
datasetId                                                             868039
density                                                              2.94031
dielectric_electronic                                                       
dielectric_ionic                                                            
electron                                                                    

### 2. 학습 데이터 빌드 (Building Learning Datasets)
#### 2.1. 화학론량적 계산값 추가 - Electronegativity

\\(  T^{avg}_{A_xB_yC_z} = \frac{xT_{A}}{x+y+z} + \frac{yT_{B}}{x+y+z} + \frac{zT_{C}}{x+y+z} \\) 식을 사용하여, \\(Cs_2Li_2H_8S_4N_4O_{12} \\)(query_result.loc[0])의 Electronegativity 를 계산해본다. 이를 위해 각 원소 별 electronegativity 값을 참조 테이블에서 확인해보자.

In [6]:
fileName = "reference_elements_dataset.csv"
atomtable = pd.read_csv(currentPath+dataPath+fileName)

In [7]:
atomtable.head()

Unnamed: 0,z,name,symbol,group,period,valenceoftheelements,numofvalenceelectrons,thermalconductivity,entalpyofatomization,fusion,...,melt,boil,specific_heat,electronegativity,first_ionization_energy,electron_affinity,s_elec,p_elec,d_elec,f_elec
0,1,Hydrogen,H,1,1,1.0,1.0,0.1805,218.0,0.558,...,14.175,20.280001,14.304,2.2,13.5984,0.754,1,0,0,0
1,2,Helium,He,18,1,0.0,2.0,0.1513,0.0,0.02,...,,4.22,5.193,0.0,24.5874,9.7,2,0,0,0
2,3,Lithium,Li,1,2,1.0,1.0,85.0,159.0,3.0,...,453.850006,1615.0,3.582,0.98,5.39172,0.618,3,0,0,0
3,4,Beryllium,Be,2,2,2.0,2.0,190.0,324.0,7.95,...,1560.150024,2742.0,1.825,1.57,9.3227,-2.4,4,0,0,0
4,5,Boron,B,13,2,3.0,3.0,27.0,563.0,50.0,...,2573.149902,4200.0,1.026,2.04,8.29803,0.279,4,1,0,0


각 원소별 electronegativity 값은 다음과 같이 확인할 수 있다.

In [8]:
print('Cs', atomtable.loc[atomtable['symbol']=='Cs'].electronegativity.values)
print('Li', atomtable.loc[atomtable['symbol']=='Li'].electronegativity.values)
print('H', atomtable.loc[atomtable['symbol']=='H'].electronegativity.values)
print('S', atomtable.loc[atomtable['symbol']=='S'].electronegativity.values)
print('N', atomtable.loc[atomtable['symbol']=='N'].electronegativity.values)
print('O', atomtable.loc[atomtable['symbol']=='O'].electronegativity.values)

Cs [ 0.79000002]
Li [ 0.98000002]
H [ 2.20000005]
S [ 2.57999992]
N [ 3.03999996]
O [ 3.44000006]


dict 형태의 unitcellformula 정보로부터 avg_electronegativity 를 계산하는 함수를 간단히 표현하면 다음과 같다.

In [9]:
def getElectronegativity_from_dict(dict_comp, atomtable):
    total_num_of_atoms = 0
    sum_of_electronegativity = 0
    for key, value in dict_comp.items():
        sum_of_electronegativity += value * float(atomtable.loc[atomtable['symbol']==key].electronegativity.values)
        total_num_of_atoms += value
    return sum_of_electronegativity/total_num_of_atoms

이제 query_result 의 4개의 화합물 정보를 입력으로 각각의 avg_electronegativity 를 구한다.

In [10]:
avg_electronegativity = []
for index in range(0,len(query_result)):
    avg_electronegativity.append(getElectronegativity_from_dict(query_result.unitcellformula.values[index], atomtable))

In [11]:
avg_electronegativity

[2.6531250216249997, 2.655000019875, 2.163055585333333, 2.7777778173888885]

In [12]:
query_result['avg_electronegativity'] = avg_electronegativity

계산된 average electronegativity가 성공적으로 추가되었음을 확인해 볼 수 있다.

In [13]:
query_result

Unnamed: 0,DataType,avg_dielectric_constant,bandgap,collectionId,coordinate,createDate,crystalsystem,dataTypeId,datasetId,density,...,spacegrouphall,spacegroupnum,spacegroupsymbol,status,title,unitcellformula,userId,userName,volume,avg_electronegativity
0,oqmd,,5.933,21912,"[{'value': [0.192493, 0.7608969999999999, 0.08...",Thu Oct 26 16:26:01 GMT+09:00 2017,Triclinic,21915,868039,2.940309,...,-P 1,2,P-1,0,CsLiH4S2N2O6,"{'S': 4, 'Li': 2, 'Cs': 2, 'N': 4, 'O': 12, 'H...",20433,siahn,374.892,2.653125
1,oqmd,,5.934,21912,"[{'value': [0.354311, 0.9872749999999999, 0.03...",Thu Oct 26 16:26:03 GMT+09:00 2017,Monoclinic,21915,868220,2.253201,...,P 2yb,4,P21,0,KLiH4S2N2O6,"{'S': 4, 'Li': 2, 'N': 4, 'O': 12, 'H': 8, 'K'...",20433,siahn,350.994,2.655
2,oqmd,,5.231,21912,"[{'value': [0, 0, 0], 'label': 'Cs'}, {'value'...",Thu Oct 26 16:30:32 GMT+09:00 2017,Tetragonal,21915,892156,2.898363,...,-I 4,87,I4/m,0,CsKNa2Li12Si4O16,"{'Si': 4, 'Li': 12, 'Cs': 1, 'O': 16, 'Na': 2,...",20433,siahn,383.509,2.163056
3,oqmd,,0.0,21912,"[{'value': [0.72818, 0, 0.25], 'label': 'B'}, ...",Thu Oct 26 16:30:56 GMT+09:00 2017,Monoclinic,21915,894362,3.624381,...,-C 2yc,15,C2/c,0,LiCu2BP2H2O10,"{'P': 4, 'B': 2, 'Li': 2, 'Cu': 4, 'O': 20, 'H...",20433,siahn,337.829,2.777778


하지만 현재 상태의 query_result는 Machine Learning 의 입력 data로 직접 적용될 수 없다. 이는 lattice, unitcellformula 등의 정보가 dict format 으로 중첩(Nested)되어 있기 때문이다. 중첩된 정보는 pandas 의 tolist() 함수를 이용하여 다수의 독립된 columns 들로 변환할 수 있다. 1) lattice information, 2) unit cell formula 순서로 columns 포멧으로 변환한다.

1) Lattice Information: lattice length 에 상응하는 columns 의 수가 작기 때문에 직접 입력해도 무방하다. 과정은 다음과 같다.

In [14]:
#query_result['lattice'] or query_result.lattice 
#어느 방식을 사용해도 해당 column의 data에 접근할 수 있다.
query_result.lattice

0     [5.465751, 7.691087, 9.59254]
1    [5.080959, 8.305728, 8.617235]
2    [6.364259, 8.389621, 8.389621]
3    [4.705827, 7.849667, 9.586717]
Name: lattice, dtype: object

In [15]:
listed_lattice_length = pd.DataFrame(query_result.lattice.tolist(), columns = ['lattice_a','lattice_b','lattice_c'])
listed_lattice_length

Unnamed: 0,lattice_a,lattice_b,lattice_c
0,5.465751,7.691087,9.59254
1,5.080959,8.305728,8.617235
2,6.364259,8.389621,8.389621
3,4.705827,7.849667,9.586717


In [16]:
listed_lattice_angle = pd.DataFrame()
listed_lattice_angle['latticealpha'] = query_result.latticealpha
listed_lattice_angle['latticebeta'] = query_result.latticebeta
listed_lattice_angle['latticegamma'] = query_result.latticegamma
listed_lattice_angle

Unnamed: 0,latticealpha,latticebeta,latticegamma
0,73.198165,76.233527,86.935936
1,90.0,105.164234,90.0
2,98.271477,67.710071,112.289929
3,90.481528,90.0,107.4424


In [17]:
listed_lattice_information = pd.concat([listed_lattice_length, listed_lattice_angle], axis = 1)
listed_lattice_information

Unnamed: 0,lattice_a,lattice_b,lattice_c,latticealpha,latticebeta,latticegamma
0,5.465751,7.691087,9.59254,73.198165,76.233527,86.935936
1,5.080959,8.305728,8.617235,90.0,105.164234,90.0
2,6.364259,8.389621,8.389621,98.271477,67.710071,112.289929
3,4.705827,7.849667,9.586717,90.481528,90.0,107.4424


2) Unit Cell Formula: 상응하는 원소의 수가 적지 않으므로 이를 직접 입력하는 방식은 효과적이지 못하다. Reference table의 'symbol' column에 원소 이름의 list가 존재하므로 이를 활용할 수 있다.

In [18]:
#현재 query result 의 각 화합물 내의 원소 종류 및 원소 개수 확인
query_result.unitcellformula

0    {'S': 4, 'Li': 2, 'Cs': 2, 'N': 4, 'O': 12, 'H...
1    {'S': 4, 'Li': 2, 'N': 4, 'O': 12, 'H': 8, 'K'...
2    {'Si': 4, 'Li': 12, 'Cs': 1, 'O': 16, 'Na': 2,...
3    {'P': 4, 'B': 2, 'Li': 2, 'Cu': 4, 'O': 20, 'H...
Name: unitcellformula, dtype: object

In [19]:
#element columns 생성
element_columns = atomtable.symbol.values

In [20]:
basic_listed_unitcellformula = pd.DataFrame(query_result.unitcellformula.tolist(), columns=element_columns)
basic_listed_unitcellformula.fillna(0, inplace=True)
basic_listed_unitcellformula

Unnamed: 0,H,He,Li,Be,B,C,N,O,F,Ne,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
0,8.0,0.0,2,0.0,0.0,0.0,4.0,12,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8.0,0.0,2,0.0,0.0,0.0,4.0,12,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,12,0.0,0.0,0.0,0.0,16,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,2,0.0,2.0,0.0,0.0,20,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


차후 Machine Learning에 사용할 Label 값들은 per atom 단위의 값이 사용되므로 총원자수 대 해당원자의 비율(rate)값으로 계산한다.

In [21]:
rate_unitcellformula = []
for row in range(0, query_result.shape[0]):
    rate_unitcellformula.append({k: v / query_result.nsites[row] for k, v in query_result.unitcellformula[row].items()})
query_result['rate_unitcellformula'] = rate_unitcellformula

In [22]:
query_result

Unnamed: 0,DataType,avg_dielectric_constant,bandgap,collectionId,coordinate,createDate,crystalsystem,dataTypeId,datasetId,density,...,spacegroupnum,spacegroupsymbol,status,title,unitcellformula,userId,userName,volume,avg_electronegativity,rate_unitcellformula
0,oqmd,,5.933,21912,"[{'value': [0.192493, 0.7608969999999999, 0.08...",Thu Oct 26 16:26:01 GMT+09:00 2017,Triclinic,21915,868039,2.940309,...,2,P-1,0,CsLiH4S2N2O6,"{'S': 4, 'Li': 2, 'Cs': 2, 'N': 4, 'O': 12, 'H...",20433,siahn,374.892,2.653125,"{'S': 0.125, 'Li': 0.0625, 'Cs': 0.0625, 'N': ..."
1,oqmd,,5.934,21912,"[{'value': [0.354311, 0.9872749999999999, 0.03...",Thu Oct 26 16:26:03 GMT+09:00 2017,Monoclinic,21915,868220,2.253201,...,4,P21,0,KLiH4S2N2O6,"{'S': 4, 'Li': 2, 'N': 4, 'O': 12, 'H': 8, 'K'...",20433,siahn,350.994,2.655,"{'S': 0.125, 'Li': 0.0625, 'N': 0.125, 'O': 0...."
2,oqmd,,5.231,21912,"[{'value': [0, 0, 0], 'label': 'Cs'}, {'value'...",Thu Oct 26 16:30:32 GMT+09:00 2017,Tetragonal,21915,892156,2.898363,...,87,I4/m,0,CsKNa2Li12Si4O16,"{'Si': 4, 'Li': 12, 'Cs': 1, 'O': 16, 'Na': 2,...",20433,siahn,383.509,2.163056,"{'Si': 0.111111111111, 'Li': 0.333333333333, '..."
3,oqmd,,0.0,21912,"[{'value': [0.72818, 0, 0.25], 'label': 'B'}, ...",Thu Oct 26 16:30:56 GMT+09:00 2017,Monoclinic,21915,894362,3.624381,...,15,C2/c,0,LiCu2BP2H2O10,"{'P': 4, 'B': 2, 'Li': 2, 'Cu': 4, 'O': 20, 'H...",20433,siahn,337.829,2.777778,"{'P': 0.111111111111, 'B': 0.0555555555556, 'L..."


In [23]:
#element columns 생성
element_columns = atomtable.symbol.values

다음과 같이 rate로 나타낸 Formula information 을 얻을 수 있다.

In [24]:
listed_unitcellformula = pd.DataFrame(query_result.rate_unitcellformula.tolist(), columns=element_columns)
listed_unitcellformula.fillna(0, inplace=True)
listed_unitcellformula

Unnamed: 0,H,He,Li,Be,B,C,N,O,F,Ne,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
0,0.25,0.0,0.0625,0.0,0.0,0.0,0.125,0.375,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.25,0.0,0.0625,0.0,0.0,0.0,0.125,0.375,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.444444,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.111111,0.0,0.055556,0.0,0.055556,0.0,0.0,0.555556,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


####  2.2. 데이터 프레임 병합(Merging)
다음으로는 DataType, collectionId, createDate 등 학습에 사용하지 않을 fields 들을 제외한 나머지 features 및 label 정보를 추출한다. 이후 위에서 생성한 lattice information, unitcellformula를 더하여 Machine Learning Datasets을 완성한다.

In [25]:
#labels and features extraction
listed_extracted = query_result[[
    #Each lable for using Y-value (supervised)
    'bandgap', 'finalenergyperatom','formationenergy', 
    
    #Basic information of each compound 
    'spacegroupnum','nelements', 'nsites','density','mass','volume',
    
    #Derived properties by calculations
    'avg_electronegativity']]

In [26]:
print("1) listed_extracted: ", listed_extracted.shape)
print("2) listed_lattice_information: ", listed_lattice_information.shape)
print("3) listed_unitcellformula: ", listed_unitcellformula.shape)

for_learning_datasets = pd.concat([listed_extracted, listed_lattice_information], axis = 1)
for_learning_datasets = pd.concat([for_learning_datasets, listed_unitcellformula], axis = 1)

print("--------------------------------------")
print("+) for_learning_datasets: ", for_learning_datasets.shape)

#show the result
for_learning_datasets

1) listed_extracted:  (4, 10)
2) listed_lattice_information:  (4, 6)
3) listed_unitcellformula:  (4, 112)
--------------------------------------
+) for_learning_datasets:  (4, 128)


Unnamed: 0,bandgap,finalenergyperatom,formationenergy,spacegroupnum,nelements,nsites,density,mass,volume,avg_electronegativity,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
0,5.933,-5.576149,-1.367975,2,6,32,2.940309,664.035112,374.892,2.653125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5.934,-5.592953,-1.369649,4,6,32,2.253201,476.421717,350.994,2.655,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5.231,-5.874426,-2.508278,87,6,36,2.898363,669.607304,383.509,2.163056,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,-6.13161,-1.942571,15,6,36,3.624381,737.602971,337.829,2.777778,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


3종류의 Labels과 125개의 Features를 포함하는 Machine Learning 용 Datasets이 구성되었다.

### 3. 심화된 학습 데이터 분석 및 가시화
지금까지는 "elements: Li AND nelements:6" 의 조건을 가진 샘플 수준의 데이터로 Dataset building 의 예를 설명하였다. 조금 더 많은 데이터를 추출하여 Dataset을 build 하고, 데이터 간의 상관성을 그래프로 가시화하여 분석하는 기법에 대하여 알아본다.
#### 3.1. Binary Compounds: 데이터 추출
예제의 수행시간을 줄이기 위해 OQMD datasets 중 Binary compounds 데이터를 추출하여 데이터의 상관관계를 분석해본다. 25,877 개의 simulation datasets이 검색됨을 알 수 있다. 

In [42]:
#쿼리 설정
other_lucene_query = "nelements:2"

full_query = '"'+ server_address + rest_api_option + basic_lucene_query + other_lucene_query + '"'
jsonResultFileName = "query_result_complex.json"

command_line = 'wget -O' + ' ' + currentPath + dataPath + jsonResultFileName +' '+'--user'+ ' '+ user_id + ' '+ '--password' + ' ' + user_pwd + ' ' + extra_args + ' ' + full_query

subprocess.call(command_line, shell=True)

complex_query_result = pd.read_json(currentPath+dataPath+jsonResultFileName)
complex_query_result.fillna(0,inplace=True)

#### 3.2. Binary Compounds: 학습 데이터 빌드
위에서의 예제와 동일한 방식을 이용하여 average electronegativity, lattice information, 그리고 formula information 을 계산한다.

In [43]:
""" building datasets by adding average electronegativity, converted lattice information, and converted formula
"""
complexset_avg_electronegativity = []
for index in range(0,len(complex_query_result)):
    complexset_avg_electronegativity.append(getElectronegativity_from_dict(complex_query_result.unitcellformula.values[index], atomtable))

# adding average electronegativity
complex_query_result['avg_electronegativity'] = complexset_avg_electronegativity

# converting lattice information
complex_listed_lattice_length = pd.DataFrame(complex_query_result.lattice.tolist(), columns = ['lattice_a','lattice_b','lattice_c'])
complex_listed_lattice_angle = pd.DataFrame()
complex_listed_lattice_angle['latticealpha'] = complex_query_result.latticealpha
complex_listed_lattice_angle['latticebeta'] = complex_query_result.latticebeta
complex_listed_lattice_angle['latticegamma'] = complex_query_result.latticegamma
complex_listed_lattice_information = pd.concat([complex_listed_lattice_length, complex_listed_lattice_angle], axis = 1)

In [44]:
complex_rate_unitcellformula = []
for row in range(0, complex_query_result.shape[0]):
    complex_rate_unitcellformula.append({k: v / complex_query_result.nsites[row] for k, v in complex_query_result.unitcellformula[row].items()})
complex_query_result['rate_unitcellformula'] = complex_rate_unitcellformula
complex_listed_unitcellformula = pd.DataFrame(complex_query_result.rate_unitcellformula.tolist(), columns=element_columns)
complex_listed_unitcellformula.fillna(0, inplace=True)

In [45]:
#labels and features extraction
complex_listed_extracted = complex_query_result[[
    #Each lable for using Y-value (supervised)
    'bandgap', 'finalenergyperatom','formationenergy', 
    
    #Basic information of each compound 
    'spacegroupnum','nelements', 'nsites','density','mass','volume',
    
    #Derived properties by calculations
    'avg_electronegativity']]

complex_for_learning_datasets = pd.concat([complex_listed_extracted, complex_listed_lattice_information], axis = 1)
complex_for_learning_datasets = pd.concat([complex_for_learning_datasets, complex_listed_unitcellformula], axis = 1)

print("--------------------------------------")
print("+) for_learning_datasets: ", complex_for_learning_datasets.shape)

#show the result
complex_for_learning_datasets.head()

--------------------------------------
+) for_learning_datasets:  (25877, 128)


Unnamed: 0,bandgap,finalenergyperatom,formationenergy,spacegroupnum,nelements,nsites,density,mass,volume,avg_electronegativity,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
0,2.083,-4.603133,-2.106159,225,2,2,8.149358,252.014,51.3345,1.825,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,-5.185152,0.106576,216,2,2,2.620305,59.059301,37.4149,2.045,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,-3.317846,-0.327448,221,2,2,7.481067,215.739998,47.8713,1.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,-6.786665,0.53432,225,2,2,4.003831,40.096201,16.624,2.225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,-3.678396,-0.262073,123,2,2,8.595496,124.073399,23.9616,1.78,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
complex_for_learning_datasets.describe()

Unnamed: 0,finalenergyperatom,formationenergy,spacegroupnum,nelements,nsites,density,mass,volume,avg_electronegativity,lattice_a,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
count,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,...,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0,25877.0
mean,-4.98398,-0.197996,119.297059,2.0,9.058778,7.086645,724.426096,192.315764,1.742834,5.65513,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,7.621827,7.282629,84.495316,0.0,10.675945,3.393081,718.655117,248.79776,0.465077,2.886247,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,-203.629754,-198.69561,1.0,2.0,2.0,0.659759,7.94894,8.87725,0.805,1.3761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-6.676607,-0.421062,38.0,2.0,4.0,4.599304,259.808998,67.018,1.42,3.779207,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-4.76032,-0.067326,139.0,2.0,6.0,6.741047,480.674408,132.189,1.655,4.903808,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,-3.08344,0.104474,221.0,2.0,12.0,8.874437,937.776398,234.65,1.94,6.150845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1122.552855,1126.321181,230.0,2.0,184.0,21.559559,7511.915863,3471.91,3.845,38.346241,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


한가지 문제를 발견하였다. 만들어진 complex_for_learning_datasets의 데이터 분포 현황에 대하여 데이터 프레임을 describe 하였을 때 columns이 128개에서 127개로 1개 감소하였다. 다음의 코드를 통해 이 문제를 자세히 확인해 볼 수 있다.

In [47]:
import numpy as np
error_rows = complex_for_learning_datasets[~complex_for_learning_datasets.applymap(np.isreal).all(1)]
error_rows

Unnamed: 0,bandgap,finalenergyperatom,formationenergy,spacegroupnum,nelements,nsites,density,mass,volume,avg_electronegativity,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
3188,,-2.399415,-0.133001,25,2,8,13.219683,1049.38797,131.772,2.095,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3787,,-8.521683,0.075234,6,2,12,11.262275,1673.417976,246.653,1.52,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3789,,-9.827054,0.071113,59,2,12,9.693963,1358.936005,232.705,1.453333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3793,,-7.699577,-0.822668,26,2,12,12.772653,1457.706001,189.451,1.91,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3828,,-6.568589,0.145195,59,2,12,5.992966,962.936005,266.725,1.403333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7556,,-3.805277,-1.026036,12,2,9,7.078194,1303.556007,305.714,2.028889,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7584,,-7.847338,-2.628218,167,2,10,4.795598,299.762407,103.763,2.716,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7585,,-6.346611,-0.558179,20,2,10,4.228852,359.480408,141.111,2.118,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7586,,-4.350178,-0.83411,9,2,10,3.556926,471.28199,219.945,2.272,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7587,,-3.639521,-0.99434,36,2,10,3.396985,724.009987,353.801,1.864,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
error_rows.shape

(47, 128)

사라진 column은 'bandgap'으로, 3188~18218 rows 의 label 값이 존재하지 않는다. 즉, 위의 47개 simulations들은 bandgap 계산이 이루어지지 않은 단계의 실험이다. 적절한 bandgap 모델을 만들기 위해 해당 rows를 filtering한다. 그리고 난 후 dataset을 float type 으로 casting 한다.

In [49]:
complex_for_learning_datasets.shape

(25877, 128)

In [50]:
error_rows.shape

(47, 128)

In [68]:
filtered_complex_for_learning_datasets = complex_for_learning_datasets[complex_for_learning_datasets.applymap(np.isreal).all(1)]
filtered_complex_for_learning_datasets = filtered_complex_for_learning_datasets.astype(np.float32)
filtered_complex_for_learning_datasets.shape

(25830, 128)

In [69]:
filtered_complex_for_learning_datasets.describe()

Unnamed: 0,bandgap,finalenergyperatom,formationenergy,spacegroupnum,nelements,nsites,density,mass,volume,avg_electronegativity,...,Lr,Rf,Db,Sg,Bh,Hs,Mt,Ds,Rg,Cn
count,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,...,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0,25830.0
mean,0.233056,-4.982612,-0.196927,119.405884,2.0,9.05695,7.087548,724.330078,192.296799,1.742248,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,0.911475,7.628321,7.289134,84.495323,0.0,10.685522,3.393343,719.086853,248.996628,0.464993,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,-203.629761,-198.695602,1.0,2.0,2.0,0.659759,7.94894,8.87725,0.805,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,-6.675405,-0.419524,38.0,2.0,4.0,4.600797,259.546898,66.9289,1.42,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,-4.756761,-0.066782,139.0,2.0,6.0,6.743203,480.674408,131.946999,1.655,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,-3.082532,0.105063,221.0,2.0,12.0,8.874264,937.691223,234.60075,1.939643,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.34,1122.552856,1126.321167,230.0,2.0,184.0,21.559559,7511.916016,3471.909912,3.845,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


25830 rows의 simulations들이 128개의 columns으로 잘 표현됨을 알 수 있다.

 <hr/>
###### References
[1] Saal, J. E., Kirklin, S., Aykol, M., Meredig, B., and Wolverton, C. "Materials Design and Discovery with High-Throughput Density Functional Theory: The Open Quantum Materials Database (OQMD)", JOM 65, 1501-1509 (2013). doi:10.1007/s11837-013-0755-4 [Link](http://dx.doi.org/10.1007/s11837-013-0755-4)

  [2] Kirklin, S., Saal, J.E., Meredig, B., Thompson, A., Doak, J.W., Aykol, M., Rühl, S. and Wolverton, C. "The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies", npj Computational Materials 1, 15010 (2015). doi:10.1038/npjcompumats.2015.10 [Link](http://dx.doi.org/10.1038/npjcompumats.2015.10)