# CSD Melting Point Data Overview

In [1]:
import os
import joblib 
import pandas as pd

## 1. List all the files
* Assuming those two following binary files are downloaded from CCDC and were put in this folder: ./Datasets
* melting_point_fps.job: a binary file of Rdkit radius 2, 1024 bits fingerprint 
* melting_point_rdkit_2d.job: a binary file of RDKit 2D descriptors (209)

## 2. The number of data points: 58810

In [3]:
df_fps = joblib.load('./../Datasets/melting_point_fps.job')
df_rdkit_2d = joblib.load('./../Datasets/melting_point_rdkit_2d.job')

In [5]:
print('The number of data points for both files: ', df_fps.shape, df_rdkit_2d.shape)

The number of data points for both files:  (58810, 4) (58810, 211)


## 3. Show examples of RDKit 2D descriptors
* There are 209 RDkit 2D descriptors
* std_temp: standardized melting point (K), the 3rd column from the last.
* clustering: Training/test split based on the Butina clustering
    * TRN: training set
    * TST: test set
* random: Training/test split based on random sampling
    * TRN: training set
    * TST: test set

In [6]:
df_rdkit_2d.head()

Unnamed: 0,MaxEStateIndex,MinEStateIndex,MaxAbsEStateIndex,MinAbsEStateIndex,qed,MolWt,HeavyAtomMolWt,ExactMolWt,NumValenceElectrons,NumRadicalElectrons,...,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,std_temp,clustering,random
CID_00000,11.984096,-0.494253,11.984096,0.027149,0.507381,371.478,338.214,371.242021,150.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,462.65,TRN,TRN
CID_00001,14.628345,-1.058673,14.628345,0.037222,0.815581,361.388,340.22,361.14895,138.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,519.0,TRN,TRN
CID_00002,10.301851,-0.913655,10.301851,0.736759,0.664344,274.057,267.001,273.949077,62.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,456.65,TRN,TST
CID_00003,9.049972,-0.076466,9.049972,0.076466,0.903515,285.774,269.646,285.092042,102.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,375.15,TST,TRN
CID_00004,13.202943,-1.882482,13.202943,0.04488,0.837269,348.358,332.23,348.111007,130.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,519.15,TRN,TRN


## 4. Show examples of Fingerprint descriptors
* std_temp: standardized melting point (K), the 2nd column.
* FP: Rdkit radius 2, 1024 bits fingerprint
* clustering: Training/test split based on the Butina clustering
    * TRN: training set
    * TST: test set
* random: Training/test split based on random sampling
    * TRN: training set
    * TST: test set

In [7]:
df_fps.head()

Unnamed: 0,std_temp,FP,clustering,random
CID_00000,462.65,"[0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",TRN,TRN
CID_00001,519.0,"[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",TRN,TRN
CID_00002,456.65,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",TRN,TST
CID_00003,375.15,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",TST,TRN
CID_00004,519.15,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",TRN,TRN
