# Feature Representation Methods in ChemML

To build a machine learning model, raw chemical data is first converted into a numerical representation. The representation contains spatial or topological information that defines a molecule. The resulting features may either be in continuous (molecular descriptors) or discrete (molecular fingerprints) form.

In [1]:
from chemml.chem import Molecule
from chemml.datasets import load_organic_density
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Creating `chemml.chem.Molecule` object from molecule SMILES

All feature representation methods available in ChemML require `chemml.chem.Molecule` as inputs

In [2]:
# Importing an existing dataset from ChemML
molecules, target, dragon_subset = load_organic_density()
mol_objs_list = []
for smi in molecules['smiles']:
    mol = Molecule(smi, 'smiles')
    mol.hydrogens('add')
    mol.to_xyz('MMFF', maxIters=10000, mmffVariant='MMFF94s')
    mol_objs_list.append(mol)

## [Coulomb Matrix](https://doi.org/10.1103/PhysRevLett.108.058301)

Simple molecular descriptor which mimics the electro-static interaction between nuclei. 

In [3]:
from chemml.chem import CoulombMatrix

#The coulomb matrix type can be sorted (SC), unsorted(UM), unsorted triangular(UT), eigen spectrum(E), or random (RC)
CM = CoulombMatrix(cm_type='SC',n_jobs=-1) 

features = CM.represent(mol_objs_list)
print(features[:5])

featurizing molecules in batches of 41 ...
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 62ms/step
Merging batch features ...    [DONE]
         0          1           2          3          4           5     \
0  388.023441  67.563708  388.023441  46.773229  71.039369  388.023441   
1   73.516695  12.680652   73.516695  13.507027  15.308382   53.358707   
2  388.023441  12.326815   73.516695  40.634121   5.708873   53.358707   
3  388.023441  74.210056  388.023441  48.421164  40.116551   73.516695   
4  388.023441  34.568998  388.023441  20.742817  20.052751   73.516695   

        6          7          8          9     ...  1643  1644  1645  1646  \
0  43.471152  31.884893  23.619685  53.358707  ...   0.0   0.0   0.0   0.0   
1  15.511767  10.387461   7.267910  53.358707  ...   0.0   0.0   0.0   0.0   
2  24.711259   7.202549  20.353557  53.358707  ...   0.0   0.0   0.0   0.0   
3  26.725815  20.224894  15.430350  73.516695  ...   0.0

## [Fingerprints from RDKit](https://www.rdkit.org/)

Molecular fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules, to find matches to a query substructure, etc.

In [4]:
from chemml.chem import RDKitFingerprint

# RDKit fingerprint types: 'morgan', 'hashed_topological_torsion' or 'htt' , 'MACCS' or 'maccs', 'hashed_atom_pair' or 'hap' 
morgan_fp = RDKitFingerprint(fingerprint_type='morgan', vector='bit', n_bits=1024, radius=3)
features = morgan_fp.represent(mol_objs_list)
print(features[:5])



   0     1     2     3     4     5     6     7     8     9     ...  1014  \
0     0     0     0     0     0     0     1     0     0     0  ...     0   
1     0     0     0     0     0     0     0     0     0     0  ...     0   
2     0     0     0     0     0     0     0     0     0     0  ...     0   
3     0     0     0     0     0     0     1     0     0     0  ...     0   
4     0     0     0     1     0     0     0     0     1     0  ...     0   

   1015  1016  1017  1018  1019  1020  1021  1022  1023  
0     0     0     0     0     1     0     0     0     0  
1     0     0     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  
3     0     0     0     0     0     0     0     0     0  
4     0     0     0     0     1     0     1     0     0  

[5 rows x 1024 columns]


## Molecule tensors from  `chemml.chem.Molecule` objects

Molecule tensors can be used to create neural graph fingerprints using `chemml.models`

In [5]:
from chemml.chem import tensorise_molecules
atoms,bonds,edges = tensorise_molecules(molecules=mol_objs_list, max_degree=5, max_atoms=None, n_jobs=-1, batch_size=100, verbose=True)

Tensorising molecules in batches of 100 ...
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 27ms/step 
Merging batch tensors ...    [DONE]


In [6]:
print("Matrix for atom features (num_molecules, max_atoms, num_atom_features):\n", atoms.shape)
print("Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):\n", edges.shape)
print("Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):\n", bonds.shape)

Matrix for atom features (num_molecules, max_atoms, num_atom_features):
 (500, 57, 62)
Matrix for connectivity between atoms (num_molecules, max_atoms, max_degree):
 (500, 57, 5)
Matrix for bond features (num_molecules, max_atoms, max_degree, num_bond_features):
 (500, 57, 5, 6)


## [Descriptors from RDKit](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html)

Comprehensive set of molecular descriptors calculated using RDKit. Includes topological, geometrical, electronic, and constitutional properties. Efficient calculation for large datasets. Flexible selection of specific or all descriptors via the RDKDesc class declaration. Integrates with other RDKit functions and Python workflows.

In [7]:
from chemml.chem import RDKDesc

rdd = RDKDesc()
features = rdd.represent(mol_objs_list)
print(features[:5])

[11:33:15] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:15] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:15] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:15] 

****
Pre-condition Violation
bad result vector size
Violation occurred o

   MaxAbsEStateIndex  MaxEStateIndex  MinAbsEStateIndex  MinEStateIndex  \
0           8.638741        8.638741           0.039513       -3.852793   
1           8.193508        8.193508           0.286175       -0.599619   
2           8.666998        8.666998           0.049406       -3.394321   
3           8.032986        8.032986           0.013750       -2.742153   
4           9.189644        9.189644           0.166504       -3.306171   

        qed        SPS    MolWt  HeavyAtomMolWt  ExactMolWt  \
0  0.816913  70.705882  285.503         266.351  285.067963   
1  0.735869  16.444444  240.222         232.158  240.064725   
2  0.801905  32.818182  313.386         298.266  313.099731   
3  0.749729  51.384615  218.299         208.219  218.007136   
4  0.772983  38.095238  319.415         306.311  319.056152   

   NumValenceElectrons  ...  fr_sulfonamd  fr_sulfone  fr_term_acetylene  \
0                   94  ...             0           0                  0   
1                 

[11:33:30] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:30] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:30] 

****
Pre-condition Violation
bad result vector size
Violation occurred on line 42 in file C:\rdkit\build\temp.win-amd64-cpython-312\Release\rdkit\Code\GraphMol\Descriptors\Crippen.cpp
Failed Expression: logpContribs.size() == mol.getNumAtoms() && mrContribs.size() == mol.getNumAtoms()
****

[11:33:30] 

****
Pre-condition Violation
bad result vector size
Violation occurred o

## [Descriptors from Mordred](https://github.com/mordred-descriptor/mordred)

Note: This function requires Mordred to be installed from the link.

Mordred molecular descriptors are an open-source alternative to Dragon/RDKit descriptors. This library can generate up to 1800+ descriptors, in comparison to Dragon's 5200+ and RDKit's 200.

In [8]:
from chemml.chem import Mordred

mord = Mordred()
features = mord.represent(mol_objs_list)
print(features[:5])

   nAcid  nBase    SpAbs_A   SpMax_A  SpDiam_A     SpAD_A   SpMAD_A   LogEE_A  \
0      0      0  22.998278  2.356990  4.586123  22.998278  1.352840  3.780704   
1      0      0  24.114905  2.401132  4.700060  24.114905  1.339717  3.834684   
2      0      1  30.133660  2.388743  4.753098  30.133660  1.369712  4.046753   
3      0      0  16.550756  2.429396  4.799667  16.550756  1.273135  3.507942   
4      0      2  28.597221  2.464180  4.847996  28.597221  1.361772  4.008031   

      VE1_A     VE2_A  ...     TSRW10          MW       AMW  WPath  WPol  \
0  3.771227  0.221837  ...  64.739856  285.067963  7.918555    564    19   
1  3.859715  0.214429  ...  64.604946  240.064725  9.233259    624    25   
2  4.414919  0.200678  ...  71.264318  313.099731  8.462155   1142    30   
3  3.324299  0.255715  ...  58.294869  218.007136  9.478571    226    18   
4  4.162396  0.198209  ...  71.560531  319.056152  9.384004    870    29   

   Zagreb1  Zagreb2  mZagreb1  mZagreb2                 

## [Descriptors from PaDELPy](https://pypi.org/project/padelpy/)

Note: This function requires PaDELPy to be installed from the link.

PaDEL-Descriptor: Open-source software for calculating molecular descriptors and fingerprints. Computes 797 descriptors (663 1D/2D, 134 3D) and 10 fingerprint types. Uses Chemistry Development Kit and custom implementations. Offers GUI and CLI, supports multiple file formats, and enables multithreading for efficient calculations.

In [11]:
from chemml.chem import PadelDesc

padel = PadelDesc()
features = padel.represent(mol_objs_list[:100])
print(features[:5])

RuntimeError: PaDEL-Descriptor encountered an error: PaDEL-Descriptor timed out during subprocess call