In [1]:
from pyanitools import anidataloader

# fet the HDF5 file containing the data
hdf5file = 'ani_gdb_s01.h5'

# construct the data loader class
adl = anidataloader(hdf5file)

# print the species of the data set one by one
for data in adl:
#     print(data.keys())

    # extract the data
    P = data['path']
    X = data['coordinates']
    E = data['energies']
    S = data['species']
    sm = data['smiles']

    # print the data
    print("Path:   ", P)
    print("  Smiles:      ","".join(sm))
    print("  Symbols:     ", S)
    print("  Coordinates: ", X.shape)
    print("  Energies:    ", E.shape, "\n")

# closes the H5 data file
adl.cleanup()

Path:    /gdb11_s01/gdb11_s01-0
  Smiles:       [H]C([H])([H])[H]
  Symbols:      ['C', 'H', 'H', 'H', 'H']
  Coordinates:  (5400, 5, 3)
  Energies:     (5400,) 

Path:    /gdb11_s01/gdb11_s01-1
  Smiles:       [H]N([H])[H]
  Symbols:      ['N', 'H', 'H', 'H']
  Coordinates:  (3600, 4, 3)
  Energies:     (3600,) 

Path:    /gdb11_s01/gdb11_s01-2
  Smiles:       [H]O[H]
  Symbols:      ['O', 'H', 'H']
  Coordinates:  (1800, 3, 3)
  Energies:     (1800,) 



# Bard AI says:
To encode molecular strings for a ANN, I would use the following steps:

1. Tokenize the molecular strings: The first step is to tokenize the molecular strings into individual tokens. This can be done using a variety of techniques, such as regular expressions or a pre-trained text tokenizer.
2. Embed the tokens: The next step is to embed the tokens into a numerical representation. This can be done using a variety of techniques, such as one-hot encoding, word2vec, or GloVe.
3. Pad or truncate the sequences: The final step is to pad or truncate the sequences to a fixed length. This is necessary because ANNs typically require input data to be of a fixed length.

Here is an example of how to encode molecular strings using the above steps:

In [None]:
import re
import numpy as np

# function to tokenize molecular strings
def tokenize_smiles(smiles):
  # Split the SMILES string into individual tokens
  tokens = re.split(r'[^A-Z0-9]', smiles)

  # Remove empty tokens
  tokens = [token for token in tokens if token]

  return tokens

# Define a function to embed molecular strings
def embed_smiles(smiles, embedding_dim):
  # Tokenize the SMILES string
  tokens = tokenize_smiles(smiles)

  # Create a one-hot encoding of each token
  embeddings = np.zeros((len(tokens), embedding_dim))
  for i in range(len(tokens)):
    embeddings[i, tokens[i]] = 1

  return embeddings

# Define a function to pad or truncate sequences
def pad_or_truncate_sequences(sequences, max_len):
  # Pad the sequences with zeros
  sequences = np.pad(sequences, ((0, 0), (0, max_len - sequences.shape[1])), 'constant')

  # Truncate the sequences to the maximum length
  sequences = sequences[:, :max_len]

  return sequences

# Example usage:

# Define the embedding dimension
embedding_dim = 128

# Encode a SMILES string
smiles = 'CCC'
embedding = embed_smiles(smiles, embedding_dim)

# Pad or truncate the sequence to a fixed length
max_len = 1024
padded_embedding = pad_or_truncate_sequences(embedding, max_len)

# ChatGPT also suggests:
Embedding Layers:
- Instead of one-hot encoding, you can use embedding layers. These layers are often part of neural network architectures.
- Embeddings map characters to continuous vectors of real numbers. The vectors are learned during training.
- Each character in the string is represented by an embedding vector, and these vectors are combined to represent the entire molecular string.

Fingerprint-Based Encoding:
- Molecular fingerprints are binary vectors that encode the presence or absence of substructures (e.g., functional groups, atom types) in a molecule.
- These fingerprints are used to represent molecular structures and can be fed directly into ANNs.

- each input should represent a conformation of a molecule
- atomic environment vector as columns
- Oliver's guess: >20 features for radial and >30 for spherical, ~50 features total.
- radial corresponds to which shell it's on
- spherical corresponds to orbital orientation
- rotational and translational invariance: if conformation2 is generated by rotationor translationo of conformation1, both conformations should have the same energy and same input representation.

# Dr. Allgood suggests

- RDkit
- PCA to compact parameters
- genetic algorithm to find optimal hyperparameters for deep learning
    - you're gonna have to tune hyperparameters either way
- can also use autoencoder (more pwerful PCA!)
- for ANN, we can use the architecture in the paper
