### Feat2LLM


In this tutorial, you will learn how to generate a string-based representation for any numerical feature vector.

<img src="scheme_new.png" width="90%" height="40%" />

Steps 1 to 3 in the diagram are all executed in the next cell, where ethanol is the target molecule.
But let us break down what happens in each step.

1) First we download the MD trajectory data and store it

2) Based on the molecular geometries we generate representation vectors, here the MBDF representation [1]. To further compress the representation, we perform a dimensionality reduction, here down to 10 components

3) Finally the compressed (numerical!) representation vectors are saved to disk


[1] Danish Khan, Stefan Heinen, O. Anatole von Lilienfeld; Kernel-based quantum machine learning at record rate: Many-body distribution functionals as compact representations. J. Chem. Phys. 21 July 2023; 159 (3): 034106. https://doi.org/10.1063/5.0152215

In [None]:
import os
import numpy as np
import tensorflow as tf

from Feat2LLM.load_data import SmallMolTraj

mol = "ethanol"
smallMol = SmallMolTraj(mol)
smallMol.get_data()
smallMol.gen_representation(n_components=10)
smallMol.save()

Take the liberty and inspect some of the attributes closer, such as the molecular geometries
`R` and total energies `E`.

In [None]:
smallMol.R, smallMol.R.shape, smallMol.E

The results for the representation vector `cMBDF` as well as the version in fewer dimensions `cMBDF_trans`, are saved in the `results` attribute. Note that `y` is the same as `E`.

In [None]:
smallMol.results

Next we just visualize the first two dimensions of the representation vector

In [None]:
X, y = smallMol.results["cMBDF_trans"], smallMol.results["y"]

In [None]:
# select first two columns of X

import matplotlib.pyplot as plt
X = X[:, :2]

# Plotting the density plot using seaborn
plt.figure(figsize=(6, 6))

plt.scatter(X[:,0], X[:,1], c=y, cmap="viridis", s=50, alpha=0.5)

plt.xlabel("$X_{1}$")
plt.ylabel("$X_{2}$")
plt.show()

Finally we perform the last step before fitting the model, we convert the numerical vectors to string representations. We shift the energy scale by the energy of the minimal conformer.

In [None]:
from Feat2LLM.vec2str import ZipFeaturizer
from sklearn.model_selection import train_test_split



X = smallMol.results["cMBDF_trans"]
y = smallMol.results["y"]

y_min = np.min(y)
y+=-y_min 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
converter = ZipFeaturizer(n_bins=300)

X_train = converter.bin_vectors(X_train)
X_test = converter.bin_vectors(X_test)

In [None]:
X_test

In [None]:
from Feat2LLM.roberta_finetuning import write_data_to_json, load_JSON_data, MoleculeDataset

# change the filename depending on the dataset
write_data_to_json(X_train, y_train, 'train.json')
write_data_to_json(X_test, y_test, 'test.json')

data = load_JSON_data("train.json")

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from transformers import RobertaTokenizer, RobertaModel, AdamW 

# Split the data into training and test sets (modify as needed if already split)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
tokenizer       = RobertaTokenizer.from_pretrained('roberta-base')
train_dataset   = MoleculeDataset(train_data, tokenizer)
test_dataset    = MoleculeDataset(test_data, tokenizer)

# Define the custom model with a regression head
class RobertaForRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.roberta = RobertaModel.from_pretrained('roberta-base')
        self.regression_head = nn.Linear(self.roberta.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state[:, 0, :]
        logits = self.regression_head(sequence_output)
        return logits

# Set device: Apple/NVIDIA/CPU
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
device = torch.device("cpu")
model = RobertaForRegression().to(device)
optimizer = AdamW(model.parameters(), lr=1e-6)

# DataLoader setup
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Training loop
model.train()
for epoch in range(2):  # Number of epochs
    for batch in train_loader:
        optimizer.zero_grad()
        inputs, labels = batch['input_ids'].to(device), batch['labels'].to(device)
        mask = batch['attention_mask'].to(device)
        outputs = model(inputs, mask).squeeze(-1)
        loss = nn.MSELoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch}, Loss: {loss.item()}")

# Evaluate the model
model.eval()
total_loss = 0
with torch.no_grad():
    for batch in test_loader:
        inputs, labels = batch['input_ids'].to(device), batch['labels'].to(device)
        mask = batch['attention_mask'].to(device)
        outputs = model(inputs, mask).squeeze(-1)
        loss = nn.MSELoss()(outputs, labels)
        total_loss += loss.item()
    print(f"Test Loss: {total_loss / len(test_loader)}")

# Save model and optimizer state
def save_model(model, optimizer, epoch, loss, filepath):
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'epoch': epoch,
        'loss': loss
    }, filepath)

# Assuming you want to save the model after training
model.eval()

if not os.path.exists('save_models'):
    os.makedirs('save_models')

save_model(model, optimizer, epoch, loss.item(), "regression.pth")