<h1><center>ASDS 5303 Final Project Assignment #2 Dataset 1: Drug SMILES Strings and Classifications Data Preperation </center></h1>

## Group Members:
### Henry Berrios #1001392315
### LeMaur Kydd #1001767382

# **A. Introduction & Dataset Overview**

## <ins>Dataset Description:</ins>
The SMILES Strings and Drug Classification dataset sourced from a [paper](https://doi.org/10.1021/acs.jcim.9b00236) is a compilation of a few different datasets, namely [PubChem](https://pubchem.ncbi.nlm.nih.gov/) and [ZINC](https://zinc.docking.org/). It contains SMILES Strings, which will function as the basis of our training data and drug classifications, which will be our target variable. There are 6 other chemical features that can help classification performance if necessary.

The features in dataset 1 are as follows:
- IsomericSMILES Strings
- De-salted SMILES Strings
- Drug Classification (Target)
- XLogP
- Molecular Weight
- CID (PubChem Molecular ID#)
- HBondAcceptorCount
- HBondDonorCount

## <ins>Defining the ML Problem</ins>
- Supervised Learning Task: Classification
- Goal: Predict the Drug Classification of a given SMILES String using only the SMILES String data.
- Potential Use: Molecular structures can have their medicinal value estimated without costly lab research based on its SMILES String.
- Target variable: Drug Classification (categorical variable)

# **B. Data Loading & Cleaning**

For this section, we will be pulling code from the 1st assignment, as well as improving some sections that needed changes after review.

### Importing Libraries

In [31]:
# Import Libraries (same as 1st assignment)
import pandas as pd
import numpy as np
from scipy.io import wavfile

import torch
import torch.nn as nn
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import torch.nn.functional as F

from google.colab import drive

### Load Dataset 1 (from 1st assignment)

In [20]:
drugs = pd.read_excel('/content/drug_smiles_categories.xlsx') # Read the dataset containing the Drug SMILES Strings and their Drug Categories into a pandas dataframe.

## Preprocessing Dataset 1

Note: This dataset requires very minimal preprocessing as the data was sourced from reputable sources such as PubChem. The only necessary preprocessing step is encoding.

In [21]:
# We are going to subset the original dataset to only capture the SMILES Strings and their labels
drugs_subset = drugs[['IsomericSMILES', 'drug_class']]
drugs_subset.head()

Unnamed: 0,IsomericSMILES,drug_class
0,CN(C)CCCCCCN(C)C.C(CBr)CBr,hematologic
1,C1CN=C(N1)NC2=C(C3=NC=CN=C3C=C2)Br.[C@@H](C(C(...,cardio
2,C1CSC2=NC(CN21)C3=CC=C(C=C3)Br,antiinfective
3,C1C2CC3CC1CC(C2)C3NC4=CC=C(C=C4)Br,cns
4,CC(CCC(C#C)N)N,antineoplastic


### Condensing Target Variable

Initially we had 6935 data points spread across 12 drug classifications, which would create a data imbalance. So to remedy this issue, in this step I will be condensing the 8 smallest categories into one called "other".

In [22]:
# various parts of this code block were autofilled in with AI tools

value_counts = drugs_subset['drug_class'].value_counts() # Collecting the counts of the unique values of the target variable
smallest_categories = value_counts.nsmallest(8).index # Subset the 8 smallest categories
drugs_subset['drug_class'] = drugs_subset['drug_class'].replace(smallest_categories, 'other') # Make the replacement
drugs_subset['drug_class'].value_counts() # Check the value counts again

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drugs_subset['drug_class'] = drugs_subset['drug_class'].replace(smallest_categories, 'other') # Make the replacement


Unnamed: 0_level_0,count
drug_class,Unnamed: 1_level_1
antiinfective,2396
other,1437
antineoplastic,1174
cns,1141
cardio,787


# **C. Convert Dataset into Tensor Format**

In [23]:
# various parts of this code block were autofilled in with AI tools

train_df, test_df = train_test_split(drugs_subset, test_size=0.2, random_state=42, stratify=drugs_subset['drug_class']) # Split the data into training and testing sets

# Tokenize SMILES Strungs
tokenizer = Tokenizer(char_level=True, filters="") # Initialize the tokenizer
tokenizer.fit_on_texts(train_df['IsomericSMILES']) # Fit the tokenizer on the training data

# SMILES Strings to sequences
x_train_sequences = tokenizer.texts_to_sequences(train_df['IsomericSMILES']) # Convert the training SMILES Strings to sequences
x_test_sequences = tokenizer.texts_to_sequences(test_df['IsomericSMILES']) # Convert the testing SMILES Strings to sequences

# Padding the sequences to the length of the longest SMILES String
max_length = max(map(len, x_train_sequences)) # Get max length
x_train_padded = pad_sequences(x_train_sequences, maxlen=max_length, padding='post') # Pad the training sequences
x_test_padded = pad_sequences(x_test_sequences, maxlen=max_length, padding='post')

# Convery to pytorch tensors
x_train_tensor = torch.tensor(x_train_padded, dtype=torch.long) # Convert the training sequences to a PyTorch tensor
x_test_tensor = torch.tensor(x_test_padded, dtype=torch.long)

In [24]:
x_train_tensor[0] # before embedding

tensor([ 1,  7,  1,  6,  9,  8, 11,  1,  1,  7,  1,  6,  9,  8,  2, 10, 11,  1,
         2,  4,  5,  3,  7,  1,  6,  9,  8,  2,  1,  2,  1,  3,  1,  3, 10,  1,
         2,  4,  5,  3,  5,  1,  3,  1, 12,  4, 10,  1, 13,  4,  1,  2, 10, 12,
         3,  1,  4,  1,  1, 14,  4,  1,  1, 17,  4,  1,  2,  1,  4,  1, 14, 13,
         3,  5,  1,  1, 20,  4,  1, 17,  1,  4,  1,  1,  2,  4,  1, 20,  3,  1,
        24,  4,  1, 10,  4,  1,  2, 10, 24,  3,  7,  1,  6,  6,  9,  8, 27,  1,
         7,  1,  6,  6,  9,  8,  2,  1, 10, 27,  1,  2,  4,  5,  3,  7,  1,  6,
         6,  9,  8,  2,  1, 29,  4,  1,  1,  4,  1,  1,  4,  1, 29,  3, 10,  1,
         2,  4,  5,  3,  5,  1,  3,  1,  5,  1,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0, 

In [27]:
# various parts of this code block were autofilled in with AI tools

# Now we will embed the SMILES tensors
embedding_dim = 128 #embedding dimension
embedding_layer = torch.nn.Embedding(num_embeddings=len(tokenizer.word_index) + 1, embedding_dim=embedding_dim) # Initialize the embedding layer
x_train_tensor = embedding_layer(x_train_tensor) # Embed the training sequences
x_test_tensor = embedding_layer(x_test_tensor) # Embed the testing sequences

In [29]:
x_train_tensor[0] # after embedding

tensor([[ 1.3483, -0.1175,  0.0347,  ..., -1.1763, -0.7915, -0.1359],
        [-0.4946,  0.9437,  1.7089,  ...,  0.5129, -0.5214,  1.7644],
        [ 1.3483, -0.1175,  0.0347,  ..., -1.1763, -0.7915, -0.1359],
        ...,
        [ 0.0292,  0.6025, -0.3574,  ...,  1.0423,  0.3369, -2.7219],
        [ 0.0292,  0.6025, -0.3574,  ...,  1.0423,  0.3369, -2.7219],
        [ 0.0292,  0.6025, -0.3574,  ...,  1.0423,  0.3369, -2.7219]],
       grad_fn=<SelectBackward0>)

In [32]:
# various parts of this code block were autofilled in with AI tools

# Apply one hot encoding to the drug class labels
label_encoder = LabelEncoder() # Initialize the label encoder
y_train_encoded = label_encoder.fit_transform(train_df['drug_class']) # Fit and transform the training labels
y_test_encoded = label_encoder.transform(test_df['drug_class']) # Transform

num_classes = len(label_encoder.classes_) # Get the number of classes
y_train_tensor = F.one_hot(torch.tensor(y_train_encoded), num_classes=num_classes).float() # Convert the training labels to one-hot encoding
y_test_tensor = F.one_hot(torch.tensor(y_test_encoded), num_classes=num_classes).float() # Convert the testing labels to one-hot encoding

#### Tensorizing Commentary

In order to tensorize SMILES text data there are a few steps. I knew of this process as I have worked with text data in general before as well as this dataset in particular in the past. First you have to take each SMILES string and tokenize them. This involves splitting the string into either individual characters or atom-based splits that keep the molecule characters together. For this dataset we went with character based tokenizeing for the first round of modeling. The second step is considered sequencing but in the use case of TensorFlow's `tf.keras.preprocessing.text.Tokenizer` this step and the first one are done sequentially.

The remaining steps involve padding and embedding. For padding we used and referenced `tensorflow.keras.preprocessing.sequence.pad_sequence` in order to standardize the length of each tokenized SMILES string. This makes sure our input layer is consistently receiving data with the same dimensions. After a bit of research on what to do next to the data we found out about embedding the vectors using an embedding layer from the `torch.nn.Embedding` docs. This embedding reduces the dimensionality of the data and keeps important structural information, which is great for LSTM's and Transformer model architectures. Lastly, the embedded vectors are compiled into simple tensor data types and saved under appropriate file names.

# **D. Save Processed Data**

In [34]:
# mounting my drive
drive.mount('/content/drive')

torch.save(x_train_tensor, '/content/drive/MyDrive/X_train_tensor_d1.pt')
torch.save(x_test_tensor, '/content/drive/MyDrive/X_test_tensor_d1.pt')
torch.save(y_train_tensor, '/content/drive/MyDrive/y_train_tensor_d1.pt')
torch.save(y_test_tensor, '/content/drive/MyDrive/y_test_tensor_d1.pt')

Mounted at /content/drive


# **References**
- TensorFlow Developers. (2023). Tokenizer API: tf.keras.preprocessing.text.Tokenizer. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer.
- RDKit Developers. (2023). RDKit: Open-source cheminformatics. Retrieved from https://www.rdkit.org/docs/.
- Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A. (2020). SELFIES: A robust representation of semantically constrained graphs with an example application in chemistry. Machine Learning: Science and Technology, 1(4). Retrieved from https://iopscience.iop.org/article/10.1088/2632-2153/aba947/meta.
- NVIDIA. (2020). CUDA Programming Guide. Retrieved from https://developer.nvidia.com/cuda-toolkit.
- OpenAI. (2025). Response generated by ChatGPT [Large language model]. OpenAI. Retrieved from https://chat.openai.com
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
- PyTorch Community. (2024). torch.Tensor.view. Retrieved from https://pytorch.org/docs/stable/generated/torch.Tensor.view.html/.
- PyTorch Developers. (2023). Data types in PyTorch. Retrieved from https://pytorch.org/docs/stable/tensors.html#torch-tensor.
- Raschka, S., Liu, Y., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing.
- Scikit-learn Developers. (2023). Preprocessing data: StandardScaler. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.