<a href="https://colab.research.google.com/github/grissharrisdennis/Machine-learning-Projects/blob/main/Similarity_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity Prediction

Molecular similarity assessments using machine learning.
Useful for the prediction of molecular similarity evaluations by humans.

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful.

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

[Link to the dataset](https://archive.ics.uci.edu/dataset/750/similarity+prediction-1)[click here]

# Acknowledgements
Gandini, Enrico, Gilles Marcou, Fanny Bonachera, Alexandre Varnek, Stefano Pieraccini, and Maurizio Sironi. 2022.
 "Molecular Similarity Perception Based on Machine-Learning Models" International Journal of Molecular Sciences 23, no. 11: 6114. https://doi.org/10.3390/ijms23116114



In [None]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.9.4


In [None]:
import pandas as pd
import numpy as np
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
#train_data_images_path='/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D'
#train_data_images=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D')
#train_data_conformers=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/conformers_3D')
train_data=pd.read_csv('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/original_training_set.csv')

In [None]:
test_data=pd.read_csv('/content/drive/My Drive/dataset_Similarity_Prediction/new_dataset/new_dataset.csv')

In [None]:
ids = train_data['id_pair']
smiles_a = train_data['curated_smiles_molecule_a']
smiles_b = train_data['curated_smiles_molecule_b']
tanimoto_coefficients = train_data['tanimoto_cdk_Extended']
tanimoto_combo = train_data['TanimotoCombo']
frac_similar = train_data['frac_similar']


In [None]:
tids = test_data['id_pair']
tsmiles_a = test_data['curated_smiles_molecule_a']
tsmiles_b = test_data['curated_smiles_molecule_b']
ttanimoto_coefficients = test_data['tanimoto_cdk_Extended']
ttanimoto_combo = test_data['TanimotoCombo']
tfrac_similar = test_data['frac_similar']

In [None]:
from rdkit import Chem
from rdkit.Chem import AllChem

def convert_smiles_to_fingerprints(smiles):
    # Convert SMILES to RDKit Mol object
    molecule = Chem.MolFromSmiles(smiles)

    # Generate molecular fingerprint (Morgan fingerprint in this example)
    fingerprint = AllChem.GetMorganFingerprintAsBitVect(molecule, 2)

    # Convert RDKit fingerprint object to bit vector
    fingerprint_bitvector = list(fingerprint.ToBitString())

    return fingerprint_bitvector


In [None]:
smiles_list = []
for i, j in zip(smiles_a, smiles_b):
    fingerprint_i = convert_smiles_to_fingerprints(i)
    fingerprint_j = convert_smiles_to_fingerprints(j)

    smiles_list.append([fingerprint_i, fingerprint_j])

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Flatten the list of fingerprints
flattened_smiles_list = np.array(smiles_list).reshape(len(smiles_list), -1)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(flattened_smiles_list, frac_similar, test_size=0.2, random_state=42)

# Example preprocessing for numerical and categorical data
numerical_scaler = StandardScaler()
X_train_numerical = numerical_scaler.fit_transform(X_train)
X_val_numerical = numerical_scaler.transform(X_val)

# Combine y_train and y_val for label encoding
combined_labels = np.concatenate([y_train, y_val])

# Encode the combined labels
label_encoder = LabelEncoder()
combined_labels_encoded = label_encoder.fit_transform(combined_labels)

# Use the label encoder to transform both training and validation labels
y_train_encoded = combined_labels_encoded[:len(y_train)]
y_val_encoded = combined_labels_encoded[len(y_train):]

# Determine the input shape based on the length of flattened fingerprint vectors
input_shape = len(flattened_smiles_list[0])

model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(input_shape,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

# Train the model
model.fit(X_train_numerical, y_train_encoded, epochs=10, batch_size=32, validation_data=(X_val_numerical, y_val_encoded))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7aa558280790>

In [24]:
tsmiles_list = []
for i, j in zip(tsmiles_a, tsmiles_b):
    fingerprint_i = convert_smiles_to_fingerprints(i)
    fingerprint_j = convert_smiles_to_fingerprints(j)
    tsmiles_list.append([fingerprint_i, fingerprint_j])

In [28]:
# Flatten the list of fingerprints for test data
tflattened_smiles_list = np.array(tsmiles_list).reshape(len(tsmiles_list), -1)

# Preprocess numerical features
X_test_numerical = numerical_scaler.transform(tflattened_smiles_list)

In [29]:
predictions = model.predict(X_test_numerical)



In [None]:
import cairosvg
from PIL import Image
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Concatenate

In [None]:
!pip install cairosvg

Collecting cairosvg
  Downloading CairoSVG-2.7.1-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m886.1 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting cairocffi (from cairosvg)
  Downloading cairocffi-1.6.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.1/75.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cssselect2 (from cairosvg)
  Downloading cssselect2-0.7.0-py3-none-any.whl (15 kB)
Installing collected packages: cssselect2, cairocffi, cairosvg
Successfully installed cairocffi-1.6.1 cairosvg-2.7.1 cssselect2-0.7.0


In [None]:
def extract_images_from_svg(svg_file):
    png_file = svg_file.replace('.svg', '.png')

    # Convert SVG to PNG using CairoSVG
    cairosvg.svg2png(url=svg_file, write_to=png_file)
    return png_file

In [None]:
svg_files = [os.path.join(train_data_images_path, file) for file in train_data_images if file.endswith('.svg')]
a_images=[]
b_images=[]
for svg_file in svg_files:
    png_img_path = extract_images_from_svg(svg_file)
    #png_img_path=preprocess_image(png_img_p)
    if png_img_path[104:105]=='a':
      a_images.append(png_img_path)
    elif png_img_path[104:105]=='b':
      b_images.append(png_img_path)

In [None]:
images_data = [[] for _ in range(101)]  # Initialize a list of lists for each number (0-100)
for img_path in a_images + b_images:
    molecule_num = int(img_path.split('_')[-1][:3])  # Extract the molecule number
    images_data[molecule_num].append(img_path)
images_data.pop(0)

In [None]:
# Convert images_data to numpy arrays for input to the model
image_pairs = np.array(images_data)  # Assuming images_data contains pairs of preprocessed images
frac_similar_values = np.array(train_data['frac_similar'])  # Assuming frac_similar is a column from train_data
print(image_pairs)
# Assuming your image_pairs are of shape (num_samples, 2, img_width, img_height, img_channels)
# Reshape to (num_samples, img_width, img_height, img_channels) for each image in the pair
#image_pairs = image_pairs.reshape(-1, 2, img_width, img_height, img_channels)
