<a href="https://colab.research.google.com/github/grissharrisdennis/Machine-learning-Projects/blob/main/Similarity_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity Prediction

Molecular similarity assessments using machine learning.
Useful for the prediction of molecular similarity evaluations by humans.

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful.

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

[Link to the dataset](https://archive.ics.uci.edu/dataset/750/similarity+prediction-1)[click here]

# Acknowledgements
Gandini, Enrico, Gilles Marcou, Fanny Bonachera, Alexandre Varnek, Stefano Pieraccini, and Maurizio Sironi. 2022.
 "Molecular Similarity Perception Based on Machine-Learning Models" International Journal of Molecular Sciences 23, no. 11: 6114. https://doi.org/10.3390/ijms23116114



In [2]:
!pip install cairosvg

Collecting cairosvg
  Downloading CairoSVG-2.7.1-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m626.3 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting cairocffi (from cairosvg)
  Downloading cairocffi-1.6.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.1/75.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cssselect2 (from cairosvg)
  Downloading cssselect2-0.7.0-py3-none-any.whl (15 kB)
Installing collected packages: cssselect2, cairocffi, cairosvg
Successfully installed cairocffi-1.6.1 cairosvg-2.7.1 cssselect2-0.7.0


In [18]:
import pandas as pd
import cairosvg
from PIL import Image
import numpy as np
import os
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Concatenate

In [4]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [5]:
train_data_images_path='/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D'
train_data_images=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D')
train_data_conformers=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/conformers_3D')
train_data=pd.read_csv('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/original_training_set.csv')

In [14]:
print(train_data)

    id_pair                          curated_smiles_molecule_a  \
0         1                         CCN(CC)CC(=O)Nc1c(C)cccc1C   
1         2  Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1)CC...   
2         3                                 COc1ccccc1OCC(O)CO   
3         4   CCOc1ccccc1OCCN[C@H](C)Cc1ccc(OC)c(S(N)(=O)=O)c1   
4         5                                 C[C@H](N)Cc1ccccc1   
..      ...                                                ...   
95       96  CCC(=O)O[C@]1(C(=O)CCl)[C@@H](C)C[C@H]2[C@@H]3...   
96       97                    C[C@H](N)[C@H](O)c1ccc(O)c(O)c1   
97       98                      CCOC(=O)C1(c2ccccc2)CCN(C)CC1   
98       99  CC1(C)O[C@@H]2C[C@H]3[C@@H]4C[C@H](F)C5=CC(=O)...   
99      100  CC(=O)OCC(=O)[C@@]1(OC(C)=O)[C@@H](C)C[C@H]2[C...   

                            curated_smiles_molecule_b  tanimoto_cdk_Extended  \
0                   CCCN1CCCC[C@H]1C(=O)Nc1c(C)cccc1C               0.641434   
1   Cc1nc2n(c(=O)c1CCN1CCC(c3noc4cc(F)ccc34)CC1

In [6]:
ids = train_data['id_pair']
smiles_a = train_data['curated_smiles_molecule_a']
smiles_b = train_data['curated_smiles_molecule_b']
tanimoto_coefficients = train_data['tanimoto_cdk_Extended']
tanimoto_combo = train_data['TanimotoCombo']
frac_similar = train_data['frac_similar']


In [19]:
def extract_images_from_svg(svg_file):
    png_file = svg_file.replace('.svg', '.png')

    # Convert SVG to PNG using CairoSVG
    cairosvg.svg2png(url=svg_file, write_to=png_file)
    return png_file

In [31]:
def preprocess_image(image):
  img = Image.open(image)
  img = img.convert('RGB')  # Ensure it's in RGB format

    # Convert image to array
  img_array = np.array(img)
  img_array = img_array.reshape((1,) + img_array.shape)  # Add batch dimension

    # Define and apply transformations
  datagen = ImageDataGenerator(
        rescale=1./255,
        rotation_range=20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        horizontal_flip=True
    )
  transformed_image = datagen.random_transform(img_array)
  standardized_image = datagen.standardize(transformed_image)

  return standardized_image[0]

In [25]:
svg_files = [os.path.join(train_data_images_path, file) for file in train_data_images if file.endswith('.svg')]
a_images=[]
b_images=[]
for svg_file in svg_files:
    png_img_path = extract_images_from_svg(svg_file)
    #png_img_path=preprocess_image(png_img_p)
    if png_img_path[104:105]=='a':
      a_images.append(png_img_path)
    elif png_img_path[104:105]=='b':
      b_images.append(png_img_path)

In [26]:
images_data = [[] for _ in range(101)]  # Initialize a list of lists for each number (0-100)
for img_path in a_images + b_images:
    molecule_num = int(img_path.split('_')[-1][:3])  # Extract the molecule number
    images_data[molecule_num].append(img_path)

In [33]:
for i in images_data:
  for j in i:
    j=preprocess_image(j)

ValueError: ignored

In [28]:
img_width, img_height, img_channels = 64, 64, 3

In [34]:
# Convert images_data to numpy arrays for input to the model
image_pairs = np.array(images_data)  # Assuming images_data contains pairs of preprocessed images
frac_similar_values = np.array(train_data['frac_similar'])  # Assuming frac_similar is a column from train_data

# Assuming your image_pairs are of shape (num_samples, 2, img_width, img_height, img_channels)
# Reshape to (num_samples, img_width, img_height, img_channels) for each image in the pair
#image_pairs = image_pairs.reshape(-1, 2, img_width, img_height, img_channels)


  image_pairs = np.array(images_data)  # Assuming images_data contains pairs of preprocessed images


In [35]:
img_width, img_height, img_channels = 64, 64, 3  # Update with your image dimensions

# Define inputs for image pairs
input_1 = Input(shape=(img_width, img_height, img_channels))
input_2 = Input(shape=(img_width, img_height, img_channels))

# CNN for image processing
convolutional_layer = Conv2D(32, (3, 3), activation='relu')
flatten_layer = Flatten()

# Process first image
x1 = convolutional_layer(input_1)
x1 = flatten_layer(x1)

# Process second image
x2 = convolutional_layer(input_2)
x2 = flatten_layer(x2)

# Concatenate processed image representations
combined = Concatenate()([x1, x2])

# Merge with frac_similar input
frac_similar_input = Input(shape=(1,))
combined_with_frac_similar = Concatenate()([combined, frac_similar_input])

# Output layer
output = Dense(1, activation='sigmoid')(combined_with_frac_similar)  # Sigmoid for similarity prediction

# Create the model
model = Model(inputs=[input_1, input_2, frac_similar_input], outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [36]:
# Train the model
model.fit([image_pairs[:, 0], image_pairs[:, 1], frac_similar_values],
          frac_similar_values,  # Assuming similarity as the target
          epochs=10, batch_size=32, validation_split=0.2)


IndexError: ignored