<a href="https://colab.research.google.com/github/grissharrisdennis/Machine-learning-Projects/blob/main/Similarity_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity Prediction

Molecular similarity assessments using machine learning.
Useful for the prediction of molecular similarity evaluations by humans.

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful.

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

[Link to the dataset](https://archive.ics.uci.edu/dataset/750/similarity+prediction-1)[click here]

# Acknowledgements
Gandini, Enrico, Gilles Marcou, Fanny Bonachera, Alexandre Varnek, Stefano Pieraccini, and Maurizio Sironi. 2022.
 "Molecular Similarity Perception Based on Machine-Learning Models" International Journal of Molecular Sciences 23, no. 11: 6114. https://doi.org/10.3390/ijms23116114



In [2]:
!pip install cairosvg

Collecting cairosvg
  Downloading CairoSVG-2.7.1-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m626.3 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting cairocffi (from cairosvg)
  Downloading cairocffi-1.6.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.1/75.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cssselect2 (from cairosvg)
  Downloading cssselect2-0.7.0-py3-none-any.whl (15 kB)
Installing collected packages: cssselect2, cairocffi, cairosvg
Successfully installed cairocffi-1.6.1 cairosvg-2.7.1 cssselect2-0.7.0


In [3]:
import pandas as pd
import cairosvg
from PIL import Image
import numpy as np
import os


In [4]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [5]:
train_data_images_path='/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D'
train_data_images=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D')
train_data_conformers=os.listdir('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/conformers_3D')
train_data=pd.read_csv('/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/original_training_set.csv')

In [6]:
ids = train_data['id_pair']
smiles_a = train_data['curated_smiles_molecule_a']
smiles_b = train_data['curated_smiles_molecule_b']
tanimoto_coefficients = train_data['tanimoto_cdk_Extended']
tanimoto_combo = train_data['TanimotoCombo']
frac_similar = train_data['frac_similar']


In [7]:
def extract_images_from_svg(svg_file):
    png_file = svg_file.replace('.svg', '.png')  # Adjust extension for the output PNG file

    # Convert SVG to PNG using CairoSVG
    cairosvg.svg2png(url=svg_file, write_to=png_file)

    return png_file

In [13]:
svg_files = [os.path.join(train_data_images_path, file) for file in train_data_images if file.endswith('.svg')]
# Extract and load images from HTML files into a numpy array for training
#training_images = []
a_images=[]
b_images=[]
for svg_file in svg_files:
    png_img_path = extract_images_from_svg(svg_file)
    #training_images.append(png_img_path)
    if png_img_path[104:105]=='a':
      a_images.append(png_img_path)
    elif png_img_path[104:105]=='b':
      b_images.append(png_img_path)
images_data = [[] for _ in range(101)]  # Initialize a list of lists for each number (0-100)

for img_path in a_images + b_images:
    molecule_num = int(img_path.split('_')[-1][:3])  # Extract the molecule number
    images_data[molecule_num].append(img_path)

['/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_015b.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_008a.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_010a.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_006a.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_014a.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_003a.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_011b.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_006b.svg', '/content/drive/My Drive/dataset_Similarity_Prediction/original_training_set/images_2D/image_mo