# What is Solubility and Why is it Important in Drug Discovery?
Solubility refers to the ability of a substance (such as a drug compound) to dissolve in a solvent, typically water. In the context of pharmaceuticals, aqueous solubility is a critical property that influences a drug's absorption, distribution, and overall effectiveness.

**Importance of solubility in Drug Discovery and Development:**

*   **Bioavailability:** Poorly soluble drugs often exhibit low bioavailability,meaning less of the drug reaches systemic circulation.
*   **Formulation Challenges:** Low solubility complicates drug formulation and delivery (e.g., oral tablets).
*   **ADMET Profile:** Solubility is closely tied to Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties.
*   **Lead Optimization:** During drug design, solubility is optimized alongside potency and selectivity to develop a balanced drug candidate.

**Prediction via QSAR:**

Quantitative Structure–Activity Relationship (QSAR) models help predict solubility using molecular descriptors, enabling faster screening of compounds in silico before synthesis or testing.


In [None]:
!pip install rdkit

**Now we will use RDKIT to build a regresison based QSAR model to predict solubility of compounds.**

Here we will make a QSAR model which can predict the solubility of given set of compounds using various descriptors like logp , molecular weight , rotatable bonds and aromatic proportion.

The dataset used is from the paper ESOL:  Estimating Aqueous Solubility Directly from Molecular Structure, John S. Delaney, Journal of Chemical Information and Computer Sciences 2004 44 (3), 1000-1005
DOI: 10.1021/ci034243x (https://pubs.acs.org/doi/10.1021/ci034243x)

John S. Delaney introduced the ESOL model, a straightforward method for predicting the aqueous solubility of compounds directly from their molecular structures. Utilizing linear regression on a dataset of 2,874 compounds, the model incorporates nine molecular descriptors, with calculated logP (octanol-water partition coefficient) being the most significant, followed by molecular weight, aromatic atom proportion, and the number of rotatable bonds. ESOL demonstrated consistent performance across three validation sets, predicting solubility within a factor of 5–8 of measured values, making it competitive with the established General Solubility Equation for drug-like molecules .

In [None]:
#Download the file contaning trainig data.
url = "https://raw.githubusercontent.com/Rajnishphe/AIDD-2022/main/ML%20Based%20QSAR/delaney.csv"

In [None]:
#Downloading Input data to Build and train model
#First we will upload the input solubility data to build and train the model
#Reading the input data using pandas
import pandas as pd
sol = pd.read_csv(url)

In [None]:
#Take a look at the input data
sol

In [None]:
#Take a look at the first structure
from rdkit import Chem
Chem.MolFromSmiles(sol.SMILES[1142])

In [None]:
#converting smiles to molecule list and Rdkit object list
from rdkit import Chem
mol_list= []
for element in sol.SMILES:
  mol = Chem.MolFromSmiles(element)
  mol_list.append(mol)

In [None]:
#length of molecule list
len(mol_list)

# Calculate Descriptors
Extracting descriptors from structure, descriptors used are logP, molecular weight , number of rotatable bonds and aromatic proportion aromatic proportion calculated seperatley, others calculated using Rdkit

**Some other packages beside RdKit to calulate descriptors**.

https://github.com/mordred-descriptor/mordred

https://github.com/ecrl/padelpy


In [None]:
from rdkit import Chem
from rdkit.Chem import Descriptors

# Define a function to calculate aromatic proportion
def get_aromatic_proportion(mol):
    aromatic_atoms = sum([1 for atom in mol.GetAtoms() if atom.GetIsAromatic()])
    heavy_atoms = Descriptors.HeavyAtomCount(mol)
    return aromatic_atoms / heavy_atoms if heavy_atoms > 0 else 0

# Initialize empty lists for each descriptor
mol_wt = []
logp = []
num_rot_bonds = []
num_h_donors = []
num_h_acceptors = []
aromatic_proportion = []
tpsa = []  # List to store TPSA values

# Calculate descriptors
for mol in mol_list:
    if mol is not None:
        mol_wt.append(Descriptors.MolWt(mol))
        logp.append(Descriptors.MolLogP(mol))
        num_rot_bonds.append(Descriptors.NumRotatableBonds(mol))
        num_h_donors.append(Descriptors.NumHDonors(mol))  # Corrected here
        num_h_acceptors.append(Descriptors.NumHAcceptors(mol))
        aromatic_proportion.append(get_aromatic_proportion(mol))
        tpsa.append(Descriptors.TPSA(mol))  # Add TPSA calculation here
    else:
        mol_wt.append(None)
        logp.append(None)
        num_rot_bonds.append(None)
        num_h_donors.append(None)
        num_h_acceptors.append(None)
        aromatic_proportion.append(None)
        tpsa.append(None)  # Handle None case for TPSA

# Add descriptors to the dataframe
sol['MolWt'] = mol_wt
sol['LogP'] = logp
sol['NumRotatableBonds'] = num_rot_bonds
sol['NumHDonors'] = num_h_donors
sol['NumHAcceptors'] = num_h_acceptors
sol['AromaticProportion'] = aromatic_proportion
sol['TPSA'] = tpsa  # Add TPSA to the DataFrame

# Display the updated dataframe
sol


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot pairwise scatter plot using Seaborn
sns.pairplot(sol[['MolWt', 'LogP', 'TPSA', 'NumHDonors', 'NumHAcceptors', 'AromaticProportion']])

# Add title
plt.suptitle('Pairwise Relationship between Descriptors', y=1.02)

# Show plot
plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'sol' is your DataFrame containing the solubility data
X = sol[['MolWt', 'LogP', 'TPSA', 'NumHDonors', 'NumHAcceptors', 'AromaticProportion']]  # Features

# Calculate the correlation matrix
corr_matrix = X.corr()

# Plot the heatmap
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features and Target variable
X = sol[['MolWt', 'LogP', 'NumRotatableBonds', 'NumHDonors', 'NumHAcceptors', 'AromaticProportion']]  # Features
y = sol['measured log(solubility:mol/L)']  # Target variable

# Step 1: Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Perform train-test split on the scaled data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Now you can use X_train and X_test for model training and evaluation


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')
print(f'R2 Score: {r2}')


In [None]:
import joblib

# Save the trained model to a file
joblib.dump(model, 'random_forest_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

In [None]:
!ls


In [None]:
# Feature importance from Random Forest
importances = model.feature_importances_
feature_names = X.columns

# Create a bar plot of feature importance
plt.barh(feature_names, importances)
plt.xlabel('Feature Importance')
plt.title('Feature Importance for Predicting Solubility')
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Plot predictions vs actual
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('True Solubility')
plt.ylabel('Predicted Solubility')
plt.title('True vs Predicted Solubility (Best Model)')
plt.show()


In [None]:
# k-fold Cross validation

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Features and target
X = sol[['MolWt', 'LogP', 'NumRotatableBonds', 'NumHDonors', 'NumHAcceptors', 'AromaticProportion']]  # Features
y = sol['measured log(solubility:mol/L)']  # Target variable

# Initialize Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Set up cross-validation (e.g., 5-fold cross-validation)
cv = KFold(n_splits=5, random_state=42, shuffle=True)

# Perform cross-validation and get the scores (e.g., R² for regression)
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='r2')

# Print cross-validation results
print("Cross-validation R² scores:", cv_scores)
print("Mean R² score:", np.mean(cv_scores))
print("Standard Deviation of R² scores:", np.std(cv_scores))


In [None]:
from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)

plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training score')
plt.plot(train_sizes, np.mean(val_scores, axis=1), label='Cross-validation score')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curves')
plt.show()


# Predicting Solubility of the external unknown molecules.


In [None]:
#loading new set of data to predict solubility of unknwon molecules using the model we build
sol1 = pd.read_csv('https://raw.githubusercontent.com/Rajnishphe/AIDD-2022/main/ML%20Based%20QSAR/new_1.csv')

In [None]:
sol1


In [None]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

# Define a function to calculate aromatic proportion
def get_aromatic_proportion(mol):
    aromatic_atoms = sum([1 for atom in mol.GetAtoms() if atom.GetIsAromatic()])
    heavy_atoms = Descriptors.HeavyAtomCount(mol)
    return aromatic_atoms / heavy_atoms if heavy_atoms > 0 else 0

# Initialize empty lists for each descriptor
mol_wt = []
logp = []
num_rot_bonds = []
num_h_donors = []
num_h_acceptors = []
aromatic_proportion = []
tpsa = []  # List to store TPSA values

# Calculate descriptors
for smiles in sol1['SMILES']:  # Loop through the SMILES column in your sol1 DataFrame
    mol = Chem.MolFromSmiles(smiles)

    if mol is not None:
        mol_wt.append(Descriptors.MolWt(mol))
        logp.append(Descriptors.MolLogP(mol))
        num_rot_bonds.append(Descriptors.NumRotatableBonds(mol))
        num_h_donors.append(Descriptors.NumHDonors(mol))  # Corrected here
        num_h_acceptors.append(Descriptors.NumHAcceptors(mol))
        aromatic_proportion.append(get_aromatic_proportion(mol))
        tpsa.append(Descriptors.TPSA(mol))  # Add TPSA calculation here
    else:
        mol_wt.append(None)
        logp.append(None)
        num_rot_bonds.append(None)
        num_h_donors.append(None)
        num_h_acceptors.append(None)
        aromatic_proportion.append(None)
        tpsa.append(None)  # Handle None case for TPSA

# Add descriptors to the dataframe
sol1['MolWt'] = mol_wt
sol1['LogP'] = logp
sol1['NumRotatableBonds'] = num_rot_bonds
sol1['NumHDonors'] = num_h_donors
sol1['NumHAcceptors'] = num_h_acceptors
sol1['AromaticProportion'] = aromatic_proportion
sol1['TPSA'] = tpsa  # Add TPSA to the DataFrame

# Save the updated DataFrame to a new CSV file
sol1.to_csv('updated_sol1.csv', index=False)

# Display the updated dataframe
sol1



In [None]:
import joblib

# Load model and scaler
model = joblib.load('random_forest_model.pkl')
scaler = joblib.load('scaler.pkl')  # to make sure to do similar scaling

X_sol1 = sol1[['MolWt', 'LogP', 'NumRotatableBonds', 'NumHDonors', 'NumHAcceptors', 'AromaticProportion']]
X_sol1_scaled = scaler.transform(X_sol1)

predicted_solubility = model.predict(X_sol1_scaled)
sol1['Predicted_Solubility'] = predicted_solubility


In [None]:
sol1

In [None]:
# Building model using lazyregressor
!pip install lazypredict

In [None]:
import lazypredict
from lazypredict.Supervised import LazyRegressor

# Initialize LazyRegressor and fit the model
regressor = LazyRegressor()
models = regressor.fit(X_train, X_test, y_train, y_test)

# Display the performance of different models
print(models)
