# HuSSPred Virtual Screening Instructions

This guide explains how to use the provided Jupyter Notebook script to run virtual screening predictions for skin sensitization using the **HuSSPred** model.

---

## 🚀 How to Use This Script

1. **Prepare Your Input File**  
   - Your input file must be an **Excel file** (`.xlsx` format).  
   - It must contain a column named: **"QSAR_READY_SMILES"** with SMILES representations of the chemicals.
   - It must be added to the **"../Batch_predictions/data"** Folder

2. **Modify the Script for Your File**  
   - Locate the following line in the script:
     ```python
     data_vs = "../data/FILENAME.xlsx"
     ```
   - Change only the **filename** at the end (`FILENAME.xlsx`) to match your actual Excel file.

3. **Run the Notebook**  
   - Open the Jupyter Notebook.  
   - Run all the cells to process your data and obtain predictions.

4. **Where to Find the Results**  
   - The predictions will be saved as a new Excel file at:  
     ```python
     output_path = "../results/1_QSAR_readysmiles only_dedup_excel_results.xlsx"
     ```
   - The results file will contain:
     - **Binary Probability Active**: The probability of a compound being active.
     - **Binary Predicted Outcome**: `"1"` (active) or `"0"` (inactive).
     - **Binary Applicability Domain**: `"Inside"` (reliable prediction) or `"Outside"` (unreliable prediction).

---

## 📂 File Locations  
- **Input file**: Place your Excel file inside the `../data` folder.  
- **Output file**: The results will be saved in the `../results/` folder.  

Once you run the script, your predictions will appear in the output Excel file. 🚀

# Binary Prediction

In [7]:
# Importing packages
import os
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, PandasTools
import numpy as np
import pandas as pd
from pandas import DataFrame
import joblib
from sklearn.metrics import euclidean_distances

# Load training compounds
data_train = "../models/dataset_ss_DSA05_WES_GHS_BIN_Binary.sdf"
train_mols = [mol for mol in Chem.SDMolSupplier(data_train) if mol is not None]

# Generate binary Morgan fingerprint with radius 2 (no features) for training data
train_fp = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 2048, useFeatures=False) for m in train_mols]

# Load test compounds from Excel file for virtual screening
data_vs = "../data/2_BatchSearch_QSAR SMILES for HUSSpred.xlsx"
output_path = "../results/2_BatchSearch_QSAR SMILES_predictions.xlsx"

test_df = pd.read_excel(data_vs)

# Standardize column names
test_df.columns = test_df.columns.str.strip().str.upper()  # Convert to uppercase for consistency

# Verify the correct column name for QSAR-ready SMILES
expected_col = "QSAR_READY_SMILES"
expected_col = expected_col.upper()  # Ensure uppercase matching

if expected_col not in test_df.columns:
    closest_match = [col for col in test_df.columns if expected_col in col.upper()]
    if closest_match:
        expected_col = closest_match[0]  # Use the closest match
        print(f"Warning: Using '{expected_col}' instead of 'QSAR_READY_SMILES'")
    else:
        raise KeyError(f"Column '{expected_col}' not found in test_df. Available columns: {test_df.columns.tolist()}")

# Convert "QSAR_READY_SMILES" to RDKit molecules, ensuring invalid SMILES are removed
valid_smiles = []
valid_mols = []

for smi in test_df[expected_col]:
    if pd.notna(smi):  # Ensure the value is not NaN
        mol = Chem.MolFromSmiles(smi)
        if mol is not None:
            valid_smiles.append(smi)
            valid_mols.append(mol)

# Ensure we only generate fingerprints for valid molecules
test_fp = [AllChem.GetMorganFingerprintAsBitVect(m, 2, 2048, useFeatures=False) for m in valid_mols]

# Convert RDKit fingerprints to numpy array
def rdkit_numpy_convert(fp):
    output = []
    for f in fp:
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(f, arr)
        output.append(arr)
    return np.asarray(output)

# Convert fingerprints to numpy arrays
x_train = rdkit_numpy_convert(train_fp)
x_test = rdkit_numpy_convert(test_fp)

# Load selected feature indices from the specified folder
load_folder = "../models"
load_path = os.path.join(load_folder, 'selected_features_svm_binary.npy')
selected_features = np.load(load_path)

# Apply selected features
x_train_selected = x_train[:, selected_features]
x_test_selected = x_test[:, selected_features]

# Load the calibrated SVM model and the threshold
calibrated_model_data = joblib.load(os.path.join(load_folder, "ss_DSA05_binary_morgan_r2_2048_svm_calibrated_with_threshold.joblib"))
loaded_model = calibrated_model_data['model']
loaded_threshold = calibrated_model_data['threshold']

# Define the applicability domain calculation function
def calculate_applicability_domain(X_train, X_test, threshold=0.5):
    distances = euclidean_distances(X_train, X_train)
    avg_distance = np.mean(distances)
    std_distance = np.std(distances)
    
    # Define applicability domain threshold
    apd_threshold = avg_distance + threshold * std_distance
    
    # Calculate test distances to training data
    test_distances = euclidean_distances(X_test, X_train)
    min_distances = np.min(test_distances, axis=1)
    
    # Calculate applicability scores
    applicability_scores = 1 - (min_distances / apd_threshold)
    
    # Ensure the applicability scores are between 0 and 1
    applicability_scores = np.clip(applicability_scores, 0, 1)
    
    # Convert to percentage
    applicability_scores_percentage = applicability_scores * 100
    
    return applicability_scores_percentage

# Calculate applicability domain for test data
applicability_scores = calculate_applicability_domain(x_train_selected, x_test_selected)

# Check for any NaN values in the descriptors
if np.isnan(x_test_selected).any():
    raise ValueError("NaN values found in descriptors.")

# Predict new data
orig_pp1 = loaded_model.predict_proba(x_test_selected)[:, 1]

# Check for NaN values in the predictions
if np.isnan(orig_pp1).any():
    raise ValueError("NaN values found in predictions.")

# Apply the threshold to binarize predictions
predicted_classes = (orig_pp1 >= loaded_threshold).astype(int)

# Convert numpy array to pandas dataframe
# Convert numpy array to pandas dataframe with SMILES column
vs1 = DataFrame({
    "QSAR_READY_SMILES": valid_smiles,  # Ensure SMILES are included
    " Binary Probability Active": np.round(orig_pp1, 2),
    "Binary Predicted Outcome": np.where(orig_pp1 >= loaded_threshold, "1", "0"),
    "Binary Applicabiliity Domain": np.where(applicability_scores >= 50, "Inside", "Outside")
})

# Ensure column names are properly formatted before merging
vs1.columns = vs1.columns.str.strip().str.upper()

# Merge predictions with the original dataset while preserving all original columns
expected_col = "QSAR_READY_SMILES"
if expected_col in test_df.columns and expected_col in vs1.columns:
    test_df = test_df.merge(vs1, on=expected_col, how="left")
else:
    raise KeyError(f"Column '{expected_col}' missing from one of the datasets. Test_df columns: {test_df.columns.tolist()}, vs1 columns: {vs1.columns.tolist()}")

# Save the updated dataset with all original columns + HuSSPred virtual screening results
test_df.to_excel(output_path, sheet_name="Sheet1", index=False)

# Print final dataframe
test_df

[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:10] SMILES Parse Error: syntax error while parsing: 
[09:10:10] SMILE

Unnamed: 0,INPUT,FOUND_BY,DTXSID,PREFERRED_NAME,QSAR_READY_SMILES,CASRN,INCHIKEY,BINARY PROBABILITY ACTIVE,BINARY PREDICTED OUTCOME,BINARY APPLICABILIITY DOMAIN
0,DTXSID7049392,DSSTox_Substance_Id,DTXSID7049392,4-Nitrobenzylbromide,[O-][N+](=O)C1=CC=C(CBr)C=C1,100-11-8,VOLRSQPSJGXRNJ-UHFFFAOYSA-N,0.83,1,Inside
1,DTXSID4025745,DSSTox_Substance_Id,DTXSID4025745,4-Nitrobenzyl chloride,[O-][N+](=O)C1=CC=C(CCl)C=C1,100-14-1,KGCNHWXDPDPSBV-UHFFFAOYSA-N,0.83,1,Inside
2,DTXSID8024658,DSSTox_Substance_Id,DTXSID8024658,alpha-Bromotoluene,BrCC1=CC=CC=C1,100-39-0,AGEZXYOZHKGVCM-UHFFFAOYSA-N,0.94,1,Inside
3,DTXSID8039241,DSSTox_Substance_Id,DTXSID8039241,Benzaldehyde,O=CC1=CC=CC=C1,100-52-7,HUMNYLRZRPPJDN-UHFFFAOYSA-N,0.95,1,Inside
4,DTXSID8021147,DSSTox_Substance_Id,DTXSID8021147,Phenylhydrazine,NNC1=CC=CC=C1,100-63-0,HKOOXMFOFWEVGF-UHFFFAOYSA-N,0.87,1,Inside
...,...,...,...,...,...,...,...,...,...,...
8325,DTXSID201021734,DSSTox_Substance_Id,DTXSID201021734,Methyl 2-[(3-phenyl-2-propen-1-ylidene)amino]b...,,94386-48-8,OOJFPYZBLJTFST-UHFFFAOYSA-N,,,
8326,DTXSID0025072,DSSTox_Substance_Id,DTXSID0025072,"1,3-Dihydroxy-2-propanone",,96-26-4,RXKJFZQQPQGTFL-UHFFFAOYSA-N,,,
8327,DTXSID6026616,DSSTox_Substance_Id,DTXSID6026616,Triethylaluminum,,97-93-8,VOITXYVAKOUIBA-UHFFFAOYSA-N,,,
8328,DTXSID20105687,DSSTox_Substance_Id,DTXSID20105687,"Fatty acids, C16-18 and C18-unsatd., esters wi...",,98859-60-0,,,,


# Multiclass Predictions

In [None]:
# Importing packages
import os
from rdkit import Chem
import numpy as np
import pandas as pd
from pandas import DataFrame
import joblib
from sklearn.preprocessing import MinMaxScaler
from mordred import Calculator, descriptors

# Load training compounds from SDF file
data_train = "../models/dataset_ss_DSA05_WES_GHS_SUB MC.sdf"
train_mols = [mol for mol in Chem.SDMolSupplier(data_train) if mol is not None]

# Load test compounds from Excel file (for virtual screening)
data_vs = output_path
output_path2 = "../results/2_BatchSearch_QSAR SMILES_predictions2.xlsx"


# Read the test dataset from Excel
test_df = pd.read_excel(data_vs)

# Standardize column names
test_df.columns = test_df.columns.str.strip().str.upper()

# Identify the correct column for QSAR-ready SMILES
expected_col = "QSAR_READY_SMILES"
if expected_col not in test_df.columns:
    raise KeyError(f"Column '{expected_col}' not found in test_df.")

# Convert SMILES to RDKit molecules
valid_smiles = []
valid_mols = []
for smi in test_df[expected_col]:
    mol = Chem.MolFromSmiles(str(smi)) if pd.notna(smi) else None
    if mol:
        valid_smiles.append(smi)
        valid_mols.append(mol)

if not valid_mols:
    raise ValueError("No valid molecules found in the test dataset.")

# Calculate Mordred descriptors
calc = Calculator(descriptors, ignore_3D=True)
x_train = calc.pandas(train_mols)
x_test = calc.pandas(valid_mols)

# Drop any descriptor columns with NaN values
x_train.dropna(axis=1, how="any", inplace=True)
x_test.dropna(axis=1, how="any", inplace=True)

# Keep only common descriptors
common_columns = x_train.columns.intersection(x_test.columns)
if common_columns.empty:
    raise ValueError("No common descriptor columns found between training and test data.")

x_train = x_train[common_columns]
x_test = x_test[common_columns]

# Min-Max scaling
scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Load selected features
load_folder = "../models"
load_path = os.path.join(load_folder, "selected_features_rf_multiclass.npy")
selected_features = np.load(load_path, allow_pickle=True)

# Ensure selected features exist in both datasets
selected_features = [f for f in selected_features if f in common_columns]
if not selected_features:
    raise ValueError("No selected features remain after NaN removal.")

# Apply feature selection
selected_indices = [x_train.columns.get_loc(feature) for feature in selected_features]
x_train_selected = x_train_scaled[:, selected_indices]
x_test_selected = x_test_scaled[:, selected_indices]

# **FIX: Replace NaN values with 0 (to prevent errors)**
x_train_selected = np.nan_to_num(x_train_selected, nan=0.0)
x_test_selected = np.nan_to_num(x_test_selected, nan=0.0)

# Define the applicability domain calculation function
def calculate_applicability_domain(X_train, X_test, threshold=0.5):
    distances = euclidean_distances(X_train, X_train)
    avg_distance = np.mean(distances)
    std_distance = np.std(distances)
    
    # Define applicability domain threshold
    apd_threshold = avg_distance + threshold * std_distance
    
    # Calculate test distances to training data
    test_distances = euclidean_distances(X_test, X_train)
    min_distances = np.min(test_distances, axis=1)
    
    # Calculate applicability scores
    applicability_scores = 1 - (min_distances / apd_threshold)
    
    # Ensure the applicability scores are between 0 and 1
    applicability_scores = np.clip(applicability_scores, 0, 1)
    
    # Convert to percentage
    applicability_scores_percentage = applicability_scores * 100
    
    return applicability_scores_percentage

# Calculate applicability domain for test data
applicability_scores = calculate_applicability_domain(x_train_selected, x_test_selected)

# Load trained model
loaded_model = joblib.load(os.path.join(load_folder, "DSA05_mordred_rf_multiclass.joblib"))

# Predict probabilities and classes
orig_pp1 = loaded_model.predict_proba(x_test_selected)
predicted_classes = np.argmax(orig_pp1, axis=1)

# Save results
vs1 = DataFrame({
    "QSAR_READY_SMILES": valid_smiles,
    "PROB_CLASS_0": np.round(orig_pp1[:, 0], 2),
    "PROB_CLASS_1": np.round(orig_pp1[:, 1], 2),
    "PROB_CLASS_2": np.round(orig_pp1[:, 2], 2),
    "Multiclass Prediction": predicted_classes,
    "Applicability Domain Multiclass": np.where(applicability_scores >= 50, "Inside", "Outside")
})

# Merge results with original dataset
test_df = test_df.merge(vs1, on="QSAR_READY_SMILES", how="left")

# Save to Excel
test_df.to_excel(output_path2, sheet_name="Sheet1", index=False)

# Print final dataframe
test_df

[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILES Parse Error: Failed parsing SMILES ' ' for input: ' '
[09:10:11] SMILES Parse Error: syntax error while parsing: 
[09:10:11] SMILE

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


100%|██████████| 158/158 [00:00<00:00, 187.02it/s]
 19%|█▉        | 1431/7419 [00:04<00:19, 311.19it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 34%|███▍      | 2528/7419 [00:08<00:15, 323.06it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 62%|██████▏   | 4635/7419 [00:16<00:09, 296.97it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 64%|██████▎   | 4717/7419 [00:17<00:08, 322.09it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 66%|██████▌   | 4900/7419 [00:17<00:07, 337.67it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


100%|██████████| 7419/7419 [00:27<00:00, 273.67it/s]
  return xp.asarray(numpy.nanmin(X, axis=axis))
  return xp.asarray(numpy.nanmax(X, axis=axis))
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Unnamed: 0,INPUT,FOUND_BY,DTXSID,PREFERRED_NAME,QSAR_READY_SMILES,CASRN,INCHIKEY,BINARY PROBABILITY ACTIVE,BINARY PREDICTED OUTCOME,BINARY APPLICABILIITY DOMAIN,PROB_CLASS_0,PROB_CLASS_1,PROB_CLASS_2,Multiclass Prediction,Applicability Domain Multiclass
0,DTXSID7049392,DSSTox_Substance_Id,DTXSID7049392,4-Nitrobenzylbromide,[O-][N+](=O)C1=CC=C(CBr)C=C1,100-11-8,VOLRSQPSJGXRNJ-UHFFFAOYSA-N,0.83,1.0,Inside,0.15,0.15,0.70,2.0,Outside
1,DTXSID4025745,DSSTox_Substance_Id,DTXSID4025745,4-Nitrobenzyl chloride,[O-][N+](=O)C1=CC=C(CCl)C=C1,100-14-1,KGCNHWXDPDPSBV-UHFFFAOYSA-N,0.83,1.0,Inside,0.23,0.18,0.60,2.0,Inside
2,DTXSID8024658,DSSTox_Substance_Id,DTXSID8024658,alpha-Bromotoluene,BrCC1=CC=CC=C1,100-39-0,AGEZXYOZHKGVCM-UHFFFAOYSA-N,0.94,1.0,Inside,0.20,0.25,0.55,2.0,Outside
3,DTXSID8039241,DSSTox_Substance_Id,DTXSID8039241,Benzaldehyde,O=CC1=CC=CC=C1,100-52-7,HUMNYLRZRPPJDN-UHFFFAOYSA-N,0.95,1.0,Inside,0.23,0.42,0.35,1.0,Inside
4,DTXSID8021147,DSSTox_Substance_Id,DTXSID8021147,Phenylhydrazine,NNC1=CC=CC=C1,100-63-0,HKOOXMFOFWEVGF-UHFFFAOYSA-N,0.87,1.0,Inside,0.25,0.36,0.39,2.0,Inside
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123393,DTXSID201021734,DSSTox_Substance_Id,DTXSID201021734,Methyl 2-[(3-phenyl-2-propen-1-ylidene)amino]b...,,94386-48-8,OOJFPYZBLJTFST-UHFFFAOYSA-N,,,,,,,,
123394,DTXSID0025072,DSSTox_Substance_Id,DTXSID0025072,"1,3-Dihydroxy-2-propanone",,96-26-4,RXKJFZQQPQGTFL-UHFFFAOYSA-N,,,,,,,,
123395,DTXSID6026616,DSSTox_Substance_Id,DTXSID6026616,Triethylaluminum,,97-93-8,VOITXYVAKOUIBA-UHFFFAOYSA-N,,,,,,,,
123396,DTXSID20105687,DSSTox_Substance_Id,DTXSID20105687,"Fatty acids, C16-18 and C18-unsatd., esters wi...",,98859-60-0,,,,,,,,,
