
# ChemEmbed — Simple Testing Notebook (Single File)
## 1. Overview
This Jupyter Notebook, ChemEmbed_Simple_Test.ipynb, serves as a self-contained testing environment for the ChemEmbed chemical structure prediction pipeline. It is designed to allow a user to execute the entire workflow—from data loading to prediction and result matching—within a single notebook interface, eliminating the need for command-line execution or external scripts. The primary advantage of this notebook is its simplicity and portability. A user only needs to edit a single configuration dictionary and run the cells sequentially to test the pipeline with their own data and models.

#### How it works
The pipeline can operate in two primary modes:

**with_smiles**: The input MSP file contains SMILES strings for each spectrum (typically for training or validation).

**without_smiles**: The input MSP file only contains spectral data (the standard use case for predicting unknown compounds).

It also handles different ionization modes by specifying an adduct type:

**+**: Positive mode

**-**: Negative mode

## To run this notebook

### Installation

**Clone this repository to your local machine**:

```bash
git clone clone https://github.com/massspecdl/ChemEmbed.git
cd ChemEmbed
```

**Create a virtual environment (optional but recommended)**:

```bash
python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
```

**Install required dependencies**:

```bash
pip install -r requirements.txt
```

**Download data and model from this link**:

## 📂 Repository Management

This section outlines the structure of the repository, describes the key files, and provides instructions for setting up the necessary data and models.

### File & Directory Descriptions

The repository is organized into source code, pre-trained models, sample data, and a reference database.

* **`ChemBERT_model/`**: This directory has no usage in this notebook.
* **`Trained_ChemEmbed_Model_Only_Positive_Mode`**: The CNN pre-trained model in **positive** ionization mode.
* **`Trained_ChemEmbed_Model_Only_Negative_Mode`**: The CNN pre-trained model in **negative** ionization mode.
* **`sample_reference_database.pkl`**: A sample reference database used to match model predictions against known chemical structures.
* **Sample Data (`.msp` files)**: These files provide sample MS/MS data for testing the pipeline. They are categorized as follows:
    * **`Sample_Input_Positive_Spectra_with_SMILE.msp`**: Positive mode spectra that include ground truth SMILES strings.
    * **`Sample_Input_Positive_Spectra_without_SMILE.msp`**: Positive mode spectra without SMILES strings.
    * **`Sample_Input_Negative_Spectra_With_SMILE.msp`**: Negative mode spectra that include ground truth SMILES strings.
    * **`Sample_Input_Negative_Spectra_Without_SMILE.msp`**: Negative mode spectra without SMILES strings.

***

### ⚙️ Setup Instructions

To run this project, you must first download the required models and data files and place them in the correct directory.

1.  **Install `gdown`**: Open your terminal or command prompt and install the `gdown` package to download files from Google Drive.
    ```bash
    pip install gdown
    ```
2.  **Download Files**: Run the following command in your terminal. This will download the compressed folder containing all the necessary models and data.
    ```bash
gdown https://drive.google.com/drive/folders/1GEAiTPPTUsLJxYYOr2zVuAm_74LyAwAw?usp=sharing --folder
unzip Data_and_model_folder.zip
    ```
3.  **Unzip the Folder**: Extract the contents of the downloaded `Data_and_model_folder.zip` file.
    ```bash
    unzip Data_and_model_folder.zip
    ```
4.  **Organize Your Repository**: After unzipping, move all the extracted files and folders into your main project directory alongside your Python code. The final structure should look like this:

    ```
    /your_project_root/
    ├── ChemBERT_model/          # No Usage in this notebook
    ├── Trained_ChemEmbed_Model_Only_Positive_Mode
    ├── Trained_ChemEmbed_Model_Only_Negative_Mode
    ├── sample_reference_database.pkl
    ├── Sample_Input_Negative_Spectra_Without_SMILE.msp
    ├── Sample_Input_Positive_Spectra_with_SMILE.msp
    ├── Sample_Input_Positive_Spectra_without_SMILE.msp
    └── Sample_Input_Negative_Spectra_With_SMILE.msp
    ```

Your repository is now set up and ready to run.



Put all files in folder where you put all codes


## How to run
**Input Files**
msp_file: Path to the input MSP file containing spectra data.
reference_database: Path to the reference database pickle file.
model_path: Path to the pre-trained CNN model file.
**Output Files**
preprocessed_data: Path where the preprocessed data will be saved (pickle format).
prediction_results: Path for the final prediction results CSV file. Supports variable substitution for
top_n_candidates.
Parameters
top_n_candidates: Number of top candidate molecules to retrieve from the reference database. (default: 5)
**input_file_type**: Type of the input MSP file. Options are 'with_smiles' and 'without_smiles'. (default: 'with_smiles')



In [18]:

# ==== EDIT ME: Minimal config for testing ====

CONFIG = {
    # Input files
    "msp_file_positive": "Data_and_model_folder/Sample_Input_Positive_Spectra_without_SMILE.msp",
    "msp_file_negative": "Data_and_model_folder/Sample_Input_Negative_Spectra_Without_SMILE.msp",
    "reference_database": "Data_and_model_folder/sample_reference_database.pkl",

    # Model paths
    "model_path_positive": "Data_and_model_folder/Trained_ChemEmbed_Model_Only_Positive_Mode",
    "model_path_negative": "Data_and_model_folder/Trained_ChemEmbed_Model_Only_Negative_Mode",

    # Outputs
    "preprocessed_data": "preprocessed_data.pkl",
    "prediction_results": "prediction_results.csv",

    # Parameters
    "tolerance": 0.01,
    "max_mz": 700,
    "resolution": 0.01,
    "intensity_threshold": 1,   # in percentage
    "top_n_candidates": 5,

    # Input type & adduct
    # Options: 'with_smiles' or 'without_smiles'
    "input_file_type": "without_smiles",
    # Options: '+' (M+H) or '-' (M-H)
    "adduct": "+"
}

print("CONFIG ready. Edit paths/options above as needed.")


CONFIG ready. Edit paths/options above as needed.


In [19]:

# ==== Validate Config (quick checks) ====

import json

REQUIRED_KEYS = [
    "msp_file_positive", "msp_file_negative", "reference_database",
    "model_path_positive", "model_path_negative",
    "preprocessed_data", "prediction_results",
    "tolerance", "max_mz", "resolution", "intensity_threshold",
    "top_n_candidates", "input_file_type", "adduct"
]

missing = [k for k in REQUIRED_KEYS if k not in CONFIG]
if missing:
    raise KeyError(f"Missing required config keys: {missing}")

print("✅ Config keys look good.\n")
print("Planned run summary:")
print(json.dumps({
    "Input type": CONFIG["input_file_type"],
    "Adduct": CONFIG["adduct"],
    "MSP (+)": CONFIG["msp_file_positive"],
    "MSP (-)": CONFIG["msp_file_negative"],
    "Reference DB": CONFIG["reference_database"],
    "Model (+)": CONFIG["model_path_positive"],
    "Model (-)": CONFIG["model_path_negative"],
    "Preprocessed out": CONFIG["preprocessed_data"],
    "Predictions out": CONFIG["prediction_results"],
    "Params": {
        "tolerance": CONFIG["tolerance"],
        "max_mz": CONFIG["max_mz"],
        "resolution": CONFIG["resolution"],
        "intensity_threshold": CONFIG["intensity_threshold"],
        "top_n_candidates": CONFIG["top_n_candidates"],
    }
}, indent=2))


✅ Config keys look good.

Planned run summary:
{
  "Input type": "without_smiles",
  "Adduct": "+",
  "MSP (+)": "Data_and_model_folder/Sample_Input_Positive_Spectra_without_SMILE.msp",
  "MSP (-)": "Data_and_model_folder/Sample_Input_Negative_Spectra_Without_SMILE.msp",
  "Reference DB": "Data_and_model_folder/sample_reference_database.pkl",
  "Model (+)": "Data_and_model_folder/Trained_ChemEmbed_Model_Only_Positive_Mode",
  "Model (-)": "Data_and_model_folder/Trained_ChemEmbed_Model_Only_Negative_Mode",
  "Preprocessed out": "preprocessed_data.pkl",
  "Predictions out": "prediction_results.csv",
  "Params": {
    "tolerance": 0.01,
    "max_mz": 700,
    "resolution": 0.01,
    "intensity_threshold": 1,
    "top_n_candidates": 5
  }
}


In [20]:

# ==== Run Pipeline (same logic as main.py, but inline) ====

from torch.utils.data import DataLoader

try:
    # Imports from your project
    from data_processing import (
        msp_to_dataframe_with_smiles,
        msp_to_dataframe_without_smiles,
        preprocess_spectra_with_smiles,
        preprocess_spectra_without_smiles,
        process_data_with_smiles,
        process_data_without_smiles
    )

    from model_utils import (
        load_model,
        predict_with_smiles,
        predict_without_smiles
    )

    from reference_utils import (
        load_reference_database_with_smiles,
        load_reference_database_without_smiles,
        match_predictions_to_reference_with_smiles,
        match_predictions_to_reference_without_smiles
    )
except ModuleNotFoundError as e:
    print("❌ Import error:", e)
    print("Tip: Make sure your repo root is the working directory or add it to PYTHONPATH.")
    raise

def run_pipeline(cfg: dict):
    input_type = cfg['input_file_type']
    adduct = cfg['adduct']

    if input_type == 'with_smiles':
        # Processing for 'with_smiles'
        if adduct == "+":
            print("Adduct=+, loading positive MSP (with SMILES).")
            msp_df = msp_to_dataframe_with_smiles(cfg['msp_file_positive'])
        else:
            print("Adduct=-, loading negative MSP (with SMILES).")
            msp_df = msp_to_dataframe_with_smiles(cfg['msp_file_negative'])
        norm_df = preprocess_spectra_with_smiles(msp_df, cfg['intensity_threshold'])
        final_up = process_data_with_smiles(norm_df, cfg['tolerance'], cfg['resolution'], cfg['max_mz'])
        final_up.to_pickle(cfg['preprocessed_data'])
        from data_loaders import inference_dataset_loader as data_loader_module
        test_dataset = data_loader_module.class_ls(cfg['preprocessed_data'])
        predict_fn = predict_with_smiles
        reference_loader_fn = load_reference_database_with_smiles
        matcher_fn = match_predictions_to_reference_with_smiles
    else:
        # Processing for 'without_smiles'
        if adduct == "+":
            print("Adduct=+, loading positive MSP (without SMILES).")
            msp_df = msp_to_dataframe_without_smiles(cfg['msp_file_positive'])
        else:
            print("Adduct=-, loading negative MSP (without SMILES).")
            msp_df = msp_to_dataframe_without_smiles(cfg['msp_file_negative'])
        norm_df = preprocess_spectra_without_smiles(msp_df, cfg['intensity_threshold'])
        final_up = process_data_without_smiles(norm_df, cfg['tolerance'], cfg['resolution'], cfg['max_mz'])
        final_up.to_pickle(cfg['preprocessed_data'])
        from data_loaders import spectra_inference_dataset_loader as data_loader_module
        test_dataset = data_loader_module.class_ls(cfg['preprocessed_data'])
        predict_fn = predict_without_smiles
        reference_loader_fn = load_reference_database_without_smiles
        matcher_fn = match_predictions_to_reference_without_smiles

    # DataLoader
    test_loader = DataLoader(dataset=test_dataset,
                             batch_size=1,
                             drop_last=True,
                             shuffle=False,
                             num_workers=0)

    # Load model based on adduct
    if adduct == "+":
        model_cnn = load_model(cfg['model_path_positive'])
    else:
        model_cnn = load_model(cfg['model_path_negative'])

    # Predict
    prediction_df = predict_fn(model_cnn, test_loader, input_type)

    # Load reference
    reference_df = reference_loader_fn(cfg['reference_database'], adduct)

    # Match predictions
    final_results_df = matcher_fn(prediction_df, reference_df, cfg['top_n_candidates'], input_type, adduct)

    # Save results
    out_csv = cfg['prediction_results']
    final_results_df.to_csv(out_csv, index=False)
    print(f"✅ Saved predictions to: {out_csv}")

    return final_results_df

# Execute
try:
    results_df = run_pipeline(CONFIG)
    print("✅ Pipeline completed.")
except FileNotFoundError as e:
    print("❌ File not found:", e)
    print("Check the CONFIG paths for MSP/model/reference files.")
except Exception as e:
    print("⚠️ Pipeline error:", repr(e))


Adduct=+, loading positive MSP (without SMILES).
❌ File not found: [Errno 2] No such file or directory: 'Data_and_model_folder/Sample_Input_Positive_Spectra_without_SMILE.msp'
Check the CONFIG paths for MSP/model/reference files.


In [17]:

# ==== Preview Results (if the CSV exists) ====

import os
import pandas as pd

out_path = CONFIG.get("prediction_results", "prediction_results.csv")
if os.path.exists(out_path):
    df = pd.read_csv(out_path)
    print(f"Loaded: {out_path} | shape={df.shape}")
    display(df.head(10))
else:
    print(f"No predictions file at: {out_path}")


Loaded: prediction_results.csv | shape=(79, 16)


Unnamed: 0,Unique_ID,Top_1_cosine,Top_2_cosine,Top_3_cosine,Top_4_cosine,Top_5_cosine,Top_1_SMILE,Top_2_SMILE,Top_3_SMILE,Top_4_SMILE,Top_5_SMILE,Top_1_InChIKey,Top_2_InChIKey,Top_3_InChIKey,Top_4_InChIKey,Top_5_InChIKey
0,"('ID1',)",0.987484,0.984343,0.982876,0.982376,0.981056,O=C1OC2=C(C=C1)C=CC=3OC(C)(C)C(OC4OC(CO)C(O)C(...,O=C(OCC1OC(OC2C(=O)OCC2C)C(O)C(O)C1O)C=CC=3C=C...,CC(C)(OC1OC(CO)C(O)C(O)C1O)C1Cc2c(ccc3ccc(=O)o...,CC(C)(O[C@@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]...,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...,BPBRRMGZTUDRDI,CEKFJUIHLOBFGM,UCJHITBUIWHISE,HXCGUCZXPFBNRD,GHKWPHRULCFTBB
1,"('ID2',)",0.96829,0.954043,0.95326,0.904643,,O=C(CCC1=CC=C(O)C(O)=C1)CC(OC2OCC(O)C(O)C2O)CC...,O=C(OCC1=CC(O)C2CCOC(OC3OC(CO)C(O)C(O)C3O)C12)...,O=C1C(O)=C2C(C3=C(O)C(O)=C(C(O)=C13)C(CO)COC(=...,O=C(C=1C(O)=C(C(OC)=C(C1O)CC2=C(O)C(=C(O)C(=C2...,,AQRNEKDRSXYJIN,CKCOHJIGJRMCES,BLLJRCSMXYBSCC,CJYWXYKMJMRSBY,
2,"('ID5',)",0.941018,0.940985,0.940542,0.939786,0.932445,O=C1OCCC(C)C(O)C(=O)OOC23CCC(=CC3OC4CC(OC(=O)C...,O=C(OCC12C(OC(=O)C)C(=O)C(C)C3(CC(OC3OC(=O)C)C...,O=C(OC1OC(C2=COC=C2)CC13C(C)CC45OCC67OC(OCC47C...,O=C(OCC12C(OC(=O)C)CC(C)C3(C(=O)OC(C4=COC=C4)C...,COC(=O)OC1C=C2C(=CC(=O)OC2(C)C)C(C)C2CC3(C)C(=...,CNNHLYVIVPPUNJ,CSZXTLSYVDFPEB,DVJHEQZGVQCVQF,BVKLUYUFIJHFAO,NLXBYYROKNGJOC
3,"('ID6',)",0.952908,0.95221,0.951438,0.950261,0.945487,O=C(OC1OC(C2=COC=C2)CC13C(C)CC45OCC67OC(OCC47C...,O=C(OCC12C(OC(=O)C)C(=O)C(C)C3(CC(OC3OC(=O)C)C...,O=C(OCC12C(OC(=O)C)CC(C)C3(C(=O)OC(C4=COC=C4)C...,O=C1OCCC(C)C(O)C(=O)OOC23CCC(=CC3OC4CC(OC(=O)C...,COC(=O)OC1C=C2C(=CC(=O)OC2(C)C)C(C)C2CC3(C)C(=...,DVJHEQZGVQCVQF,CSZXTLSYVDFPEB,BVKLUYUFIJHFAO,CNNHLYVIVPPUNJ,NLXBYYROKNGJOC
4,"('ID7',)",0.96505,0.960649,0.960031,0.95979,0.959694,OC=1C=C(OC2OC(CO)C(O)C(O)C2O)C=C3OC(C=4C=CC=CC...,OC1=CC=C(C=C1OC)C2OC(O)C3C(OCC23)C4=CC(OC)=C(O...,O=C(OC)C1(O)CC2=CC(OC)=C(O)C=C2C(C3=CC=C(O)C(O...,O=C(C1=CC(OC)=C(O)C(OC)=C1)C2COC(C3=CC=C(O)C(O...,COc1ccc(/C=C/c2cc(O)cc(O[C@@H]3O[C@H](CO)[C@@H...,AAHNTCWRJBNODQ,DQFGZXKOVWIUGY,AHZRNDDIIHQSQY,FCUUNRWODYPBEE,MFMQRDLLSRLUJY
5,"('ID9',)",0.948677,0.935848,0.933014,0.931897,0.921545,O=C(OCC1OC(OCCC=CCC=2C(=O)CCC2C)C(O)C(O)C1O)C=...,O=C(OCC12C(OC(=O)C)C(=O)C(C)C3(CC(OC3OC(=O)C)C...,O=C(OCC12C(OC(=O)C)CC(C)C3(C(=O)OC(C4=COC=C4)C...,O=CC1=CCC(OC(=O)C)C(=CCC(OC(=O)C)C2(OC2C3OC(=O...,COC(=O)OC1C=C2C(=CC(=O)OC2(C)C)C(C)C2CC3(C)C(=...,AYSWQYNATUUOGQ,CSZXTLSYVDFPEB,BVKLUYUFIJHFAO,FYCJDKWSHNOLAG,NLXBYYROKNGJOC
6,"('ID10',)",0.984175,0.982955,0.982955,0.980052,0.979746,O=C1C=C(OC2=C1C(=CC(OC)=C2C3OC(CO)C(O)C(O)C3O)...,COc1cc(O[C@@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@H...,COc1cc(OC2OC(CO)C(O)C(O)C2O)cc2cc(C)c(C(C)=O)c...,O=C1OC2=C(C=C1)C=CC=3OC(C)(C)C(OC4OC(CO)C(O)C(...,O=C(OCC1OC(OC2C(=O)OCC2C)C(O)C(O)C1O)C=CC=3C=C...,CNMGVPHFWMWCGL,GHKWPHRULCFTBB,FEZDDTIDMGTSLT,BPBRRMGZTUDRDI,CEKFJUIHLOBFGM
7,"('ID11',)",0.966461,0.958574,0.956658,0.955036,0.951897,O=C(O)C1=C(O)C=C(O)C2=C1CC(OC(=O)C)CC2C,O=C1OC(CC2=CC=C(O)C(OC)=C12)CCC(=O)OC,O=C(O)C1=CC=C2OC(CC2=C1)C(O)(C)COC(=O)C,O=C1C2=C(O)C=C(OC)C(O)=C2C(O)C(C1)CC(=O)C,O=C1C2=C(O)C=C(OC)C=C2C(O)C(O)(C1)CC(=O)C,FQAQKEQYUMXFEA,CSIZCWHTOQSSKC,AGZADXYKMDGMON,DFKLLVDXSVLRDE,CYKQKQBPUJQHFY
8,"('ID12',)",0.967126,0.961954,0.961487,0.957625,0.957118,O=C(OCC1OC(OC2COC(=O)C2C)C(O)C(O)C1O)C=CC3=CC=...,O=C(OCC=1C=CC=CC1OC2OC(CO)C(O)C(O)C2O)C3(O)C=C...,CC(=O)OCC1OC(Oc2c(C)c(O)c3c(=O)cc(C)oc3c2C)C(O...,O=C1OC=2C(O)=C3OC(C)(C)C(OC4OC(CO)C(O)C(O)C4O)...,CC(C)(O)C1Cc2cc3ccc(=O)oc3c(OC3OC(CO)C(O)C(O)C...,AAQFUUDEHZBVHT,CZDNLUMNELLDDD,NJRLDXGREKHXGV,DJNJDDXGDIUVGC,JWWFVRMFYKPZNE
9,"('ID13',)",0.938371,0.912507,0.908634,0.87287,0.787278,O=C(OC1C2C3(OC(=O)C)COC3CC(O)C2(C(=O)C(O)C4=C(...,O=C(OCC(=O)C1CCC2(O)C3CCC4(OC(=O)C)CC(OC(=O)C)...,O=C(OC1C(=C2C3OC(=O)C(O)(C)C3(O)C(OC(=O)C(C)CC...,OC1=CC=C(C=C1OC)CC(C)C(C)CC2=CC(OC)=C(OC3OC(CO...,O=C(OC(C1=CC(OC)=C2OCOC2=C1)C(OC(=O)C(O)(C)C(O...,GADAUHGBDGVMET,GADAHPHEPRHMOO,FIAZIVNRHQWTPY,FWXROUSCNCIRNX,CIKCPLCXGQIZCS



---

### Tips
- If imports fail, ensure the notebook's working directory is the project root (where those modules live),
  or add the path to `sys.path`:
  ```python
  import sys, pathlib
  sys.path.append(str(pathlib.Path('.').resolve()))
  ```
- You can copy this single file to a colleague and they can test by just editing the `CONFIG` cell.