<a href="https://colab.research.google.com/github/pcmay/AI-Guidelines/blob/main/ALyzer3DAI_batch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




<div style="display: flex; justify-content: space-between; align-items: center;">
<img src="https://raw.githubusercontent.com/petercmay89/ALyzer3D.AI/main/white.png" width="10%">
<img src="https://raw.githubusercontent.com/petercmay89/ALyzer3D.AI/main/ALyzer3D.AI_logo.png" width="25%">
<img src="https://raw.githubusercontent.com/petercmay89/ALyzer3D.AI/main/white.png" width="25%">
<img src="https://raw.githubusercontent.com/petercmay89/ALyzer3D.AI/main/ColabFold_logo.png" width="25%">
<img src="https://raw.githubusercontent.com/petercmay89/ALyzer3D.AI/main/white.png" width="10%">
</div>



Welcome to **ALyzer3D.AI BATCH**. This notebook allows you to predict the amyloidogenicity of the VL domains of a list of light chains. The tool will first generate 3D structures with [ColabFold](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) and then automatically analyze them with the ALyzer3D.AI model.

**Instructions:**

1. **Enter Your Sequences**: In the first cell (sequences_input), paste the amino acid sequences of your light chains' VL domains. Format: >ID1:sequence1;>ID2:sequence2;>ID3:sequence3;...
2. **Select a GPU**: Click Runtime, select Change runtime type, select T4 GPU (or any GPU option available). Click Save.
3. **Run Everything**: Click on the menu Runtime -> Run all.

The notebook will now execute all the steps for you: it will install dependencies, run the ColabFold structure predictions and perform the ALyzer3D.AI analyses on the resulting top-ranked structures. At Step 3, you will be able to download a CSV file with the results.


---



In [None]:
#@title Install Dependencies and Mount
import os
import sys
from sys import version_info

# Check if ColabFold and its dependencies are already installed
if not os.path.isfile("COLABFOLD_READY"):
    print("Installing ColabFold...")
    # Install ColabFold
    os.system("pip install -q --no-warn-conflicts 'colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold'")

    # Fix for TensorFlow "undefined symbol" error
    os.system("rm -f /usr/local/lib/python3.*/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so")

    # Create symbolic links
    os.system("ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold")
    os.system("ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold")
    os.system("touch COLABFOLD_READY")

# Install ALyzer3D.AI dependencies
print("Installing ALyzer3D.AI and its dependencies...")
os.system("git clone https://github.com/petercmay89/ALyzer3D.AI.git > /dev/null 2>&1")
sys.path.insert(0, '/content/ALyzer3D.AI')
# Added biopython to the install list
os.system("pip install -q transformers scikit-learn joblib biopython > /dev/null 2>&1")

In [None]:
#@title Run Batch Prediction and Analysis

#@markdown ### Paste your sequences below (format: `>ID1:SEQUENCE;>ID2:SEQUENCE`)
sequences_input = '>ID1:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID2:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID3:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID4:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID5:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID6:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID7:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID8:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID9:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID10:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID11:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID12:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID13:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID14:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID15:DIRLTQSPSSLSASVGDRVTITCQASQHINNYLNWYQHKPGQAPKVLIYDASNLATGVPSRFSGNGSGTHFTLTINSLQPEDAATYYCQQHDDLPLTFGGGTKVEIR;>ID16:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID17:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID18:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID19:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID20:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID21:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID22:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID23:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID24:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID25:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID26:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID27:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID28:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID29:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL;>ID30:SASASLGASVNFTCTLSNEHSTYAITWHQQQPKKGPRYLMKVKSDGSHNKGDGIPDRFSGSSSGAERYLTISSLQSDNEADYYCQTWDTDILVFGGGTNLTVL' #@param {type:"string"}
output_dir = 'colabfold_output'

# --- Standard Parameters ---
num_relax = 0
template_mode = "none"
msa_mode = "mmseqs2_uniref_env"
model_type = "auto"
pair_mode = "unpaired_paired"
num_recycles = "auto"
# ---------------------------

import os
import re
import glob
import json
import numpy as np
import joblib
import torch
import tensorflow as tf
from pathlib import Path
import warnings
from IPython.display import display, HTML
from transformers import AutoTokenizer, EsmModel
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import is_aa
from Bio.Data.PDBData import protein_letters_3to1
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Suppress warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

from colabfold.download import download_alphafold_params
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type

# ------------------------------------------------------------------------------
# 1. DEFINE PREDICTOR CLASS (New Logic)
# ------------------------------------------------------------------------------

MODEL_FOLDER_NAME = "paper_model_scalar_pathway_v1_minus5_stripped_80_20_seed3"
REPO_PATH = "/content/ALyzer3D.AI"
FULL_MODEL_DIR = os.path.join(REPO_PATH, MODEL_FOLDER_NAME)
MAX_LENGTH = 120
PLM_MODEL_NAME = "facebook/esm2_t6_8M_UR50D"

class AmyloidPredictor:
    def __init__(self, model_dir):
        self.model_dir = model_dir
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        print(f"   -> Loading ESM-2 Model ({PLM_MODEL_NAME})...")
        self.tokenizer = AutoTokenizer.from_pretrained(PLM_MODEL_NAME)
        self.plm_model = EsmModel.from_pretrained(PLM_MODEL_NAME).to(self.device)
        self.plm_model.eval()

        print(f"   -> Searching for models in: '{os.path.basename(model_dir)}'")
        self.models = []
        self.scalers = []
        self._load_ensemble()
        print("   -> Predictor initialized.")

    def _load_ensemble(self):
        if not os.path.exists(self.model_dir):
            raise FileNotFoundError(f"Directory '{self.model_dir}' does not exist.")

        model_files = sorted(glob.glob(os.path.join(self.model_dir, "*.keras")))
        if not model_files:
            model_files = sorted(glob.glob(os.path.join(self.model_dir, "*.h5")))
        scaler_files = sorted(glob.glob(os.path.join(self.model_dir, "*.joblib")))

        if not model_files or len(model_files) != len(scaler_files):
            raise ValueError(f"Found {len(model_files)} models and {len(scaler_files)} scalers. Mismatch or empty.")

        for mp, sp in zip(model_files, scaler_files):
            self.models.append(tf.keras.models.load_model(mp, compile=False, safe_mode=False))
            self.scalers.append(joblib.load(sp))

    def _get_embedding(self, sequence):
        inputs = self.tokenizer(
            sequence, return_tensors="pt", truncation=True, max_length=1022
        ).to(self.device)
        with torch.no_grad():
            outputs = self.plm_model(**inputs)
        return outputs.last_hidden_state.squeeze(0).mean(dim=0).cpu().numpy()

    def _calculate_rog(self, pdb_path):
        try:
            parser = PDBParser(QUIET=True)
            structure = parser.get_structure("s", pdb_path)
            model = structure[0]
            atoms = list(model.get_atoms())
            if not atoms: return 0.0
            com = sum(a.coord for a in atoms) / len(atoms)
            rog_sq = sum(np.sum((a.coord - com)**2) for a in atoms)
            n_res = len(list(model.get_residues()))
            return np.sqrt(rog_sq / len(atoms)) / np.sqrt(n_res) if n_res > 0 else 0.0
        except:
            return 0.0

    def _get_biochem_features(self, sequence):
        try:
            seq = "".join(c for c in sequence if c in "ACDEFGHIKLMNPQRSTVWY")
            pa = ProteinAnalysis(seq)
            return [pa.isoelectric_point(), pa.gravy(), pa.aromaticity(), pa.molecular_weight()]
        except:
            return [7.0, 0.0, 0.0, 12000.0]

    def _load_sequence(self, pdb_path):
        parser = PDBParser(QUIET=True)
        chain = parser.get_structure("s", pdb_path)[0].get_chains().__next__()
        return "".join(
            protein_letters_3to1.get(r.get_resname().upper(), 'X')
            for r in chain.get_residues() if is_aa(r, standard=True)
        )

    def predict(self, pdb_path, json_path):
        try:
            sequence = self._load_sequence(pdb_path)
            with open(json_path, 'r') as f:
                data = json.load(f)
        except Exception as e:
            return {"error": f"File loading failed: {e}"}

        plddt = np.array(data['plddt'])
        pae = np.array(data['pae'])

        L_struct = min(len(sequence), len(plddt), pae.shape[0])
        effective_len = L_struct - 5

        if effective_len <= 0:
            return {"error": f"Protein too short (len={L_struct}) for -5 truncation."}

        slice_len = min(effective_len, MAX_LENGTH)

        pad_pae = np.zeros((MAX_LENGTH, MAX_LENGTH))
        pad_pae[:slice_len, :slice_len] = pae[:slice_len, :slice_len]

        pad_plddt = np.zeros(MAX_LENGTH)
        pad_plddt[:slice_len] = plddt[:slice_len]

        pad_row = np.zeros(MAX_LENGTH)
        pad_col = np.zeros(MAX_LENGTH)
        if slice_len > 0:
            pad_row[:slice_len] = np.mean(pae[:slice_len, :slice_len], axis=1)
            pad_col[:slice_len] = np.mean(pae[:slice_len, :slice_len], axis=0)

        embedding = self._get_embedding(sequence)
        biochem = self._get_biochem_features(sequence)
        rog = self._calculate_rog(pdb_path)
        raw_scalars = np.array(biochem + [rog]).reshape(1, -1)

        inputs_base = {
            "pae_input": np.expand_dims(pad_pae, [0, -1]),
            "plddt_input": np.expand_dims(pad_plddt, [0, -1]),
            "embedding_input": np.expand_dims(embedding, 0),
            "pae_row_input": np.expand_dims(pad_row, [0, -1]),
            "pae_col_input": np.expand_dims(pad_col, [0, -1]),
            "length_input": np.array([effective_len])
        }

        fold_preds = []
        for model, scaler in zip(self.models, self.scalers):
            inputs_fold = inputs_base.copy()
            inputs_fold["scalar_features_input"] = scaler.transform(raw_scalars)
            pred = model.predict(inputs_fold, verbose=0)[0][0]
            fold_preds.append(pred)

        avg_prob = np.mean(fold_preds)

        return {
            "sequence": sequence,
            "label": "AMYLOID" if avg_prob > 0.6 else "NON-AMYLOID",
            "probability": float(avg_prob),
            "fold_scores": fold_preds
        }

# ------------------------------------------------------------------------------
# 2. MAIN BATCH EXECUTION
# ------------------------------------------------------------------------------

# This list will store the results for Cell 3
all_results = []

# Create output directory
os.makedirs(output_dir, exist_ok=True)

# Load the AI model
predictor = None
try:
    predictor = AmyloidPredictor(model_dir=FULL_MODEL_DIR)
except Exception as e:
    print(f"❗️ Error loading model: {e}")

# Proceed only if the model was loaded successfully
if predictor:
    # Parse the input sequences
    sequences_to_process = []
    processed_input = sequences_input.strip()
    if processed_input.startswith('>'):
        processed_input = processed_input[1:]
    entries = processed_input.split(';')

    for entry in entries:
        if ':' in entry:
            parts = entry.split(':', 1)
            if len(parts) == 2 and parts[0].strip() and parts[1].strip():
                sequences_to_process.append({'id': parts[0].strip(), 'sequence': parts[1].strip()})
            else:
                print(f"⚠️ Skipping malformed entry part: '{entry}'")
        elif entry.strip():
            print(f"⚠️ Skipping malformed entry (missing ':'): '{entry}'")

    print(f"\n✅ Found {len(sequences_to_process)} sequences to process.")

    # Process each parsed sequence
    for item in sequences_to_process:
        query_id = item['id']
        query_sequence = "".join(item['sequence'].split())

        sanitized_jobname = re.sub(r'\W+', '', query_id)
        jobname = f"{output_dir}/{sanitized_jobname}"
        os.makedirs(jobname, exist_ok=True)

        queries_path = os.path.join(jobname, f"{sanitized_jobname}.csv")
        with open(queries_path, "w") as text_file:
            text_file.write(f"id,sequence\n{sanitized_jobname},{query_sequence}")

        print(f"\n{'='*60}\nProcessing ID: {query_id}")

        result_dir = Path(jobname)
        setup_logging(result_dir.joinpath("log.txt"))

        queries, is_complex = get_queries(queries_path)
        model_type_run = set_model_type(is_complex, model_type)
        num_recycles_parsed = None if num_recycles == "auto" else int(num_recycles)

        download_alphafold_params(model_type_run, Path("."))

        # Run ColabFold
        run(
            queries=queries, result_dir=result_dir, use_templates=(template_mode != "none"),
            num_relax=num_relax, msa_mode=msa_mode, model_type=model_type_run,
            num_models=5, num_recycles=3, model_order=[1, 2, 3, 4, 5],
            is_complex=is_complex, data_dir=Path("."), keep_existing_results=False,
            rank_by="auto", pair_mode=pair_mode, stop_at_score=100.0,
            zip_results=False, user_agent="colabfold/google-colab-main",
        )

        pdb_file = next(Path(jobname).glob("*_unrelaxed_rank_001*.pdb"), None)
        json_file = next(Path(jobname).glob("*_scores_rank_001*.json"), None)

        if pdb_file and json_file:
            print(f"   -> Analyzing structure...")
            result = predictor.predict(pdb_path=str(pdb_file), json_path=str(json_file))

            if result.get("error"):
                print(f"❗️ Analysis Error: {result['error']}")
                all_results.append({
                    "ID": query_id,
                    "Sequence": query_sequence,
                    "Prediction": "Error",
                    "Probability": 0.0,
                    "Notes": result['error']
                })
            else:
                prob = result['probability']
                label = result['label']
                print(f"   -> Prediction: {label} ({prob:.4f})")

                all_results.append({
                    "ID": query_id,
                    "Sequence": result['sequence'],
                    "Prediction": label,
                    "Probability": prob,
                    "Notes": "Success"
                })
        else:
            print(f"❗️ Error: Could not find ColabFold output files for {sanitized_jobname}.")
            all_results.append({
                "ID": query_id, "Sequence": query_sequence,
                "Prediction": "Processing Error",
                "Probability": 0.0,
                "Notes": "ColabFold failed to generate output"
            })

    print("\n\n✅ Batch processing complete.")

In [None]:
#@title Download Results as CSV
import pandas as pd
from google.colab import files

# Check if the 'all_results' list exists and has content
if 'all_results' in locals() and all_results:
    # Convert the list of dictionaries to a pandas DataFrame
    results_df = pd.DataFrame(all_results)

    # Create a cleaner display column for confidence
    results_df['Confidence %'] = (results_df['Probability'] * 100).round(2)

    # Reorder columns slightly for readability
    cols = ['ID', 'Prediction', 'Probability', 'Confidence %', 'Sequence', 'Notes']
    results_df = results_df[cols]

    # Define the CSV filename
    csv_filename = 'alyzer3d_batch_results.csv'

    # Save the DataFrame to a CSV file
    results_df.to_csv(csv_filename, index=False)

    print(f"✅ Results have been saved to '{csv_filename}'.")
    print("Preview of results:")
    display(results_df.head())

    print("\nStarting download...")
    # Trigger the file download in the browser
    files.download(csv_filename)
else:
    print("❗️ No results found to download. Please run the 'Batch Prediction and Analysis' cell (Cell 2) first.")