# Run this in Kaggle

# 🧬 Intelligent SNP-Based Machine Learning for Genetic Risk Assessment

Dataset used in this project:
🔗 https://www.kaggle.com/datasets/huebitsvizg/snp-dataset-1

This dataset contains:
- X_converted.csv → SNP matrix (already converted from raw X.txt)
- y_converted.csv → phenotype values
- The user does NOT need to manually convert raw `.txt` files.

---

## 📘 How to Use the Dataset in Kaggle

1. Open the above dataset link.  
2. Click **Add Dataset** → attach it to your Kaggle notebook.  
3. Kaggle will mount the dataset at:


4. All code in this notebook loads files directly from that folder.

You will see these files automatically:
- `/kaggle/input/snp-dataset-1/X_converted.csv`
- `/kaggle/input/snp-dataset-1/y_converted.csv`

These are already cleaned, separated, and ready for PCA + ML.

---

## 📘 What This Notebook Does

This notebook builds a complete ML pipeline for SNP-based genetic risk prediction:

✔ Loads SNP matrix (X) and phenotype vector (y)  
✔ Cleans and aligns data  
✔ Reduces dimensionality using PCA  
✔ Trains 3 models:  
- LightGBM baseline  
- XGBoost (feature-selected)  
- Final optimized XGBoost (GPU)  
✔ Saves:
- final_xgboost_model.pkl  
- top_snps.json  

These two files will later be uploaded into the **Colab Web App** for user-facing inference.

---

## 📘 After Running the Notebook

You will automatically generate:

1. **final_xgboost_model.pkl**  
2. **top_snps.json**

Download these from:



These are the only two files required to run the next stage (Colab web app).

---

## 🚀 You're Ready!

Once the dataset is attached and this notebook is executed:

✔ All preprocessing is handled  
✔ All models are trained  
✔ All final files are exported  
✔ The user can proceed directly to the Colab web application  



In [None]:
import pandas as pd
import numpy as np


## Step 1: Load phenotype (y)

The dataset provides `y.txt` which contains phenotype values.
We load it to understand the number of samples and verify formatting.

This is required before doing SNP alignment.


In [None]:
y_path = "/kaggle/input/snp-dataset-for-gwas/y.txt"

y = pd.read_csv(y_path, header=None)

print("y shape:", y.shape)
print(y.head())


In [None]:
import pandas as pd

# Load space-separated SNP matrix
X = pd.read_csv("/kaggle/input/snp-dataset-for-gwas/X.txt",
                sep="\s+",
                low_memory=False)

# Save as CSV
X.to_csv("X_converted.csv", index=False)

print("Conversion completed: X_converted.csv created.")


## Step 2: Convert raw X.txt → X_converted.csv

The original dataset is space-separated and extremely large.
This step converts it into a standard CSV format so ML models can load it faster.

NOTE:
- Kaggle version already includes `X_converted.csv`
- You may skip this conversion if using the provided processed dataset


In [None]:
y = pd.read_csv("/kaggle/input/snp-dataset-for-gwas/y.txt", header=None)
y.to_csv("y_converted.csv", index=False)
print("y_converted.csv created.")


In [None]:
!head -n 1 /kaggle/input/snp-dataset-1/X_converted.csv | cut -d',' -f1-20


## Step 3: Quick Inspection of Converted SNP Matrix

We inspect the first few columns and rows to confirm:

✔ Column count (SNP markers)  
✔ Valid encoding (0,1,2)  
✔ No parsing errors  

This is especially important when handling very large SNP matrices.


In [None]:
import pandas as pd

df_small = pd.read_csv(
    "/kaggle/input/snp-dataset-1/X_converted.csv",
    usecols=range(20),   # only first 20 columns
    nrows=2              # only first 2 rows
)

df_small


# Step : 1

## Step 4: Clean phenotype vector y

We perform:

✔ Remove non-numeric entries  
✔ Reset index  
✔ Keep exactly first 1000 samples  
✔ y must align with X rows for correct training  

This ensures downstream models do not fail.


In [None]:
import pandas as pd

# Load y
y = pd.read_csv("/kaggle/input/snp-dataset-1/y_converted.csv",
                header=None)

# Keep only numeric rows
y = y[pd.to_numeric(y[0], errors='coerce').notnull()]

# Reset index
y = y[0].reset_index(drop=True)

# Remove the extra last row (make y = 1000 rows)
y = y.iloc[:1000].reset_index(drop=True)

print("Final y shape:", y.shape)
print(y.head())


## Step 5: Load SNP matrix (X)

We load the entire SNP matrix using dtype=int8 (0/1/2 variants).
This reduces RAM usage significantly and speeds up PCA & ML models.


In [None]:
# Load only first row to confirm separator
X_test = pd.read_csv("/kaggle/input/snp-dataset-1/X_converted.csv",
                     nrows=1)

print("Columns detected:", len(X_test.columns))
print(X_test.iloc[:, :10])   # first 10 SNPs


In [None]:
import pandas as pd

X = pd.read_csv(
    "/kaggle/input/snp-dataset-1/X_converted.csv",
    dtype="int8",          # SNPs are 0/1/2 → fits in int8
    low_memory=False
)

print("X shape:", X.shape)


## Step 6: PCA — Reduce Dimensionality

SNP matrices often contain hundreds of thousands of columns.

We reduce to:
- 300 principal components  
- retaining major genetic variation  

This dramatically speeds up modeling.


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=300, random_state=42)

X_reduced = pca.fit_transform(X)

print("Reduced shape:", X_reduced.shape)


## Step 7: Train Baseline Model (LightGBM)

We first train a baseline regressor on PCA-reduced SNPs.

Outputs:
- RMSE error  
- R² score  


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from lightgbm import LGBMRegressor

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_reduced, y, test_size=0.2, random_state=42
)

# Model
model_lgbm = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1
)

# Train
model_lgbm.fit(X_train, y_train)

# Predict
preds = model_lgbm.predict(X_test)

# Metrics
rmse = mean_squared_error(y_test, preds, squared=False)
r2 = r2_score(y_test, preds)

print("LightGBM RMSE:", rmse)
print("LightGBM R2:", r2)


## Step 8: SNP Feature Selection

We compute correlation of each SNP with phenotype y
and select top 500 strongest SNPs.

This reduces noise and improves model accuracy.


In [None]:
top_snps = corr.abs().sort_values(ascending=False).head(500).index
X_top = X[top_snps]

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(
    X_top, y, test_size=0.2, random_state=42
)

model = XGBRegressor(
    n_estimators=800,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    tree_method="hist",
    device="cuda"    # enables GPU
)

model.fit(X_train, y_train)

preds = model.predict(X_test)

print("RMSE:", mean_squared_error(y_test, preds, squared=False))
print("R2:", r2_score(y_test, preds))


## Step 10: FINAL Optimized XGBoost Model

We tune:
- n_estimators  
- learning_rate  
- max_depth  
- reg_alpha/reg_lambda  
- subsample parameters  

This produces the final model used in the web app.


In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_top, y, test_size=0.2, random_state=42
)

# FINAL MODEL
final_model = XGBRegressor(
    n_estimators=1200,
    learning_rate=0.03,
    max_depth=7,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_alpha=1.0,
    reg_lambda=1.0,
    tree_method="hist",
    device="cuda"
)

# Train
final_model.fit(X_train, y_train)

# Predict
final_preds = final_model.predict(X_test)

# Metrics
final_rmse = mean_squared_error(y_test, final_preds, squared=False)
final_r2 = r2_score(y_test, final_preds)

print("FINAL MODEL RMSE:", final_rmse)
print("FINAL MODEL R2:", final_r2)


## Step 11: Save model + selected SNP list

Files created:
- final_xgboost_model.pkl  → ML model  
- top_snps.json            → list of selected SNP column names  

These must be downloaded if user wants to run the Colab app.


In [None]:
import joblib

# Save the model
joblib.dump(final_model, "final_xgboost_model.pkl")

# Save SNP list
import json
with open("top_snps.json", "w") as f:
    json.dump(list(top_snps), f)

print("Model and SNP list saved!")


## Step 12: Reload saved model for validation

We load:
- model  
- selected SNP list  
- reconstructed test data  

This step assures reproducibility.


In [None]:
import joblib
import json
import pandas as pd

# Load model
model = joblib.load("/kaggle/working/final_xgboost_model.pkl")

# Load SNP columns
with open("/kaggle/working/top_snps.json", "r") as f:
    top_snps = json.load(f)

print("Model and SNP list loaded successfully!")
print("Number of SNPs used:", len(top_snps))


## Step 13: Evaluate prediction performance

We compute:
- RMSE  
- R² scores  

And compare actual vs predicted values for first few samples.


In [None]:
import joblib
import json

MODEL_PATH = "/kaggle/input/snp-models-best/other/default/1/final_xgboost_model.pkl"
SNPS_PATH = "/kaggle/input/snp-models-best/other/default/1/top_snps.json"

model = joblib.load(MODEL_PATH)

with open(SNPS_PATH, "r") as f:
    top_snps = json.load(f)

print("Model loaded successfully!")
print("Top SNP count:", len(top_snps))


In [None]:
import pandas as pd

X_test_input = pd.read_csv(
    "/kaggle/input/snp-dataset-1/X_converted.csv",
    usecols=top_snps,
    dtype='int8'
)

# Keep only the selected SNP columns
X_test_input = X_test_input[top_snps]

print("Final input shape:", X_test_input.shape)


In [None]:
# Predict for all samples using the already loaded model & selected columns
preds = model.predict(X_test_input)

print("Predictions shape:", preds.shape)
print(preds[:10])   # first 10 predictions


In [None]:
import pandas as pd

# Load y
y = pd.read_csv("/kaggle/input/snp-dataset-1/y_converted.csv",
                header=None)

# Keep only numeric rows
y = y[pd.to_numeric(y[0], errors='coerce').notnull()]

# Reset index
y = y[0].reset_index(drop=True)

# Remove the extra last row (make y = 1000 rows)
y = y.iloc[:1000].reset_index(drop=True)

print("Final y shape:", y.shape)
print(y.head())


In [None]:
y_true = y.values

comparison_df = pd.DataFrame({
    "Actual": y_true[:10],
    "Predicted": preds[:10]
})

comparison_df


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

rmse = mean_squared_error(y_true, preds, squared=False)
r2 = r2_score(y_true, preds)

print("Final RMSE:", rmse)
print("Final R2:", r2)


In [None]:
# 🎉 COMPLETE PIPELINE SUCCESSFULLY EXECUTED

After running this notebook, user will download:

1. final_xgboost_model.pkl
2. top_snps.json

These will be uploaded into the Colab Web App in the next stage.

The user only needs:
✔ This Kaggle notebook
✔ The dataset link
✔ The Colab notebook

All models, JSON files, PCA data, and predictions are generated automatically.


# Run this in Colab

# 🧬 SNP Cluster Explorer — Colab Web App (LLM + KMeans)

This notebook turns your trained SNP ML pipeline into a **live web application**.

It uses:
• Important SNP list → `important_snps.json`  
• KMeans model for clustering → `kmeans_75snps.pkl`  
• SNP dataset → `X_converted.csv`  
• Mistral-7B-Instruct (4-bit) for cluster explanations  

These files should already exist because you exported them from the **Kaggle training notebook**.

---

## 📘 Before You Start: Upload the Required Files to Google Drive

Place the following inside:


Required files:
- `important_snps.json`
- `kmeans_75snps.pkl`
- `X_converted.csv`

Your folder structure:


If the user runs this notebook **without these files**, the app will fail.  
Make sure they are uploaded before executing.

---

## 📘 What This Web App Does

✔ User enters a **row number**  
✔ App loads SNP values for that individual  
✔ KMeans predicts genetic cluster  
✔ LLM generates explanation of that cluster  
✔ UI displays SNP preview + explanation  

---

## 🚀 Workflow Overview

1. Mount Google Drive  
2. Install dependencies  
3. Create Flask server  
4. Load SNP models + LLM  
5. Build UI (index.html + css)  
6. Run ngrok → generate Public URL  
7. Use the app like a real website  

---

Paste this markdown **just above the following cell**:


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

# 📦 Install Dependencies

This cell installs everything needed for:

• Backend server (Flask)  
• LLM inference (Transformers + bitsandbytes)  
• Clustering (scikit-learn, joblib)  
• Public hosting (ngrok)  

Notes:
- First execution may take 2–3 minutes.
- Mistral 7B (4-bit) loads **in background**, so the app becomes available immediately, and the LLM loads asynchronously.


In [None]:
!ls "/content/drive/MyDrive/Colab Notebooks/On-Going Projects/SNP Project"

In [None]:
!pip install  flask pyngrok pandas scikit-learn joblib
!pip install  transformers accelerate huggingface-hub sentencepiece protobuf
!pip install  bitsandbytes
!pip install  ngrok

In [None]:
!mkdir -p templates static uploads

# 🧠 Create Flask Application (Backend)

This cell generates the entire backend:

✔ Loads:
   - important SNP indices  
   - KMeans clustering model  
   - Full SNP dataset (only selected SNPs)  

✔ Implements:
   - Background LLM loading thread  
   - `/api/predict` endpoint for cluster prediction  
   - Dynamic explanation generation  

Important:
- The first request may return:  
  “LLM is warming up — try again in 1–3 minutes”  
  because the 7B model loads in the background.


In [None]:
%%writefile app.py
import os
import json
import joblib
import threading
import numpy as np
import pandas as pd
from flask import Flask, render_template, request, jsonify

# Reduce TF noise if it's present in the environment
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

# ============================================================
# App Configuration
# ============================================================
app = Flask(__name__, template_folder="templates", static_folder="static")
os.makedirs("uploads", exist_ok=True)

# CUDA memory safety
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()

# ============================================================
# Load SNP Models (local, small)
# ============================================================
MODEL_DIR = "/content/drive/MyDrive/Colab Notebooks/On-Going Projects/SNP Project"

IMPORTANT_SNPS_PATH = os.path.join(MODEL_DIR, "important_snps.json")
KMEANS_MODEL_PATH = os.path.join(MODEL_DIR, "kmeans_75snps.pkl")
DATASET_PATH = os.path.join(MODEL_DIR, "X_converted.csv")

with open(IMPORTANT_SNPS_PATH, "r") as f:
    IMPORTANT_SNPS = json.load(f)

KMEANS = joblib.load(KMEANS_MODEL_PATH)
FULL_DATA = pd.read_csv(DATASET_PATH, usecols=IMPORTANT_SNPS, dtype="int8")

# ============================================================
# Cluster Labels
# ============================================================
CLUSTER_LABELS = {
    0: "High variation on chromosome 3 & 6",
    1: "More stable SNP profile, low diversity",
    2: "High-risk variants in immune-related SNPs",
    3: "Strong variation in chromosome 14 region",
    4: "High heterozygosity across genome",
    5: "Miscellaneous genetic pattern group"
}

# ============================================================
# LLM Loader state + background loading
# ============================================================
LLM = None
TOKENIZER = None
PIPELINE = None
LLM_LOCK = threading.Lock()
LLM_LOADING = False

def _really_load_llm():
    """Blocking load of the model; run inside a background thread."""
    global LLM, TOKENIZER, PIPELINE, LLM_LOADING
    try:
        model_name = "mistralai/Mistral-7B-Instruct-v0.2"
        print("🚀 Background LLM load started (4-bit NF4) ...")

        bnb = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

        TOKENIZER = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        LLM = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            quantization_config=bnb,
        )

        # create a pipeline once model/tokenizer are ready
        PIPELINE = pipeline(
            "text-generation",
            model=LLM,
            tokenizer=TOKENIZER,
            max_new_tokens=250,
            temperature=0.4,
            do_sample=True,
        )

        print("✔ Background LLM load finished.")
    except Exception as e:
        print("‼ LLM load failed:", e)
    finally:
        # mark loading finished even on error so future calls may retry
        with LLM_LOCK:
            global LLM_LOADING
            LLM_LOADING = False

def ensure_llm_background():
    """If not loaded, start background loading (non-blocking)."""
    global LLM_LOADING
    with LLM_LOCK:
        if PIPELINE is not None:
            return True
        if LLM_LOADING:
            return False
        # start background loader
        LLM_LOADING = True
        t = threading.Thread(target=_really_load_llm, daemon=True)
        t.start()
        return False

def get_pipeline():
    """Return pipeline if ready, otherwise None."""
    return PIPELINE

# ============================================================
# Generate Explanation (uses pipeline if ready)
# ============================================================
def generate_cluster_explanation(cluster_id, cluster_name):
    pipe = get_pipeline()
    if pipe is None:
        # trigger background loading if not already started
        ensure_llm_background()
        return ("LLM is warming up — explanation will be generated shortly. "
                "Try again in ~1–3 minutes (model download time depends on your GPU / network).")

    prompt = f"""
[INST]
Explain this genetic cluster in simple terms.

Cluster ID: {cluster_id}
Cluster Name: {cluster_name}

Write 4–6 lines:
• What this cluster represents
• Common SNP/variation pattern
• No medical predictions
• Simple, beginner-friendly explanation
[/INST]
"""
    out = pipe(prompt)[0]["generated_text"]
    if "[/INST]" in out:
        out = out.split("[/INST]")[-1].strip()
    return out.strip()

# ============================================================
# ROUTES
# ============================================================
@app.route("/")
def home():
    return render_template("index.html")

@app.route("/api/predict", methods=["POST"])
def predict_cluster():
    try:
        row_id = request.form.get("row_id")
        if row_id is None:
            return jsonify({"error": "Row number required"}), 400

        try:
            row_id = int(row_id)
        except:
            return jsonify({"error": "Invalid row number"}), 400

        if row_id < 0 or row_id >= len(FULL_DATA):
            return jsonify({"error": "Row number out of range"}), 400

        # Get SNPs (single row)
        snp_row = FULL_DATA.iloc[row_id]  # Series

        # KMeans prediction (fast, local)
        cluster_id = int(KMEANS.predict(snp_row.values.reshape(1, -1))[0])
        cluster_name = CLUSTER_LABELS.get(cluster_id, "Unknown Cluster")

        # Preview: first 10 SNP values as native ints
        preview_values = [int(x) for x in snp_row.iloc[:10].tolist()]

        # Explanation (may be placeholder while model loads)
        explanation = generate_cluster_explanation(cluster_id, cluster_name)

        return jsonify({
            "row_id": row_id,
            "cluster_id": cluster_id,
            "cluster_name": cluster_name,
            "explanation": explanation,
            "snp_count": len(IMPORTANT_SNPS),
            "preview": preview_values
        })

    except Exception as e:
        print("🔥 ERROR:", e)
        torch.cuda.empty_cache()
        return jsonify({"error": str(e)}), 500

# ============================================================
# RUN SERVER
# ============================================================
if __name__ == "__main__":
    # optionally kick off background LLM loading at server startup:
    # ensure_llm_background()
    app.run(host="0.0.0.0", port=8000)



# 🎨 Frontend UI — SNP Cluster Explorer

This creates the user interface:

✔ Row input box  
✔ SNP preview (first 10 SNPs)  
✔ Cluster ID + cluster name  
✔ AI-generated explanation  
✔ Dynamic badges for clusters  
✔ Loading spinner + error messages  

You don’t need to edit this unless customizing styling.


In [None]:
%%writefile templates/index.html
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"/>
  <meta name="viewport" content="width=device-width,initial-scale=1"/>
  <title>SNP Cluster Explorer</title>
  <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
</head>

<body>

<div class="particles"></div>

<header class="topbar">
  <div class="topbar-left">
    <span class="logo-icon">🧬</span>
    <div>
      <div class="brand-title">SNP Cluster Explorer</div>
      <div class="brand-sub">Genetic Pattern Classification</div>
    </div>
  </div>
</header>

<main class="container">

  <div class="page-header">
    <h2 class="page-title">🔬 Explore Genetic Clusters</h2>
    <p class="page-subtitle">Enter a row number (0–999) to fetch SNP values, compute cluster, and generate an AI explanation.</p>
  </div>

  <div class="main-form">

    <!-- INPUT SECTION -->
    <div class="form-section">
      <h3 class="section-title">📥 Step 1: Enter Row Number</h3>
      <p class="section-desc">The dataset contains 1000 individuals. Select a row to analyze.</p>

      <label class="input-label">Row Number (0–999)</label>
      <input id="rowInput" type="number" min="0" max="999" placeholder="Example: 0">

      <div class="action-row">
        <button class="submit-button primary" onclick="predictCluster()">🧬 Predict Cluster</button>
        <button class="submit-button ghost" onclick="clearOutput()">🗑️ Clear</button>
      </div>

      <!-- LOADING SPINNER -->
      <div id="loader" class="loader" style="display:none; margin-top:12px; text-align:center;">
        <div class="spinner"></div>
        <p style="margin-top:8px;">Fetching row data & generating explanation...</p>
      </div>
    </div>

    <!-- RESULT SECTION -->
    <div id="resultSection" class="form-section" style="display:none">
      <h3 class="section-title">📊 Step 2: Cluster Prediction Result</h3>

      <!-- Cluster Badge -->
      <div class="ai-badge" id="clusterBadge"></div>

      <!-- Row Preview -->
      <label class="input-label">Preview of SNP Values (first 10 SNPs)</label>
      <textarea id="rowPreview" rows="2" readonly></textarea>

      <!-- Cluster Info -->
      <label class="input-label">Cluster ID</label>
      <input id="clusterId" type="text" readonly>

      <label class="input-label">Cluster Name</label>
      <input id="clusterName" type="text" readonly>

      <!-- AI Explanation -->
      <label class="input-label">AI Explanation</label>
      <textarea id="explanation" rows="8" readonly class="fade-in"></textarea>

      <div class="action-row" style="margin-top:16px">
        <button class="submit-button success" onclick="scrollToTop()">🔼 Back to Top</button>
      </div>
    </div>

    <div id="status" class="status-message" style="display:none"></div>

  </div>
</main>

<footer class="footer">
  SNP Cluster Explorer — Powered by Mistral 7B • © 2025
</footer>

<script>
function scrollToTop() {
  window.scrollTo({ top: 0, behavior: "smooth" });
}

function clearOutput() {
  document.getElementById("rowInput").value = "";
  document.getElementById("resultSection").style.display = "none";
  document.getElementById("loader").style.display = "none";
  setStatus("", "");
}

async function predictCluster() {
  const rowEl = document.getElementById("rowInput");
  const row = (rowEl.value || "").toString().trim();

  if (row === "") {
    setStatus("❌ Please enter a row number.", "error");
    return;
  }

  setStatus("🔍 Fetching SNP values from dataset...", "info");
  document.getElementById("loader").style.display = "block";
  document.getElementById("resultSection").style.display = "none";

  const fd = new FormData();
  fd.append("row_id", row);

  try {
    const r = await fetch("/api/predict", { method: "POST", body: fd });
    if (!r.ok) {
      const err = await r.json().catch(()=>({error: r.statusText}));
      throw new Error(err.error || r.statusText || "Server error");
    }
    const j = await r.json();

    if (j.error) {
      setStatus("❌ " + j.error, "error");
      document.getElementById("loader").style.display = "none";
      return;
    }

    document.getElementById("resultSection").style.display = "block";

    document.getElementById("clusterId").value = j.cluster_id;
    document.getElementById("clusterName").value = j.cluster_name;
    document.getElementById("explanation").value = j.explanation || "";

    // safely render preview (may be undefined)
    const previewArr = Array.isArray(j.preview) ? j.preview : [];
    document.getElementById("rowPreview").value = previewArr.join(", ");

    const badge = document.getElementById("clusterBadge");
    badge.textContent = `Cluster ${j.cluster_id}`;
    badge.className = "ai-badge cluster-" + j.cluster_id;

    setStatus("✅ Prediction returned.", "success");

    document.getElementById("loader").style.display = "none";
    document.getElementById("resultSection").scrollIntoView({ behavior: "smooth" });

  } catch (e) {
    console.error("Fetch error:", e);
    setStatus("❌ Network error: " + (e.message || "Failed to fetch"), "error");
    document.getElementById("loader").style.display = "none";
  }
}

function setStatus(msg, type) {
  const el = document.getElementById("status");
  el.textContent = msg;
  el.className = "status-message " + (type || "");
  el.style.display = msg ? "block" : "none";
}
</script>

</body>
</html>



# 🎨 Styling (CSS)

This stylesheet provides:

• Dark-glass UI  
• Animations  
• Cluster color badges  
• Responsive layout  
• Modern typography  

The UI theme matches the “Genomics + AI” aesthetic.


In [None]:
%%writefile static/style.css
/* ============================================================
   BASE THEME
   ============================================================ */
:root {
  --bg: #061026;
  --card: #0e1724;
  --glass: rgba(255,255,255,0.05);
  --text: #ffffff;
  --muted: #c7d2e8;

  --primary: #6366f1;
  --primary-dark: #4f46e5;
  --success: #10b981;

  /* Cluster colors */
  --c0: #60a5fa;
  --c1: #34d399;
  --c2: #fbbf24;
  --c3: #c084fc;
  --c4: #fb7185;
  --c5: #2dd4bf;

  --shadow: 0 18px 40px rgba(0,0,0,0.55);
}

body {
  margin: 0;
  background: linear-gradient(135deg, #041025, #0b1730);
  color: var(--text);
  font-family: Inter, system-ui, sans-serif;
}

/* ============================================================
   FULL PAGE WRAPPER
   ============================================================ */
.container {
  max-width: 900px;
  margin: 40px auto;
  padding: 20px;
}

/* Title section */
.page-header {
  text-align: center;
  margin-bottom: 30px;
}

.page-title {
  font-size: 32px;
  font-weight: 800;
}

.page-subtitle {
  font-size: 15px;
  color: var(--muted);
  margin-top: 8px;
}

/* ============================================================
   BACKGROUND PARTICLES
   ============================================================ */
.particles {
  position: fixed;
  inset: 0;
  pointer-events: none;
  background:
    radial-gradient(circle at 10% 20%, rgba(99,102,241,0.12), transparent 20%),
    radial-gradient(circle at 80% 80%, rgba(16,185,129,0.10), transparent 25%);
  filter: blur(12px);
}

/* ============================================================
   TOPBAR
   ============================================================ */
.topbar {
  background: rgba(255,255,255,0.04);
  backdrop-filter: blur(12px);
  padding: 16px 30px;
  display: flex;
  justify-content: space-between;
  align-items: center;
  border-bottom: 1px solid rgba(255,255,255,0.05);
  position: sticky;
  top: 0;
  z-index: 10;
}

.topbar-left {
  display: flex;
  align-items: center;
  gap: 14px;
}

.logo-icon {
  font-size: 32px;
}

.brand-title {
  font-size: 20px;
  font-weight: 700;
}

.brand-sub {
  font-size: 12px;
  color: var(--muted);
}

/* ============================================================
   CARD (Main Form)
   ============================================================ */
.main-form {
  background: var(--card);
  padding: 30px;
  border-radius: 14px;
  box-shadow: var(--shadow);
  border: 1px solid rgba(255,255,255,0.06);
}

/* Section inside card */
.form-section {
  margin-bottom: 32px;
  padding-bottom: 24px;
  border-bottom: 1px solid rgba(255,255,255,0.08);
}

.form-section:last-child {
  border-bottom: none;
}

.section-title {
  font-size: 20px;
  font-weight: 700;
}

.section-desc {
  margin-top: 10px;
  color: var(--muted);
  line-height: 1.5rem;
}

/* ============================================================
   INPUTS
   ============================================================ */
.input-label {
  display: block;
  margin-top: 16px;
  margin-bottom: 6px;
  color: var(--muted);
  font-size: 13px;
  font-weight: 700;
  text-transform: uppercase;
}

input[type="number"], input[type="text"], textarea {
  width: 100%;
  padding: 14px;
  border-radius: 10px;
  background: rgba(255,255,255,0.04);
  border: 1px solid rgba(255,255,255,0.12);
  color: white;
  font-size: 15px;
}

textarea {
  resize: vertical;
}

input:focus, textarea:focus {
  outline: none;
  border-color: var(--primary);
}

/* ============================================================
   BUTTONS
   ============================================================ */
.action-row {
  margin-top: 20px;
  display: flex;
  gap: 12px;
  flex-wrap: wrap;
}

.submit-button {
  padding: 12px 22px;
  border-radius: 12px;
  border: none;
  font-weight: 700;
  cursor: pointer;
  transition: all 0.2s;
}

/* Primary button */
.submit-button.primary {
  background: linear-gradient(90deg, var(--primary), var(--primary-dark));
  color: white;
}
.submit-button.primary:hover { transform: translateY(-2px); }

/* Ghost button */
.submit-button.ghost {
  background: transparent;
  border: 1px solid rgba(255,255,255,0.25);
  color: white;
}
.submit-button.ghost:hover {
  background: rgba(255,255,255,0.05);
}

/* Success button */
.submit-button.success {
  background: linear-gradient(90deg, var(--success), #0d9a72);
  color: #fff;
}

/* ============================================================
   STATUS MESSAGES + LOADING
   ============================================================ */
.status-message {
  margin-top: 16px;
  padding: 12px;
  border-radius: 10px;
  font-weight: 600;
  display: none;
  animation: fadein 0.2s ease-in;
}

.status-message.info {
  background: rgba(59,130,246,0.15);
  color: #93c5fd;
  display: block;
}

.status-message.success {
  background: rgba(16,185,129,0.15);
  color: #6ee7b7;
  display: block;
}

.status-message.error {
  background: rgba(239,68,68,0.15);
  color: #fca5a5;
  display: block;
}

/* Loader spinner */
.spinner {
  width: 36px;
  height: 36px;
  border: 4px solid rgba(255,255,255,0.15);
  border-top: 4px solid var(--primary);
  border-radius: 50%;
  margin: 0 auto;
  animation: spin 1s linear infinite;
}

@keyframes spin { 100% { transform: rotate(360deg); } }

/* ============================================================
   CLUSTER BADGES
   ============================================================ */
.ai-badge {
  padding: 10px 16px;
  border-radius: 30px;
  font-weight: 800;
  margin-bottom: 16px;
  display: inline-block;
  font-size: 14px;
  animation: fadein 0.2s ease-in;
}

.cluster-0 { background: var(--c0); color: #000; }
.cluster-1 { background: var(--c1); color: #000; }
.cluster-2 { background: var(--c2); color: #000; }
.cluster-3 { background: var(--c3); color: #000; }
.cluster-4 { background: var(--c4); color: #000; }
.cluster-5 { background: var(--c5); color: #000; }

/* ============================================================
   RESPONSIVE
   ============================================================ */
@media (max-width: 600px) {
  .main-form { padding: 20px; }
  .page-title { font-size: 26px; }
}




📘 Kill Previous Processes

This ensures Flask and ngrok do not conflict:

- Stops earlier Flask sessions  
- Stops older ngrok tunnels  
- Prevents "port already in use" errors  

Safe to run every time before starting server.


In [None]:
!pkill -f flask || echo "No flask running"
!pkill -f ngrok || echo "No ngrok running"


📘  Checking Port 8000 (User Instructions)

If server fails, port 8000 may be occupied.

Run:
!lsof -i :8000

If you see:
python   12345 LISTEN

Kill it with:
!kill -9 12345

Then launch Flask again.


In [None]:
!lsof -i :8000

In [None]:
!kill -9 8369

📘  Run Flask App in Background

Starts backend without blocking the notebook:

!nohup python app.py > flask.log 2>&1 &

Logs are stored in flask.log


In [None]:
!nohup python app.py > flask.log 2>&1 &


📘  Ngrok Setup

Ngrok provides a public HTTPS link.

Your ngrok token was removed for safety.

To use ngrok:
1. Get token → https://dashboard.ngrok.com/get-started/your-authtoken  
2. Add inside notebook:

conf.get_default().auth_token = "YOUR_NGROK_TOKEN_HERE"

3. Start tunnel:

public_url = ngrok.connect(8000)

Shareable app link appears here.


In [None]:
from pyngrok import ngrok, conf
conf.get_default().auth_token = "YOUR_NGROK_TOKEN_HERE"

public_url = ngrok.connect(8000)
print("🌍 Public URL:", public_url)

!sleep 3 && tail -n 30 flask.log

📘  View Logs

To debug backend:

!tail -n 20 flask.log

Shows:
- Model loading issues  
- Prompt errors  
- Script formatting errors  
- Runtime crashes  


In [None]:
!tail -n 50 flask.log