# 🧬 NeuroGenAI | DNABERT Semantic Embeddings Bridge
### What is DNABERT?
#### DNABERT is a transformer-based model pre-trained on genomic sequences using k-mer tokenization. Like BERT for language, it captures semantic patterns in DNA.

### Why k-mer Encoding?
#### DNA is tokenized into overlapping sequences (e.g., "ACGTGA"). This allows the model to learn motifs and structures.

### Why LoRA / QLoRA?
#### PEFT methods like LoRA enable fast, low-resource fine-tuning. Great for adapting DNABERT to specific genomes or classification tasks.

## 🔧 Setup & Imports

In [1]:
import os
import json
import numpy as np
import pandas as pd
from pathlib import Path

# Add src path
import sys
src_path = Path().resolve().parents[1] / "src"
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

from nlp.dna_embedding_model import DNAEmbedder

## 📥 Load Clean FASTA Sequences

In [None]:
# Load cleaned FASTA sequences
fasta_path = "data/processed/human_fasta_clean.csv"
df = pd.read_csv(fasta_path)

# For test: use only first N
df = df[df['Length'] >= 30].head(100)  # Change as needed
print(f"✅ Loaded {len(df)} sequences.")

## 🧠 Initialize DNABERT Embedding Engine

In [None]:
embedder = DNAEmbedder(model_id="armheb/DNA_bert_6", k=6)

## 💾 Save Embeddings as .npy

In [11]:
# Extract sequences
sequences = df["Sequence"].tolist()

# Embed all
embeddings = embedder.embed_batch(sequences)
print("✅ Final embedding shape:", embeddings.shape)

# Save as .npy
os.makedirs("data/processed", exist_ok=True)
np.save("data/processed/fasta_dnabert_embeddings.npy", embeddings)
print("📁 Saved to: data/processed/fasta_dnabert_embeddings.npy")

🔁 Embedding sequence 1/100...
🔁 Embedding sequence 2/100...
🔁 Embedding sequence 3/100...
🔁 Embedding sequence 4/100...
🔁 Embedding sequence 5/100...
🔁 Embedding sequence 6/100...
🔁 Embedding sequence 7/100...
🔁 Embedding sequence 8/100...
🔁 Embedding sequence 9/100...
🔁 Embedding sequence 10/100...
🔁 Embedding sequence 11/100...
🔁 Embedding sequence 12/100...
🔁 Embedding sequence 13/100...
🔁 Embedding sequence 14/100...
🔁 Embedding sequence 15/100...
🔁 Embedding sequence 16/100...
🔁 Embedding sequence 17/100...
🔁 Embedding sequence 18/100...
🔁 Embedding sequence 19/100...
🔁 Embedding sequence 20/100...
🔁 Embedding sequence 21/100...
🔁 Embedding sequence 22/100...
🔁 Embedding sequence 23/100...
🔁 Embedding sequence 24/100...
🔁 Embedding sequence 25/100...
🔁 Embedding sequence 26/100...
🔁 Embedding sequence 27/100...
🔁 Embedding sequence 28/100...
🔁 Embedding sequence 29/100...
🔁 Embedding sequence 30/100...
🔁 Embedding sequence 31/100...
🔁 Embedding sequence 32/100...
🔁 Embedding seque

## 🧾 Log Embedding Metadata

In [None]:
# Save metadata for reproducibility
meta = {
    "model_id": embedder.model_id,
    "vector_dim": embeddings.shape[1],
    "sequence_count": embeddings.shape[0],
    "source_fasta": fasta_path,
    "kmer_size": embedder.k,
    "device": embedder.device,
    "huggingface_url": f"https://huggingface.co/{embedder.model_id}"
}

with open("data/outputs/3. DNABERT + SNN + NLP/embedding_info.json", "w") as f:
    json.dump(meta, f, indent=4)

print("✅ Metadata saved to: data/outputs/3. DNABERT + SNN + NLP/embedding_info.json")