# Plot Nucleotide Frequencies & GC Content
-Count A, T, G, C per sequence and globally
- Calculate GC content = (G + C) / (A + T + G + C)
-Plot overall distribution and optionally per gene

## Preparation from DEV-8: Source DNA from NCBI(Homo Sapiens) in Fasta

In [None]:
!wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.rna.fna.gz
!gunzip human.1.rna.fna.gz
!pip install biopython

from Bio import Entrez, SeqIO
import pandas as pd
from google.colab import files

def fasta_to_csv(filepath, output_path, max_len=200):
    data = []
    for record in SeqIO.parse(filepath, "fasta"):
        data.append({
            "ID": record.id,
            "Length": len(record.seq),
            "Description": record.description,
            "Sequence": str(record.seq[:max_len]) + ("..." if len(record.seq) > max_len else "")
        })
    df = pd.DataFrame(data)
    df.to_csv(output_path, index=False)
    print(f"✅ Saved {len(df)} records to {output_path}")
    return df

In [None]:
df = fasta_to_csv("human.1.rna.fna", "human_rna.csv")
df.head(200)

## Plot Basic Nucleotide Frequencies

### Nucleotide Frequency & GC Content Calculation

In [None]:
import matplotlib.pyplot as plt
from collections import Counter

def analyze_nucleotide_freq(df):
    freqs = {'A': 0, 'T': 0, 'C': 0, 'G': 0}
    gc_content = []

    for seq in df["Sequence"].str.replace("...", "", regex=False):  # Remove ellipsis
        counts = Counter(seq.upper())
        for base in freqs:
            freqs[base] += counts.get(base, 0)
        total = sum(counts.get(b, 0) for b in "ATCG")
        gc = (counts.get('G', 0) + counts.get('C', 0)) / total if total > 0 else 0
        gc_content.append(gc)

    return freqs, gc_content

nuc_freqs, gc_list = analyze_nucleotide_freq(df)

### Plot Results

In [None]:
# Plot nucleotide frequencies
plt.figure(figsize=(6,4))
plt.bar(nuc_freqs.keys(), nuc_freqs.values(), color=["green", "red", "blue", "orange"])
plt.title("Overall Nucleotide Frequency (A/T/C/G)")
plt.ylabel("Count")
plt.show()

# Plot GC content distribution
plt.figure(figsize=(6,4))
plt.hist(gc_list, bins=30, color='purple', edgecolor='black')
plt.title("GC Content per Sequence")
plt.xlabel("GC Content")
plt.ylabel("Number of Sequences")
plt.show()

## Create k-mer Splitter (For ML & SNN)

- Generate overlapping k-mers from each RNA sequence
-Example:
  - Sequence: "ATGCGAAT" with k=6 →
  - Output: ["ATGCGA", "TGCGAA", "GCGAAT"]
- Store as:
  - ID: transcript ID
  - KMER: individual k-mer

### Define the k-mer Generator

In [None]:
# Define the k-mer Generator
def generate_kmers(sequence, k=6):
    return [sequence[i:i+k] for i in range(len(sequence) - k + 1)]

In [None]:
# Apply to Data
def kmers_to_dataframe(df, k=6):
    kmers_data = []

    for _, row in df.iterrows():
        seq = row["Sequence"].replace("...", "")  # remove ellipsis
        kmers = generate_kmers(seq, k)
        for kmer in kmers:
            kmers_data.append({
                "ID": row["ID"],
                "KMER": kmer
            })

    kmer_df = pd.DataFrame(kmers_data)
    print(f"✅ Created {len(kmer_df)} k-mers (k={k}) from {len(df)} sequences")
    return kmer_df

In [None]:
# Create output folder if needed
import os
os.makedirs("data", exist_ok=True)

# Save to CSV
kmer_df = kmers_to_dataframe(df, k=6)
kmer_df.to_csv("data/human_rna_kmers.csv", index=False)
print("📁 Saved to: data/human_rna_kmers.csv")

In [None]:
kmer_df.head()

## Spike Encoding (For SNN Phase)

### 🧠 What Are Spiking Neural Networks (SNNs)?
SNNs are a third generation of neural networks, inspired by how biological neurons process information:
- Instead of continuous values (like in CNNs or LSTMs), neurons in SNNs fire spikes (discrete events) over time.
- Learning and computation rely on spike timing, not just spike rate — this leads to sparse, energy-efficient, and biologically-plausible computing.

### 🔌 Why Spike Encode Gene Sequences?
RNA/DNA data is sequential, symbolic, and sparse — perfect for SNNs because:
- Genomic data has temporal dependencies (e.g., motif positioning)
- K-mers can be treated like event triggers (e.g., spiking "neurons" that activate on biologically meaningful motifs)
- SNNs offer low-latency, low-power inference for on-chip or embedded bioinformatics

⚙️ Types of Spike Encoders for Gene Sequences
1. Rate Coding (Poisson Encoding)
- Encode values (e.g. nucleotide frequency, k-mer presence) as spike rates.
- Each input neuron has a firing rate r; it spikes randomly using a Poisson process.
- Good for: statistical regularities, k-mer presence/frequency.
2. Temporal Coding
- Stronger signal = earlier spike
- One spike per neuron; timing encodes importance
- Useful if you extract bio-features like GC content, motif scores, etc.
3. Population Coding
- Each k-mer activates a set of neurons, like a distributed code.
- Often used with Gaussian tuning curves over input space
- Useful if you want to embed k-mers via NLP methods (Word2Vec, etc.) and convert embeddings to spike rates.

### Task
We need to simulate spike trains from our nucleotide or k-mer data.
We'll implement a Poisson-based spike encoder (common in SNN prep):
- Each k-mer gets a "firing probability" based on frequency or embedding
- Generate spike times as a list per input

### Dummy Frequency-Based Poisson Encoder (Conceptual)

In [None]:
# Let’s simulate spike timing for k-mers using frequency-based intensity.
import numpy as np

def poisson_spike_train(kmer, rate=20, duration=100):
    """
    Generate Poisson spike times for a k-mer.
    - `rate`: firing rate in Hz
    - `duration`: simulation time in ms
    """
    spikes = []
    for t in range(duration):
        if np.random.rand() < rate / 1000:  # convert Hz to probability/ms
            spikes.append(t)
    return spikes

### Apply to a Few Sample K-mers

In [None]:
# Simulate spikes for a few example k-mers
sampled_kmers = kmer_df['KMER'].sample(5, random_state=42)

for kmer in sampled_kmers:
    spikes = poisson_spike_train(kmer, rate=50)
    print(f"K-mer: {kmer} → Spikes (ms): {spikes[:10]}... ({len(spikes)} spikes)")

## Store Clean FASTA Preview

In [None]:
df.to_csv("data/human_rna_clean.csv", index=False)
print("📁 Saved clean preview to: data/human_rna_clean.csv")