# üß¨ Encode DNA with k-mers (n-grams)

This notebook generates **k-mer feature vectors** from real gene sequences using BioNLP-style processing.

Goal: Make sequences ML-ready by turning them into structured, vectorized form.


In [1]:
# ‚úÖ Imports
import pandas as pd
from collections import Counter
import os

In [None]:
# Load cleaned FASTA dataset
fasta_path = "data/processed/human_fasta_clean.csv"
df = pd.read_csv(fasta_path)

# Preview
print("‚úÖ Loaded:", df.shape)
df.head()

In [3]:
def generate_kmers(sequence, k=6):
    """Generate overlapping k-mers from DNA sequence."""
    return [sequence[i:i+k] for i in range(len(sequence) - k + 1)]

def count_kmers(sequence, k=6):
    """Count frequency of each k-mer in a sequence."""
    kmers = generate_kmers(sequence.upper(), k)
    return Counter(kmers)

def encode_kmer_counts(sequences, k=6):
    """Turn list of sequences into DataFrame of k-mer frequencies."""
    kmer_dicts = [count_kmers(seq, k) for seq in sequences]
    df = pd.DataFrame(kmer_dicts)
    df.fillna(0, inplace=True)
    print(f"‚úÖ Encoded {len(sequences)} sequences into matrix of shape {df.shape}")
    return df

In [None]:
# Use only first 100 for now (speed)
subset = df["Sequence"].head(100)

# Encode
encoded_kmers_df = encode_kmer_counts(subset, k=6)

# Preview output
encoded_kmers_df.head()

In [None]:
# Sum all k-mers across rows and sort
top_kmers = encoded_kmers_df.sum().sort_values(ascending=False).head(20)

# Plot
top_kmers.plot(kind="bar", title="Top 20 Most Common 6-mers", figsize=(10, 4))

In [None]:
# Save matrix
out_path = "data/processed/fasta_kmer_6mer.csv"
encoded_kmers_df.to_csv(out_path, index=False)

print("üìÅ Saved k-mer matrix to:", out_path)

### ‚úîÔ∏è Done:
- Extracted 6-mers from gene sequences
- Encoded them into frequency matrices
- Saved clean input for ML