# 🧬 Parse FASTA
## 📌 Why This Dataset?

We're using:
- 📌 [human.1.rna.fna.gz](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/) from the RefSeq database
- 📚 RefSeq is maintained by NCBI and widely used in publications
- 🧬 The .rna.fna files contain transcribed RNA sequences from human genes — used
in protein synthesis, gene function analysis, and genome mapping

### 🔍 Why NCBI RefSeq FASTA?

The RefSeq FASTA dataset is a curated, non-redundant source of **transcribed gene sequences** for Homo sapiens.  
It’s widely used in:
- Transcriptomics
- Functional annotation
- Deep learning on genomics

By limiting to `human.1.rna.fna`, we start small while using **data from published research pipelines**.

✅ Filtered out sequences < 20bp  
✅ Created previews for easier inspection and feature design

In [None]:
!wget -O data/raw/human_rna.fna.gz https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.rna.fna.gz
!gunzip -f data/raw/human_rna.fna.gz

# 🧬 Install dependencies
!pip install biopython

# ✅ Imports
import os
import pandas as pd
from Bio import SeqIO

In [None]:
def strip_fasta_comments(input_path, output_path):
    """
    Removes any comment lines (e.g. starting with '#', ';', or '!') before the first '>' in FASTA.
    """
    with open(input_path, 'r') as infile, open(output_path, 'w') as outfile:
        write = False
        for line in infile:
            if line.startswith('>'):
                write = True  # Start writing after first valid sequence entry
            if write:
                outfile.write(line)

# Clean and save new version
strip_fasta_comments("data/raw/human_rna.fna", "data/processed/human_rna_cleaned.fna")

In [None]:
def parse_fasta_to_df(filepath, max_len=200):
    records = []
    for record in SeqIO.parse(filepath, "fasta"):
        if len(record.seq) < 20:
            continue
        records.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq),
            "Length": len(record.seq),
            "Preview": str(record.seq[:max_len]) + ("..." if len(record.seq) > max_len else "")
        })
    df = pd.DataFrame(records)
    print(f"✅ Parsed {len(df)} gene sequences.")
    return df

# Parse cleaned version
fasta_df = parse_fasta_to_df("data/processed/human_rna_cleaned.fna")
fasta_df.head()

In [None]:
fasta_df.to_csv("data/processed/human_fasta_clean.csv", index=False)
print("📁 Saved parsed FASTA to: data/processed/human_fasta_clean.csv")