# EDA Taxonomy Analysis

This notebook analyzes the taxonomy distribution in the training and testing datasets for the CAFA 6 competition.

**Goal:**
1. Count taxonomy occurrences in `train_taxonomy.tsv`.
2. Extract taxonomy IDs from `testsuperset.fasta`.
3. Check if all test taxonomies are present in the training set.
4. Check if any protein is associated with multiple taxonomies in the training set.

In [None]:
import pandas as pd
import os

# Define file paths
DATA_DIR = "../../data"
TRAIN_TAXONOMY_PATH = os.path.join(DATA_DIR, "Train", "train_taxonomy.tsv")
TEST_FASTA_PATH = os.path.join(DATA_DIR, "Test", "testsuperset.fasta")

print(f"Train Taxonomy Path: {TRAIN_TAXONOMY_PATH}")
print(f"Test Fasta Path: {TEST_FASTA_PATH}")

Train Taxonomy Path: ../../data/Train/train_taxonomy.tsv
Test Fasta Path: ../../data/Test/testsuperset.fasta


## 1. Analyze Training Taxonomy

In [2]:
# Load train_taxonomy.tsv
train_tax_df = pd.read_csv(TRAIN_TAXONOMY_PATH, sep="\t", header=None, names=["ProteinID", "TaxonomyID"])
print(f"Loaded {len(train_tax_df)} rows from train_taxonomy.tsv")
train_tax_df.head()

Loaded 82404 rows from train_taxonomy.tsv


Unnamed: 0,ProteinID,TaxonomyID
0,A0A0C5B5G6,9606
1,A0JNW5,9606
2,A0JP26,9606
3,A0PK11,9606
4,A1A4S6,9606


In [3]:
# Count unique taxonomies in Train
train_taxonomies = set(train_tax_df["TaxonomyID"].unique())
print(f"Number of unique taxonomies in Train: {len(train_taxonomies)}")

Number of unique taxonomies in Train: 1381


## 2. Analyze Test Taxonomy from Fasta

In [4]:
from Bio import SeqIO

# Parse testsuperset.fasta to extract taxonomy IDs using BioPython
test_taxonomies = set()

# Iterate over the fasta file
for record in SeqIO.parse(TEST_FASTA_PATH, "fasta"):
    # Header format: >ProteinID TaxonomyID
    # record.description contains the full header line after >
    # e.g., "A0A0C5B5G6 9606"
    parts = record.description.split()
    if len(parts) >= 2:
        try:
            # The second part is expected to be the TaxonomyID
            tax_id = int(parts[1])
            test_taxonomies.add(tax_id)
        except ValueError:
            pass

print(f"Number of unique taxonomies in Test: {len(test_taxonomies)}")

## 3. Compare Train and Test Taxonomies

In [5]:
# Check coverage
missing_in_train = test_taxonomies - train_taxonomies
is_covered = len(missing_in_train) == 0

print(f"Does Train cover all Test taxonomies? {is_covered}")
print(f"Number of Test taxonomies missing in Train: {len(missing_in_train)}")

if not is_covered:
    print("Missing Taxonomies:", missing_in_train)

## 4. Check Protein-Taxonomy Multiplicity in Train
Check if any protein is associated with multiple taxonomies.

In [6]:
# Count unique taxonomies per protein
protein_tax_counts = train_tax_df.groupby("ProteinID")["TaxonomyID"].nunique()

# Check if any protein has > 1 taxonomy
multi_tax_proteins = protein_tax_counts[protein_tax_counts > 1]

print(f"Number of proteins with multiple taxonomies: {len(multi_tax_proteins)}")

if len(multi_tax_proteins) > 0:
    print("Examples:")
    print(multi_tax_proteins.head())

Number of proteins with multiple taxonomies: 0
