### <span style="color:teal"> __TAXONOMY DICTIONARY MAPPING__
____

This methodology resolves scientific names and phylum-level classifications for NCBI TaxIDs in two stages, with batching, fallbacks, and rate-limit safeguards. First, it collects all unique `ncbi_taxon_id` values from the dataset and queries **NCBI E-utilities (esummary, db=taxonomy)** in batches (size 200) using a provided API key; for each TaxID, it prefers the **species-rank** scientific name and otherwise falls back to whatever rank returns a scientific name. Any IDs not resolved by NCBI are retried against the **UniProt Taxonomy REST API** as a fallback; unresolved entries are labeled `"Unknown"`. This produces a `taxon_dict` mapping TaxID → scientific name, which is saved to JSON and accompanied by a small summary report (counts resolved via NCBI, via UniProt, and unresolved). In the second stage, the code retrieves **taxonomy XML** from NCBI (efetch, db=taxonomy) in batches (size 100) and parses each `<Taxon>` record’s `LineageEx` to extract the ancestor whose `<Rank>` equals `"phylum"`, recording that ancestor’s `<ScientificName>`; if no phylum is located, it marks the entry as `"phylum_Not_Found"`, and if the batch fails twice it records `"Error"`. The workflow respects NCBI’s rate guidelines with short sleeps between requests (∼0.34–0.40 s), handles HTTP errors gracefully, and skips the sentinel `-1` by assigning `"Unknown"` up front (while still including it in the final mapping for completeness). The outputs—**`taxon_dict.json`** (TaxID → scientific name) and **`taxon_phyla_dict.json`** (TaxID → phylum)—are persisted for **downstream analyses**, enabling consistent taxon naming and robust phylum-level grouping/filtering without re-querying external services; the latter file is derived strictly from parsed lineage and serves as a compact classification layer suitable for stratification, summaries, or QC.

In [10]:
# Loading packages
import pandas as pd
import numpy as np
import json
import requests
import time
from tqdm import tqdm
import xml.etree.ElementTree as ET

In [11]:
# Load the main dataset (GMRepo 'species_abundance.txt')
df = pd.read_csv("/mnt/iusers01/fatpou01/bmh01/msc-bioinf-2024-2025/h44063jg/gm_repository/species_abundance.txt", sep = '\t')

In [12]:
taxon_id = df['ncbi_taxon_id'].unique()
#get update taxonomy species ID from NCBI or uniprot

API_KEY = "87fdd45a6ee743fecdbd1b7e9f010d669109"
BATCH_SIZE = 200

# Ensure your taxon_id list exists and convert to strings
taxon_id = [str(x) for x in taxon_id]

# Final dictionary and counters
taxon_dict = {}
from_ncbi = 0
from_uniprot = 0
unresolved = 0

def fetch_ncbi_batch(batch_ids):
    """Query NCBI for names (species preferred, fallback to higher rank)."""
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    params = {
        'db': 'taxonomy',
        'id': ",".join(batch_ids),
        'retmode': 'json',
        'api_key': API_KEY
    }
    species = {}
    fallback = {}
    try:
        r = requests.get(url, params=params, timeout=15)
        if r.status_code == 200:
            result = r.json().get("result", {})
            for tid in batch_ids:
                record = result.get(tid, {})
                name = record.get("scientificname")
                rank = record.get("rank", "")
                if name:
                    if rank == "species":
                        species[tid] = name
                    else:
                        fallback[tid] = name
    except Exception as e:
        print(f" NCBI error for {batch_ids[:3]}: {e}")
    return species, fallback

def get_name_from_uniprot(tax_id):
    """Query UniProt as fallback."""
    url = f"https://rest.uniprot.org/taxonomy/{tax_id}"
    try:
        r = requests.get(url, timeout=10)
        if r.status_code == 200:
            return r.json().get("scientificName")
    except Exception as e:
        print(f" UniProt error for {tax_id}: {e}")
    return None

# Step 1: Query NCBI
missing_ids = []
for i in tqdm(range(0, len(taxon_id), BATCH_SIZE), desc="NCBI Batch lookup"):
    batch = taxon_id[i:i+BATCH_SIZE]
    species_dict, fallback_dict = fetch_ncbi_batch(batch)

    for tid in batch:
        tid_int = int(tid)
        if tid in species_dict:
            taxon_dict[tid_int] = species_dict[tid]
            from_ncbi += 1
        elif tid in fallback_dict:
            taxon_dict[tid_int] = fallback_dict[tid]
            from_ncbi += 1
        else:
            missing_ids.append(tid)

    time.sleep(0.4)  # NCBI rate limit

# Step 2: Query UniProt
for tid in tqdm(missing_ids, desc="UniProt fallback"):
    name = get_name_from_uniprot(tid)
    tid_int = int(tid)
    if name:
        taxon_dict[tid_int] = name
        from_uniprot += 1
    else:
        taxon_dict[tid_int] = "Unknown"
        unresolved += 1

# Step 3: Save the complete dictionary
with open("taxon_dict.json", "w") as f:
    json.dump({str(k): v for k, v in taxon_dict.items()}, f, indent=2)

# Step 4: Report summary
print("\n✅ Taxonomic Name Resolution Summary")
print("------------------------------------")
print(f"🔹 Total Taxon IDs Provided     : {len(taxon_id)}")
print(f"🔹 Resolved from NCBI           : {from_ncbi}")
print(f"🔹 Resolved from UniProt        : {from_uniprot}")
print(f"🔹 Unresolved IDs               : {unresolved}")
print(f"🔹 Final Entries in Dictionary  : {len(taxon_dict)}")


# Save final dictionary
with open("/mnt/iusers01/fatpou01/bmh01/msc-bioinf-2024-2025/h44063jg/gmp_jms/taxon_dict.json", "w") as f:
    json.dump({str(k): v for k, v in taxon_dict.items()}, f, indent=2)

print("\n💾 Files saved:")
print(" - taxon_dict.json")

NCBI Batch lookup:   0%|          | 0/45 [00:00<?, ?it/s]

NCBI Batch lookup: 100%|██████████| 45/45 [00:35<00:00,  1.26it/s]
UniProt fallback: 100%|██████████| 92/92 [00:41<00:00,  2.21it/s]


✅ Taxonomic Name Resolution Summary
------------------------------------
🔹 Total Taxon IDs Provided     : 8910
🔹 Resolved from NCBI           : 8818
🔹 Resolved from UniProt        : 91
🔹 Unresolved IDs               : 1
🔹 Final Entries in Dictionary  : 8910

💾 Files saved:
 - taxon_dict.json





In [13]:
# === Your API key here
NCBI_API_KEY = "87fdd45a6ee743fecdbd1b7e9f010d669109"

# === Function to fetch taxonomy XML for a batch of taxon IDs
def fetch_phylum_batch(taxon_ids):
    joined_ids = ",".join(str(t) for t in taxon_ids)
    params = {
        "db": "taxonomy",
        "id": joined_ids,
        "retmode": "xml",
        "api_key": NCBI_API_KEY
    }
    url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"

    try:
        response = requests.get(url, params=params)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"⚠️ Error fetching batch {taxon_ids[:3]}...: {e}")
        return None

# === Main script to fetch phylum for each TaxID
def batch_get_phylum(taxon_dict, batch_size=100):
    all_taxids = [int(tid) for tid in taxon_dict.keys() if tid != "-1"]
    taxon_phylum_map = {"-1": "Unknown"}

    for i in range(0, len(all_taxids), batch_size):
        batch = all_taxids[i:i + batch_size]
        print(f"🔍 Processing batch {i+1} to {i+len(batch)} of {len(all_taxids)}...")

        xml_data = fetch_phylum_batch(batch)
        if not xml_data:
            # Retry once after delay
            time.sleep(1)
            xml_data = fetch_phylum_batch(batch)
            if not xml_data:
                for tid in batch:
                    taxon_phylum_map[str(tid)] = "Error"
                continue

        # Parse each Taxon block individually
        root = ET.fromstring(xml_data)
        for taxon in root.findall("Taxon"):
            tid = taxon.findtext("TaxId")
            phylum_name = "phylum_Not_Found"

            for ancestor in taxon.findall(".//LineageEx/Taxon"):
                if ancestor.findtext("Rank") == "phylum":
                    phylum_name = ancestor.findtext("ScientificName")
                    break

            taxon_phylum_map[tid] = phylum_name

        time.sleep(0.34)  # NCBI rate limit: keep <3 requests/sec

    return taxon_phylum_map

# === Example usage
# taxon_dict = {"562": "Escherichia coli", "1280": "Lactobacillus", ...}
output_dict = batch_get_phylum(taxon_dict, batch_size=100)

unique_values = set(output_dict.values())
print(f"Number of unique values: {len(unique_values)}")

# === Save to JSON
with open("/mnt/iusers01/fatpou01/bmh01/msc-bioinf-2024-2025/h44063jg/gmp_jms/taxon_phyla_dict.json", "w") as f:
    json.dump(output_dict, f, indent=2)

print("✅ Done! Orders saved to 'taxon_to_phyla.json'")

🔍 Processing batch 1 to 100 of 8910...
🔍 Processing batch 101 to 200 of 8910...
🔍 Processing batch 201 to 300 of 8910...
🔍 Processing batch 301 to 400 of 8910...
🔍 Processing batch 401 to 500 of 8910...
🔍 Processing batch 501 to 600 of 8910...
🔍 Processing batch 601 to 700 of 8910...
🔍 Processing batch 701 to 800 of 8910...
🔍 Processing batch 801 to 900 of 8910...
🔍 Processing batch 901 to 1000 of 8910...
🔍 Processing batch 1001 to 1100 of 8910...
🔍 Processing batch 1101 to 1200 of 8910...
🔍 Processing batch 1201 to 1300 of 8910...
🔍 Processing batch 1301 to 1400 of 8910...
🔍 Processing batch 1401 to 1500 of 8910...
🔍 Processing batch 1501 to 1600 of 8910...
🔍 Processing batch 1601 to 1700 of 8910...
🔍 Processing batch 1701 to 1800 of 8910...
🔍 Processing batch 1801 to 1900 of 8910...
🔍 Processing batch 1901 to 2000 of 8910...
🔍 Processing batch 2001 to 2100 of 8910...
🔍 Processing batch 2101 to 2200 of 8910...
🔍 Processing batch 2201 to 2300 of 8910...
🔍 Processing batch 2301 to 2400 