# NB 04: Cross-Species Metal Fitness Families

Map metal-important genes to ortholog groups and identify gene families
with conserved metal fitness phenotypes across multiple species.

**Runs locally** — uses ortholog groups from `essential_genome/data/`.

**Inputs**:
- `data/metal_important_genes.csv` (from NB02)
- `essential_genome/data/ortholog_groups.csv`
- `essential_genome/data/family_conservation.tsv`
- `essential_genome/data/essential_families.tsv`

**Outputs**:
- `data/conserved_metal_families.csv`
- `data/novel_metal_candidates.csv`
- `figures/metal_family_conservation_heatmap.png`

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy import stats
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns

PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
FIGURES_DIR = PROJECT_DIR / 'figures'
EG_DATA = PROJECT_DIR.parent / 'essential_genome' / 'data'
FM_DATA = PROJECT_DIR.parent / 'fitness_modules' / 'data'

# Load data
metal_important = pd.read_csv(DATA_DIR / 'metal_important_genes.csv')
metal_important['locusId'] = metal_important['locusId'].astype(str)
print(f'Metal-important gene records: {len(metal_important):,}')

# Ortholog groups — try essential_genome first, then fitness_modules
og_path = EG_DATA / 'all_ortholog_groups.csv'
if not og_path.exists():
    og_path = FM_DATA / 'orthologs' / 'ortholog_groups.csv'
ortholog_groups = pd.read_csv(og_path)
ortholog_groups['locusId'] = ortholog_groups['locusId'].astype(str)
print(f'Ortholog group assignments: {len(ortholog_groups):,}')
print(f'Unique ortholog groups: {ortholog_groups["OG_id"].nunique():,}')

family_cons = pd.read_csv(EG_DATA / 'family_conservation.tsv', sep='\t')
print(f'Family conservation records: {len(family_cons):,}')

essential_fam = pd.read_csv(EG_DATA / 'essential_families.tsv', sep='\t')
print(f'Essential family records: {len(essential_fam):,}')

Metal-important gene records: 12,838
Ortholog group assignments: 179,237
Unique ortholog groups: 17,222
Family conservation records: 16,758
Essential family records: 17,222


## 1. Map Metal-Important Genes to Ortholog Groups

In [2]:
# Join metal-important genes with ortholog groups
metal_og = metal_important.merge(
    ortholog_groups[['orgId', 'locusId', 'OG_id']],
    on=['orgId', 'locusId'],
    how='left'
)

n_mapped = metal_og['OG_id'].notna().sum()
print(f'Metal-important genes mapped to OGs: {n_mapped:,} / {len(metal_og):,} '
      f'({100*n_mapped/len(metal_og):.1f}%)')

metal_og_mapped = metal_og[metal_og['OG_id'].notna()].copy()
print(f'Unique OGs with metal phenotype: {metal_og_mapped["OG_id"].nunique():,}')

Metal-important genes mapped to OGs: 10,876 / 12,838 (84.7%)
Unique OGs with metal phenotype: 2,891


## 2. Identify Conserved Metal Families

A "conserved metal family" has metal fitness defects in ≥2 organisms
for the same metal.

In [3]:
# For each OG × metal, count organisms with metal phenotype
og_metal_counts = metal_og_mapped.groupby(['OG_id', 'metal_element']).agg(
    n_organisms=('orgId', 'nunique'),
    organisms=('orgId', lambda x: ','.join(sorted(x.unique()))),
    mean_fitness=('mean_fit', 'mean'),
    min_fitness=('min_fit', 'min'),
).reset_index()

# Conserved: metal phenotype in >= 2 organisms
conserved_2 = og_metal_counts[og_metal_counts['n_organisms'] >= 2]
conserved_3 = og_metal_counts[og_metal_counts['n_organisms'] >= 3]

print(f'OG × metal combinations: {len(og_metal_counts):,}')
print(f'Conserved (≥2 organisms): {len(conserved_2):,} families × metals '
      f'({conserved_2["OG_id"].nunique()} unique OGs)')
print(f'Conserved (≥3 organisms): {len(conserved_3):,} families × metals '
      f'({conserved_3["OG_id"].nunique()} unique OGs)')

# Also count across ANY metal (organism has ANY metal phenotype)
og_any_metal = metal_og_mapped.groupby('OG_id').agg(
    n_organisms_any=('orgId', 'nunique'),
    n_metals=('metal_element', 'nunique'),
    metals=('metal_element', lambda x: ','.join(sorted(x.unique()))),
).reset_index()

conserved_any_2 = og_any_metal[og_any_metal['n_organisms_any'] >= 2]
conserved_any_3 = og_any_metal[og_any_metal['n_organisms_any'] >= 3]
print(f'\nOGs with metal phenotype in ≥2 organisms (any metal): {len(conserved_any_2):,}')
print(f'OGs with metal phenotype in ≥3 organisms (any metal): {len(conserved_any_3):,}')

OG × metal combinations: 7,231
Conserved (≥2 organisms): 1,836 families × metals (906 unique OGs)
Conserved (≥3 organisms): 704 families × metals (353 unique OGs)



OGs with metal phenotype in ≥2 organisms (any metal): 1,182
OGs with metal phenotype in ≥3 organisms (any metal): 601


In [4]:
# Top conserved families by organism breadth
print('\nTop 30 conserved metal families (by # organisms with metal phenotype):')
print('=' * 90)
top_families = og_any_metal.sort_values('n_organisms_any', ascending=False).head(30)
for _, row in top_families.iterrows():
    print(f'  {row.OG_id:12s}  {row.n_organisms_any:2d} organisms  '
          f'{row.n_metals:2d} metals  [{row.metals}]')


Top 30 conserved metal families (by # organisms with metal phenotype):
  OG00424       18 organisms   7 metals  [Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc]
  OG00245       18 organisms   5 metals  [Aluminum,Cobalt,Copper,Nickel,Zinc]
  OG00128       17 organisms   9 metals  [Aluminum,Chromium,Cobalt,Copper,Manganese,Molybdenum,Nickel,Tungsten,Zinc]
  OG00605       17 organisms   7 metals  [Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc]
  OG00016       16 organisms   5 metals  [Aluminum,Cobalt,Copper,Nickel,Zinc]
  OG00082       15 organisms   8 metals  [Aluminum,Chromium,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc]
  OG00342       15 organisms   5 metals  [Aluminum,Cobalt,Copper,Nickel,Zinc]
  OG00019       15 organisms   9 metals  [Aluminum,Cadmium,Chromium,Cobalt,Copper,Iron,Nickel,Uranium,Zinc]
  OG00151       14 organisms   5 metals  [Aluminum,Cobalt,Copper,Nickel,Zinc]
  OG00075       14 organisms   7 metals  [Aluminum,Cobalt,Copper,Iron,Molybdenum,Nickel,

## 3. Annotate Conserved Families

In [5]:
# Load SEED annotations for functional context
seed_path = EG_DATA / 'all_seed_annotations.tsv'
if seed_path.exists():
    seed_annot = pd.read_csv(seed_path, sep='\t')
    seed_annot['locusId'] = seed_annot['locusId'].astype(str)
    print(f'SEED annotations: {len(seed_annot):,}')

    og_seed = ortholog_groups.merge(
        seed_annot[['orgId', 'locusId', 'seed_desc']].drop_duplicates(),
        on=['orgId', 'locusId'], how='left'
    )
    og_seed_summary = og_seed.groupby('OG_id')['seed_desc'].agg(
        lambda x: x.dropna().mode().iloc[0] if len(x.dropna()) > 0 and len(x.dropna().mode()) > 0 else 'hypothetical'
    ).reset_index()
    og_seed_summary.columns = ['OG_id', 'seed_description']
else:
    print('No SEED annotations available')
    og_seed_summary = pd.DataFrame(columns=['OG_id', 'seed_description'])

# Add conservation and essentiality
conserved_annotated = og_any_metal.merge(family_cons, left_on='OG_id', right_on='OG_id', how='left')
conserved_annotated = conserved_annotated.merge(og_seed_summary, on='OG_id', how='left')
conserved_annotated = conserved_annotated.merge(
    essential_fam[['OG_id', 'essentiality_class', 'frac_essential']],
    on='OG_id', how='left'
)
print(f'\nAnnotated families: {len(conserved_annotated):,}')

# Identify novel candidates with finer categories
conserved_with_data = conserved_annotated[conserved_annotated['n_organisms_any'] >= 2].copy()

# Category 1: Truly unknown — no informative annotation at all
def classify_novelty(row):
    desc = str(row.get('rep_desc', ''))
    seed = str(row.get('seed_description', ''))
    combined = (desc + ' ' + seed).lower()
    if any(k in combined for k in ['hypothetical', 'unknown', 'uncharacterized', 'duf', 'upf']):
        if any(k in combined for k in ['duf', 'upf']):
            return 'novel_domain'  # Has a DUF/UPF domain — known domain, unknown metal function
        elif any(k in combined for k in ['transporter', 'regulator', 'kinase', 'transferase',
                                          'oxidoreductase', 'hydrolase', 'permease', 'efflux']):
            return 'novel_metal_function'  # Has functional hint, unknown metal role
        else:
            return 'truly_unknown'  # No functional annotation at all
    return 'annotated'

conserved_with_data['novelty_class'] = conserved_with_data.apply(classify_novelty, axis=1)

novel = conserved_with_data[conserved_with_data['novelty_class'] != 'annotated'].copy()
truly_unknown = conserved_with_data[conserved_with_data['novelty_class'] == 'truly_unknown']
novel_domain = conserved_with_data[conserved_with_data['novelty_class'] == 'novel_domain']
novel_metal = conserved_with_data[conserved_with_data['novelty_class'] == 'novel_metal_function']

print(f'Conserved families (>=2 orgs): {len(conserved_with_data):,}')
print(f'Novel candidates total: {len(novel):,}')
print(f'  Truly unknown (no annotation): {len(truly_unknown)}')
print(f'  Novel domain (DUF/UPF, unknown metal role): {len(novel_domain)}')
print(f'  Novel metal function (known domain, uncharacterized metal role): {len(novel_metal)}')

SEED annotations: 177,519



Annotated families: 2,891
Conserved families (>=2 orgs): 1,182
Novel candidates total: 149
  Truly unknown (no annotation): 89
  Novel domain (DUF/UPF, unknown metal role): 43
  Novel metal function (known domain, uncharacterized metal role): 17


## 4. Figures

In [6]:
# Distribution of organism breadth for metal OGs
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: histogram of organism count per OG
ax = axes[0]
ax.hist(og_any_metal['n_organisms_any'], bins=range(1, og_any_metal['n_organisms_any'].max()+2),
        color='steelblue', alpha=0.8, edgecolor='black', linewidth=0.5, align='left')
ax.set_xlabel('Number of organisms with metal fitness defect')
ax.set_ylabel('Number of ortholog families')
ax.set_title('Metal Fitness Gene Family Breadth')
ax.axvline(2, color='red', linestyle='--', alpha=0.7, label='≥2 = conserved')
ax.legend()

# Right: conservation (pct_core) vs organism breadth
ax = axes[1]
plot_data = conserved_annotated[conserved_annotated['pct_core'].notna()]
ax.scatter(plot_data['n_organisms_any'], plot_data['pct_core'], 
           alpha=0.3, s=10, color='steelblue')
# Add trend
if len(plot_data) > 10:
    bins = plot_data.groupby('n_organisms_any')['pct_core'].mean()
    ax.plot(bins.index, bins.values, 'ro-', markersize=6, linewidth=2, label='Mean')
ax.set_xlabel('# organisms with metal phenotype')
ax.set_ylabel('Pangenome conservation (% core)')
ax.set_title('Metal Family Breadth vs Conservation')
ax.legend()

plt.tight_layout()
fig.savefig(FIGURES_DIR / 'metal_family_conservation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/metal_family_conservation_heatmap.png')

Saved: figures/metal_family_conservation_heatmap.png


## 5. Save Results

In [7]:
# --- Issue #8: Characterize novel candidates further ---
print('Top 20 Novel Metal Biology Candidates (hypothetical + conserved):')
print('=' * 100)
novel_sorted = novel.sort_values('n_organisms_any', ascending=False)
for _, row in novel_sorted.head(20).iterrows():
    desc = row.get('rep_desc', 'unknown')
    if pd.isna(desc):
        desc = row.get('seed_description', 'hypothetical')
    ess = row.get('essentiality_class_x', 'unknown')
    pct = row.get('pct_core', 0)
    print(f'  {row.OG_id:10s}  {int(row.n_organisms_any):2d} orgs  '
          f'{int(row.n_metals):2d} metals  core={pct:.0f}%  '
          f'ess={ess}  [{desc[:60]}]')
    print(f'              metals: {row.metals}')

# Summarize novel candidates by functional hints
print(f'\nNovel candidate summary:')
print(f'  Total: {len(novel)}')
duf_count = novel['rep_desc'].fillna('').str.contains('DUF', case=False).sum()
membrane_count = novel['rep_desc'].fillna('').str.contains('membrane|transport', case=False).sum()
enzyme_count = novel['rep_desc'].fillna('').str.contains('ase|enzyme|oxidoreductase|hydrolase', case=False).sum()
print(f'  With DUF domain: {duf_count}')
print(f'  Membrane/transport hint: {membrane_count}')
print(f'  Enzyme hint: {enzyme_count}')
print(f'  Completely unknown: {len(novel) - duf_count - membrane_count - enzyme_count}')

# Essentiality distribution of novel candidates
if 'essentiality_class_x' in novel.columns:
    print(f'\n  Essentiality classes:')
    print(novel['essentiality_class_x'].value_counts().to_string())

# Save
conserved_with_data.to_csv(DATA_DIR / 'conserved_metal_families.csv', index=False)
novel.to_csv(DATA_DIR / 'novel_metal_candidates.csv', index=False)

print(f'\nSaved: data/conserved_metal_families.csv ({len(conserved_with_data)} families)')
print(f'Saved: data/novel_metal_candidates.csv ({len(novel)} hypothetical families)')

print('\n' + '=' * 80)
print('NB04 SUMMARY: Cross-Species Metal Fitness Families')
print('=' * 80)
print(f'Metal-important genes mapped to OGs: {n_mapped:,}')
print(f'Unique OGs with metal phenotype: {metal_og_mapped["OG_id"].nunique():,}')
print(f'Conserved families (>=2 organisms): {len(conserved_with_data):,}')
print(f'Conserved families (>=3 organisms): {len(conserved_any_3):,}')
print(f'Novel candidates (hypothetical + conserved): {len(novel):,}')
print(f'  DUF-containing: {duf_count}, Membrane/transport: {membrane_count}, Enzyme: {enzyme_count}')
print('=' * 80)

Top 20 Novel Metal Biology Candidates (hypothetical + conserved):
  OG01383     11 orgs   6 metals  core=100%  ess=variably_essential  [YebC/PmpR transcriptional regulator]
              metals: Aluminum,Chromium,Cobalt,Copper,Tungsten,Zinc
  OG02233      8 orgs   4 metals  core=92%  ess=variably_essential  [phospholipid transport system substrate-binding protein]
              metals: Cobalt,Copper,Nickel,Zinc
  OG02094      8 orgs   7 metals  core=97%  ess=variably_essential  [Uncharacterized P-loop ATPase protein UPF0042]
              metals: Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc
  OG00391      7 orgs   9 metals  core=95%  ess=variably_essential  [Uncharacterized PLP-dependent aminotransferase YfdZ]
              metals: Cobalt,Copper,Iron,Mercury,Molybdenum,Nickel,Selenium,Tungsten,Zinc
  OG02716      7 orgs   4 metals  core=92%  ess=never_essential  [Protein of unknown function (DUF3108).]
              metals: Aluminum,Copper,Nickel,Zinc
  OG03264      6 orgs   