# NB 01: Metal Experiment Classification

Identify and classify all metal-related experiments across the Fitness Browser's
48 organisms. Build a master table mapping each metal experiment to its metal
element, organism, concentration, and experiment metadata.

**Runs locally** — reads cached experiment files from `fitness_modules/data/annotations/`.

**Outputs**: `data/metal_experiments.csv` — master table of all metal experiments.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import re

# Paths
PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
DATA_DIR.mkdir(exist_ok=True)
FIGURES_DIR = PROJECT_DIR / 'figures'
FIGURES_DIR.mkdir(exist_ok=True)

# Cached experiment files from fitness_modules project
FM_ANNOTATIONS = PROJECT_DIR.parent / 'fitness_modules' / 'data' / 'annotations'

print(f'Project dir: {PROJECT_DIR}')
print(f'Experiment annotations dir: {FM_ANNOTATIONS}')
print(f'Annotation files exist: {FM_ANNOTATIONS.exists()}')

Project dir: /home/psdehal/pangenome_science/BERIL-research-observatory/projects/metal_fitness_atlas
Experiment annotations dir: /home/psdehal/pangenome_science/BERIL-research-observatory/projects/fitness_modules/data/annotations
Annotation files exist: True


## 1. Define Metal Compound → Element Mapping

Map compound names in the FB `condition_1` field to standardized metal element names.
Also flag the metal category (toxic vs essential) and whether it's on the USGS
critical minerals list.

In [2]:
# Compound name → (metal_element, metal_category, usgs_critical)
# metal_category: 'toxic' = external stress, 'essential' = nutrient/cofactor
METAL_COMPOUND_MAP = {
    # Toxic/stress metals — broad panel
    'Nickel (II) chloride hexahydrate': ('Nickel', 'toxic', True),
    'Cobalt chloride hexahydrate': ('Cobalt', 'toxic', True),
    'copper (II) chloride dihydrate': ('Copper', 'toxic', False),
    'Copper (II) sulfate pentahydrate': ('Copper', 'toxic', False),
    'Zinc sulfate heptahydrate': ('Zinc', 'toxic', False),
    'Zinc Pyrithione': ('Zinc', 'toxic', False),  # antimicrobial, not pure metal
    'Aluminum chloride hydrate': ('Aluminum', 'toxic', True),
    'Uranyl acetate': ('Uranium', 'toxic', True),
    'Sodium Chromate': ('Chromium', 'toxic', True),
    'Potassium dichromate': ('Chromium', 'toxic', True),
    'mercury (II) chloride': ('Mercury', 'toxic', False),
    'Cadmium chloride hemipentahydrate': ('Cadmium', 'toxic', False),
    'Cisplatin': ('Platinum', 'toxic', False),  # DNA-damaging agent
    
    # Essential/nutrient metals
    'Iron (II) chloride tetrahydrate': ('Iron', 'essential', False),
    'Sodium molybdate': ('Molybdenum', 'essential', False),
    'Sodium tungstate dihydrate': ('Tungsten', 'essential', True),
    'Sodium selenate': ('Selenium', 'essential', False),
    'Manganese (II) chloride tetrahydrate': ('Manganese', 'essential', True),
}

# Also match by keyword in condition_1 or expDesc for edge cases
METAL_KEYWORDS = {
    'nickel': 'Nickel',
    'cobalt': 'Cobalt',
    'copper': 'Copper',
    'zinc': 'Zinc',
    'aluminum': 'Aluminum',
    'uranyl': 'Uranium',
    'uranium': 'Uranium',
    'chromat': 'Chromium',
    'dichromat': 'Chromium',
    'mercury': 'Mercury',
    'cadmium': 'Cadmium',
    'molybdat': 'Molybdenum',
    'tungstat': 'Tungsten',
    'selenat': 'Selenium',
    'selenite': 'Selenium',
    'manganese': 'Manganese',
    'iron': 'Iron',
    'cisplatin': 'Platinum',
}

print(f'Defined {len(METAL_COMPOUND_MAP)} compound mappings')
print(f'Defined {len(METAL_KEYWORDS)} keyword patterns')

Defined 18 compound mappings
Defined 18 keyword patterns


## 2. Load All Cached Experiment Files

In [3]:
# Load experiment metadata for all 32 organisms with cached data
exp_files = sorted(FM_ANNOTATIONS.glob('*_experiments.csv'))
print(f'Found {len(exp_files)} experiment files')

all_experiments = []
for f in exp_files:
    org_id = f.stem.replace('_experiments', '')
    df = pd.read_csv(f)
    df['orgId'] = org_id
    all_experiments.append(df)

experiments = pd.concat(all_experiments, ignore_index=True)
print(f'\nTotal experiments loaded: {len(experiments):,}')
print(f'Organisms: {experiments["orgId"].nunique()}')
print(f'\nColumns: {list(experiments.columns)}')
print(f'\nExperiments per organism:')
print(experiments.groupby('orgId').size().sort_values(ascending=False).to_string())

Found 32 experiment files

Total experiments loaded: 6,804
Organisms: 32

Columns: ['expName', 'expDesc', 'expGroup', 'condition_1', 'media', 'cor12', 'mad12', 'nMapped', 'orgId']

Experiments per organism:
orgId
DvH                   757
Btheta                519
Methanococcus_S2      371
psRCH2                350
Putida                300
Phaeo                 274
Marino                255
pseudo3_N2E3          211
Koxy                  208
Cola                  202
WCS417                201
Caulo                 198
SB2B                  190
pseudo6_N2E2          188
Dino                  186
pseudo5_N2C3_1        184
Miya                  178
Pedo557               177
MR1                   176
Keio                  168
Korea                 162
PV4                   160
pseudo1_N1B4          147
acidovorax_3H11       140
Methanococcus_JJ      129
SynE                  129
BFirm                 113
Kang                  108
ANA3                  107
Cup4G11               106
pseudo1

## 3. Classify Metal Experiments

Match experiments to metals using compound name mapping and keyword fallback.

In [4]:
def classify_metal(row):
    """Classify an experiment row as metal-related or not.
    
    Returns (metal_element, metal_category, usgs_critical, match_method)
    or (None, None, None, None) if not metal-related.
    """
    condition = str(row.get('condition_1', ''))
    desc = str(row.get('expDesc', ''))
    group = str(row.get('expGroup', ''))
    
    # Method 1: Exact compound match
    if condition in METAL_COMPOUND_MAP:
        metal, cat, crit = METAL_COMPOUND_MAP[condition]
        return metal, cat, crit, 'compound_match'
    
    # Method 2: Keyword match in condition_1
    for keyword, metal in METAL_KEYWORDS.items():
        if keyword.lower() in condition.lower():
            cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
            crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
            return metal, cat, crit, 'keyword_condition'
    
    # Method 3: Keyword match in expDesc
    for keyword, metal in METAL_KEYWORDS.items():
        if keyword.lower() in desc.lower():
            cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
            crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
            return metal, cat, crit, 'keyword_desc'
    
    # Method 4: expGroup = 'metal limitation' or 'iron'
    if group == 'metal limitation':
        # psRCH2 metal limitation experiments — check desc for specific metal
        for keyword, metal in METAL_KEYWORDS.items():
            if keyword.lower() in desc.lower():
                cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
                crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
                return metal, cat, crit, 'metal_limitation_group'
        # If no specific metal found, classify as general metal limitation
        return 'Metal_limitation', 'essential', False, 'metal_limitation_group'
    
    if group == 'iron':
        return 'Iron', 'essential', False, 'iron_group'
    
    return None, None, None, None


# Apply classification
results = experiments.apply(classify_metal, axis=1, result_type='expand')
results.columns = ['metal_element', 'metal_category', 'usgs_critical', 'match_method']

experiments_annotated = pd.concat([experiments, results], axis=1)

# Filter to metal experiments only
metal_exps = experiments_annotated[experiments_annotated['metal_element'].notna()].copy()

print(f'Total experiments: {len(experiments):,}')
print(f'Metal experiments: {len(metal_exps):,} ({100*len(metal_exps)/len(experiments):.1f}%)')
print(f'\nMetal experiments by match method:')
print(metal_exps['match_method'].value_counts().to_string())

Total experiments: 6,804


Metal experiments: 559 (8.2%)

Metal experiments by match method:
match_method
compound_match            463
keyword_desc               87
metal_limitation_group      9


## 4. Extract Concentrations

Parse concentration from `expDesc` where possible.

In [5]:
def extract_concentration(desc):
    """Extract concentration from experiment description.
    
    Returns (value, unit) or (None, None).
    """
    desc = str(desc)
    
    # Pattern: number followed by mM or uM
    match = re.search(r'(\d+\.?\d*)\s*(mM|uM|µM|mg/L|ppm)', desc, re.IGNORECASE)
    if match:
        return float(match.group(1)), match.group(2)
    
    # Pattern: "X.Xx" concentration at end of desc (common in FB)
    match = re.search(r'(\d+\.?\d*)\s*(?:mM|uM)', desc)
    if match:
        return float(match.group(1)), 'mM'
    
    # Pattern: "limitation (0.2x)" for iron
    match = re.search(r'(\d+\.?\d*)x', desc)
    if match and ('limitation' in desc.lower() or 'excess' in desc.lower()):
        return float(match.group(1)), 'x_relative'
    
    return None, None


metal_exps[['concentration', 'conc_unit']] = metal_exps['expDesc'].apply(
    lambda x: pd.Series(extract_concentration(x))
)

n_with_conc = metal_exps['concentration'].notna().sum()
print(f'Experiments with parsed concentration: {n_with_conc}/{len(metal_exps)} '
      f'({100*n_with_conc/len(metal_exps):.0f}%)')
print(f'\nConcentration units found:')
print(metal_exps['conc_unit'].value_counts(dropna=False).to_string())

Experiments with parsed concentration: 411/559 (74%)

Concentration units found:
conc_unit
mM            354
NaN           148
uM             39
x_relative     18


## 5. Summary Statistics

In [6]:
# Summary by metal element
metal_summary = metal_exps.groupby('metal_element').agg(
    n_experiments=('expName', 'count'),
    n_organisms=('orgId', 'nunique'),
    organisms=('orgId', lambda x: ', '.join(sorted(x.unique()))),
    category=('metal_category', 'first'),
    usgs_critical=('usgs_critical', 'first'),
).sort_values('n_experiments', ascending=False)

print('=' * 80)
print('METAL EXPERIMENT SUMMARY')
print('=' * 80)
print(f'\nTotal metal experiments: {len(metal_exps)}')
print(f'Metals: {metal_exps["metal_element"].nunique()}')
print(f'Organisms with metal data: {metal_exps["orgId"].nunique()}')
print()

# Display summary table (without organism list for readability)
display_df = metal_summary[['n_experiments', 'n_organisms', 'category', 'usgs_critical']].copy()
display_df.columns = ['Experiments', 'Organisms', 'Category', 'USGS Critical']
print(display_df.to_string())

METAL EXPERIMENT SUMMARY

Total metal experiments: 559
Metals: 16
Organisms with metal data: 31

                  Experiments  Organisms   Category USGS Critical
metal_element                                                    
Cobalt                     89         27      toxic          True
Nickel                     79         26      toxic          True
Platinum                   67         24      toxic         False
Copper                     60         23      toxic         False
Aluminum                   54         22      toxic          True
Zinc                       52         17      toxic         False
Iron                       45          3  essential         False
Molybdenum                 39          1  essential         False
Tungsten                   22          1  essential          True
Chromium                   11          2      toxic          True
Selenium                    9          1  essential         False
Metal_limitation            9          1  ess

In [7]:
# Summary by organism
org_summary = metal_exps.groupby('orgId').agg(
    n_metal_experiments=('expName', 'count'),
    n_metals=('metal_element', 'nunique'),
    metals=('metal_element', lambda x: ', '.join(sorted(x.unique()))),
).sort_values('n_metals', ascending=False)

print('\nMetal experiments by organism (sorted by number of metals tested):')
print('=' * 80)
for _, row in org_summary.iterrows():
    print(f'{row.name:25s}  {row.n_metal_experiments:3d} exps  '
          f'{row.n_metals:2d} metals  [{row.metals}]')


Metal experiments by organism (sorted by number of metals tested):
DvH                        149 exps  13 metals  [Aluminum, Chromium, Cobalt, Copper, Iron, Manganese, Mercury, Molybdenum, Nickel, Selenium, Tungsten, Uranium, Zinc]
psRCH2                      61 exps  10 metals  [Aluminum, Cadmium, Chromium, Cobalt, Copper, Metal_limitation, Nickel, Platinum, Uranium, Zinc]
Dino                        19 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
MR1                         12 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
Marino                      24 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
Korea                        8 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
Cola                        20 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
PV4                         15 exps   6 metals  [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc]
SB2B                       

In [8]:
# Focus: metals tested in >= 3 organisms (usable for cross-species comparison)
cross_species_metals = metal_summary[metal_summary['n_organisms'] >= 3].index.tolist()

print('\nMetals with >= 3 organisms (usable for cross-species analysis):')
print('=' * 80)
for metal in cross_species_metals:
    row = metal_summary.loc[metal]
    print(f'  {metal:12s}  {row.n_experiments:3d} experiments  '
          f'{row.n_organisms:2d} organisms  '
          f'category={row.category}  critical={row.usgs_critical}')

print(f'\n{len(cross_species_metals)} metals suitable for cross-species analysis')

# Exclude Platinum/Cisplatin (DNA damage agent, not metal stress)
analysis_metals = [m for m in cross_species_metals if m != 'Platinum']
print(f'{len(analysis_metals)} metals after excluding Platinum/Cisplatin: {analysis_metals}')


Metals with >= 3 organisms (usable for cross-species analysis):
  Cobalt         89 experiments  27 organisms  category=toxic  critical=True
  Nickel         79 experiments  26 organisms  category=toxic  critical=True
  Platinum       67 experiments  24 organisms  category=toxic  critical=False
  Copper         60 experiments  23 organisms  category=toxic  critical=False
  Aluminum       54 experiments  22 organisms  category=toxic  critical=True
  Zinc           52 experiments  17 organisms  category=toxic  critical=False
  Iron           45 experiments   3 organisms  category=essential  critical=False

7 metals suitable for cross-species analysis
6 metals after excluding Platinum/Cisplatin: ['Cobalt', 'Nickel', 'Copper', 'Aluminum', 'Zinc', 'Iron']


In [9]:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns

# Build organism × metal experiment count matrix
# Focus on the analysis metals (exclude Platinum)
metal_exps_analysis = metal_exps[metal_exps['metal_element'].isin(analysis_metals)]

org_metal_matrix = metal_exps_analysis.groupby(
    ['orgId', 'metal_element']
).size().unstack(fill_value=0)

# Sort: metals by total experiments (descending), organisms by total metals tested
metal_order = org_metal_matrix.sum().sort_values(ascending=False).index.tolist()
org_order = (org_metal_matrix > 0).sum(axis=1).sort_values(ascending=False).index.tolist()
org_metal_matrix = org_metal_matrix.loc[org_order, metal_order]

# Plot heatmap
fig, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(
    org_metal_matrix,
    cmap='YlOrRd',
    annot=True,
    fmt='d',
    linewidths=0.5,
    ax=ax,
    cbar_kws={'label': 'Number of experiments'}
)
ax.set_title('Fitness Browser: Metal Experiments per Organism', fontsize=14)
ax.set_xlabel('Metal Element')
ax.set_ylabel('Organism (orgId)')
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'organism_metal_matrix.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/organism_metal_matrix.png')

Saved: figures/organism_metal_matrix.png


In [10]:
# Summary by metal category
print('\nMetal experiments by category:')
print('=' * 60)
for cat in ['toxic', 'essential']:
    subset = metal_exps_analysis[metal_exps_analysis['metal_category'] == cat]
    metals = sorted(subset['metal_element'].unique())
    print(f'\n  {cat.upper()} metals ({len(metals)}): {metals}')
    print(f'    Experiments: {len(subset)}')
    print(f'    Organisms: {subset["orgId"].nunique()}')

# USGS critical minerals
critical = metal_exps_analysis[metal_exps_analysis['usgs_critical'] == True]
critical_metals = sorted(critical['metal_element'].unique())
print(f'\n  USGS CRITICAL MINERALS in FB: {critical_metals}')
print(f'    Experiments: {len(critical)}')
print(f'    Organisms: {critical["orgId"].nunique()}')


Metal experiments by category:

  TOXIC metals (5): ['Aluminum', 'Cobalt', 'Copper', 'Nickel', 'Zinc']
    Experiments: 334
    Organisms: 29

  ESSENTIAL metals (1): ['Iron']
    Experiments: 45
    Organisms: 3

  USGS CRITICAL MINERALS in FB: ['Aluminum', 'Cobalt', 'Nickel']
    Experiments: 222
    Organisms: 29


## 6. Save Master Table

In [11]:
# Save the full metal experiments table
output_cols = [
    'orgId', 'expName', 'expDesc', 'expGroup', 'condition_1', 'media',
    'metal_element', 'metal_category', 'usgs_critical', 'match_method',
    'concentration', 'conc_unit',
    'cor12', 'mad12', 'nMapped'
]
metal_exps_out = metal_exps[output_cols].sort_values(['metal_element', 'orgId', 'expName'])
metal_exps_out.to_csv(DATA_DIR / 'metal_experiments.csv', index=False)

print(f'Saved: data/metal_experiments.csv')
print(f'  Rows: {len(metal_exps_out):,}')
print(f'  Metals: {metal_exps_out["metal_element"].nunique()}')
print(f'  Organisms: {metal_exps_out["orgId"].nunique()}')

# Also save the analysis-ready subset (excluding Platinum)
analysis_out = metal_exps_out[metal_exps_out['metal_element'].isin(analysis_metals)]
analysis_out.to_csv(DATA_DIR / 'metal_experiments_analysis.csv', index=False)

print(f'\nSaved: data/metal_experiments_analysis.csv (excluding Platinum/Cisplatin)')
print(f'  Rows: {len(analysis_out):,}')
print(f'  Metals: {analysis_out["metal_element"].nunique()}')
print(f'  Organisms: {analysis_out["orgId"].nunique()}')

Saved: data/metal_experiments.csv
  Rows: 559
  Metals: 16
  Organisms: 31

Saved: data/metal_experiments_analysis.csv (excluding Platinum/Cisplatin)
  Rows: 379
  Metals: 6
  Organisms: 31


In [12]:
print('=' * 80)
print('NB01 SUMMARY: Metal Experiment Classification')
print('=' * 80)
print(f'Total FB experiments scanned: {len(experiments):,}')
print(f'Metal experiments identified: {len(metal_exps):,} ({100*len(metal_exps)/len(experiments):.1f}%)')
print(f'Unique metals: {metal_exps["metal_element"].nunique()}')
print(f'Organisms with metal data: {metal_exps["orgId"].nunique()}')
print(f'\nCross-species analysis metals (>= 3 orgs, excl. Pt): {len(analysis_metals)}')
print(f'  {analysis_metals}')
print(f'\nUSGS critical minerals covered: {critical_metals}')
print(f'\nOutputs:')
print(f'  data/metal_experiments.csv — all metal experiments')
print(f'  data/metal_experiments_analysis.csv — analysis subset')
print(f'  figures/organism_metal_matrix.png — organism x metal heatmap')
print('=' * 80)

NB01 SUMMARY: Metal Experiment Classification
Total FB experiments scanned: 6,804
Metal experiments identified: 559 (8.2%)
Unique metals: 16
Organisms with metal data: 31

Cross-species analysis metals (>= 3 orgs, excl. Pt): 6
  ['Cobalt', 'Nickel', 'Copper', 'Aluminum', 'Zinc', 'Iron']

USGS critical minerals covered: ['Aluminum', 'Cobalt', 'Nickel']

Outputs:
  data/metal_experiments.csv — all metal experiments
  data/metal_experiments_analysis.csv — analysis subset
  figures/organism_metal_matrix.png — organism x metal heatmap
