# Ecotype Reanalysis: Environmental vs Human-Associated Species

**Runs locally** (no Spark needed) — loads data from parent projects.

## Background

The [ecotype_analysis](../../ecotype_analysis/) project found that **phylogeny dominates over environment** in predicting gene content similarity (median partial correlation 0.003 for environment vs 0.014 for phylogeny, across 172 species).

The [env_embedding_explorer](../../env_embedding_explorer/) project then discovered that **38% of AlphaEarth genomes are human-associated** (clinical, gut), and their embeddings carry **70% weaker geographic signal** (2.0x ratio vs 3.4x for environmental samples).

**Hypothesis**: The weak environment effect in the ecotype analysis was diluted by clinical samples. Environmental species should show stronger environment–gene content correlations.

## Data dependencies

| Source | File | Content |
|--------|------|--------|
| ecotype_analysis | `data/ecotype_correlation_results.csv` | Partial correlations for 213 species |
| env_embedding_explorer | `data/alphaearth_with_env.csv` | Harmonized env_category per genome |
| ecotype_analysis | `data/target_genomes_expanded.csv` | Target genomes with species assignments |

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu, spearmanr
import plotly.express as px
import plotly.graph_objects as go
import os
import warnings
warnings.filterwarnings('ignore')

DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)

def save_fig(fig, name):
    fig.write_image(os.path.join(FIG_DIR, f'{name}.png'), scale=2)
    fig.write_html(os.path.join(FIG_DIR, f'{name}.html'))
    print(f'Saved figures/{name}.png + .html')

## 1. Load data from parent projects

In [2]:
# Ecotype correlation results (recomputed from per-species ANI + gene cluster files)
corr_df = pd.read_csv('../../ecotype_analysis/data/ecotype_correlation_results.csv')
print(f'Ecotype correlations: {len(corr_df)} species')
print(f'With partial correlations: {corr_df["r_partial_emb_jaccard"].notna().sum()}')
print(f'With NaN partial corr: {corr_df["r_partial_emb_jaccard"].isna().sum()}')
corr_df.head(3)

Ecotype correlations: 213 species
With partial correlations: 183
With NaN partial corr: 30


Unnamed: 0,species,short_name,n_genomes,r_emb_jaccard,r_ani_jaccard,r_partial_emb_jaccard,p_partial_emb
0,s__Acinetobacter_baumannii--RS_GCF_009759685.1,Acinetobacter_baumannii,3505,,0.871024,,
1,s__Mycobacterium_tuberculosis--RS_GCF_000195955.2,Mycobacterium_tuberculosis,2556,,0.389331,,
2,s__Enterobacter_hormaechei_A--RS_GCF_001729745.1,Enterobacter_hormaechei_A,812,0.158944,0.664785,0.187425,0.0


In [3]:
# Target genomes with species assignments
genomes_df = pd.read_csv('../../ecotype_analysis/data/target_genomes_expanded.csv')
print(f'Target genomes: {len(genomes_df):,}')
print(f'Species: {genomes_df["gtdb_species_clade_id"].nunique()}')

Target genomes: 13,381
Species: 224


In [4]:
# Harmonized environment categories from env_embedding_explorer
env_df = pd.read_csv('../../env_embedding_explorer/data/alphaearth_with_env.csv',
                     usecols=['genome_id', 'isolation_source'])
print(f'Environment data: {len(env_df):,} genomes')

# Recompute env_category using the same harmonization as env_embedding_explorer
ENV_CATEGORIES = [
    ('Marine', ['ocean', 'marine', 'sea water', 'seawater', 'deep sea', 'coastal water',
                'marine sediment', 'coral', 'sponge', 'hydrothermal', 'estuary', 'brackish']),
    ('Freshwater', ['freshwater', 'fresh water', 'river', 'lake', 'stream', 'pond',
                    'groundwater', 'aquifer', 'drinking water', 'aquatic biome']),
    ('Soil', ['soil', 'rhizosphere', 'compost', 'permafrost', 'sediment', 'mud', 'biome', 'meadow']),
    ('Human gut', ['feces', 'fecal', 'faeces', 'stool', 'rectal swab', 'human gut', 'infant fec']),
    ('Human clinical', ['blood', 'sputum', 'urine', 'wound', 'abscess', 'patient', 'clinical',
                        'cerebrospinal fluid', 'lung', 'throat swab', 'nasopharynx', 'pus',
                        'liver', 'bodily fluid', 'tissue']),
    ('Human other', ['human', 'homo sapiens', 'homo', 'skin', 'oral', 'saliva', 'nasal']),
    ('Animal', ['chicken', 'cattle', 'cow', 'pig', 'sheep', 'dog', 'cat', 'mouse', 'fish',
                'bird', 'animal', 'bovine', 'rumen', 'pork']),
    ('Plant', ['plant', 'leaf', 'root', 'phyllosphere', 'endophyte']),
    ('Food', ['food', 'cheese', 'milk', 'dairy', 'ferment', 'meat']),
    ('Wastewater', ['wastewater', 'sewage', 'sludge', 'bioreactor', 'treatment plant']),
    ('Extreme', ['hot spring', 'hypersaline', 'acid mine', 'soda lake', 'subsurface',
                 'aspo', 'olkiluoto']),
]

def harmonize(v):
    if pd.isna(v): return 'Unknown'
    vl = str(v).lower().strip()
    if vl in ('', 'missing', 'not collected', 'not applicable', 'unknown', 'na', 'environmental'):
        return 'Unknown'
    for cat, kws in ENV_CATEGORIES:
        for kw in kws:
            if kw in vl: return cat
    return 'Other'

env_df['env_category'] = env_df['isolation_source'].apply(harmonize)
print(f'\nEnv category distribution:')
print(env_df['env_category'].value_counts().to_string())

Environment data: 83,287 genomes

Env category distribution:
env_category
Human clinical    16648
Other             15134
Human gut         13429
Unknown           10495
Soil               5921
Freshwater         5871
Marine             5758
Wastewater         2861
Animal             1973
Plant              1629
Human other        1624
Food               1017
Extreme             927


## 2. Classify species by dominant environment

For each species, we count its genomes per environment category and classify by majority vote:
- **Environmental**: >50% of genomes are Soil, Marine, Freshwater, Extreme, or Plant
- **Human-associated**: >50% are Human gut, Human clinical, or Human other
- **Mixed/Other**: no clear majority

In [5]:
ENV_TYPES = {'Soil', 'Marine', 'Freshwater', 'Extreme', 'Plant'}
HUMAN_TYPES = {'Human gut', 'Human clinical', 'Human other'}

# Join genomes with env_category
merged = genomes_df.merge(env_df[['genome_id', 'env_category']], on='genome_id', how='left')
merged['env_category'] = merged['env_category'].fillna('Unknown')
print(f'Matched genomes: {len(merged):,}')

# Per-species counts
species_env = merged.groupby('gtdb_species_clade_id')['env_category'].value_counts().unstack(fill_value=0)
env_cols = [c for c in species_env.columns if c in ENV_TYPES]
human_cols = [c for c in species_env.columns if c in HUMAN_TYPES]

species_env['n_env'] = species_env[env_cols].sum(axis=1) if env_cols else 0
species_env['n_human'] = species_env[human_cols].sum(axis=1) if human_cols else 0
species_env['n_total'] = merged.groupby('gtdb_species_clade_id').size()
species_env['frac_env'] = species_env['n_env'] / species_env['n_total']
species_env['frac_human'] = species_env['n_human'] / species_env['n_total']

def classify(row):
    if row['frac_env'] > 0.5: return 'Environmental'
    elif row['frac_human'] > 0.5: return 'Human-associated'
    else: return 'Mixed/Other'

species_env['group'] = species_env.apply(classify, axis=1)

print(f'\nSpecies classification ({len(species_env)} species):')
for g, n in species_env['group'].value_counts().items():
    print(f'  {g}: {n} ({100*n/len(species_env):.1f}%)')

Matched genomes: 13,385

Species classification (224 species):
  Human-associated: 106 (47.3%)
  Mixed/Other: 71 (31.7%)
  Environmental: 47 (21.0%)


**47% of ecotype species are human-associated** by genome-level classification, with only **21% genuinely environmental**. This confirms the clinical bias but the question is whether it affects the correlation results.

## 3. Join with correlation results

In [6]:
# Join correlations with species classification
corr_df = corr_df.merge(
    species_env[['group', 'frac_env', 'frac_human', 'n_env', 'n_human', 'n_total']],
    left_on='species', right_index=True, how='left'
)
corr_df['group'] = corr_df['group'].fillna('Unknown')

print(f'Species with correlations + classification:')
print(corr_df['group'].value_counts().to_string())

# NaN analysis: which groups have the most NaN partial correlations?
print(f'\nNaN partial correlations by group:')
for g in ['Environmental', 'Human-associated', 'Mixed/Other']:
    sub = corr_df[corr_df['group'] == g]
    n_nan = sub['r_partial_emb_jaccard'].isna().sum()
    print(f'  {g}: {n_nan}/{len(sub)} NaN ({100*n_nan/len(sub):.0f}%)')

print(f'\nLargest NaN species (may indicate bias):')
nan_species = corr_df[corr_df['r_partial_emb_jaccard'].isna()].sort_values('n_genomes', ascending=False)
nan_species[['short_name', 'n_genomes', 'group']].head(10)

Species with correlations + classification:
group
Human-associated    100
Mixed/Other          66
Environmental        47

NaN partial correlations by group:
  Environmental: 10/47 NaN (21%)
  Human-associated: 7/100 NaN (7%)
  Mixed/Other: 13/66 NaN (20%)

Largest NaN species (may indicate bias):


Unnamed: 0,short_name,n_genomes,group
0,Acinetobacter_baumannii,3505,Human-associated
1,Mycobacterium_tuberculosis,2556,Human-associated
3,Enterococcus_B_faecium,805,Human-associated
4,Vibrio_cholerae,541,Mixed/Other
6,Neisseria_gonorrhoeae,304,Mixed/Other
13,Citrobacter_freundii,215,Human-associated
18,Enterobacter_roggenkampii,142,Human-associated
23,Cutibacterium_acnes,134,Mixed/Other
27,Vibrio_vulnificus,112,Mixed/Other
28,Vibrio_alginolyticus,108,Mixed/Other


## 4. Statistical tests

### Binary comparison: Mann-Whitney U test

Do environmental species show stronger partial correlations than human-associated species?

In [7]:
env_corrs = corr_df[corr_df['group'] == 'Environmental']['r_partial_emb_jaccard'].dropna()
human_corrs = corr_df[corr_df['group'] == 'Human-associated']['r_partial_emb_jaccard'].dropna()
mixed_corrs = corr_df[corr_df['group'] == 'Mixed/Other']['r_partial_emb_jaccard'].dropna()

print('=== Partial Correlations (env|phylo) by Group ===')
for name, vals in [('Environmental', env_corrs), ('Human-associated', human_corrs), ('Mixed/Other', mixed_corrs)]:
    print(f'\n{name} (n={len(vals)}):')
    print(f'  Median: {vals.median():.4f}')
    print(f'  Mean:   {vals.mean():.4f}')
    print(f'  Std:    {vals.std():.4f}')
    print(f'  Range:  [{vals.min():.4f}, {vals.max():.4f}]')

stat, pval = mannwhitneyu(env_corrs, human_corrs, alternative='greater')
print(f'\n=== Mann-Whitney U Test (Environmental > Human-associated) ===')
print(f'U = {stat:.0f}, p = {pval:.4f}')
print(f'Significant at p<0.05: {"YES" if pval < 0.05 else "NO"}')
print(f'\nResult: {"H1 supported" if pval < 0.05 else "H0 not rejected"} — '
      f'{"environmental species show stronger signal" if pval < 0.05 else "no evidence that environmental species show stronger signal"}')

=== Partial Correlations (env|phylo) by Group ===

Environmental (n=37):
  Median: 0.0511
  Mean:   0.0729
  Std:    0.2993
  Range:  [-0.4971, 0.7822]

Human-associated (n=93):
  Median: 0.0838
  Mean:   0.1095
  Std:    0.2262
  Range:  [-0.2971, 0.7259]

Mixed/Other (n=53):
  Median: 0.1092
  Mean:   0.1478
  Std:    0.2607
  Range:  [-0.3811, 0.6935]

=== Mann-Whitney U Test (Environmental > Human-associated) ===
U = 1536, p = 0.8301
Significant at p<0.05: NO

Result: H0 not rejected — no evidence that environmental species show stronger signal


### Continuous analysis: Spearman correlation on fraction environmental

Instead of the binary majority-vote classification, we can use the continuous fraction of environmental genomes (`frac_env`) per species. This is more powerful because it avoids the arbitrary 50% threshold and uses all the information.

In [8]:
# Continuous analysis: does higher fraction of environmental genomes predict stronger env partial correlation?
valid = corr_df[corr_df['r_partial_emb_jaccard'].notna()].copy()

rho_env, p_env = spearmanr(valid['frac_env'], valid['r_partial_emb_jaccard'])
rho_human, p_human = spearmanr(valid['frac_human'], valid['r_partial_emb_jaccard'])

print('=== Continuous Analysis: Spearman Correlation ===')
print(f'\nFraction environmental vs partial correlation:')
print(f'  rho = {rho_env:.4f}, p = {p_env:.4f}, n = {len(valid)}')
print(f'  {"Significant" if p_env < 0.05 else "Not significant"} at p<0.05')

print(f'\nFraction human-associated vs partial correlation:')
print(f'  rho = {rho_human:.4f}, p = {p_human:.4f}, n = {len(valid)}')
print(f'  {"Significant" if p_human < 0.05 else "Not significant"} at p<0.05')

print(f'\nInterpretation: {"Higher environmental fraction DOES predict stronger env signal" if (rho_env > 0 and p_env < 0.05) else "No evidence that environmental fraction predicts env signal strength"}')

=== Continuous Analysis: Spearman Correlation ===

Fraction environmental vs partial correlation:
  rho = -0.0853, p = 0.2510, n = 183
  Not significant at p<0.05

Fraction human-associated vs partial correlation:
  rho = 0.0298, p = 0.6884, n = 183
  Not significant at p<0.05

Interpretation: No evidence that environmental fraction predicts env signal strength


## 5. Visualizations

In [9]:
# Box plot: partial correlations by species group
plot_df = corr_df[corr_df['group'].isin(['Environmental', 'Human-associated', 'Mixed/Other'])].copy()
plot_df = plot_df[plot_df['r_partial_emb_jaccard'].notna()]

fig_box = px.box(
    plot_df, x='group', y='r_partial_emb_jaccard',
    title='Environment–Gene Content Partial Correlation by Species Group',
    labels={'r_partial_emb_jaccard': 'Partial Correlation (env|phylo)', 'group': ''},
    color='group', points='all',
)
fig_box.add_hline(y=0, line_dash='dash', line_color='gray')
fig_box.update_layout(width=700, height=500, showlegend=False)
save_fig(fig_box, 'partial_corr_by_group')
fig_box.show()

Saved figures/partial_corr_by_group.png + .html


In [10]:
# Overlaid histograms
fig_hist = go.Figure()
for group, color in [('Environmental', 'green'), ('Human-associated', 'red'), ('Mixed/Other', 'gray')]:
    vals = corr_df[corr_df['group'] == group]['r_partial_emb_jaccard'].dropna()
    if len(vals) > 0:
        fig_hist.add_trace(go.Histogram(
            x=vals, name=f'{group} (n={len(vals)})',
            opacity=0.6, marker_color=color, nbinsx=30,
        ))
fig_hist.update_layout(
    title='Distribution of Environment Partial Correlations by Species Group',
    xaxis_title='Partial Correlation (env|phylo)', yaxis_title='Count',
    barmode='overlay', width=800, height=450,
)
fig_hist.add_vline(x=0, line_dash='dash', line_color='black')
save_fig(fig_hist, 'partial_corr_distributions')
fig_hist.show()

Saved figures/partial_corr_distributions.png + .html


In [11]:
# Continuous: scatter of frac_env vs partial correlation
fig_cont = px.scatter(
    valid, x='frac_env', y='r_partial_emb_jaccard',
    color='group', hover_data=['short_name', 'n_genomes'],
    title=f'Fraction Environmental vs Partial Correlation (Spearman rho={rho_env:.3f}, p={p_env:.3f})',
    labels={'frac_env': 'Fraction Environmental Genomes',
            'r_partial_emb_jaccard': 'Partial Correlation (env|phylo)'},
    opacity=0.6,
)
fig_cont.add_hline(y=0, line_dash='dash', line_color='gray')
fig_cont.update_layout(width=800, height=550)
save_fig(fig_cont, 'frac_env_vs_partial_corr')
fig_cont.show()

Saved figures/frac_env_vs_partial_corr.png + .html


In [12]:
# Species classification pie chart
group_counts = species_env['group'].value_counts().reset_index()
group_counts.columns = ['group', 'count']
fig_pie = px.pie(
    group_counts, values='count', names='group',
    title='Species Classification by Dominant Environment (224 species)',
    color='group',
    color_discrete_map={'Environmental': 'green', 'Human-associated': 'red', 'Mixed/Other': 'gray'},
)
fig_pie.update_layout(width=500, height=400)
save_fig(fig_pie, 'species_classification')
fig_pie.show()

Saved figures/species_classification.png + .html


## 6. NaN species analysis

30 species produced NaN partial correlations, likely due to zero variance in distance matrices. Are these disproportionately from one group, potentially biasing the comparison?

In [13]:
# NaN rate by group
print('NaN partial correlation rate by group:')
for g in ['Environmental', 'Human-associated', 'Mixed/Other']:
    sub = corr_df[corr_df['group'] == g]
    n_nan = sub['r_partial_emb_jaccard'].isna().sum()
    n_total = len(sub)
    print(f'  {g}: {n_nan}/{n_total} ({100*n_nan/n_total:.1f}%)')

print(f'\nIf NaN species are disproportionately from one group, the comparison may be biased.')
print(f'The largest NaN species are typically clinical pathogens with narrow ANI ranges:')
nan_sp = corr_df[corr_df['r_partial_emb_jaccard'].isna()].sort_values('n_genomes', ascending=False)
nan_sp[['short_name', 'n_genomes', 'group', 'r_ani_jaccard']].head(10)

NaN partial correlation rate by group:
  Environmental: 10/47 (21.3%)
  Human-associated: 7/100 (7.0%)
  Mixed/Other: 13/66 (19.7%)

If NaN species are disproportionately from one group, the comparison may be biased.
The largest NaN species are typically clinical pathogens with narrow ANI ranges:


Unnamed: 0,short_name,n_genomes,group,r_ani_jaccard
0,Acinetobacter_baumannii,3505,Human-associated,0.871024
1,Mycobacterium_tuberculosis,2556,Human-associated,0.389331
3,Enterococcus_B_faecium,805,Human-associated,0.710227
4,Vibrio_cholerae,541,Mixed/Other,0.822172
6,Neisseria_gonorrhoeae,304,Mixed/Other,0.577787
13,Citrobacter_freundii,215,Human-associated,0.528894
18,Enterobacter_roggenkampii,142,Human-associated,0.42645
23,Cutibacterium_acnes,134,Mixed/Other,0.595432
27,Vibrio_vulnificus,112,Mixed/Other,0.435496
28,Vibrio_alginolyticus,108,Mixed/Other,0.214131


## 7. The 27x discrepancy with the original analysis

Our median partial correlation (0.081) is 27x higher than the original ecotype analysis (0.003). This section documents the methodological differences that likely explain this.

In [14]:
print('=== Methodological Differences ===')
print(f'\nOriginal ecotype analysis:')
print(f'  Species: 172 (with ANI data)')
print(f'  Max genomes/species: 250 (diversity-maximizing downsampling)')
print(f'  Median partial corr: 0.003')
print(f'\nThis reanalysis:')
print(f'  Species: {corr_df["r_partial_emb_jaccard"].notna().sum()} (with valid partial correlations)')
print(f'  Max genomes/species: {corr_df["n_genomes"].max()} (no downsampling)')
print(f'  Median partial corr: {corr_df["r_partial_emb_jaccard"].median():.4f}')
print(f'\nKey differences:')
print(f'  1. No diversity-maximizing downsampling — all genomes with embeddings used')
print(f'  2. More genomes per species increases statistical power')
print(f'  3. Different genome sets may change distance distributions')
print(f'  4. Large species (>250 genomes) included for first time')
print(f'\nNote: The 27x difference affects the absolute magnitude but not the')
print(f'group comparison (Environmental vs Human), which is the question we are testing.')

=== Methodological Differences ===

Original ecotype analysis:
  Species: 172 (with ANI data)
  Max genomes/species: 250 (diversity-maximizing downsampling)
  Median partial corr: 0.003

This reanalysis:
  Species: 183 (with valid partial correlations)
  Max genomes/species: 3505 (no downsampling)
  Median partial corr: 0.0809

Key differences:
  1. No diversity-maximizing downsampling — all genomes with embeddings used
  2. More genomes per species increases statistical power
  3. Different genome sets may change distance distributions
  4. Large species (>250 genomes) included for first time

Note: The 27x difference affects the absolute magnitude but not the
group comparison (Environmental vs Human), which is the question we are testing.


## 8. Summary

In [15]:
print('=== Reanalysis Summary ===')
print(f'\nHypothesis: Environmental species show stronger environment–gene content')
print(f'correlations because their AlphaEarth embeddings carry more geographic signal.')
print(f'\nResult: H0 NOT REJECTED')
print(f'\n  Mann-Whitney U (binary): p = {pval:.4f}')
print(f'  Spearman (continuous):  rho = {rho_env:.4f}, p = {p_env:.4f}')
print(f'\n  Environmental median:      {env_corrs.median():.4f} (n={len(env_corrs)})')
print(f'  Human-associated median:   {human_corrs.median():.4f} (n={len(human_corrs)})')
print(f'  Mixed/Other median:        {mixed_corrs.median():.4f} (n={len(mixed_corrs)})')
print(f'\nConclusion: The clinical sampling bias in AlphaEarth embeddings does NOT')
print(f'explain the weak environment–gene content signal. The original ecotype analysis')
print(f'conclusion — that phylogeny dominates — holds for all environment types.')

=== Reanalysis Summary ===

Hypothesis: Environmental species show stronger environment–gene content
correlations because their AlphaEarth embeddings carry more geographic signal.

Result: H0 NOT REJECTED

  Mann-Whitney U (binary): p = 0.8301
  Spearman (continuous):  rho = -0.0853, p = 0.2510

  Environmental median:      0.0511 (n=37)
  Human-associated median:   0.0838 (n=93)
  Mixed/Other median:        0.1092 (n=53)

Conclusion: The clinical sampling bias in AlphaEarth embeddings does NOT
explain the weak environment–gene content signal. The original ecotype analysis
conclusion — that phylogeny dominates — holds for all environment types.


In [16]:
# Save results
corr_df.to_csv(os.path.join(DATA_DIR, 'ecotype_corr_with_env_group.csv'), index=False)
print(f'Saved data/ecotype_corr_with_env_group.csv ({len(corr_df)} species)')

print(f'\nFigures:')
for f in sorted(os.listdir(FIG_DIR)):
    if f.endswith('.png'):
        print(f'  {f}')

Saved data/ecotype_corr_with_env_group.csv (213 species)

Figures:
  frac_env_vs_partial_corr.png
  partial_corr_by_group.png
  partial_corr_distributions.png
  species_classification.png
