# NB01: Ecotype Reanalysis — Environmental vs Human-Associated Species

**Runs locally or on JupyterHub** (no Spark needed).

## Background

The [ecotype_analysis](../../ecotype_analysis/) project found that **phylogeny dominates over environment** in predicting gene content similarity across 172 bacterial species (median partial correlation 0.003 for environment vs 0.014 for phylogeny). It also tested whether ecological category matters, finding no significant difference (p=0.66).

However, the [env_embedding_explorer](../../env_embedding_explorer/) project subsequently discovered that **38% of AlphaEarth genomes are human-associated** (clinical, gut), and their embeddings carry **70% weaker geographic signal** (2.0x ratio vs 3.4x for environmental samples). Hospitals worldwide look similar from satellite, so human-associated genomes have homogeneous embeddings regardless of geography.

This notebook tests whether the weak environment effect in the ecotype analysis was diluted by the clinical sample bias. We classify each species by its dominant environment type (using genome-level isolation_source harmonization) and compare the environment–gene content correlations between truly environmental and human-associated species.

## Data dependencies

| Source | File | Content |
|--------|------|--------|
| ecotype_analysis | `data/ecotype_correlation_results.csv` | Partial correlations for 172 species |
| ecotype_analysis | `data/target_genomes_expanded.csv` | 13,381 genomes with species assignments |
| env_embedding_explorer | `data/alphaearth_with_env.csv` | Harmonized env_category per genome |

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu, spearmanr
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os
import warnings
warnings.filterwarnings('ignore')

DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)

def save_fig(fig, name):
    fig.write_image(os.path.join(FIG_DIR, f'{name}.png'), scale=2)
    fig.write_html(os.path.join(FIG_DIR, f'{name}.html'))
    print(f'Saved figures/{name}.png + .html')

## 1. Load data from parent projects

In [None]:
# Ecotype correlation results
ecotype_path = '../../ecotype_analysis/data/ecotype_correlation_results.csv'
if not os.path.exists(ecotype_path):
    # Try alternate paths
    alt = '../../ecotype_analysis/data/ecotype_correlation_results_with_categories.csv'
    if os.path.exists(alt):
        ecotype_path = alt
    else:
        raise FileNotFoundError(
            f'Ecotype results not found. Run ecotype_analysis/notebooks/02_ecotype_correlation_analysis.ipynb first.'
        )

corr_df = pd.read_csv(ecotype_path)
print(f'Ecotype correlations: {len(corr_df)} species')
print(f'Columns: {list(corr_df.columns)}')
corr_df.head(3)

In [None]:
# Genome metadata from ecotype analysis
genomes_path = '../../ecotype_analysis/data/target_genomes_expanded.csv'
if os.path.exists(genomes_path):
    genomes_df = pd.read_csv(genomes_path)
    print(f'Target genomes: {len(genomes_df):,}')
    print(f'Species: {genomes_df["species"].nunique()}')
else:
    print(f'WARNING: {genomes_path} not found')
    genomes_df = None

In [None]:
# Harmonized environment categories from env_embedding_explorer
env_path = '../../env_embedding_explorer/data/alphaearth_with_env.csv'
env_df = pd.read_csv(env_path, usecols=['genome_id', 'species', 'env_category', 'isolation_source'])
print(f'Environment data: {len(env_df):,} genomes')
print(f'Env categories: {env_df["env_category"].value_counts().to_dict()}')

## 2. Classify species by dominant environment

For each species in the ecotype analysis, we determine its dominant environment type by majority vote of its genomes' harmonized `env_category` values. Species are labeled:
- **Environmental**: majority Soil, Marine, Freshwater, Extreme, or Plant
- **Human-associated**: majority Human gut, Human clinical, or Human other
- **Mixed/Other**: no clear majority, or majority Animal, Food, Wastewater, Other, Unknown

In [None]:
ENV_TYPES = {'Soil', 'Marine', 'Freshwater', 'Extreme', 'Plant'}
HUMAN_TYPES = {'Human gut', 'Human clinical', 'Human other'}

# Count genomes per env_category per species
species_env = env_df.groupby('species')['env_category'].value_counts().unstack(fill_value=0)

# Compute fraction environmental vs human-associated
env_cols = [c for c in species_env.columns if c in ENV_TYPES]
human_cols = [c for c in species_env.columns if c in HUMAN_TYPES]

species_env['n_env'] = species_env[env_cols].sum(axis=1) if env_cols else 0
species_env['n_human'] = species_env[human_cols].sum(axis=1) if human_cols else 0
species_env['n_total'] = species_env.sum(axis=1)
species_env['frac_env'] = species_env['n_env'] / species_env['n_total']
species_env['frac_human'] = species_env['n_human'] / species_env['n_total']

# Classify
def classify_species(row):
    if row['frac_env'] > 0.5:
        return 'Environmental'
    elif row['frac_human'] > 0.5:
        return 'Human-associated'
    else:
        return 'Mixed/Other'

species_env['species_group'] = species_env.apply(classify_species, axis=1)

print('Species classification (all AlphaEarth species):')
print(species_env['species_group'].value_counts().to_string())
print(f'\nTotal species with env data: {len(species_env):,}')

## 3. Join with ecotype correlation results

We need to match species between the ecotype analysis (which uses `gtdb_species_clade_id` format) and the env_embedding_explorer (which uses GTDB species names). The join key may need adjustment depending on the column formats.

In [None]:
# Identify the species column in the ecotype correlation results
print('Ecotype correlation columns:', list(corr_df.columns))
print(f'\nFirst few species values:')

# Find the species ID column
species_col = None
for col in ['species', 'short_name', 'gtdb_species_clade_id', 'species_id']:
    if col in corr_df.columns:
        species_col = col
        break

if species_col:
    print(f'Using column: {species_col}')
    print(corr_df[species_col].head(5).to_string())
else:
    print('WARNING: No obvious species column found')
    print(corr_df.head(2).to_string())

In [None]:
# Join ecotype correlations with species classification
# The species names may need harmonization between the two datasets
# env_embedding_explorer uses GTDB species like 's__Escherichia coli'
# ecotype_analysis may use short names or clade IDs

# Build a mapping from GTDB species to species_group
species_group_map = species_env['species_group'].to_dict()

# Try direct join first
if species_col and species_col in corr_df.columns:
    corr_df['species_group'] = corr_df[species_col].map(species_group_map)
    n_matched = corr_df['species_group'].notna().sum()
    print(f'Direct match: {n_matched}/{len(corr_df)} species matched')
    
    if n_matched < len(corr_df) * 0.5:
        # Try matching with 's__' prefix or short name
        print('Low match rate — trying alternative matching...')
        # Try stripping 's__' prefix for matching
        alt_map = {k.replace('s__', '').replace(' ', '_'): v 
                   for k, v in species_group_map.items()}
        corr_df['species_group'] = corr_df[species_col].map(
            lambda x: alt_map.get(str(x).replace('s__', '').replace(' ', '_'), 
                                  species_group_map.get(x))
        )
        n_matched = corr_df['species_group'].notna().sum()
        print(f'After alternative matching: {n_matched}/{len(corr_df)} species matched')

# Fill unmatched as 'Unknown'
corr_df['species_group'] = corr_df['species_group'].fillna('Unknown')

print(f'\nSpecies group distribution in ecotype results:')
print(corr_df['species_group'].value_counts().to_string())

## 4. Compare environment effects between groups

The key test: do environmental species show stronger partial correlations between environment similarity and gene content similarity?

In [None]:
# Identify the partial correlation column
partial_col = None
for col in ['r_partial_emb_jaccard', 'partial_corr_env', 'r_partial']:
    if col in corr_df.columns:
        partial_col = col
        break

if partial_col is None:
    # Try to find any column with 'partial' and 'env' or 'emb'
    candidates = [c for c in corr_df.columns if 'partial' in c.lower()]
    if candidates:
        partial_col = candidates[0]
    else:
        raise ValueError(f'No partial correlation column found. Columns: {list(corr_df.columns)}')

print(f'Using partial correlation column: {partial_col}')

# Summary statistics by group
for group in ['Environmental', 'Human-associated', 'Mixed/Other', 'Unknown']:
    subset = corr_df[corr_df['species_group'] == group][partial_col].dropna()
    if len(subset) > 0:
        print(f'\n{group} (n={len(subset)}):')
        print(f'  Median: {subset.median():.4f}')
        print(f'  Mean:   {subset.mean():.4f}')
        print(f'  Std:    {subset.std():.4f}')
        print(f'  Range:  [{subset.min():.4f}, {subset.max():.4f}]')

In [None]:
# Mann-Whitney U test: Environmental vs Human-associated
env_corrs = corr_df[corr_df['species_group'] == 'Environmental'][partial_col].dropna()
human_corrs = corr_df[corr_df['species_group'] == 'Human-associated'][partial_col].dropna()

if len(env_corrs) > 0 and len(human_corrs) > 0:
    stat, pval = mannwhitneyu(env_corrs, human_corrs, alternative='greater')
    print(f'Mann-Whitney U test (Environmental > Human-associated):')
    print(f'  Environmental: n={len(env_corrs)}, median={env_corrs.median():.4f}')
    print(f'  Human-associated: n={len(human_corrs)}, median={human_corrs.median():.4f}')
    print(f'  U={stat:.0f}, p={pval:.4f} (one-sided)')
    print(f'  Significant at p<0.05: {"YES" if pval < 0.05 else "NO"}')
else:
    print(f'Cannot run test: Environmental n={len(env_corrs)}, Human n={len(human_corrs)}')

## 5. Visualizations

In [None]:
# Box plot: partial correlations by species group
plot_df = corr_df[corr_df['species_group'].isin(['Environmental', 'Human-associated', 'Mixed/Other'])].copy()

fig_box = px.box(
    plot_df, x='species_group', y=partial_col,
    title='Environment–Gene Content Partial Correlation by Species Group',
    labels={partial_col: 'Partial Correlation (env|phylo)', 'species_group': ''},
    color='species_group',
    points='all',
)
fig_box.add_hline(y=0, line_dash='dash', line_color='gray')
fig_box.update_layout(width=700, height=500, showlegend=False)
save_fig(fig_box, 'partial_corr_by_group')
fig_box.show()

In [None]:
# Overlaid histograms
fig_hist = go.Figure()
for group, color in [('Environmental', 'green'), ('Human-associated', 'red'), ('Mixed/Other', 'gray')]:
    vals = corr_df[corr_df['species_group'] == group][partial_col].dropna()
    if len(vals) > 0:
        fig_hist.add_trace(go.Histogram(
            x=vals, name=f'{group} (n={len(vals)})',
            opacity=0.6, marker_color=color, nbinsx=30,
        ))

fig_hist.update_layout(
    title='Distribution of Environment Partial Correlations by Species Group',
    xaxis_title='Partial Correlation (env|phylo)',
    yaxis_title='Count',
    barmode='overlay', width=800, height=450,
)
fig_hist.add_vline(x=0, line_dash='dash', line_color='black')
save_fig(fig_hist, 'partial_corr_distributions')
fig_hist.show()

In [None]:
# Scatter: environment vs phylogeny partial correlations, colored by group
phylo_col = None
for col in ['r_partial_ani_jaccard', 'partial_corr_phylo', 'r_ani_jaccard']:
    if col in corr_df.columns:
        phylo_col = col
        break

if phylo_col:
    fig_scatter = px.scatter(
        plot_df, x=phylo_col, y=partial_col, color='species_group',
        hover_data=[species_col] if species_col else None,
        title='Environment vs Phylogeny Partial Correlations — by Species Group',
        labels={partial_col: 'Env Partial Corr', phylo_col: 'Phylo Partial Corr'},
        opacity=0.6,
    )
    fig_scatter.add_hline(y=0, line_dash='dash', line_color='gray')
    fig_scatter.add_vline(x=0, line_dash='dash', line_color='gray')
    fig_scatter.update_layout(width=800, height=600)
    save_fig(fig_scatter, 'env_vs_phylo_by_group')
    fig_scatter.show()
else:
    print('No phylogeny partial correlation column found — skipping scatter plot')

## 6. Interpretation

In [None]:
# Summary
print('=== Reanalysis Summary ===')
print(f'\nOriginal ecotype analysis: 172 species, median env partial corr = 0.003')
print(f'\nThis reanalysis:')

for group in ['Environmental', 'Human-associated', 'Mixed/Other']:
    vals = corr_df[corr_df['species_group'] == group][partial_col].dropna()
    if len(vals) > 0:
        print(f'  {group}: n={len(vals)}, median={vals.median():.4f}')

if len(env_corrs) > 0 and len(human_corrs) > 0:
    print(f'\nMann-Whitney U (Environmental > Human): p={pval:.4f}')
    if pval < 0.05:
        print('RESULT: Environmental species show SIGNIFICANTLY stronger environment effects.')
        print('The weak signal in the original analysis was partially due to clinical sample bias.')
    else:
        print('RESULT: No significant difference. The clinical bias does NOT explain the weak signal.')
        print('The original conclusion (phylogeny dominates) holds even for environmental samples.')

In [None]:
# Save results
corr_df.to_csv(os.path.join(DATA_DIR, 'ecotype_corr_with_env_group.csv'), index=False)
print(f'Saved data/ecotype_corr_with_env_group.csv')

print(f'\nFigures saved:')
for f in sorted(os.listdir(FIG_DIR)):
    if f.endswith('.png'):
        print(f'  {f}')