# End-to-End NMDC Study Analysis
## Study: nmdc:sty-11-aygzgv51

This notebook demonstrates a comprehensive analysis workflow using the NMDC API utilities:

1. **Study Exploration** - Retrieve and examine study metadata
2. **Sample Parameter Analysis** - Explore biosample characteristics (depth, pH, temperature, ecosystem types)
3. **Data Type Discovery** - Identify available data types (functional annotations, metabolites)
4. **Functional Annotation Analysis** - Examine EC numbers, PFAM domains, COG categories, KEGG orthologs
5. **Enrichment Analysis** - Compare functional profiles across sample groups
6. **Visualizations** - Create informative plots of the findings

### Scientific Questions
- What are the environmental characteristics of samples in this study?
- What functional capabilities (enzymes, protein families) are present?
- Are certain functions enriched in specific environmental conditions?
- What metabolic pathways are represented across different sample types?

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from pathlib import Path
from collections import Counter, defaultdict

# NMDC API utilities
from nmdc_api_utilities.study_search import StudySearch
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_object_search import DataObjectSearch

# Set up plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Study ID
STUDY_ID = "nmdc:sty-11-aygzgv51"

print(f"Analyzing NMDC Study: {STUDY_ID}")
print("="*60)

## 1. Study Exploration

First, let's retrieve the study metadata and understand what this study is about.

In [None]:
# Initialize study search client
study_client = StudySearch(env="prod")

# Get study metadata
study = study_client.get_record_by_id(collection_id=STUDY_ID)

# Display key study information
print("Study Information")
print("="*60)
print(f"ID: {study.get('id', 'N/A')}")
print(f"Name: {study.get('name', 'N/A')}")
print(f"Description: {study.get('description', 'N/A')[:500]}...")  # First 500 chars
print(f"\nPrincipal Investigator: {study.get('principal_investigator', {}).get('has_raw_value', 'N/A')}")
print(f"Ecosystem Category: {study.get('ecosystem_category', 'N/A')}")
print(f"Ecosystem Type: {study.get('ecosystem_type', 'N/A')}")
print(f"Ecosystem Subtype: {study.get('ecosystem_subtype', 'N/A')}")

# Check for study-level environmental context
if 'study_category' in study:
    print(f"\nStudy Category: {study['study_category']}")

print("\n" + "="*60)

## 2. Sample Parameter Exploration

Now let's retrieve all biosamples associated with this study and explore their environmental parameters.

In [None]:
# Load biosamples from local data (API can be slow/timeout)
try:
    # Try to load from downloaded study data
    # dump-study creates individual biosample.json files in each biosample directory
    from pathlib import Path
    study_dir = Path('../study_data_complete')
    biosample_files = list(study_dir.glob("nmdc_bsm-*/biosample.json"))
    
    if not biosample_files:
        raise FileNotFoundError("No biosample files found in study_data_complete")
    
    print(f"Found {len(biosample_files)} biosample files")
    
    # Aggregate individual biosample files into a list
    biosamples = []
    for biosample_file in biosample_files:
        with open(biosample_file) as f:
            biosample = json.load(f)
            biosamples.append(biosample)
    
    print(f"Loaded {len(biosamples)} biosamples from local cache")
except FileNotFoundError as e:
    # Fallback to API
    print(f"Local files not found: {e}")
    print("Fetching biosamples from API (this may take a while)...")
    biosamples = study_client.get_linked_biosamples(study_id=STUDY_ID)

print(f"Found {len(biosamples)} biosamples in this study")
print("="*60)

# Display a few sample IDs
print("\nSample IDs (first 10):")
for i, bs in enumerate(biosamples[:10]):
    print(f"  {i+1}. {bs.get('id', 'N/A')} - {bs.get('name', 'N/A')[:50]}")

### 2.1 Environmental Parameters

Let's extract and analyze key environmental parameters from the biosamples.

In [None]:
# Extract environmental parameters
def extract_numeric_value(field_dict):
    """Extract numeric value from NMDC measurement fields."""
    if not field_dict:
        return None
    if isinstance(field_dict, dict):
        # Try different field names
        for key in ['has_numeric_value', 'has_maximum_numeric_value', 'has_minimum_numeric_value']:
            if key in field_dict:
                return field_dict[key]
    return None

def extract_raw_value(field_dict):
    """Extract raw value from NMDC fields."""
    if not field_dict:
        return None
    if isinstance(field_dict, dict) and 'has_raw_value' in field_dict:
        return field_dict['has_raw_value']
    return str(field_dict) if field_dict else None

# Build a dataframe of sample parameters
sample_data = []
for bs in biosamples:
    sample_info = {
        'id': bs.get('id'),
        'name': bs.get('name'),
        'ecosystem_category': extract_raw_value(bs.get('ecosystem_category')),
        'ecosystem_type': extract_raw_value(bs.get('ecosystem_type')),
        'ecosystem_subtype': extract_raw_value(bs.get('ecosystem_subtype')),
        'depth': extract_numeric_value(bs.get('depth')),
        'ph': extract_numeric_value(bs.get('ph')),
        'temp': extract_numeric_value(bs.get('temp')),
        'lat': extract_numeric_value(bs.get('lat_lon', {}).get('latitude') if isinstance(bs.get('lat_lon'), dict) else None),
        'lon': extract_numeric_value(bs.get('lat_lon', {}).get('longitude') if isinstance(bs.get('lat_lon'), dict) else None),
    }
    sample_data.append(sample_info)

df_samples = pd.DataFrame(sample_data)

print(f"\nSample DataFrame Shape: {df_samples.shape}")
print("\nFirst few rows:")
display(df_samples.head())

print("\nData Summary:")
display(df_samples.describe())

### 2.2 Ecosystem Distribution

In [None]:
# Analyze ecosystem distribution
print("Ecosystem Distribution")
print("="*60)

for col in ['ecosystem_category', 'ecosystem_type', 'ecosystem_subtype']:
    if col in df_samples.columns:
        counts = df_samples[col].value_counts()
        print(f"\n{col.replace('_', ' ').title()}:")
        for val, count in counts.items():
            if val:
                print(f"  {val}: {count}")

# Visualize ecosystem types
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, col in enumerate(['ecosystem_category', 'ecosystem_type', 'ecosystem_subtype']):
    if col in df_samples.columns:
        data = df_samples[col].value_counts()
        if len(data) > 0:
            data.plot(kind='bar', ax=axes[idx], color='steelblue')
            axes[idx].set_title(f"{col.replace('_', ' ').title()} Distribution")
            axes[idx].set_xlabel('')
            axes[idx].set_ylabel('Count')
            axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### 2.3 Environmental Parameter Distributions

In [None]:
# Plot distributions of numeric parameters
numeric_params = ['depth', 'ph', 'temp']
available_params = [p for p in numeric_params if p in df_samples.columns and df_samples[p].notna().sum() > 0]

if available_params:
    fig, axes = plt.subplots(1, len(available_params), figsize=(5*len(available_params), 4))
    if len(available_params) == 1:
        axes = [axes]
    
    for idx, param in enumerate(available_params):
        data = df_samples[param].dropna()
        axes[idx].hist(data, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f"{param.upper()} Distribution")
        axes[idx].set_xlabel(param.title())
        axes[idx].set_ylabel('Frequency')
        
        # Add statistics
        mean_val = data.mean()
        median_val = data.median()
        axes[idx].axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
        axes[idx].axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')
        axes[idx].legend()
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nEnvironmental Parameter Statistics:")
    print("="*60)
    for param in available_params:
        data = df_samples[param].dropna()
        print(f"\n{param.upper()}:")
        print(f"  Count: {len(data)}")
        print(f"  Mean: {data.mean():.2f}")
        print(f"  Median: {data.median():.2f}")
        print(f"  Std Dev: {data.std():.2f}")
        print(f"  Min: {data.min():.2f}")
        print(f"  Max: {data.max():.2f}")
else:
    print("No numeric environmental parameters available for visualization.")

### 2.4 Geographic Distribution

If lat/lon coordinates are available, let's visualize the geographic distribution.

In [None]:
# Plot geographic distribution
if 'lat' in df_samples.columns and 'lon' in df_samples.columns:
    df_geo = df_samples[['lat', 'lon']].dropna()
    
    if len(df_geo) > 0:
        fig, ax = plt.subplots(figsize=(12, 8))
        scatter = ax.scatter(df_geo['lon'], df_geo['lat'], 
                           c=range(len(df_geo)), cmap='viridis', 
                           s=100, alpha=0.6, edgecolors='black')
        ax.set_xlabel('Longitude', fontsize=12)
        ax.set_ylabel('Latitude', fontsize=12)
        ax.set_title('Geographic Distribution of Samples', fontsize=14)
        ax.grid(True, alpha=0.3)
        plt.colorbar(scatter, label='Sample Index', ax=ax)
        plt.tight_layout()
        plt.show()
        
        print(f"\nMapped {len(df_geo)} samples with coordinates")
    else:
        print("No samples with valid lat/lon coordinates found.")
else:
    print("No geographic coordinates available.")

## 3. Data Type Discovery

Now let's explore what types of data are available for this study. We're particularly interested in:
- Functional annotations (from metagenomes and proteomes)
- Metabolite data
- GFF annotation files

In [None]:
# Get all data objects for the study
print("Retrieving data objects for the study...")
print("This may take a minute...\n")

try:
    data_objects = study_client.get_all_linked_data_objects(study_id=STUDY_ID, group_by_type=True)
except Exception as e:
    print(f"API request failed: {e}")
    print("Using local data from study dump...\n")
    
    # Load from local files
    from pathlib import Path
    data_objects = defaultdict(list)
    
    study_dir = Path('../my_study')
    for biosample_dir in study_dir.glob('nmdc_bsm-*'):
        data_dir = biosample_dir / 'data_objects'
        if data_dir.exists():
            # Get metadata for each file
            for file in data_dir.glob('*.json'):
                try:
                    with open(file) as f:
                        obj = json.load(f)
                        obj_type = obj.get('data_object_type', obj.get('type', 'Unknown'))
                        data_objects[obj_type].append(obj)
                except:
                    pass

print(f"Found {len(data_objects)} data object types")
print("="*60)

# Display data types and counts
data_type_counts = {}
for data_type, objects in data_objects.items():
    count = len(objects)
    data_type_counts[data_type] = count
    print(f"{data_type}: {count} files")

# Sort by count
data_type_counts_sorted = dict(sorted(data_type_counts.items(), key=lambda x: x[1], reverse=True))

In [None]:
# Visualize data type distribution
if data_type_counts_sorted:
    fig, ax = plt.subplots(figsize=(12, 8))
    types = list(data_type_counts_sorted.keys())
    counts = list(data_type_counts_sorted.values())
    
    bars = ax.barh(types, counts, color='coral', edgecolor='black', alpha=0.7)
    ax.set_xlabel('Number of Files', fontsize=12)
    ax.set_ylabel('Data Object Type', fontsize=12)
    ax.set_title('Distribution of Data Types in Study', fontsize=14)
    ax.grid(axis='x', alpha=0.3)
    
    # Add count labels
    for bar in bars:
        width = bar.get_width()
        ax.text(width, bar.get_y() + bar.get_height()/2, 
               f'{int(width)}', ha='left', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()

### 3.1 Focus on Functional Annotations

Let's identify data types related to functional annotations (EC numbers, PFAM, COG, KEGG, etc.)

In [None]:
# Keywords for functional annotation files
annotation_keywords = ['annotation', 'ec', 'pfam', 'cog', 'kegg', 'ko', 'gff', 'ortholog', 
                       'cath', 'smart', 'tigrfam', 'superfamily', 'functional']

# Filter for annotation-related data types
annotation_types = {}
for data_type, objects in data_objects.items():
    if any(keyword in data_type.lower() for keyword in annotation_keywords):
        annotation_types[data_type] = objects

print("Functional Annotation Data Types:")
print("="*60)
for data_type, objects in annotation_types.items():
    print(f"\n{data_type}: {len(objects)} files")
    # Show a few example file names
    for obj in objects[:3]:
        print(f"  - {obj.get('name', 'N/A')[:80]}")
    if len(objects) > 3:
        print(f"  ... and {len(objects)-3} more")

### 3.2 Metabolite and Chemical Data

Let's check for metabolite data.

In [None]:
# Keywords for metabolite/chemical data
metabolite_keywords = ['metabol', 'chemical', 'compound', 'lipidomic', 'metabolomic', 
                       'organic', 'natural products', 'mass spec', 'ms', 'nmr']

# Filter for metabolite-related data types
metabolite_types = {}
for data_type, objects in data_objects.items():
    if any(keyword in data_type.lower() for keyword in metabolite_keywords):
        metabolite_types[data_type] = objects

if metabolite_types:
    print("Metabolite/Chemical Data Types:")
    print("="*60)
    for data_type, objects in metabolite_types.items():
        print(f"\n{data_type}: {len(objects)} files")
        for obj in objects[:5]:
            print(f"  - {obj.get('name', 'N/A')[:80]}")
        if len(objects) > 5:
            print(f"  ... and {len(objects)-5} more")
else:
    print("No metabolite data found for this study.")
    print("This study appears to focus on genomic/metagenomic data.")

## 4. Functional Annotation Analysis

Now let's analyze the functional annotations in detail. We'll look for patterns in:
- EC numbers (enzyme functions)
- PFAM domains (protein families)
- COG categories (functional categories)
- KEGG orthologs (metabolic pathways)

### 4.1 Sample Annotation Files

Let's examine some annotation files to understand their content.

In [None]:
# Find GFF files or annotation TSV files
gff_files = data_objects.get('Annotation GFF', [])
ec_files = data_objects.get('Annotation Enzyme Commission', [])
ko_files = data_objects.get('Annotation KEGG Orthology', [])
pfam_files = data_objects.get('Annotation PFAM', [])
cog_files = data_objects.get('Annotation COG', [])

print("Available Annotation Files:")
print("="*60)
print(f"GFF files: {len(gff_files)}")
print(f"EC annotation files: {len(ec_files)}")
print(f"KEGG Orthology files: {len(ko_files)}")
print(f"PFAM files: {len(pfam_files)}")
print(f"COG files: {len(cog_files)}")

# Show example file URLs (if available)
if gff_files:
    print("\nExample GFF file:")
    print(f"  ID: {gff_files[0].get('id')}")
    print(f"  Name: {gff_files[0].get('name')}")
    if 'url' in gff_files[0]:
        print(f"  URL: {gff_files[0]['url']}")

### 4.2 Aggregate Functional Profile

Since individual annotation files can be large, let's create an aggregate view of functional categories across the study.

**Note**: This is a demonstration. For real analysis, you would download and parse the annotation files.

In [None]:
# Create summary of available functional data
functional_summary = {
    'Total Biosamples': len(biosamples),
    'GFF Annotation Files': len(gff_files),
    'EC Annotation Files': len(ec_files),
    'KEGG Orthology Files': len(ko_files),
    'PFAM Files': len(pfam_files),
    'COG Files': len(cog_files),
}

print("Functional Annotation Summary:")
print("="*60)
for key, value in functional_summary.items():
    print(f"{key}: {value}")

# Calculate coverage (% of samples with each annotation type)
if len(biosamples) > 0:
    print("\nAnnotation Coverage:")
    print("="*60)
    for ann_type, count in [("EC", len(ec_files)), ("KO", len(ko_files)), 
                             ("PFAM", len(pfam_files)), ("COG", len(cog_files))]:
        coverage = (count / len(biosamples)) * 100
        print(f"{ann_type}: {coverage:.1f}% of samples")

## 5. Enrichment Analysis

For enrichment analysis, we would typically:
1. Download the study data using `nmdc dump-study`
2. Run enrichment analysis using `nmdc enrich`

Let's demonstrate the workflow and what questions we could answer.

### 5.1 Enrichment Analysis Strategy

Based on the sample parameters we explored, here are interesting enrichment analyses we could perform:

#### If depth data is available:
```bash
# Compare shallow vs deep samples
nmdc enrich ./study_data --group-by depth --threshold <median_depth> --annotation-type ec_number
```

#### If pH data is available:
```bash
# Compare acidic vs neutral/alkaline samples
nmdc enrich ./study_data --group-by ph --threshold 7.0 --annotation-type pfam
```

#### If multiple ecosystem types:
```bash
# Compare specific ecosystem types
nmdc enrich ./study_data --group-by ecosystem_type --categories "Soil,Marine" --annotation-type ko
```

#### If temperature data:
```bash
# Compare cold vs warm environments
nmdc enrich ./study_data --group-by temp --bins 2 --annotation-type cog
```

In [None]:
# Generate enrichment analysis recommendations based on available parameters
print("Recommended Enrichment Analyses:")
print("="*60)

recommendations = []

# Check for continuous variables
if 'depth' in df_samples.columns and df_samples['depth'].notna().sum() > 5:
    median_depth = df_samples['depth'].median()
    recommendations.append({
        'name': 'Depth-based enrichment',
        'rationale': f'Compare functions in shallow (≤{median_depth:.1f}) vs deep (>{median_depth:.1f}) samples',
        'command': f'nmdc enrich ./study_data --group-by depth --threshold {median_depth:.1f} --annotation-type ec_number',
        'question': 'Are certain enzymes enriched at different depths?'
    })

if 'ph' in df_samples.columns and df_samples['ph'].notna().sum() > 5:
    median_ph = df_samples['ph'].median()
    recommendations.append({
        'name': 'pH-based enrichment',
        'rationale': f'Compare functions in acidic (≤{median_ph:.1f}) vs alkaline (>{median_ph:.1f}) samples',
        'command': f'nmdc enrich ./study_data --group-by ph --threshold {median_ph:.1f} --annotation-type pfam',
        'question': 'Which protein families are associated with pH tolerance?'
    })

if 'temp' in df_samples.columns and df_samples['temp'].notna().sum() > 5:
    recommendations.append({
        'name': 'Temperature-based enrichment',
        'rationale': 'Compare functions in cold vs warm environments',
        'command': 'nmdc enrich ./study_data --group-by temp --bins 2 --annotation-type cog',
        'question': 'What COG categories are temperature-dependent?'
    })

# Check for categorical variables
if 'ecosystem_type' in df_samples.columns:
    eco_types = df_samples['ecosystem_type'].value_counts()
    if len(eco_types) >= 2:
        top_two = ','.join(eco_types.head(2).index.tolist())
        recommendations.append({
            'name': 'Ecosystem type comparison',
            'rationale': f'Compare functions between {top_two}',
            'command': f'nmdc enrich ./study_data --group-by ecosystem_type --categories "{top_two}" --annotation-type ko',
            'question': 'What metabolic pathways differ between ecosystem types?'
        })

# Display recommendations
if recommendations:
    for i, rec in enumerate(recommendations, 1):
        print(f"\n{i}. {rec['name']}")
        print(f"   Scientific Question: {rec['question']}")
        print(f"   Rationale: {rec['rationale']}")
        print(f"   Command: {rec['command']}")
else:
    print("\nInsufficient sample parameters for enrichment analysis.")
    print("Need at least 5 samples with numeric or categorical metadata.")

### 5.2 Real Enrichment Results

We've performed actual enrichment analysis comparing samples from different environmental media (ENVO:01000017 vs ENVO:00002007).
This analysis used real functional annotation data from GFF files.

In [None]:
# Load real enrichment results
# Try multiple possible locations for enrichment results
enrichment_paths = [
    '../study_data_complete/enrichment_env_medium_ec.tsv',
    'enrichment_env_medium_ec.tsv',
    'enrichment_with_all_labels.tsv',
    '../enrichment_env_medium_ec.tsv'
]

enrichment_file = None
for path in enrichment_paths:
    if Path(path).exists():
        enrichment_file = path
        break

if enrichment_file:
    try:
        enrichment_df = pd.read_csv(enrichment_file, sep='\t')
        
        print("Real Enrichment Analysis Results")
        print(f"Loaded from: {enrichment_file}")
        print(f"Comparing: {enrichment_df['group1_name'].iloc[0]} vs {enrichment_df['group2_name'].iloc[0]}")
        print("="*60)
        print(f"Total features tested: {len(enrichment_df)}")
        
        # Filter for significant results
        significant = enrichment_df[enrichment_df['fdr'] < 0.05]
        print(f"Significant features (FDR < 0.05): {len(significant)}")
        
        # Show top 20 most significant
        print("\nTop 20 Most Significant Features:")
        print("="*60)
        display_cols = ['feature_id', 'group1_name', 'group1_count', 
                        'group2_name', 'group2_count', 'p_value', 
                        'fdr', 'effect_size', 'enriched_in']
        # Add feature_name if available (from OAKlib labeling)
        if 'feature_name' in enrichment_df.columns:
            display_cols.insert(1, 'feature_name')
        
        top20 = enrichment_df.head(20)[display_cols]
        display(top20)
        
        # Visualize enrichment
        if len(significant) > 0:
            # Prepare data for plotting
            plot_data = enrichment_df.head(100).copy()  # Top 100 for visualization
            plot_data['neg_log_pval'] = -np.log10(plot_data['p_value'])
            plot_data['log2_fc'] = np.log2(plot_data['effect_size'])
            
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
            
            # Volcano plot
            colors = ['red' if fdr < 0.05 else 'gray' for fdr in plot_data['fdr']]
            ax1.scatter(plot_data['log2_fc'], plot_data['neg_log_pval'], 
                       c=colors, s=60, alpha=0.6, edgecolors='black', linewidth=0.5)
            ax1.axhline(-np.log10(0.05), color='blue', linestyle='--', 
                       linewidth=2, label='p=0.05', alpha=0.7)
            ax1.axvline(0, color='black', linestyle='-', linewidth=0.5)
            ax1.set_xlabel('Log2(Fold Change)', fontsize=12, fontweight='bold')
            ax1.set_ylabel('-Log10(p-value)', fontsize=12, fontweight='bold')
            ax1.set_title('Enrichment Volcano Plot (Top 100)', fontsize=14, fontweight='bold')
            ax1.legend()
            ax1.grid(True, alpha=0.3)
            
            # Bar plot of top enriched features (10 in each direction)
            group1_enriched = significant[significant['enriched_in'] == significant['group1_name'].iloc[0]].head(10)
            group2_enriched = significant[significant['enriched_in'] == significant['group2_name'].iloc[0]].head(10)
            
            # Combine and sort by fold change
            top_features = pd.concat([group1_enriched, group2_enriched]).sort_values('effect_size')
            
            colors_bar = ['steelblue' if x == top_features['group1_name'].iloc[0] 
                         else 'coral' for x in top_features['enriched_in']]
            
            # Use feature_name for labels if available, otherwise feature_id
            if 'feature_name' in top_features.columns:
                labels = [f"{row['feature_id']}\n{row['feature_name'][:40]}" 
                         for _, row in top_features.iterrows()]
            else:
                labels = top_features['feature_id'].tolist()
            
            ax2.barh(range(len(top_features)), top_features['effect_size'], 
                    color=colors_bar, edgecolor='black', alpha=0.7, linewidth=0.5)
            ax2.set_yticks(range(len(top_features)))
            ax2.set_yticklabels(labels, fontsize=8)
            ax2.axvline(1, color='black', linestyle='--', linewidth=1)
            ax2.set_xlabel('Fold Change', fontsize=12, fontweight='bold')
            ax2.set_ylabel('Feature', fontsize=12, fontweight='bold')
            ax2.set_title('Top Enriched Features (10 per group)', fontsize=14, fontweight='bold')
            ax2.set_xscale('log')
            ax2.grid(axis='x', alpha=0.3)
            
            # Add legend
            from matplotlib.patches import Patch
            legend_elements = [
                Patch(facecolor='steelblue', label=top_features['group1_name'].iloc[0]),
                Patch(facecolor='coral', label=top_features['group2_name'].iloc[0])
            ]
            ax2.legend(handles=legend_elements, loc='best')
            
            plt.tight_layout()
            plt.show()
            
            # Summary statistics
            print(f"\n" + "="*60)
            print("Enrichment Summary:")
            print(f"  Features enriched in {enrichment_df['group1_name'].iloc[0]}: "
                  f"{len(significant[significant['enriched_in'] == enrichment_df['group1_name'].iloc[0]])}")
            print(f"  Features enriched in {enrichment_df['group2_name'].iloc[0]}: "
                  f"{len(significant[significant['enriched_in'] == enrichment_df['group2_name'].iloc[0]])}")
            
            # Show some interesting highly enriched features
            print("\nHighly Enriched Features (Fold Change > 10):")
            highly_enriched = significant[significant['effect_size'] > 10].head(10)
            if len(highly_enriched) > 0:
                display_cols_he = ['feature_id', 'effect_size', 'enriched_in']
                if 'feature_name' in highly_enriched.columns:
                    display_cols_he.insert(1, 'feature_name')
                for _, row in highly_enriched.iterrows():
                    if 'feature_name' in row:
                        print(f"  {row['feature_id']} ({row['feature_name']}): {row['effect_size']:.1f}x enriched in {row['enriched_in']}")
                    else:
                        print(f"  {row['feature_id']}: {row['effect_size']:.1f}x enriched in {row['enriched_in']}")
            else:
                print("  (None with fold change > 10)")
                
    except Exception as e:
        print(f"Error loading enrichment results: {e}")
        import traceback
        traceback.print_exc()
else:
    print("Enrichment results file not found in any of the expected locations:")
    for path in enrichment_paths:
        print(f"  - {path}")
    print("\nTo generate real enrichment results, run:")
    print("  cd notebooks")
    print("  nmdc enrich study_data_complete --group-by env_medium --categories 'ENVO:01000017,ENVO:00002007' --annotation-type ec_number --output enrichment_env_medium_ec.tsv")

## 6. Summary and Next Steps

### What We Discovered

1. **Study Characteristics**: Retrieved study metadata and context
2. **Sample Parameters**: Analyzed environmental parameters across biosamples
3. **Data Availability**: Identified available functional annotation data types
4. **Enrichment Potential**: Identified opportunities for comparative functional analysis

### Next Steps for Real Analysis

To perform a complete analysis with real data:

```bash
# 1. Download the complete study data
nmdc dump-study nmdc:sty-11-aygzgv51 ./study_data --download-data \
  --include-types "Annotation GFF,Annotation Enzyme Commission,Annotation KEGG Orthology"

# 2. Run enrichment analysis (use one of the recommended analyses above)
nmdc enrich ./study_data --group-by depth --threshold 10.0 \
  --annotation-type ec_number --output enrichment_results.tsv

# 3. Analyze GFF files for specific functions
nmdc gff query ./study_data/biosample_*/data_objects/*.gff \
  --ec "2.7.%" --output kinases.tsv

# 4. Find biosynthetic gene clusters
nmdc gff find-bgc ./study_data/biosample_*/data_objects/*.gff \
  --min-genes 5 --output bgc_candidates.json
```

### Scientific Questions to Explore

- How does microbial functional diversity vary with environmental parameters?
- What metabolic pathways are enriched in specific ecosystem types?
- Are there depth-dependent or pH-dependent functional adaptations?
- What biosynthetic gene clusters are present and how are they distributed?
- Can we identify ecosystem-specific enzymes or protein families?

In [None]:
# Final summary statistics
print("Analysis Summary")
print("="*60)
print(f"Study ID: {STUDY_ID}")
print(f"Total Biosamples: {len(biosamples)}")
print(f"Total Data Object Types: {len(data_objects)}")
print(f"Functional Annotation Types: {len(annotation_types)}")
if metabolite_types:
    print(f"Metabolite Data Types: {len(metabolite_types)}")

# Parameter coverage
print("\nParameter Coverage:")
for param in ['depth', 'ph', 'temp', 'lat', 'lon']:
    if param in df_samples.columns:
        count = df_samples[param].notna().sum()
        pct = (count / len(df_samples)) * 100
        print(f"  {param}: {count}/{len(df_samples)} ({pct:.1f}%)")

print("\n" + "="*60)
print("Analysis complete!")
print("\nFor full analysis with downloaded data, use the commands shown above.")

## References and Resources

- NMDC Portal: https://microbiomedata.org/
- NMDC API Documentation: https://api.microbiomedata.org/docs
- nmdc_api_utilities Documentation: https://microbiomedata.github.io/nmdc_api_utilities/
- Enrichment Analysis Guide: See `ENRICHMENT_ANALYSIS.md` in the repository
- GFF Utilities Guide: See `GFF_UTILITIES.md` in the repository