# Enformer Oracle - Updated Example

This notebook demonstrates the new features of the Enformer oracle:
- Environment isolation to avoid dependency conflicts
- Reference genome integration for accurate predictions
- ENCODE track identifier support
- Simplified prediction API

## Setup

In [None]:
import chorus
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

## 1. Initialize Enformer with Environment Isolation

The new Enformer implementation runs in an isolated conda environment to avoid TensorFlow/PyTorch conflicts.

In [None]:
# Path to reference genome (update this to your hg38.fa location)
REFERENCE_FASTA = "/path/to/hg38.fa"

# Check if reference exists
if not os.path.exists(REFERENCE_FASTA):
    print(f"⚠️  Reference genome not found at {REFERENCE_FASTA}")
    print("Please update REFERENCE_FASTA to point to your hg38.fa file")
    print("\nFor examples without reference genome, we'll use sequence-based predictions")
    REFERENCE_FASTA = None

In [None]:
# Create Enformer oracle with environment isolation
oracle = chorus.create_oracle(
    'enformer',
    use_environment=True,  # Use isolated conda environment
    reference_fasta=REFERENCE_FASTA  # Optional, for genomic coordinates
)

print("✓ Enformer oracle created with environment isolation")
print(f"  Sequence length: {oracle.sequence_length:,} bp")
print(f"  Output window: 114,688 bp (896 bins × 128 bp)")
print(f"  Tracks: 5,313 human tracks")

In [None]:
# Load pre-trained model
print("Loading Enformer model (this may take a moment)...")
oracle.load_pretrained_model()
print("✓ Model loaded successfully!")

## 2. Basic Sequence Prediction

Predict regulatory activity from a DNA sequence.

In [None]:
# Example: Create a test sequence
test_sequence = "ACGT" * 1000  # 4,000 bp sequence

# Predict using descriptive track names
predictions = oracle.predict(
    test_sequence,
    ['DNase:K562', 'ATAC-seq:K562']
)

print(f"Input sequence: {len(test_sequence):,} bp")
print(f"\nPredictions:")
for track, values in predictions.items():
    print(f"  {track}: shape={values.shape}, mean={values.mean():.4f}")

## 3. Genomic Coordinate Prediction with Reference Genome

When a reference genome is provided, Enformer automatically extracts sequences with proper genomic context.

In [None]:
if REFERENCE_FASTA:
    # Define a genomic region
    chrom = "chrX"
    start = 48780505
    end = 48785229
    
    print(f"Predicting for region: {chrom}:{start:,}-{end:,}")
    print(f"Region length: {end-start:,} bp")
    
    # Use ENCODE identifier for specific experiment
    track_id = "ENCFF413AHU"  # DNase:K562 specific experiment
    
    # Predict using genomic coordinates
    predictions = oracle.predict(
        (chrom, start, end),  # Tuple of genomic coordinates
        [track_id]
    )
    
    print(f"\n✓ Predictions complete!")
    print(f"  Track: {track_id} (DNase accessibility in K562 cells)")
    print(f"  Output shape: {predictions[track_id].shape}")
    print(f"  Mean signal: {predictions[track_id].mean():.4f}")
    print(f"  Max signal: {predictions[track_id].max():.4f}")
else:
    print("Skipping coordinate-based prediction (requires reference genome)")

## 4. Using ENCODE Identifiers vs Descriptive Names

Enformer supports both ENCODE identifiers and descriptive track names.

In [None]:
# Example sequence
sequence = "ACGT" * 500

# Method 1: Using ENCODE identifier (specific experiment)
predictions_encode = oracle.predict(
    sequence,
    ['ENCFF413AHU']  # Specific DNase:K562 experiment
)

# Method 2: Using descriptive name (may match multiple tracks)
predictions_desc = oracle.predict(
    sequence,
    ['DNase:K562']  # General DNase in K562 cells
)

print("Using ENCODE identifier (ENCFF413AHU):")
print(f"  Mean signal: {predictions_encode['ENCFF413AHU'].mean():.4f}")

print("\nUsing descriptive name (DNase:K562):")
print(f"  Mean signal: {predictions_desc['DNase:K562'].mean():.4f}")
print("\nNote: These may differ as there are multiple DNase:K562 tracks")

## 5. Creating BedGraph Files for Visualization

Convert predictions to BedGraph format for genome browser visualization.

In [None]:
def create_bedgraph(predictions, chrom, start, end, track_name, output_file):
    """Create a BedGraph file from Enformer predictions."""
    
    # Calculate coordinate mapping
    output_length = 896 * 128  # 114,688 bp
    input_length = 393216
    output_offset = (input_length - output_length) // 2
    
    # Map region to output bins
    region_center = (start + end) // 2
    input_center = input_length // 2
    
    region_start_in_input = input_center - (region_center - start)
    region_end_in_input = input_center + (end - region_center)
    
    region_start_in_output = region_start_in_input - output_offset
    region_end_in_output = region_end_in_input - output_offset
    
    start_bin = max(0, region_start_in_output // 128)
    end_bin = min(896, (region_end_in_output + 127) // 128)
    
    # Write BedGraph
    with open(output_file, 'w') as f:
        # Header
        f.write(f'track type=bedGraph name="{track_name}" ')
        f.write(f'description="Enformer prediction" ')
        f.write(f'visibility=full autoScale=on color=255,0,0\n')
        
        # Data
        for i in range(start_bin, end_bin):
            bin_start_in_output = i * 128
            bin_pos_in_region = bin_start_in_output - region_start_in_output
            bin_start_genomic = start + bin_pos_in_region
            bin_end_genomic = bin_start_genomic + 128
            
            # Clip to region bounds
            bin_start_genomic = max(start, bin_start_genomic)
            bin_end_genomic = min(end, bin_end_genomic)
            
            if bin_start_genomic < bin_end_genomic:
                value = float(predictions[i])
                f.write(f"{chrom}\t{bin_start_genomic}\t{bin_end_genomic}\t{value:.6f}\n")
    
    return output_file

In [None]:
# Create a BedGraph file from predictions
if REFERENCE_FASTA and 'predictions' in locals():
    bedgraph_file = create_bedgraph(
        predictions[track_id],
        chrom, start, end,
        f"{track_id}_Enformer",
        "enformer_prediction.bedgraph"
    )
    
    print(f"✓ Created BedGraph file: {bedgraph_file}")
    print("\nTo visualize:")
    print("1. Open UCSC Genome Browser or IGV")
    print("2. Load hg38 genome")
    print(f"3. Navigate to {chrom}:{start}-{end}")
    print(f"4. Upload {bedgraph_file} as a custom track")
else:
    print("Create predictions with genomic coordinates first to generate BedGraph")

## 6. Analyzing Multiple Tracks

Predict multiple regulatory signals simultaneously.

In [None]:
# Select multiple tracks of interest
tracks = [
    'DNase:K562',
    'ATAC-seq:K562',
    'ChIP-seq_H3K4me3:K562',
    'ChIP-seq_H3K27ac:K562',
    'CAGE:K562'
]

# Use a test sequence
test_seq = "ACGT" * 2000

# Predict all tracks
print("Predicting multiple tracks...")
multi_predictions = oracle.predict(test_seq, tracks)

# Analyze results
print("\nTrack statistics:")
for track in tracks:
    values = multi_predictions[track]
    print(f"\n{track}:")
    print(f"  Shape: {values.shape}")
    print(f"  Mean: {values.mean():.4f}")
    print(f"  Max: {values.max():.4f}")
    print(f"  Std: {values.std():.4f}")

## 7. Visualizing Predictions

Create a simple visualization of the predictions.

In [None]:
# Visualize multiple tracks
if 'multi_predictions' in locals():
    fig, axes = plt.subplots(len(tracks), 1, figsize=(12, 2*len(tracks)), sharex=True)
    
    for i, (track, ax) in enumerate(zip(tracks, axes)):
        values = multi_predictions[track]
        positions = np.arange(len(values)) * 128 / 1000  # Convert to kb
        
        ax.plot(positions, values, linewidth=1)
        ax.fill_between(positions, values, alpha=0.3)
        ax.set_ylabel(track.split(':')[0], fontsize=10)
        ax.set_xlim(0, positions[-1])
        
        # Add max value annotation
        max_idx = np.argmax(values)
        ax.annotate(f'Max: {values[max_idx]:.2f}',
                   xy=(positions[max_idx], values[max_idx]),
                   xytext=(5, 5), textcoords='offset points',
                   fontsize=8, alpha=0.7)
    
    axes[-1].set_xlabel('Position (kb)', fontsize=12)
    plt.suptitle('Enformer Predictions Across Multiple Tracks', fontsize=14)
    plt.tight_layout()
    plt.show()

## 8. Example: Predicting Effects of a Promoter Variant

Compare predictions for reference and alternate alleles.

In [None]:
# Example promoter sequence with a variant
promoter_ref = """
GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC
CCAATCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGC
TATAAAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGC
CAGCTGCAGCTGCAGCTGCAGCTGCAGCTGCAGCTGCAGCTGCAGCTGCAGCTGCAGCTG
""".replace('\n', '')

# Create alternate with TATA box mutation
promoter_alt = promoter_ref.replace('TATAAA', 'TATGAA')

print(f"Reference: ...{promoter_ref[115:125]}...")
print(f"Alternate: ...{promoter_alt[115:125]}...")

# Predict for both sequences
pred_ref = oracle.predict(promoter_ref, ['CAGE:K562', 'ChIP-seq_TBP:K562'])
pred_alt = oracle.predict(promoter_alt, ['CAGE:K562', 'ChIP-seq_TBP:K562'])

# Compare predictions
print("\nPrediction comparison:")
for track in ['CAGE:K562', 'ChIP-seq_TBP:K562']:
    ref_mean = pred_ref[track].mean()
    alt_mean = pred_alt[track].mean()
    change = (alt_mean - ref_mean) / ref_mean * 100
    
    print(f"\n{track}:")
    print(f"  Reference: {ref_mean:.4f}")
    print(f"  Alternate: {alt_mean:.4f}")
    print(f"  Change: {change:+.1f}%")

## Summary

This notebook demonstrated the key features of the updated Enformer oracle:

1. **Environment Isolation**: Runs in isolated conda environment to avoid conflicts
2. **Reference Genome Support**: Automatically extracts sequences with proper genomic context
3. **ENCODE Identifiers**: Use specific experiment IDs for reproducibility
4. **Simplified API**: Single `predict()` method for both sequences and coordinates
5. **BedGraph Export**: Easy visualization in genome browsers
6. **Multi-track Analysis**: Predict multiple regulatory signals simultaneously

### Key Improvements:
- No more padding with N's - uses actual genomic sequence
- Better track selection with 5,313 tracks available
- Cleaner API that handles both sequence and coordinate inputs
- Isolated environments prevent TensorFlow/PyTorch conflicts