# Indexed Random Access: FAI and TBI

This notebook demonstrates how to use **FAI (FASTA Index)** and **TBI (Tabix Index)** for efficient random access to genomic data.

## What You'll Learn

1. **FAI (FASTA Index)**
   - Build and load FASTA indices
   - Fetch entire sequences or specific regions
   - Use cases: gene extraction, reference lookups

2. **TBI (Tabix Index)**
   - Load tabix indices for VCF/BED/GFF3 files
   - Query specific genomic regions
   - Use cases: variant calling, peak overlaps

3. **Performance Benefits**
   - 100-1000× speedup vs sequential scanning
   - Constant memory usage
   - Essential for large-scale genomics

In [None]:
import biometal
import tempfile
import os
import struct

## Part 1: FAI - FASTA Indexing

FAI enables O(1) random access to any sequence in a FASTA file without loading the entire file into memory.

In [None]:
# Create a sample FASTA file
fasta_data = """##chr1 Human chromosome 1
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCA
>chr2 Human chromosome 2
GGGGCCCCAAAATTTTGGGGCCCCAAAATTTTGGGGCCCCAAAATTTTGGGGCCCCAAAA
TTTTGGGGCCCCAAAATTTTGGGGCCCCAAAATTTTGGGGCCCCAAAATTTTGGGGCCCC
>chrM Mitochondrial DNA
ATCGATCGATCGATCGATCGATCG
"""

# Write to temporary file
with tempfile.NamedTemporaryFile(mode='w', suffix='.fa', delete=False) as f:
    fasta_path = f.name
    f.write(fasta_data)

print(f"Created FASTA file: {fasta_path}")

### Step 1: Build the Index

In [None]:
# Build FAI index from FASTA file
fai_index = biometal.FaiIndex.build(fasta_path)
print(f"Built index with {len(fai_index)} sequences")
print(f"Sequence names: {fai_index.sequence_names()}")

### Step 2: Save and Load Index

In [None]:
# Save index to file
fai_path = fasta_path + ".fai"
fai_index.write(fai_path)
print(f"Saved index to: {fai_path}")

# Load index from file (fast: <1ms)
loaded_index = biometal.FaiIndex.from_path(fai_path)
print(f"Loaded index: {len(loaded_index)} sequences")

### Step 3: Query Sequence Metadata

In [None]:
# Get info for each sequence
for seq_name in loaded_index.sequence_names():
    info = loaded_index.get_info(seq_name)
    print(f"{seq_name}:")
    print(f"  Length: {info['length']} bp")
    print(f"  Offset: {info['offset']} bytes")
    print(f"  Line bases: {info['line_bases']}")
    print(f"  Line width: {info['line_width']}")

### Step 4: Fetch Entire Sequences

In [None]:
# Fetch entire sequences
chr1 = loaded_index.fetch("chr1", fasta_path)
print(f"chr1: {len(chr1)} bp")
print(f"First 60 bp: {chr1[:60]}")

chrM = loaded_index.fetch("chrM", fasta_path)
print(f"\nchrM: {len(chrM)} bp")
print(f"Full sequence: {chrM}")

### Step 5: Fetch Specific Regions

This is where FAI really shines - extracting small regions without reading the entire sequence.

In [None]:
# Fetch region [0, 20) from chr1
region1 = loaded_index.fetch_region("chr1", 0, 20, fasta_path)
print(f"chr1:0-20 = {region1}")

# Fetch region [60, 80) - crosses line boundary
region2 = loaded_index.fetch_region("chr1", 60, 80, fasta_path)
print(f"chr1:60-80 = {region2}")

# Fetch middle of chr2
region3 = loaded_index.fetch_region("chr2", 30, 50, fasta_path)
print(f"chr2:30-50 = {region3}")

### Use Case: Extract Gene Sequence and Calculate GC Content

In [None]:
# Simulate extracting a gene at chr1:25-55
gene_seq = loaded_index.fetch_region("chr1", 25, 55, fasta_path)
print(f"Gene sequence: {gene_seq}")
print(f"Length: {len(gene_seq)} bp")

# Calculate GC content
gc_count = sum(1 for base in gene_seq if base in 'GC')
gc_percent = (gc_count / len(gene_seq)) * 100
print(f"GC content: {gc_percent:.1f}%")

## Part 2: TBI - Tabix Indexing

TBI enables efficient region queries on sorted, BGZF-compressed tab-delimited files (VCF, BED, GFF3).

In [None]:
def create_vcf_tbi():
    """Create a minimal VCF TBI index for demonstration"""
    data = bytearray()
    
    # Magic "TBI\1"
    data.extend(b"TBI\x01")
    
    # n_ref = 2 (chr1, chr2)
    data.extend(struct.pack('<i', 2))
    
    # format = 2 (VCF)
    data.extend(struct.pack('<i', 2))
    
    # col_seq = 0, col_beg = 1, col_end = 0
    data.extend(struct.pack('<i', 0))  # CHROM
    data.extend(struct.pack('<i', 1))  # POS
    data.extend(struct.pack('<i', 0))  # END (same as POS for VCF)
    
    # meta = '#', skip = 0
    data.extend(struct.pack('<i', ord('#')))
    data.extend(struct.pack('<i', 0))
    
    # l_nm = 10 ("chr1\0chr2\0")
    data.extend(struct.pack('<i', 10))
    data.extend(b"chr1\0chr2\0")
    
    # Index for chr1: 2 bins
    data.extend(struct.pack('<i', 2))  # n_bin
    
    # Bin 0 (entire chromosome)
    data.extend(struct.pack('<I', 0))  # bin_id
    data.extend(struct.pack('<i', 1))  # n_chunk
    data.extend(struct.pack('<Q', 0x100000))  # chunk start
    data.extend(struct.pack('<Q', 0x200000))  # chunk end
    
    # Bin 4681 (specific region)
    data.extend(struct.pack('<I', 4681))
    data.extend(struct.pack('<i', 1))
    data.extend(struct.pack('<Q', 0x150000))
    data.extend(struct.pack('<Q', 0x180000))
    
    # Linear index: 3 intervals
    data.extend(struct.pack('<i', 3))
    data.extend(struct.pack('<Q', 0x100000))
    data.extend(struct.pack('<Q', 0x120000))
    data.extend(struct.pack('<Q', 0x150000))
    
    # Index for chr2: 1 bin
    data.extend(struct.pack('<i', 1))
    data.extend(struct.pack('<I', 0))
    data.extend(struct.pack('<i', 1))
    data.extend(struct.pack('<Q', 0x300000))
    data.extend(struct.pack('<Q', 0x400000))
    
    # Linear index
    data.extend(struct.pack('<i', 2))
    data.extend(struct.pack('<Q', 0x300000))
    data.extend(struct.pack('<Q', 0x350000))
    
    return bytes(data)

# Create TBI file
tbi_data = create_vcf_tbi()
with tempfile.NamedTemporaryFile(mode='wb', suffix='.vcf.gz.tbi', delete=False) as f:
    tbi_path = f.name
    f.write(tbi_data)

print(f"Created TBI file: {tbi_path}")

### Step 1: Load TBI Index

In [None]:
# Load TBI index
tbi_index = biometal.TbiIndex.from_path(tbi_path)
print(f"Loaded TBI index")
print(f"Format: {tbi_index.format()}")
print(f"References: {tbi_index.references()}")
print(f"Number of references: {len(tbi_index)}")

### Step 2: Query Index Metadata

In [None]:
# Get metadata
print("Index metadata:")
print(f"  CHROM column: {tbi_index.col_seq()}")
print(f"  POS column: {tbi_index.col_beg()}")
print(f"  END column: {tbi_index.col_end()}")
print(f"  Comment char: '{tbi_index.meta_char()}'")
print(f"  Skip lines: {tbi_index.skip_lines()}")

# Check reference info
for ref_name in tbi_index.references():
    info = tbi_index.get_info(ref_name)
    print(f"\n{ref_name}:")
    print(f"  Bins: {info['n_bins']}")
    print(f"  Intervals: {info['n_intervals']}")

### Step 3: Query Genomic Regions

In [None]:
# Query chr1:0-100000
print("Query: chr1:0-100000")
chunks = tbi_index.query("chr1", 0, 100000)
print(f"Found {len(chunks)} chunks:")
for i, (start, end) in enumerate(chunks):
    print(f"  Chunk {i}: 0x{start:016x} - 0x{end:016x}")
    print(f"           (compressed: 0x{start >> 16:x}, uncompressed: 0x{start & 0xFFFF:x})")

# Query chr2:0-50000
print("\nQuery: chr2:0-50000")
chunks = tbi_index.query("chr2", 0, 50000)
print(f"Found {len(chunks)} chunks:")
for i, (start, end) in enumerate(chunks):
    print(f"  Chunk {i}: 0x{start:016x} - 0x{end:016x}")

### Step 4: Check Reference Existence

In [None]:
# Check if references exist
print(f"chr1 exists: {tbi_index.contains('chr1')}")
print(f"chr2 exists: {tbi_index.contains('chr2')}")
print(f"chr99 exists: {tbi_index.contains('chr99')}")

### Step 5: Error Handling

In [None]:
# Test error handling
try:
    tbi_index.query("chr99", 0, 100000)
    print("ERROR: Should have raised exception")
except ValueError as e:
    print(f"✓ Correctly rejected chr99: {e}")

try:
    tbi_index.query("chr1", 100000, 50000)
    print("ERROR: Should have raised exception")
except ValueError as e:
    print(f"✓ Correctly rejected invalid range: {e}")

## Use Case: Targeted Variant Analysis

Practical workflow for extracting variants in a specific gene region.

In [None]:
# Scenario: Find variants in BRCA2 gene region
gene_name = "BRCA2"
chromosome = "chr1"  # Demo
gene_start = 32_889_000
gene_end = 32_974_000

print(f"Analyzing {gene_name} region: {chromosome}:{gene_start:,}-{gene_end:,}")
print()

# Step 1: Query TBI index
print("Step 1: Query TBI index for relevant file chunks")
chunks = tbi_index.query(chromosome, gene_start, gene_end)
print(f"→ Found {len(chunks)} chunks to read")
print()

# Step 2: Show file offsets
print("Step 2: File offsets to seek to")
for i, (start, end) in enumerate(chunks):
    compressed_offset = start >> 16
    uncompressed_offset = start & 0xFFFF
    print(f"  Chunk {i}:")
    print(f"    Virtual offset: 0x{start:016x}")
    print(f"    Seek to byte: {compressed_offset}")
    print(f"    Uncompressed offset: {uncompressed_offset}")
print()

# Step 3: Workflow explanation
print("Step 3: Complete workflow")
print("1. Open BGZF-compressed VCF file")
print("2. Seek to first chunk's compressed offset")
print("3. Stream VCF records using biometal.VcfStream")
print(f"4. Filter records where {gene_start} <= POS < {gene_end}")
print("5. Process variants (parse genotypes, calculate AF, etc.)")
print()

print("Benefits:")
print("  • Skip reading entire VCF file")
print("  • Only decompress relevant BGZF blocks")
print("  • 100-1000× faster than full scan")
print("  • Constant memory usage (~5 MB)")

## Performance Characteristics

### FAI (FASTA Index)

| Operation | Complexity | Typical Time |
|-----------|------------|-------------|
| Load index | O(n) sequences | <1 ms |
| Sequence lookup | O(1) hash | <1 µs |
| Region fetch | O(1) seek + O(m) read | 1-10 ms |
| Memory | O(n) sequences | ~200 bytes/seq |

### TBI (Tabix Index)

| Operation | Complexity | Typical Time |
|-----------|------------|-------------|
| Load index | O(n) file size | 1-10 ms |
| Region query | O(log n) bins | <1 ms |
| Chunk count | Variable | 1-10 chunks |
| Memory | O(n) references | 1-10 MB |
| Speedup vs scan | Depends on region | 100-1000× |

### When to Use Indexed Access

✅ **Use FAI/TBI when:**
- Querying specific regions (<10% of file)
- Random access patterns
- Memory-constrained environments
- Interactive analysis

❌ **Use streaming when:**
- Processing entire file
- Sequential access
- Simple filtering


## Cleanup

In [None]:
# Clean up temporary files
import os
os.unlink(fasta_path)
os.unlink(fai_path)
os.unlink(tbi_path)
print("Cleaned up temporary files")

## Summary

In this notebook, you learned:

1. **FAI (FASTA Index)**
   - Build indices with `FaiIndex.build()`
   - Fetch sequences with `fetch()` and regions with `fetch_region()`
   - O(1) access to any sequence or region

2. **TBI (Tabix Index)**
   - Load indices with `TbiIndex.from_path()`
   - Query regions with `query()` to get file chunks
   - O(log n) region queries on compressed files

3. **Performance**
   - 100-1000× faster than sequential scanning
   - Constant memory usage
   - Essential for large-scale genomics

4. **Use Cases**
   - Gene extraction and analysis
   - Targeted variant calling
   - Peak overlap analysis
   - Region-specific queries

## Next Steps

- Try with real genomic data
- Combine with VCF/BED streaming parsers
- Batch process multiple regions
- Build production pipelines