# ManifestLookup Demo

This notebook demonstrates how to use the `ManifestLookup` class to query gene and tissue data from parquet files.

The ManifestLookup class provides methods to:
- Check if gene_id and tissue_id combinations exist
- Get file paths for specific gene/tissue combinations
- List all tissues for a given gene
- List all genes for a given tissue
- Get all unique genes and tissues in the dataset

## Setup

First, let's import the required modules and set up logging:

In [1]:
import logging
from lookup import ManifestLookup

# Configure logging to see what's happening
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

ModuleNotFoundError: No module named 'lookup'

## Initialize the ManifestLookup

We'll create a ManifestLookup instance using either a local file or an S3 path. The class handles both automatically.

In [None]:
# Choose your data source:
# Option 1: Local file
# parquet_file_path = "../data/manifest.parquet"

# Option 2: S3 file (used in this demo)
parquet_file_path = "s3://czi-dnacell-staging/alzhimers_disease/v1/manifest.parquet"

print(f"Loading data from: {parquet_file_path}")

In [None]:
# Initialize the lookup using context manager for proper cleanup
lookup = ManifestLookup(parquet_file_path)
print(f"✓ Successfully loaded manifest data from {parquet_file_path}")

## 1. Query Specific Gene-Tissue Combinations

Let's check if specific gene-tissue combinations exist and get their file paths:

In [None]:
# Example gene and tissue IDs
example_gene = "ENSG00000003989.16"
example_tissue = "model_tissue_21"

print(f"Looking up gene: {example_gene}, tissue: {example_tissue}")
print("-" * 60)

# Check if the combination exists
exists = lookup.exists(example_gene, example_tissue)
print(f"Does combination exist? {exists}")

# Get the S3 file path if it exists
if exists:
    s3_file_path = lookup.get_s3_file_path(example_gene, example_tissue)
    print(f"S3 file path: {s3_file_path}")
    
    # Also get the local file path (downloads the file)
    local_file_path = lookup.get_file_path(example_gene, example_tissue)
    print(f"Local file path (downloaded): {local_file_path}")
else:
    print("No file path found for this combination")

## 2. Get All Records for a Specific Gene

Let's see all tissues associated with our example gene:

In [None]:
gene_records = lookup.get_records_for_gene(example_gene)
print(f"Found {len(gene_records)} records for gene {example_gene}:")
print("-" * 60)

# Show first few records
for i, record in enumerate(gene_records[:5]):
    print(f"{i+1}. {record}")
    
if len(gene_records) > 5:
    print(f"... and {len(gene_records) - 5} more records")

## 3. Get All Records for a Specific Tissue

Now let's see all genes associated with a specific tissue:

In [None]:
example_tissue_id = "tissue_21"  # Note: the class handles "model_" and "tissue_" prefixes automatically
tissue_records = lookup.get_records_for_tissue(example_tissue_id)

print(f"Found {len(tissue_records)} records for tissue {example_tissue_id}:")
print("-" * 60)

# Show first few records
for i, record in enumerate(tissue_records[:5]):
    print(f"{i+1}. {record}")
    
if len(tissue_records) > 5:
    print(f"... and {len(tissue_records) - 5} more records")

## 4. Explore the Dataset

Let's get an overview of all unique genes and tissues in the dataset:

In [None]:
# Get all unique genes
unique_genes = lookup.get_unique("gene_id")
print(f"Total unique genes: {len(unique_genes)}")
print(f"First 10 genes: {unique_genes[:10]}")
print()

In [None]:
# Get all unique tissues
unique_tissues = lookup.get_unique("tissue_id")
print(f"Total unique tissues: {len(unique_tissues)}")
print(f"All tissues: {unique_tissues}")

## 5. S3 File Path vs Local File Path

The ManifestLookup class provides two methods for file access:
- `get_s3_file_path()` - Returns the S3 path directly from the manifest
- `get_file_path()` - Downloads the file from S3 and returns local path

In [None]:
# Demonstrate the difference between S3 and local file paths
test_gene = "ENSG00000003989.16"
test_tissue = 21

print("Comparing S3 vs Local file path access:")
print("-" * 50)

# Method 1: Get S3 path only (no download)
s3_path = lookup.get_s3_file_path(test_gene, test_tissue)
if s3_path:
    print(f"S3 path: {s3_path}")
    print("✓ Fast - no download required")
else:
    print("❌ No file found for this gene-tissue combination")
    
print()

# Method 2: Get local path (downloads file)
if s3_path:  # Only attempt download if S3 path exists
    print("Downloading file to local storage...")
    local_path = lookup.get_file_path(test_gene, test_tissue)
    if local_path:
        print(f"Local path: {local_path}")
        print("✓ File downloaded and ready for local processing")
    else:
        print("❌ Download failed")
        
print()
print("Use get_s3_file_path() when you only need the S3 location")
print("Use get_file_path() when you need to process the file locally")

In [None]:
# Example 1: Find all genes that exist in a specific tissue
target_tissue = 21
genes_in_tissue = lookup.get_records_for_tissue(target_tissue)
gene_ids_in_tissue = [record.gene_id for record in genes_in_tissue]

print(f"Genes available in tissue {target_tissue}:")
print(f"Total: {len(gene_ids_in_tissue)}")
print(f"Sample: {gene_ids_in_tissue[:5]}")
print()

# Example 2: Batch processing - get S3 paths for multiple files
print("Batch S3 path lookup:")
print("-" * 30)
sample_genes = gene_ids_in_tissue[:3] if gene_ids_in_tissue else []
for gene in sample_genes:
    s3_path = lookup.get_s3_file_path(gene, target_tissue)
    print(f"{gene[:20]:<20} -> {s3_path}")
print()

In [None]:
# Example 3: Check multiple gene-tissue combinations with S3 paths
test_combinations = [
    ("ENSG00000003989.16", "tissue_21"),
    ("ENSG00000003989.16", "tissue_1"),
    ("NONEXISTENT_GENE", "tissue_21")
]

print("Testing multiple combinations with S3 paths:")
print("-" * 55)
for gene, tissue in test_combinations:
    exists = lookup.exists(gene, tissue)
    if exists:
        s3_path = lookup.get_s3_file_path(gene, tissue)
        status = f"✓ S3: {s3_path}"
    else:
        status = "✗ NOT FOUND"
    print(f"{gene[:20]:<20} + {tissue:<15} = {status}")

## 6. Cleanup

In [None]:
# Close the connection and cleanup resources
lookup.close()
print("✓ ManifestLookup connection closed successfully")

## Best Practices

### Using Context Manager (Recommended)

For automatic cleanup, use the context manager pattern:

In [None]:
# Recommended pattern using 'with' statement
def demonstrate_context_manager(parquet_file_path):
    with ManifestLookup(parquet_file_path) as lookup:
        # All your operations here
        exists = lookup.exists("ENSG00000003989.16", "tissue_21")
        print(f"Gene-tissue combination exists: {exists}")
        
        # Get S3 path without downloading
        if exists:
            s3_path = lookup.get_s3_file_path("ENSG00000003989.16", "tissue_21")
            print(f"S3 path: {s3_path}")
        
        # Connection is automatically closed when exiting the 'with' block
    print("✓ Connection automatically closed")

# Uncomment to run:
# demonstrate_context_manager(parquet_file_path)

## Summary

The `ManifestLookup` class provides:

1. **File Path Resolution**: Supports both local files and S3 paths
2. **Efficient Querying**: Uses DuckDB with indexes for fast lookups
3. **Flexible File Access**: Both S3 paths and local downloaded files
4. **Resource Management**: Proper cleanup with context managers
5. **Error Handling**: Robust error handling for missing files and invalid schemas

### Key Methods:
- `exists(gene_id, tissue_id)` - Check if combination exists
- `get_s3_file_path(gene_id, tissue_id)` - Get S3 file path directly (no download)
- `get_file_path(gene_id, tissue_id)` - Download file and get local path
- `get_records_for_gene(gene_id)` - Get all records for a gene
- `get_records_for_tissue(tissue_id)` - Get all records for a tissue
- `get_unique(column_name)` - Get all unique values for a column

### When to Use Which Method:
- **get_s3_file_path()**: When you need the S3 location for other AWS services or just metadata
- **get_file_path()**: When you need to process the file locally (downloads and caches the file)