# Biomni NIBR Quick Start

This notebook demonstrates how to use Biomni for biomedical AI research with the NIBR data lake.

## 1. Setup and Verification

In [None]:
import os
import pandas as pd
import sys

# Add Biomni to path
sys.path.insert(0, '/app')

# Check data is mounted
data_path = '/biomni_data/data_lake'
files = os.listdir(data_path)
print(f"✅ Data lake mounted with {len(files)} files")
print(f"📊 Total size: ~14GB")
print(f"\nSample files:")
for f in files[:5]:
    size = os.path.getsize(f"{data_path}/{f}") / (1024*1024)  # MB
    print(f"  • {f}: {size:.1f} MB")

## 2. Initialize Biomni Agent

**Note**: You need to set an API key in the environment or `.env` file first.

In [None]:
# Set API key (replace with your actual key)
# os.environ['OPENAI_API_KEY'] = 'your-key-here'
# os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'

from biomni_local_mount import A1LocalMount

# Initialize agent with local mount (no S3 download)
agent = A1LocalMount(
    path='/',
    skip_download=True,
    validate_data=True,
    # llm='gpt-4'  # or 'claude-3', requires API key
)

print("✅ Biomni agent initialized!")

## 3. Explore Available Data

In [None]:
# Load and preview DisGeNET gene-disease associations
disgenet = pd.read_parquet(f'{data_path}/DisGeNET.parquet')
print("DisGeNET Gene-Disease Associations:")
print(f"Shape: {disgenet.shape}")
print(f"\nColumns: {list(disgenet.columns)}")
print(f"\nFirst 5 rows:")
disgenet.head()

In [None]:
# Check DepMap cancer cell line data
depmap_files = [f for f in files if f.startswith('DepMap')]
print("DepMap Cancer Cell Line Files:")
for f in depmap_files:
    size = os.path.getsize(f"{data_path}/{f}") / (1024*1024)
    print(f"  • {f}: {size:.1f} MB")

## 4. Run Biomedical Queries

**Note**: These require a valid LLM API key to work.

In [None]:
# Example queries (uncomment after setting API key)

# # Query 1: Disease-gene associations
# result = agent.run("What are the top genes associated with Alzheimer's disease?")
# print(result)

# # Query 2: Drug targets
# result = agent.run("Find potential drug targets for type 2 diabetes")
# print(result)

# # Query 3: Cancer dependencies
# result = agent.run("Which genes are essential in lung cancer cell lines according to DepMap?")
# print(result)

## 5. Direct Data Analysis

In [None]:
# Load gene info
try:
    gene_info = pd.read_parquet(f'{data_path}/gene_info.parquet')
    print(f"Gene info loaded: {gene_info.shape[0]} genes")
    print(f"Columns: {list(gene_info.columns)[:10]}")  # First 10 columns
except FileNotFoundError:
    print("gene_info.parquet not found in current dataset")

In [None]:
# Analyze disease associations
if 'disgenet' in locals():
    # Top diseases by number of associated genes
    if 'disease_name' in disgenet.columns:
        top_diseases = disgenet['disease_name'].value_counts().head(10)
        print("Top 10 diseases by gene associations:")
        for disease, count in top_diseases.items():
            print(f"  {disease}: {count} genes")

## 6. Custom Analysis Functions

In [None]:
def find_disease_genes(disease_name, top_n=10):
    """Find top genes associated with a disease"""
    try:
        disgenet = pd.read_parquet(f'{data_path}/DisGeNET.parquet')
        
        # Filter for disease (case-insensitive)
        disease_data = disgenet[
            disgenet['disease_name'].str.contains(disease_name, case=False, na=False)
        ]
        
        if disease_data.empty:
            return f"No data found for '{disease_name}'"
        
        # Get top genes by score or frequency
        if 'score' in disease_data.columns:
            top_genes = disease_data.nlargest(top_n, 'score')[['gene_symbol', 'score', 'disease_name']]
        else:
            top_genes = disease_data.head(top_n)
        
        return top_genes
    except Exception as e:
        return f"Error: {e}"

# Example usage
# alzheimer_genes = find_disease_genes('Alzheimer', top_n=5)
# print(alzheimer_genes)

## 7. Save Results

In [None]:
# Results are saved to /biomni_data/results/
results_path = '/biomni_data/results'
os.makedirs(results_path, exist_ok=True)

# Example: Save analysis results
# results_df.to_csv(f'{results_path}/analysis_results.csv', index=False)
# print(f"Results saved to {results_path}/")

print(f"✅ Results directory ready at: {results_path}")

## Resources

- **Data location**: `/biomni_data/data_lake/` (76 files, 14GB)
- **Cache**: `/biomni_data/cache/`
- **Results**: `/biomni_data/results/`
- **Logs**: `/biomni_data/logs/`

### Available Datasets
- **DisGeNET**: Gene-disease associations
- **DepMap**: Cancer cell line dependencies
- **BindingDB**: Drug-protein binding data
- **GTEx**: Tissue gene expression
- **DrugBank**: Drug-target interactions
- **And 70+ more biomedical datasets**

### Container Info
- **Name**: biomni-nibr
- **Memory**: 16GB
- **CPUs**: 4 cores
- **Image**: biomni:tier2 (3.25GB)