# 🦠 Viral AI Variants Explorer

This notebook demonstrates how to explore the **VirusSeq Variants** table on Viral AI using the Omics AI Explorer Python library.

**Target Dataset**: `collections.virusseq.variants` on [viral.ai](https://viral.ai)

## What we'll cover:
- Connect to Viral AI network
- Explore the VirusSeq collection
- Query the variants table
- Display the first 10 rows of variant data

---

## 📦 Setup and Installation

First, let's install and import the Omics AI Explorer library:

In [ ]:
# Install the Omics AI Explorer library
!pip install git+https://github.com/mfiume/omics-ai-python-library.git --quiet

# Import the functional API
from omics_ai import list_collections, list_tables, get_schema_fields, query

# Import data analysis libraries
import pandas as pd
from datetime import datetime

print("✅ All imports successful!")
print(f"📅 Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n🦠 Viral AI Variants Explorer Ready!")

## 🔗 Connect to Viral AI

Let's connect to the Viral AI network and explore what's available:

In [ ]:
# Connect to Viral AI and explore collections
print("🔗 Connected to Viral AI!")
print(f"🌐 Network: viral.ai")

# Test basic connection
try:
    collections = list_collections("viral")
    print(f"✅ Connection successful! Found {len(collections)} collections.")
    
    # Look for the virusseq collection
    virusseq_collection = None
    for collection in collections:
        if collection.get('slugName') == 'virusseq':
            virusseq_collection = collection
            break
    
    if virusseq_collection:
        print(f"\n🎯 Found VirusSeq collection:")
        print(f"   📋 Name: {virusseq_collection.get('name', 'N/A')}")
        print(f"   🔗 Slug: {virusseq_collection.get('slugName', 'N/A')}")
        print(f"   📝 Description: {virusseq_collection.get('description', 'N/A')[:100]}...")
    else:
        print("❌ VirusSeq collection not found")
        print("Available collections:")
        for collection in collections[:5]:
            print(f"   - {collection.get('name', 'Unnamed')} ({collection.get('slugName', 'no-slug')})")
            
except Exception as e:
    print(f"❌ Connection failed: {e}")

## 📊 Explore VirusSeq Tables

Now let's see what tables are available in the VirusSeq collection:

In [ ]:
# List tables in the virusseq collection
try:
    tables = list_tables("viral", "virusseq")
    print(f"📋 Found {len(tables)} tables in VirusSeq collection:")
    print()
    
    variants_table = None
    for i, table in enumerate(tables, 1):
        table_name = table.get('qualified_table_name', table.get('name', 'Unknown'))
        display_name = table.get('display_name', table_name)
        size = table.get('size', 'Unknown')
        
        print(f"   {i}. {display_name}")
        print(f"      🔗 Table: {table_name}")
        print(f"      📏 Size: {size} rows")
        
        # Check if this is our target variants table
        if 'variants' in table_name.lower():
            variants_table = table
            print(f"      👆 This is our target table!")
        
        print()
    
    if variants_table:
        print(f"🎯 Target table found: {variants_table.get('qualified_table_name')}")
    else:
        print("❌ Variants table not found in the list")
        
except Exception as e:
    print(f"❌ Failed to list tables: {e}")

## 🔍 Explore Variants Table Schema

Let's examine the structure (schema) of the variants table to understand what data is available:

In [ ]:
# Get schema for the variants table
try:
    fields = get_schema_fields("viral", "virusseq", "collections.virusseq.variants")
    print(f"🔍 Variants table schema - {len(fields)} fields:")
    print()
    
    # Show first 15 fields
    for i, field in enumerate(fields[:15], 1):
        field_name = field['field']
        field_type = field['type']
        sql_type = field.get('sql_type', '')
        
        print(f"   {i:2d}. {field_name:<25} | {field_type:<15} | {sql_type}")
    
    if len(fields) > 15:
        print(f"   ... and {len(fields) - 15} more fields")
    
    print(f"\n📊 Key fields we'll see in the data:")
    key_fields = ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']
    for field in fields:
        if any(key in field['field'].lower() for key in key_fields):
            print(f"   🔹 {field['field']}: {field['type']}")
            
except Exception as e:
    print(f"❌ Failed to get schema: {e}")

## 🔬 Query Variants Data

Now let's query the variants table to get the first 10 rows of actual data:

In [ ]:
# Query the first 10 rows from the variants table
try:
    print("🔬 Querying variants table...")
    print("⚡ This may take a moment as the query is processed asynchronously.")
    print()
    
    # Use the functional API to query
    result = query("viral", "virusseq", "collections.virusseq.variants", limit=10)
    
    # Extract the data
    data = result.get('data', [])
    pagination = result.get('pagination', {})
    
    print(f"\n🎉 Successfully retrieved {len(data)} variant records!")
    
    if pagination:
        total = pagination.get('total', 'Unknown')
        print(f"📊 Total variants in table: {total}")
    
    print("\n" + "="*80)
    print("📋 FIRST 10 VARIANT RECORDS:")
    print("="*80)
    
    # Display each variant record
    for i, variant in enumerate(data, 1):
        print(f"\n🔹 Variant {i}:")
        
        # Show key fields first
        key_fields = ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']
        for key in key_fields:
            if key in variant:
                print(f"   {key:<12}: {variant[key]}")
        
        # Show a few other interesting fields
        other_fields = ['quality', 'filter', 'info', 'genotype']
        for key in other_fields:
            if key in variant:
                value = variant[key]
                if isinstance(value, str) and len(value) > 50:
                    value = value[:50] + "..."
                print(f"   {key:<12}: {value}")
        
        # Show total number of fields in this record
        print(f"   📏 Total fields: {len(variant)}")
        
        if i < len(data):
            print("   " + "-"*50)
    
    print("\n" + "="*80)
    
except Exception as e:
    print(f"❌ Query failed: {e}")
    import traceback
    print("\n🔍 Error details:")
    traceback.print_exc()

## 📈 Convert to DataFrame

Let's convert the variant data to a pandas DataFrame for easier analysis:

In [6]:
# Convert to DataFrame if we have data
try:
    if 'data' in locals() and data:
        # Create DataFrame
        df = pd.DataFrame(data)
        
        print(f"📊 Created DataFrame with {len(df)} rows and {len(df.columns)} columns")
        print()
        
        # Show basic info
        print("🔍 DataFrame Info:")
        print(f"   Shape: {df.shape}")
        print(f"   Columns: {len(df.columns)}")
        print()
        
        # Show column names
        print("📋 Column Names:")
        for i, col in enumerate(df.columns[:20], 1):
            print(f"   {i:2d}. {col}")
        
        if len(df.columns) > 20:
            print(f"   ... and {len(df.columns) - 20} more columns")
        
        print()
        
        # Show first few rows with key columns
        key_columns = []
        for col in ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']:
            if col in df.columns:
                key_columns.append(col)
        
        if key_columns:
            print(f"🔹 Key columns preview:")
            print(df[key_columns].head())
        else:
            print(f"🔹 First 5 columns preview:")
            print(df.iloc[:, :5].head())
        
    else:
        print("❌ No data available to create DataFrame")
        
except Exception as e:
    print(f"❌ Failed to create DataFrame: {e}")

📊 Created DataFrame with 10 rows and 5 columns

🔍 DataFrame Info:
   Shape: (10, 5)
   Columns: 5

📋 Column Names:
    1. start_position
    2. end_position
    3. reference_bases
    4. alternate_bases
    5. sequence_accession

🔹 First 5 columns preview:
   start_position  end_position reference_bases alternate_bases  \
0           10455         10456               C               T   
1           10457         10458               C               T   
2           10457         10458               C               T   
3           10457         10458               C               T   
4           10457         10458               C               T   

                    sequence_accession  
0  hCoV-19/Canada/AB-ABPHL-102772/2023  
1  hCoV-19/Canada/BC-BCCDC-101979/2021  
2   hCoV-19/Canada/BC-BCCDC-92384/2021  
3   hCoV-19/Canada/BC-BCCDC-75322/2021  
4   hCoV-19/Canada/BC-BCCDC-78406/2021  


## 🎯 Summary

**What we accomplished:**

✅ Connected to Viral AI network  
✅ Explored the VirusSeq collection  
✅ Examined the variants table schema  
✅ Successfully queried the first 10 variant records  
✅ Converted the data to a pandas DataFrame  

**Next steps you could try:**
- Query more data by increasing the `limit` parameter
- Add filters to focus on specific chromosomes or positions
- Analyze variant frequencies and distributions
- Export the data for further analysis

---

🦠 **Viral AI Variants Explorer** - Powered by [Omics AI Explorer](https://github.com/mfiume/omics-ai-python-library)