# 🦠 Viral AI Variants Explorer

This notebook demonstrates how to explore the **VirusSeq Variants** table on Viral AI using the Omics AI Explorer Python library.

**Target Dataset**: `collections.virusseq.variants` on [viral.ai](https://viral.ai)

## What we'll cover:
- Connect to Viral AI network
- Explore the VirusSeq collection
- Query the variants table
- Display the first 10 rows of variant data

---

## 📦 Setup and Installation

First, let's install and import the Omics AI Explorer library:

In [ ]:
# Install the Omics AI Explorer library
!pip install git+https://github.com/mfiume/omics-ai-python-library.git --quiet

# Import required libraries
try:
    from omics_ai import OmicsAIClient
    print("✅ Successfully imported OmicsAIClient!")
except ImportError:
    print("⚠️ Package import failed, using fallback implementation...")
    
    # Fallback implementation based on the working debug script
    import requests
    import json
    import time
    from typing import Dict, List, Optional, Any
    from urllib.parse import quote
    
    def parse_json_lines_response(raw_text: str) -> Dict[str, Any]:
        """Parse JSON Lines response from Viral AI API."""
        if not raw_text.strip():
            raise Exception("Empty response received")
        
        # Split by lines and filter out empty lines
        lines = [line.strip() for line in raw_text.strip().split('\n') if line.strip()]
        
        if not lines:
            raise Exception("No valid lines found in response")
        
        # Parse each line as JSON
        json_objects = []
        for i, line in enumerate(lines):
            try:
                obj = json.loads(line)
                json_objects.append(obj)
            except json.JSONDecodeError as e:
                if line != "{}":
                    pass  # Silently ignore parsing errors
        
        if not json_objects:
            raise Exception("No valid JSON objects found in response")
        
        # Find the object with data (usually the last non-empty one)
        for obj in reversed(json_objects):
            if obj and 'data' in obj:
                return obj
        
        # If no data object found, check for next_page_token (polling case)
        for obj in reversed(json_objects):
            if obj and 'next_page_token' in obj:
                return obj
        
        # If we get here, we have only empty objects {} or unexpected format
        if all(not obj for obj in json_objects):
            return {"next_page_token": "empty_response_poll"}
        
        # Return the last non-empty object
        non_empty_objects = [obj for obj in json_objects if obj]
        if non_empty_objects:
            return non_empty_objects[-1]
        
        raise Exception(f"No data or next_page_token found. Objects: {json_objects}")
    
    class OmicsAIClient:
        """Simplified Omics AI Explorer client for Viral AI."""
        
        def __init__(self, network: str = "viral.ai"):
            if not network.startswith(('http://', 'https://')):
                network = f"https://{network}"
            self.network = network.rstrip('/')
            self.session = requests.Session()
            self.session.headers.update({
                'User-Agent': 'viral-ai-explorer/1.0',
                'Accept': 'application/json',
                'Content-Type': 'application/json'
            })
        
        def _make_request(self, method: str, endpoint: str, **kwargs):
            url = f"{self.network}{endpoint}"
            response = self.session.request(method, url, **kwargs)
            response.raise_for_status()
            return response
        
        def list_collections(self) -> List[Dict[str, Any]]:
            response = self._make_request('GET', '/api/collections')
            return response.json()
        
        def list_tables(self, collection_slug: str) -> List[Dict[str, Any]]:
            endpoint = f"/api/collections/{quote(collection_slug)}/tables"
            response = self._make_request('GET', endpoint)
            return response.json()
        
        def get_schema_fields(self, collection_slug: str, table_name: str) -> List[Dict[str, str]]:
            endpoint = f"/api/collection/{quote(collection_slug)}/data-connect/table/{quote(table_name)}/info"
            response = self._make_request('GET', endpoint)
            schema = response.json()
            
            data_model = schema.get('data_model', {}).get('properties', {})
            fields = []
            for field_name, field_spec in data_model.items():
                field_type = field_spec.get('type', '')
                if isinstance(field_type, list):
                    field_type = ', '.join(field_type)
                if field_type == 'array' and 'items' in field_spec:
                    item_type = field_spec['items'].get('type', '')
                    if isinstance(item_type, list):
                        item_type = ', '.join(item_type)
                    field_type = f"array<{item_type}>"
                
                fields.append({
                    'field': field_name,
                    'type': field_type,
                    'sql_type': field_spec.get('sqlType', '')
                })
            return fields
        
        def query(self, collection_slug: str, table_name: str, 
                 filters=None, limit: int = 100, offset: int = 0,
                 max_polls: int = 10, poll_interval: float = 2.0) -> Dict[str, Any]:
            """Query with auto-polling for async results."""
            if filters is None:
                filters = {}
                
            payload = {
                "tableName": table_name,
                "filters": filters,
                "pagination": {"limit": limit, "offset": offset}
            }
            
            endpoint = f"/api/collections/{quote(collection_slug)}/tables/{quote(table_name)}/filter"
            
            for poll_count in range(max_polls):
                response = self._make_request('POST', endpoint, json=payload)
                
                # Parse response
                try:
                    result = parse_json_lines_response(response.text)
                except Exception as e:
                    raise Exception(f"Failed to parse response: {e}")
                
                # Check if we have data or need to poll
                if 'data' in result and isinstance(result['data'], list):
                    return result
                elif 'next_page_token' in result or result.get('next_page_token') == 'empty_response_poll':
                    if result.get('next_page_token') != 'empty_response_poll':
                        payload['next_page_token'] = result['next_page_token']
                    time.sleep(poll_interval)
                else:
                    return result  # Return whatever we got
            
            raise Exception(f"Query timed out after {max_polls} polls")
    
    print("✅ Using fallback implementation with working JSON Lines parser!")

# Import data analysis libraries
import pandas as pd
from datetime import datetime

print("\n🦠 Viral AI Variants Explorer Ready!")
print(f"📅 Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 🔗 Connect to Viral AI

Let's connect to the Viral AI network and explore what's available:

In [2]:
# Create client for Viral AI
client = OmicsAIClient("viral.ai")

print("🔗 Connected to Viral AI!")
print(f"🌐 Network: {client.network}")

# Test basic connection
try:
    collections = client.list_collections()
    print(f"✅ Connection successful! Found {len(collections)} collections.")
    
    # Look for the virusseq collection
    virusseq_collection = None
    for collection in collections:
        if collection.get('slugName') == 'virusseq':
            virusseq_collection = collection
            break
    
    if virusseq_collection:
        print(f"\n🎯 Found VirusSeq collection:")
        print(f"   📋 Name: {virusseq_collection.get('name', 'N/A')}")
        print(f"   🔗 Slug: {virusseq_collection.get('slugName', 'N/A')}")
        print(f"   📝 Description: {virusseq_collection.get('description', 'N/A')[:100]}...")
    else:
        print("❌ VirusSeq collection not found")
        print("Available collections:")
        for collection in collections[:5]:
            print(f"   - {collection.get('name', 'Unnamed')} ({collection.get('slugName', 'no-slug')})")
            
except Exception as e:
    print(f"❌ Connection failed: {e}")

🔗 Connected to Viral AI!
🌐 Network: https://viral.ai
✅ Connection successful! Found 18 collections.

🎯 Found VirusSeq collection:
   📋 Name: VirusSeq SARS-CoV-2 Genome Sequences
   🔗 Slug: virusseq
   📝 Description: <p>The&nbsp;mission&nbsp;of&nbsp;Canadian&nbsp;COVID&nbsp;Genomics&nbsp;Network&nbsp;(<a href="https...


## 📊 Explore VirusSeq Tables

Now let's see what tables are available in the VirusSeq collection:

In [3]:
# List tables in the virusseq collection
try:
    tables = client.list_tables("virusseq")
    print(f"📋 Found {len(tables)} tables in VirusSeq collection:")
    print()
    
    variants_table = None
    for i, table in enumerate(tables, 1):
        table_name = table.get('qualified_table_name', table.get('name', 'Unknown'))
        display_name = table.get('display_name', table_name)
        size = table.get('size', 'Unknown')
        
        print(f"   {i}. {display_name}")
        print(f"      🔗 Table: {table_name}")
        print(f"      📏 Size: {size} rows")
        
        # Check if this is our target variants table
        if 'variants' in table_name.lower():
            variants_table = table
            print(f"      👆 This is our target table!")
        
        print()
    
    if variants_table:
        print(f"🎯 Target table found: {variants_table.get('qualified_table_name')}")
    else:
        print("❌ Variants table not found in the list")
        
except Exception as e:
    print(f"❌ Failed to list tables: {e}")

📋 Found 3 tables in VirusSeq collection:

   1. variants
      🔗 Table: collections.virusseq.variants
      📏 Size: 42235025 rows
      👆 This is our target table!

   2. samples
      🔗 Table: collections.virusseq.samples
      📏 Size: 631138 rows

   3. Files
      🔗 Table: collections.virusseq._files
      📏 Size: 1888810 rows

🎯 Target table found: collections.virusseq.variants


## 🔍 Explore Variants Table Schema

Let's examine the structure (schema) of the variants table to understand what data is available:

In [4]:
# Get schema for the variants table
try:
    fields = client.get_schema_fields("virusseq", "collections.virusseq.variants")
    print(f"🔍 Variants table schema - {len(fields)} fields:")
    print()
    
    # Show first 15 fields
    for i, field in enumerate(fields[:15], 1):
        field_name = field['field']
        field_type = field['type']
        sql_type = field.get('sql_type', '')
        
        print(f"   {i:2d}. {field_name:<25} | {field_type:<15} | {sql_type}")
    
    if len(fields) > 15:
        print(f"   ... and {len(fields) - 15} more fields")
    
    print(f"\n📊 Key fields we'll see in the data:")
    key_fields = ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']
    for field in fields:
        if any(key in field['field'].lower() for key in key_fields):
            print(f"   🔹 {field['field']}: {field['type']}")
            
except Exception as e:
    print(f"❌ Failed to get schema: {e}")

🔍 Variants table schema - 5 fields:

    1. start_position            | string          | bigint
    2. end_position              | string          | bigint
    3. reference_bases           | string          | varchar
    4. alternate_bases           | string          | varchar
    5. sequence_accession        | string          | varchar

📊 Key fields we'll see in the data:
   🔹 start_position: string
   🔹 end_position: string
   🔹 reference_bases: string
   🔹 alternate_bases: string


## 🔬 Query Variants Data

Now let's query the variants table to get the first 10 rows of actual data:

In [None]:
# Query the first 10 rows from the variants table
try:
    print("🔬 Querying variants table...")
    print("⚡ This may take a moment as the query is processed asynchronously.")
    print()

    # Use the standard query method from the library
    result = client.query(
        collection_slug="virusseq", 
        table_name="collections.virusseq.variants", 
        filters={},  # No filters - get all data
        limit=10     # First 10 rows
    )

    # Extract the data
    data = result.get('data', [])
    pagination = result.get('pagination', {})
    
    print(f"\n🎉 Successfully retrieved {len(data)} variant records!")
    
    if pagination:
        total = pagination.get('total', 'Unknown')
        print(f"📊 Total variants in table: {total}")
    
    print("\n" + "="*80)
    print("📋 FIRST 10 VARIANT RECORDS:")
    print("="*80)
    
    # Display each variant record
    for i, variant in enumerate(data, 1):
        print(f"\n🔹 Variant {i}:")
        
        # Show key fields first
        key_fields = ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']
        for key in key_fields:
            if key in variant:
                print(f"   {key:<12}: {variant[key]}")
        
        # Show a few other interesting fields
        other_fields = ['quality', 'filter', 'info', 'genotype']
        for key in other_fields:
            if key in variant:
                value = variant[key]
                if isinstance(value, str) and len(value) > 50:
                    value = value[:50] + "..."
                print(f"   {key:<12}: {value}")
        
        # Show total number of fields in this record
        print(f"   📏 Total fields: {len(variant)}")
        
        if i < len(data):
            print("   " + "-"*50)
    
    print("\n" + "="*80)
    
except Exception as e:
    print(f"❌ Query failed: {e}")
    import traceback
    print("\n🔍 Error details:")
    traceback.print_exc()

🔬 Querying variants table...
⚡ This may take a moment as the query is processed asynchronously.

Going in!
Results:
{'data': [{'start_position': 10455, 'end_position': 10456, 'reference_bases': 'C', 'alternate_bases': 'T', 'sequence_accession': 'hCoV-19/Canada/AB-ABPHL-102772/2023'}, {'start_position': 10457, 'end_position': 10458, 'reference_bases': 'C', 'alternate_bases': 'T', 'sequence_accession': 'hCoV-19/Canada/BC-BCCDC-101979/2021'}, {'start_position': 10457, 'end_position': 10458, 'reference_bases': 'C', 'alternate_bases': 'T', 'sequence_accession': 'hCoV-19/Canada/BC-BCCDC-92384/2021'}, {'start_position': 10457, 'end_position': 10458, 'reference_bases': 'C', 'alternate_bases': 'T', 'sequence_accession': 'hCoV-19/Canada/BC-BCCDC-75322/2021'}, {'start_position': 10457, 'end_position': 10458, 'reference_bases': 'C', 'alternate_bases': 'T', 'sequence_accession': 'hCoV-19/Canada/BC-BCCDC-78406/2021'}, {'start_position': 10457, 'end_position': 10458, 'reference_bases': 'C', 'alternat

## 📈 Convert to DataFrame

Let's convert the variant data to a pandas DataFrame for easier analysis:

In [6]:
# Convert to DataFrame if we have data
try:
    if 'data' in locals() and data:
        # Create DataFrame
        df = pd.DataFrame(data)
        
        print(f"📊 Created DataFrame with {len(df)} rows and {len(df.columns)} columns")
        print()
        
        # Show basic info
        print("🔍 DataFrame Info:")
        print(f"   Shape: {df.shape}")
        print(f"   Columns: {len(df.columns)}")
        print()
        
        # Show column names
        print("📋 Column Names:")
        for i, col in enumerate(df.columns[:20], 1):
            print(f"   {i:2d}. {col}")
        
        if len(df.columns) > 20:
            print(f"   ... and {len(df.columns) - 20} more columns")
        
        print()
        
        # Show first few rows with key columns
        key_columns = []
        for col in ['pos', 'ref', 'alt', 'chrom', 'variant_id', 'sample_id']:
            if col in df.columns:
                key_columns.append(col)
        
        if key_columns:
            print(f"🔹 Key columns preview:")
            print(df[key_columns].head())
        else:
            print(f"🔹 First 5 columns preview:")
            print(df.iloc[:, :5].head())
        
    else:
        print("❌ No data available to create DataFrame")
        
except Exception as e:
    print(f"❌ Failed to create DataFrame: {e}")

📊 Created DataFrame with 10 rows and 5 columns

🔍 DataFrame Info:
   Shape: (10, 5)
   Columns: 5

📋 Column Names:
    1. start_position
    2. end_position
    3. reference_bases
    4. alternate_bases
    5. sequence_accession

🔹 First 5 columns preview:
   start_position  end_position reference_bases alternate_bases  \
0           10455         10456               C               T   
1           10457         10458               C               T   
2           10457         10458               C               T   
3           10457         10458               C               T   
4           10457         10458               C               T   

                    sequence_accession  
0  hCoV-19/Canada/AB-ABPHL-102772/2023  
1  hCoV-19/Canada/BC-BCCDC-101979/2021  
2   hCoV-19/Canada/BC-BCCDC-92384/2021  
3   hCoV-19/Canada/BC-BCCDC-75322/2021  
4   hCoV-19/Canada/BC-BCCDC-78406/2021  


## 🎯 Summary

**What we accomplished:**

✅ Connected to Viral AI network  
✅ Explored the VirusSeq collection  
✅ Examined the variants table schema  
✅ Successfully queried the first 10 variant records  
✅ Converted the data to a pandas DataFrame  

**Next steps you could try:**
- Query more data by increasing the `limit` parameter
- Add filters to focus on specific chromosomes or positions
- Analyze variant frequencies and distributions
- Export the data for further analysis

---

🦠 **Viral AI Variants Explorer** - Powered by [Omics AI Explorer](https://github.com/mfiume/omics-ai-python-library)