# 🎵 Maha-System: The Indian AI Orchestra

**Run India's sovereign AI models on free Google Colab T4 GPU**

This notebook implements the "Jugaad" architecture from the manifesto:
- Sequential model loading (hot-swap) to fit in 16GB VRAM
- Translate-Reason-Verify (TRV) pipeline for SOTA reasoning
- Cultural contextualization for Indian languages

📄 [Read the Manifesto](https://github.com/yourusername/maha-system)

⚠️ **Requirements**: GPU Runtime (Runtime → Change runtime type → T4 GPU)

In [None]:
# @title 1. Setup: Install Dependencies
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --quiet
!pip install huggingface_hub langgraph pyyaml --quiet

print("✅ Dependencies installed")

In [None]:
# @title 2. Mount Google Drive (Optional but Recommended)
# Models are large (~15GB total). Storing in Drive avoids re-downloading.

from google.colab import drive
import os

drive.mount('/content/drive')

# Create symlink to Drive for persistent storage
MODEL_DIR = "/content/drive/MyDrive/maha-system/models"
os.makedirs(MODEL_DIR, exist_ok=True)
!ln -sf {MODEL_DIR} /content/models

print(f"📁 Models will be stored in: {MODEL_DIR}")

In [None]:
# @title 3. Download Models (One-time)
# Downloads ~15GB of models. Takes 5-10 minutes depending on connection.

from huggingface_hub import hf_hub_download
import os

models_to_download = [
    {
        "name": "Sarvam-1 (2B) - Translator",
        "repo_id": "sarvamai/sarvam-1",
        "filename": "sarvam-1-2b-q4.gguf"
    },
    {
        "name": "DeepSeek-R1 (8B) - Reasoner",
        "repo_id": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B-GGUF",
        "filename": "DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf"
    },
    {
        "name": "Airavata (7B) - Critic",
        "repo_id": "ai4bharat/airavata",
        "filename": "airavata-7b-q4.gguf"
    }
]

for model in models_to_download:
    print(f"⬇️ Downloading {model['name']}...")
    try:
        path = hf_hub_download(
            repo_id=model["repo_id"],
            filename=model["filename"],
            local_dir="/content/models",
            local_dir_use_symlinks=False,
            resume_download=True
        )
        size = os.path.getsize(path) / (1024**3)
        print(f"   ✅ {size:.2f}GB - {path}")
    except Exception as e:
        print(f"   ❌ Error: {e}")

print("\n🎵 Orchestra assembled!")

In [None]:
# @title 4. Clone Maha-System Repository
!git clone https://github.com/yourusername/maha-system.git /content/maha-system
%cd /content/maha-system
!pip install -e . --quiet

In [None]:
# @title 5. Test VRAM Management
import torch
from maha_system.core import VRAMManager

print("GPU Status:")
print(f"  Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  Device: {torch.cuda.get_device_name(0)}")
    stats = VRAMManager.get_memory_stats()
    print(f"  Total VRAM: {stats['total_gb']:.2f} GB")
    print(f"  Currently allocated: {stats['allocated_gb']:.2f} GB")

# Test flush protocol
VRAMManager.flush()
print("\n✅ VRAM flush protocol working")

In [None]:
# @title 6. Run Maha-System Demo
from maha_system.core import JugaadOrchestrator, TRVPipeline
import yaml

# Configuration
MODEL_PATHS = {
    "translator": "/content/models/sarvam-1-2b-q4.gguf",
    "reasoner": "/content/models/DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf",
    "critic": "/content/models/airavata-7b-q4.gguf"
}

# Load prompts
with open("prompts/meta_prompts.yaml") as f:
    PROMPTS = yaml.safe_load(f)

# Initialize
orchestrator = JugaadOrchestrator(MODEL_PATHS)
pipeline = TRVPipeline(orchestrator, PROMPTS)

# Test query (Hindi riddle)
TEST_QUERY = "एक रस्सी की दो टुकड़े, दोनों के दोनों रूखे। इसका मतलब क्या है?"

print("🧪 Running TRV Pipeline on Hindi riddle...")
print(f"Query: {TEST_QUERY}\n")

result = pipeline.execute(
    query=TEST_QUERY,
    language="hindi",
    enable_critic=True
)

print(f"\n🎯 Answer: {result['final_answer']}")
print(f"\nIterations: {result.get('iterations', 0)}")

In [None]:
# @title 7. Interactive Mode (Run this cell multiple times)
query = """एक रस्सी की दो टुकड़े, दोनों के दोनों रूखे""" #@param {type:"string"}
language = "hindi" #@param ["hindi", "tamil", "telugu", "hinglish"]
enable_critic = True #@param {type:"boolean"}
show_reasoning = False #@param {type:"boolean"}

result = pipeline.execute(
    query=query,
    language=language,
    enable_critic=enable_critic
)

print(f"\n📝 Answer:")
print(result['final_answer'])

if show_reasoning:
    print(f"\n🔍 Reasoning Trace:")
    for step in result['reasoning_trace']:
        print(f"\n{step['phase'].upper()}:")
        print(step['output'][:300] + "..." if len(step['output']) > 300 else step['output'])

## 💡 Tips for Best Results

1. **VRAM Management**: If you get OOM errors, restart runtime (Ctrl+M) and run cells 1-5 again
2. **Model Hot-swap**: The system unloads models after each use. First query is slow (loading), subsequent are faster.
3. **Critic Phase**: Disable for faster responses, enable for higher accuracy on complex reasoning
4. **Language Support**: Best for Hindi/Tamil. For Hinglish, use `language='hinglish'` and the Bridge model.

## 📊 Benchmark Results

Expected performance on T4 GPU:
- **Sarvam-1 (2B)**: 1.5GB VRAM, 50 tokens/sec
- **DeepSeek-R1 (8B)**: 5GB VRAM, 25 tokens/sec
- **Total Pipeline**: ~30-60 seconds per complex query (including hot-swap overhead)