# 🚀 Xoron Multimodal Model - Kaggle Setup Guide

This notebook demonstrates how to:
1. Install dependencies and setup kt-kernel
2. Download the Xoron model from HuggingFace
3. Load and run the model
4. Test text generation
5. Test image generation (snowy mountain)
6. Test video generation (windy mountain)

**Model:** `Backup-bdg/Xoron-Dev-MultiMoe`

**Features:**
- Full multimodal support (text, image, video, audio)
- MoE LLM with MLA (Multi-Head Latent Attention)
- 128K context with Ring Attention
- TiTok 1D tokenization for vision/video
- Conformer-based audio processing

---
## 📦 Step 1: Install Dependencies

This cell installs all required packages including PyTorch, Transformers, and kt-kernel.

In [None]:
# Install PyTorch with CUDA support
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install core dependencies
!pip install -q transformers safetensors huggingface_hub accelerate

# Install multimodal dependencies
!pip install -q Pillow opencv-python librosa soundfile

# Install CLI utilities
!pip install -q typer rich pyyaml

print("✅ Core dependencies installed!")

---
## 🔧 Step 2: Install kt-kernel

kt-kernel provides high-performance CPU/GPU inference kernels for MoE models.

In [None]:
import os

# Install kt-kernel from the current repository
if os.path.exists('./kt-kernel'):
    %cd kt-kernel
    !pip install -q -e .
    %cd ..
    print("✅ kt-kernel installed from local directory!")
else:
    # If running from a different location, clone the repo first
    !git clone -b feature/xoron-multimodal-support https://github.com/nigfuapp-web/xformer.git temp_xformer
    %cd temp_xformer/kt-kernel
    !pip install -q -e .
    %cd ../..
    print("✅ kt-kernel installed from GitHub!")

---
## 📥 Step 3: Download Xoron Model from HuggingFace

Downloads the `Backup-bdg/Xoron-Dev-MultiMoe` model.

In [None]:
from huggingface_hub import snapshot_download
import os

MODEL_REPO = "Backup-bdg/Xoron-Dev-MultiMoe"
MODEL_DIR = "./xoron-model"

if not os.path.exists(MODEL_DIR):
    print(f"📥 Downloading model from {MODEL_REPO}...")
    snapshot_download(
        repo_id=MODEL_REPO,
        local_dir=MODEL_DIR
    )
    print(f"✅ Model downloaded to {MODEL_DIR}")
else:
    print(f"✅ Model already exists at {MODEL_DIR}")

# List downloaded files
print("\n📁 Downloaded files:")
for f in os.listdir(MODEL_DIR)[:10]:
    print(f"   {f}")
if len(os.listdir(MODEL_DIR)) > 10:
    print(f"   ... and {len(os.listdir(MODEL_DIR)) - 10} more files")

---
## 🧠 Step 4: Load the Xoron Model

Loads the model and processor with automatic device placement.

In [None]:
import sys
import torch

# Add kt-kernel to path
kt_kernel_path = './kt-kernel/python'
if os.path.exists(kt_kernel_path):
    sys.path.insert(0, kt_kernel_path)
elif os.path.exists('./temp_xformer/kt-kernel/python'):
    sys.path.insert(0, './temp_xformer/kt-kernel/python')

from kt_kernel.models.xoron import XoronForCausalLM, XoronMultimodalProcessor

print("🔄 Loading Xoron model...")
print(f"   PyTorch version: {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Load model
model = XoronForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

# Load processor
processor = XoronMultimodalProcessor.from_pretrained(MODEL_DIR)

device = next(model.parameters()).device
print(f"\n✅ Model loaded successfully!")
print(f"   Device: {device}")
print(f"   Model type: {type(model).__name__}")

---
## 💬 Step 5: Test Text Generation

Let's test the model with a simple text prompt.

In [None]:
def generate_text(prompt, max_tokens=256, temperature=0.7):
    """Generate text from a prompt."""
    inputs = processor(text=prompt, return_tensors="pt")
    inputs = {k: v.to(device) if hasattr(v, 'to') else v for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
        )
    
    response = processor.decode(outputs[0], skip_special_tokens=True)
    return response

# Test text generation
print("="*60)
print("🤖 TEST: Text Generation")
print("="*60)

prompt = "Hello! Tell me about yourself and your capabilities as a multimodal AI assistant."
print(f"\n📝 Prompt: {prompt}\n")

response = generate_text(prompt)
print(f"💬 Response:\n{response}")

---
## 🏔️ Step 6: Generate Snowy Mountain Image

Ask the model to generate an image of a mountain with lots of snow.

In [None]:
print("="*60)
print("🏔️ TEST: Image Generation - Snowy Mountain")
print("="*60)

prompt = """Generate a beautiful picture of a majestic mountain peak covered with lots of pristine white snow. 
The scene should have dramatic lighting, with the sun casting golden rays on the snow-covered peaks. 
Include a clear blue sky and some pine trees at the base of the mountain. 
Make it photorealistic and highly detailed."""

print(f"\n📝 Prompt: {prompt}\n")

# Generate response
response = generate_text(prompt, max_tokens=300)
print(f"💬 Response:\n{response}")

# Try actual image generation if available
if hasattr(model, 'generate_image') and hasattr(model.config, 'has_generator') and model.config.has_generator:
    print("\n🎨 Attempting image generation...")
    try:
        inputs = processor(text=prompt, return_tensors="pt")
        inputs = {k: v.to(device) if hasattr(v, 'to') else v for k, v in inputs.items()}
        
        with torch.no_grad():
            hidden = model(**inputs, output_hidden_states=True, return_dict=True)
            image = model.generate_image(hidden.hidden_states[-1])
            
        if image is not None:
            from PIL import Image
            import numpy as np
            
            img_np = image[0].cpu().permute(1, 2, 0).numpy()
            img_np = ((img_np + 1) * 127.5).clip(0, 255).astype(np.uint8)
            img = Image.fromarray(img_np)
            img.save("snowy_mountain.png")
            print("✅ Image saved to: snowy_mountain.png")
            display(img)
    except Exception as e:
        print(f"⚠️ Image generation not available: {e}")
else:
    print("\n📝 Note: Direct image generation requires trained generator weights.")

---
## 🌬️ Step 7: Generate Windy Mountain Video

Ask the model to generate a video of a mountain with windy climate.

In [None]:
print("="*60)
print("🌬️ TEST: Video Generation - Windy Mountain")
print("="*60)

prompt = """Generate a video of a mountain landscape with windy climate. 
Show trees swaying dramatically in the strong wind, clouds moving rapidly across the sky, 
and perhaps some snow being blown off the mountain peaks. 
The atmosphere should feel dynamic and powerful, with dramatic weather conditions.
Make it cinematic with smooth camera movement."""

print(f"\n📝 Prompt: {prompt}\n")

# Generate response
response = generate_text(prompt, max_tokens=300)
print(f"💬 Response:\n{response}")

# Try actual video generation if available
if hasattr(model, 'generate_video') and hasattr(model.config, 'has_video_generator') and model.config.has_video_generator:
    print("\n🎬 Attempting video generation...")
    try:
        inputs = processor(text=prompt, return_tensors="pt")
        inputs = {k: v.to(device) if hasattr(v, 'to') else v for k, v in inputs.items()}
        
        with torch.no_grad():
            hidden = model(**inputs, output_hidden_states=True, return_dict=True)
            video = model.generate_video(hidden.hidden_states[-1], num_frames=16)
            
        if video is not None:
            import cv2
            import numpy as np
            
            video_np = video[0].cpu().numpy()
            h, w = video_np.shape[2], video_np.shape[3]
            
            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
            out = cv2.VideoWriter('windy_mountain.mp4', fourcc, 8, (w, h))
            
            for frame in video_np:
                frame = np.transpose(frame, (1, 2, 0))
                frame = ((frame + 1) * 127.5).clip(0, 255).astype(np.uint8)
                out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
            
            out.release()
            print("✅ Video saved to: windy_mountain.mp4")
    except Exception as e:
        print(f"⚠️ Video generation not available: {e}")
else:
    print("\n📝 Note: Direct video generation requires trained generator weights.")

---
## 🎤 Step 8: Test Audio Understanding (Optional)

If you have an audio file, you can test audio understanding.

In [None]:
print("="*60)
print("🎤 TEST: Audio Capabilities")
print("="*60)

# Check if audio capabilities are available
if hasattr(model.config, 'has_audio_encoder') and model.config.has_audio_encoder:
    print("✅ Audio encoder is available")
    print("   - Can understand spoken language")
    print("   - Supports raw waveform input")
else:
    print("⚠️ Audio encoder not available in current model")

if hasattr(model.config, 'has_audio_decoder') and model.config.has_audio_decoder:
    print("✅ Audio decoder is available")
    print("   - Can generate speech (TTS)")
    print("   - Supports zero-shot speaker cloning")
else:
    print("⚠️ Audio decoder not available in current model")

# Example of how to use audio input (if you have an audio file)
print("\n📝 To test audio understanding, you can use:")
print("""
inputs = processor(
    text="What is being said in this audio?",
    audio=["path/to/audio.wav"],
    return_tensors="pt"
)
outputs = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(outputs[0]))
""")

---
## 🖼️ Step 9: Test Image Understanding (Optional)

Test the model's ability to understand images.

In [None]:
print("="*60)
print("🖼️ TEST: Image Understanding")
print("="*60)

# Check if vision capabilities are available
if hasattr(model.config, 'has_vision_encoder') and model.config.has_vision_encoder:
    print("✅ Vision encoder is available")
    print(f"   - Vision model: {model.config.vision_model_name}")
    print(f"   - TiTok enabled: {model.config.use_vision_titok}")
    print(f"   - Dual-stream: {model.config.use_vision_dual_stream}")
else:
    print("⚠️ Vision encoder not available in current model")

# Example of how to use image input
print("\n📝 To analyze an image, you can use:")
print("""
from PIL import Image

# Load an image
image = Image.open("your_image.jpg")

# Process with the model
inputs = processor(
    text="Describe this image in detail.",
    images=[image],
    return_tensors="pt"
)
inputs = {k: v.to(device) if hasattr(v, 'to') else v for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0]))
""")

---
## 🌐 Step 10: Start API Server (Optional)

You can also run the model as an API server using SGLang.

In [None]:
print("="*60)
print("🌐 API Server Instructions")
print("="*60)

print("""
To run Xoron as an API server, open a terminal and run:

```bash
python -m sglang.launch_server \\
    --model ./xoron-model \\
    --host 0.0.0.0 \\
    --port 8000 \\
    --trust-remote-code \\
    --tensor-parallel-size 1 \\
    --max-total-tokens 8192 \\
    --mem-fraction-static 0.85
```

Then you can make API calls:

```python
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "xoron",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)
print(response.json())
```
""")

---
## 📊 Model Information

In [None]:
print("="*60)
print("📊 Model Configuration")
print("="*60)

config = model.config

print(f"\n🧠 LLM Architecture:")
print(f"   Hidden size: {config.hidden_size}")
print(f"   Num layers: {config.num_layers}")
print(f"   Num heads: {config.num_heads}")
print(f"   Vocab size: {config.vocab_size}")
print(f"   Max positions: {config.max_position_embeddings}")

print(f"\n🎯 MoE Configuration:")
print(f"   Use MoE: {config.use_moe}")
print(f"   Num experts: {config.num_experts}")
print(f"   Experts per token: {config.num_experts_per_tok}")
print(f"   MoE layer freq: {config.moe_layer_freq}")

print(f"\n👁️ Vision Configuration:")
print(f"   Vision model: {config.vision_model_name}")
print(f"   Use TiTok: {config.use_vision_titok}")
print(f"   Num vision tokens: {config.num_vision_tokens}")

print(f"\n🎬 Video Configuration:")
print(f"   Use VideoTiTok: {config.use_video_titok}")
print(f"   Max frames: {config.video_max_frames}")

print(f"\n🎤 Audio Configuration:")
print(f"   Sample rate: {config.audio_sample_rate}")
print(f"   Use raw waveform: {config.use_raw_waveform}")

print(f"\n🎨 Generation Configuration:")
print(f"   Enable generation: {config.enable_generation}")
print(f"   Use flow matching: {config.generation_use_flow_matching}")

---
## 🧹 Cleanup (Optional)

In [None]:
# Uncomment to free GPU memory
# del model
# del processor
# torch.cuda.empty_cache()
# print("✅ Memory cleared")

---
## 📚 Summary

This notebook demonstrated:

1. ✅ Installing kt-kernel and dependencies
2. ✅ Downloading the Xoron model from HuggingFace
3. ✅ Loading the model with automatic device placement
4. ✅ Text generation capabilities
5. ✅ Image generation (snowy mountain)
6. ✅ Video generation (windy mountain)
7. ✅ Audio and vision understanding capabilities
8. ✅ API server deployment options

**Repository:** https://github.com/nigfuapp-web/xformer (branch: `feature/xoron-multimodal-support`)

**Model:** `Backup-bdg/Xoron-Dev-MultiMoe`