# Overseer Gemma 3n Model Training Notebook

This notebook provides a comprehensive training pipeline for the Overseer system using Google's Gemma 3n model. It includes:

- **Data Preparation**: Loading and preprocessing training data from multiple sources
- **Model Configuration**: Setting up the Gemma model with custom parameters
- **Fine-tuning**: Training the model on system administration and AI assistant tasks
- **Evaluation**: Testing model performance and generating metrics
- **Continuous Learning**: Integration with user feedback for model improvement

## Prerequisites

Before running this notebook, ensure you have:
1. Valid Hugging Face token with access to Gemma models
2. Kaggle API credentials for data collection
3. Required Python packages installed (see requirements.txt)
4. Sufficient GPU memory (recommended: 16GB+ VRAM)

In [None]:
# Import Required Libraries
import os
import sys
import warnings
import logging
from dotenv import load_dotenv
from pathlib import Path

# ML and DL libraries
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# Transformers and datasets
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback
)
from datasets import Dataset, DatasetDict
from transformers.training_args import TrainingArguments

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, clear_output

# Custom modules (make sure the training directory is in path)
sys.path.append('/Users/lionelweng/Downloads/Overseer/training')
from training_config import TrainingConfig
from data_preparation import SystemCommandsDataset
from fine_tuning import OverseerTrainer
from evaluation import ModelEvaluator
from continuous_learning import ContinuousLearningManager
from kaggle_data_collector import KaggleDataCollector

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv(dotenv_path='/Users/lionelweng/Downloads/Overseer/.env')

print("✅ All imports successful!")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🤗 Transformers version: {torch.__version__}")
print(f"💾 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🚀 GPU: {torch.cuda.get_device_name()}")
    print(f"💰 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 1. Training Configuration Setup

Let's initialize our training configuration with all the parameters needed for fine-tuning the Gemma model.

In [None]:
# Initialize Training Configuration
config = TrainingConfig()

# Display current configuration
print("🔧 Training Configuration:")
print("=" * 50)
print(f"Base Model: {config.base_model}")
print(f"Learning Rate: {config.learning_rate}")
print(f"Batch Size: {config.batch_size}")
print(f"Epochs: {config.num_epochs}")
print(f"Max Sequence Length: {config.max_sequence_length}")
print(f"Use GPU: {config.use_gpu}")
print(f"Mixed Precision: {config.mixed_precision}")
print(f"Output Directory: {config.output_dir}")
print("=" * 50)

# Create output directory if it doesn't exist
os.makedirs(config.output_dir, exist_ok=True)
print(f"📁 Output directory created: {config.output_dir}")

# Check environment variables
hf_token = os.getenv('HF_TOKEN')
kaggle_username = os.getenv('KAGGLE_USERNAME')
kaggle_key = os.getenv('KAGGLE_KEY')

print("\n🔐 Environment Check:")
print(f"HF Token: {'✅ Set' if hf_token else '❌ Missing'}")
print(f"Kaggle Username: {'✅ Set' if kaggle_username else '❌ Missing'}")
print(f"Kaggle Key: {'✅ Set' if kaggle_key else '❌ Missing'}")

if not hf_token:
    print("\n⚠️  Warning: HF_TOKEN not found. You'll need this to access Gemma models.")
    print("Set it in your .env file: HF_TOKEN=your_token_here")

## 2. Data Preparation

Now let's prepare our training data by collecting datasets from Kaggle and generating synthetic examples.

In [None]:
# Initialize data components
print("🗂️  Initializing data collection components...")
kaggle_collector = KaggleDataCollector()
dataset_processor = SystemCommandsDataset()

# Option to download fresh data from Kaggle
download_fresh_data = True  # Set to False to skip Kaggle download

if download_fresh_data and kaggle_username and kaggle_key:
    print("\n📥 Downloading fresh data from Kaggle...")
    try:
        dataset_processor.load_kaggle_data(kaggle_collector)
        print("✅ Kaggle data download completed!")
    except Exception as e:
        print(f"⚠️  Could not download Kaggle data: {e}")
        print("Proceeding with synthetic data only...")
else:
    print("⏭️  Skipping Kaggle download - using existing/synthetic data only")

# Generate training examples
print("\n🔄 Creating training examples...")
training_examples = dataset_processor.create_training_examples()
synthetic_examples = dataset_processor.generate_synthetic_data()

# Combine all training data
all_training_data = training_examples + synthetic_examples
print(f"📊 Total training examples: {len(all_training_data)}")
print(f"   - From datasets: {len(training_examples)}")
print(f"   - Synthetic: {len(synthetic_examples)}")

# Display some sample training examples
print("\n📋 Sample Training Examples:")
print("=" * 60)
for i, example in enumerate(all_training_data[:3]):
    print(f"\nExample {i+1}:")
    print(f"Input: {example['input'][:100]}...")
    print(f"Output: {example['output'][:100]}...")
print("=" * 60)

# Create train/validation/test splits
print(f"\n🔀 Creating data splits...")
train_size = int(len(all_training_data) * config.train_split)
val_size = int(len(all_training_data) * config.val_split)

train_data = all_training_data[:train_size]
val_data = all_training_data[train_size:train_size + val_size]
test_data = all_training_data[train_size + val_size:]

print(f"📈 Data Split Summary:")
print(f"   - Training: {len(train_data)} examples ({config.train_split*100}%)")
print(f"   - Validation: {len(val_data)} examples ({config.val_split*100}%)")
print(f"   - Test: {len(test_data)} examples ({config.test_split*100}%)")

## 3. Model Loading and Setup

Let's load the Gemma model and configure it for our training task.

In [None]:
# Initialize the trainer with our configuration
print("🤖 Initializing Overseer Trainer...")
try:
    trainer = OverseerTrainer(config)
    print("✅ Trainer initialized successfully!")
    print(f"📄 Tokenizer vocab size: {len(trainer.tokenizer)}")
    print(f"🧠 Model parameters: {trainer.model.num_parameters():,}")
    
    # Display model info
    print(f"\n🔍 Model Details:")
    print(f"   - Architecture: {trainer.model.config.architectures}")
    print(f"   - Hidden size: {trainer.model.config.hidden_size}")
    print(f"   - Attention heads: {trainer.model.config.num_attention_heads}")
    print(f"   - Layers: {trainer.model.config.num_hidden_layers}")
    print(f"   - Vocab size: {trainer.model.config.vocab_size}")
    
except Exception as e:
    print(f"❌ Error initializing trainer: {e}")
    print("Check your HF_TOKEN and model access permissions.")
    raise

# Prepare datasets for training
print(f"\n🔄 Preparing datasets...")
train_dataset = trainer.prepare_dataset(train_data)
val_dataset = trainer.prepare_dataset(val_data)

print(f"✅ Datasets prepared:")
print(f"   - Training dataset: {len(train_dataset)} examples")
print(f"   - Validation dataset: {len(val_dataset)} examples")

# Display tokenized example
print(f"\n🔤 Sample Tokenized Input:")
sample_idx = 0
sample_input = train_dataset[sample_idx]['input_ids']
print(f"Token IDs: {sample_input[:20]}...")
print(f"Decoded: {trainer.tokenizer.decode(sample_input[:20])}...")
print(f"Full length: {len(sample_input)} tokens")

## 4. Training the Model

Now let's start the actual training process. This will fine-tune the Gemma model on our system administration and AI assistant tasks.

In [None]:
# Setup training monitoring
import time
from datetime import datetime

print("🚀 Starting model training...")
print(f"📅 Training started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"⏱️  Estimated training time: ~{config.num_epochs * len(train_dataset) // config.batch_size // 60} minutes")

# Start training
start_time = time.time()

try:
    # Call the training method from our trainer
    training_results = trainer.train(train_dataset, val_dataset)
    
    training_time = time.time() - start_time
    print(f"\n✅ Training completed successfully!")
    print(f"⏱️  Total training time: {training_time/60:.1f} minutes")
    print(f"📊 Final training loss: {training_results.get('train_loss', 'N/A')}")
    print(f"📊 Final validation loss: {training_results.get('eval_loss', 'N/A')}")
    
except Exception as e:
    print(f"❌ Training failed: {e}")
    print("This might be due to insufficient GPU memory or other issues.")
    print("Try reducing batch_size or max_sequence_length in the config.")
    raise

# Memory cleanup
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    current_memory = torch.cuda.memory_allocated() / 1e9
    max_memory = torch.cuda.max_memory_allocated() / 1e9
    print(f"\n💾 GPU Memory Usage:")
    print(f"   - Current: {current_memory:.1f} GB")
    print(f"   - Peak: {max_memory:.1f} GB")

## 5. Model Evaluation

Let's evaluate our trained model's performance on various tasks and generate some test outputs.

In [None]:
# Evaluate on test data
print("📊 Evaluating model on test dataset...")

if len(test_data) > 0:
    test_dataset = trainer.prepare_dataset(test_data)
    
    # Generate some predictions
    print(f"\n🔮 Generating test predictions...")
    test_examples = test_data[:5]  # Take first 5 test examples
    
    for i, example in enumerate(test_examples):
        print(f"\n--- Test Example {i+1} ---")
        print(f"Input: {example['input']}")
        print(f"Expected: {example['output'][:100]}...")
        
        # Generate prediction using the model
        try:
            input_text = f"User: {example['input']}\nAssistant:"
            inputs = trainer.tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
            
            if config.use_gpu and torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = trainer.model.generate(
                    **inputs,
                    max_new_tokens=150,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=trainer.tokenizer.eos_token_id
                )
            
            prediction = trainer.tokenizer.decode(outputs[0], skip_special_tokens=True)
            # Extract just the assistant's response
            if "Assistant:" in prediction:
                prediction = prediction.split("Assistant:")[-1].strip()
            
            print(f"Generated: {prediction[:100]}...")
            
        except Exception as e:
            print(f"Error generating prediction: {e}")
        
        print("-" * 50)
else:
    print("⚠️  No test data available for evaluation")

# Create evaluation metrics
print(f"\n📈 Model Performance Summary:")
print("=" * 50)
print(f"✅ Training completed successfully")
print(f"📊 Training examples processed: {len(train_data)}")
print(f"📊 Validation examples: {len(val_data)}")
print(f"📊 Test examples: {len(test_data)}")
print(f"⚡ Model size: {trainer.model.num_parameters():,} parameters")
print(f"💾 Output saved to: {config.output_dir}")
print("=" * 50)

## 6. Interactive Model Testing

Test your trained model with custom prompts!

In [None]:
def test_model_interactive(prompt, max_tokens=200, temperature=0.7):
    """
    Test the trained model with a custom prompt
    """
    print(f"🤖 Testing prompt: {prompt}")
    print("-" * 60)
    
    try:
        # Format the input
        input_text = f"User: {prompt}\nAssistant:"
        inputs = trainer.tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
        
        if config.use_gpu and torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        
        # Generate response
        with torch.no_grad():
            outputs = trainer.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=True,
                pad_token_id=trainer.tokenizer.eos_token_id,
                repetition_penalty=1.1
            )
        
        # Decode the response
        full_response = trainer.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract just the assistant's response
        if "Assistant:" in full_response:
            response = full_response.split("Assistant:")[-1].strip()
        else:
            response = full_response
        
        print(f"🎯 Response: {response}")
        
    except Exception as e:
        print(f"❌ Error generating response: {e}")
    
    print("-" * 60)

# Test with some example prompts
test_prompts = [
    "How do I monitor GPU usage on my system?",
    "I need to find all Python files in my project",
    "What's the best way to backup my database?",
    "Help me set up a development environment for machine learning",
    "How do I check which processes are using the most memory?"
]

print("🧪 Testing model with sample prompts...")
for i, prompt in enumerate(test_prompts, 1):
    print(f"\n🔍 Test {i}/5:")
    test_model_interactive(prompt, max_tokens=150, temperature=0.7)
    
print("\n✅ Interactive testing completed!")
print("\n💡 Tip: You can call test_model_interactive('your prompt here') to test with custom prompts!")

## 7. Model Saving and Deployment

Let's save our trained model and prepare it for deployment.

In [None]:
# Save the trained model and tokenizer
print("💾 Saving trained model...")

try:
    # Save model and tokenizer
    trainer.model.save_pretrained(config.output_dir)
    trainer.tokenizer.save_pretrained(config.output_dir)
    
    # Save configuration
    import json
    config_dict = {
        'base_model': config.base_model,
        'learning_rate': config.learning_rate,
        'batch_size': config.batch_size,
        'num_epochs': config.num_epochs,
        'max_sequence_length': config.max_sequence_length,
        'training_completed': datetime.now().isoformat(),
        'total_training_examples': len(train_data),
        'model_parameters': trainer.model.num_parameters()
    }
    
    with open(os.path.join(config.output_dir, 'training_config.json'), 'w') as f:
        json.dump(config_dict, f, indent=2)
    
    print(f"✅ Model saved successfully to: {config.output_dir}")
    print(f"📁 Saved files:")
    saved_files = os.listdir(config.output_dir)
    for file in saved_files:
        print(f"   - {file}")
    
    # Calculate model size
    model_size = sum(os.path.getsize(os.path.join(config.output_dir, f)) 
                    for f in saved_files if os.path.isfile(os.path.join(config.output_dir, f)))
    print(f"📏 Total model size: {model_size / (1024**3):.2f} GB")
    
except Exception as e:
    print(f"❌ Error saving model: {e}")

# Create deployment summary
print(f"\n🚀 Deployment Information:")
print("=" * 50)
print(f"Model Location: {config.output_dir}")
print(f"Base Model: {config.base_model}")
print(f"Training Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Parameters: {trainer.model.num_parameters():,}")
print(f"Training Examples: {len(train_data)}")
print(f"Validation Examples: {len(val_data)}")
print("=" * 50)

print(f"\n📋 To use this model in production:")
print(f"```python")
print(f"from transformers import AutoTokenizer, AutoModelForCausalLM")
print(f"")
print(f"tokenizer = AutoTokenizer.from_pretrained('{config.output_dir}')")
print(f"model = AutoModelForCausalLM.from_pretrained('{config.output_dir}')")
print(f"```")

# Integration with continuous learning
print(f"\n🔄 Setting up continuous learning...")
try:
    learning_manager = ContinuousLearningManager()
    print("✅ Continuous learning manager initialized")
    print("📝 User interactions will be logged for future training iterations")
except Exception as e:
    print(f"⚠️  Could not initialize continuous learning: {e}")

print(f"\n🎉 Training pipeline completed successfully!")
print(f"📊 Summary:")
print(f"   - Model: Gemma 3n fine-tuned for system administration")
print(f"   - Training time: {training_time/60:.1f} minutes")
print(f"   - Examples processed: {len(train_data)}")
print(f"   - Model saved to: {config.output_dir}")
print(f"   - Ready for deployment!"))

## 8. Next Steps and Usage

🎉 **Congratulations!** You've successfully trained your Overseer Gemma 3n model.

### What you accomplished:
- ✅ Loaded and configured the Gemma 3n base model
- ✅ Prepared training data from multiple sources
- ✅ Fine-tuned the model on system administration tasks
- ✅ Evaluated model performance
- ✅ Saved the trained model for deployment

### Next steps:
1. **Integration**: Integrate this model into the Overseer backend system
2. **Testing**: Run more comprehensive evaluation on real-world tasks
3. **Monitoring**: Set up continuous learning to improve the model over time
4. **Optimization**: Consider model quantization for faster inference
5. **Deployment**: Deploy to production environment with proper monitoring

### Usage in production:
```python
# Load your trained model
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "your_output_directory"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Generate responses
def generate_response(prompt):
    inputs = tokenizer(f"User: {prompt}\nAssistant:", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
```

### Continuous improvement:
- Monitor user interactions and feedback
- Periodically retrain with new data
- A/B test different model versions
- Track performance metrics in production