# üöÄ Google Colab Training Notebook for Trading Bot ML Models

This notebook provides a complete ML training pipeline that can be run in Google Colab. It will:

1. **Clone your repository** or upload as ZIP
2. **Install dependencies** with fallback handling
3. **Mount Google Drive** for artifact storage
4. **Train models** using your existing modules
5. **Save artifacts** to both repo and Drive
6. **Validate models** and generate manifest
7. **Download results** as ZIP file

## üìã Instructions:

1. **Replace `GITHUB_REPO_URL`** with your actual repository URL (or upload repo as ZIP)
2. **Set `fast_test = True`** for quick testing, `False` for full training
3. **Run all cells** in order
4. **Check outputs** and download your trained models

‚ö†Ô∏è **Private repos**: Use personal access token in URL format: `https://TOKEN@github.com/user/repo.git`

In [None]:
# üîß Configuration Section - MODIFY THESE VALUES
import os
import sys
from datetime import datetime

# ===== USER CONFIGURATION =====
GITHUB_REPO_URL = "https://github.com/krish567366/bot-model.git"  # Replace with your GitHub repo URL
REPO_NAME = "bot-model"  # Your repository name

# Training configuration
SYMBOL = "BTC-USD"
INTERVAL = "1m"
CFG = {
    "fast_test": False,     # Set to False for full training
    "horizon": 5,           # Prediction horizon
    "pos_thresh": 0.002,    # Positive threshold (0.2%)
    "n_splits": 2,          # Cross-validation splits
    "seed": 42,             # Random seed
    "n_estimators": 100,    # Boosting rounds (fast_test)
    "n_estimators_full": 1000  # Boosting rounds (full training)
}

# Paths (automatically configured)
REPO_PATH = f"/content/{REPO_NAME}"
MODEL_SAVE_REPO_PATH = f"{REPO_PATH}/models/"
MODEL_SAVE_DRIVE_PATH = "/content/drive/MyDrive/trading_bot_models/"
RUN_TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")

# Global state
DRIVE_MOUNTED = False
MODULES_IMPORTED = False

print("üîß Configuration loaded:")
print(f"  Symbol: {SYMBOL}")
print(f"  Fast test mode: {CFG['fast_test']}")
print(f"  Repository: {GITHUB_REPO_URL}")
print(f"  Run timestamp: {RUN_TIMESTAMP}")

# üéØ Data Strategy: Real Market Data Integration

This notebook now integrates with your **existing data pipeline** for production-quality training:

## üìä **Data Source Priority:**
1. **üåü Real Market Data** (Yahoo Finance via your pipeline)
2. **üíæ Cached Data** (From previous runs)  
3. **üîß Synthetic Data** (Fallback only)

## üöÄ **Your Data Pipeline Features:**
- ‚úÖ **Multi-source ingestion** (yfinance, Alpha Vantage, CCXT)
- ‚úÖ **Data validation & cleaning**
- ‚úÖ **Technical indicator computation** 
- ‚úÖ **Flexible storage** (SQLite/PostgreSQL)
- ‚úÖ **Real-time & historical processing**
- ‚úÖ **ML-ready feature engineering**

## ‚ö° **What This Means:**
- **Training on REAL market data** instead of synthetic
- **Production-grade features** from your existing pipeline
- **Consistent data between training and inference**
- **Automatic fallback** if real data unavailable

**Set `CFG['fast_test'] = False` for full 2-year training dataset!**

# üì• Step 1: Clone Repository

Clone your trading bot repository to access the training modules.

In [None]:
def clone_repository():
    """Clone the GitHub repository"""
    
    if GITHUB_REPO_URL == "<YOUR_REPO_URL>":
        print("‚ùå Please set GITHUB_REPO_URL in the configuration section above!")
        print("   Example: GITHUB_REPO_URL = 'https://github.com/username/trading-bot.git'")
        print("   For private repos: GITHUB_REPO_URL = 'https://TOKEN@github.com/username/trading-bot.git'")
        return False
    
    try:
        print(f"üîÑ Cloning repository from {GITHUB_REPO_URL}...")
        
        # Remove existing directory if present
        if os.path.exists(REPO_PATH):
            print("üìÅ Removing existing repository...")
            import shutil
            shutil.rmtree(REPO_PATH)
        
        # Clone repository
        clone_cmd = f"git clone {GITHUB_REPO_URL} {REPO_PATH}"
        result = os.system(clone_cmd)
        
        if result == 0 and os.path.exists(REPO_PATH):
            print("‚úÖ Repository cloned successfully")
            
            # Add to Python path
            if REPO_PATH not in sys.path:
                sys.path.insert(0, REPO_PATH)
                print(f"‚úÖ Added {REPO_PATH} to Python path")
            
            # Check for key files
            key_files = ["requirements.txt", "ai/", "core/"]
            for file in key_files:
                if os.path.exists(os.path.join(REPO_PATH, file)):
                    print(f"  ‚úì Found {file}")
                else:
                    print(f"  ‚ö†Ô∏è  Missing {file}")
            
            return True
        else:
            print("‚ùå Repository cloning failed")
            return False
            
    except Exception as e:
        print(f"‚ùå Error cloning repository: {e}")
        return False

# Alternative: Upload ZIP file
def upload_repo_zip():
    """Upload repository as ZIP file (alternative to git clone)"""
    try:
        from google.colab import files
        print("üìÅ Upload your repository as a ZIP file:")
        uploaded = files.upload()
        
        if len(uploaded) == 1:
            zip_name = list(uploaded.keys())[0]
            print(f"üì¶ Extracting {zip_name}...")
            
            import zipfile
            with zipfile.ZipFile(zip_name, 'r') as zip_ref:
                zip_ref.extractall("/content/")
            
            # Find extracted directory
            extracted_dirs = [d for d in os.listdir("/content/") if os.path.isdir(f"/content/{d}") and d != "sample_data"]
            
            if extracted_dirs:
                global REPO_PATH
                REPO_PATH = f"/content/{extracted_dirs[0]}"
                
                if REPO_PATH not in sys.path:
                    sys.path.insert(0, REPO_PATH)
                
                print(f"‚úÖ Repository extracted to {REPO_PATH}")
                return True
            else:
                print("‚ùå Could not find extracted directory")
                return False
        else:
            print("‚ùå Please upload exactly one ZIP file")
            return False
            
    except ImportError:
        print("‚ö†Ô∏è  Not running in Colab - ZIP upload not available")
        return False
    except Exception as e:
        print(f"‚ùå Error uploading ZIP: {e}")
        return False

# Clone repository (or use ZIP upload as fallback)
clone_success = clone_repository()

if not clone_success:
    print("\nüí° Alternative: Upload repository as ZIP file")
    print("   Uncomment the next line to use ZIP upload instead:")
    print("   # clone_success = upload_repo_zip()")
    
    # Uncomment this line if you want to use ZIP upload:
    # clone_success = upload_repo_zip()

# üì¶ Step 2: Install Dependencies

Install required Python packages with robust fallback handling.

In [None]:
def install_dependencies():
    """Install required dependencies with fallbacks"""
    
    print("üîÑ Installing dependencies...")
    
    # Try requirements.txt first
    requirements_path = os.path.join(REPO_PATH, "requirements.txt")
    
    if os.path.exists(requirements_path):
        print("üìÑ Found requirements.txt, installing...")
        result = os.system(f"pip install -q -r {requirements_path}")
        
        if result == 0:
            print("‚úÖ Requirements installed from requirements.txt")
        else:
            print("‚ö†Ô∏è  Some packages from requirements.txt failed, continuing with manual installs...")
    else:
        print("üìÑ No requirements.txt found, installing core packages...")
    
    # Core ML packages
    core_packages = [
        "pandas", "numpy", "scikit-learn", "joblib",
        "lightgbm", "xgboost", "matplotlib", "seaborn",
        "nest_asyncio"  # For async support in Colab
    ]
    
    # Optional packages (won't fail if not installed)
    optional_packages = [
        "catboost", "optuna", "shap", "yfinance", "ccxt", 
        "ta", "alpha_vantage", "sqlalchemy"  # For data pipeline
    ]
    
    print("üîÑ Installing core ML packages...")
    for package in core_packages:
        try:
            result = os.system(f"pip install -q {package}")
            if result == 0:
                print(f"  ‚úì {package}")
            else:
                print(f"  ‚ö†Ô∏è  {package} - failed")
        except:
            print(f"  ‚ùå {package} - error")
    
    print("üîÑ Installing data pipeline packages...")
    for package in optional_packages:
        try:
            result = os.system(f"pip install -q {package}")
            if result == 0:
                print(f"  ‚úì {package}")
            else:
                print(f"  ‚ö†Ô∏è  {package} - skipped")
        except:
            print(f"  ‚ö†Ô∏è  {package} - skipped")
    
    # Verify key packages
    print("\nüîç Verifying package installations...")
    key_imports = {
        'pandas': 'pd',
        'numpy': 'np',
        'sklearn': 'sklearn',
        'lightgbm': 'lgb',
        'joblib': 'joblib',
        'nest_asyncio': 'nest_asyncio',
        'yfinance': 'yf'
    }
    
    successful_imports = []
    failed_imports = []
    
    for package, alias in key_imports.items():
        try:
            __import__(package)
            successful_imports.append(package)
            print(f"  ‚úì {package}")
        except ImportError:
            failed_imports.append(package)
            print(f"  ‚ùå {package}")
    
    print(f"\n‚úÖ Successfully imported: {len(successful_imports)}/{len(key_imports)} key packages")
    
    if failed_imports:
        print(f"‚ö†Ô∏è  Failed imports: {failed_imports}")
        print("   Training will continue but some features may be unavailable")
        
        # Special handling for yfinance failure
        if 'yfinance' in failed_imports:
            print("   üìâ yfinance unavailable - will use synthetic data only")
    
    return len(failed_imports) == 0

# Install dependencies
install_success = install_dependencies()

# üíæ Step 3: Mount Google Drive

Mount Google Drive to save trained models for long-term storage.

In [None]:
def mount_google_drive():
    """Mount Google Drive safely"""
    global DRIVE_MOUNTED
    
    try:
        from google.colab import drive
        print("üîÑ Mounting Google Drive...")
        drive.mount('/content/drive')
        
        # Verify mount
        if os.path.exists('/content/drive/MyDrive'):
            print("‚úÖ Google Drive mounted successfully")
            print(f"üìÅ Drive path: {MODEL_SAVE_DRIVE_PATH}")
            
            # Create models directory in Drive if it doesn't exist
            os.makedirs(MODEL_SAVE_DRIVE_PATH, exist_ok=True)
            DRIVE_MOUNTED = True
            return True
        else:
            print("‚ùå Drive mount verification failed")
            return False
            
    except ImportError:
        print("‚ö†Ô∏è  Not running in Google Colab - Drive mount skipped")
        return False
    except Exception as e:
        print(f"‚ö†Ô∏è  Drive mount failed: {e}")
        print("Continuing without Drive backup...")
        return False

# Mount Google Drive
mount_success = mount_google_drive()

if mount_success:
    print("üí° Models will be saved to both repo and Google Drive")
else:
    print("üí° Models will be saved to repo only")

# üì• Step 4: Import Modules

Import the trading bot modules and verify everything is working.

In [None]:
def import_trading_modules():
    """Import trading bot modules with fallbacks"""
    global MODULES_IMPORTED
    
    print("üîÑ Importing trading bot modules...")
    
    # Core imports
    import pandas as pd
    import numpy as np
    import joblib
    from datetime import datetime, timedelta
    import json
    
    # Set random seed
    np.random.seed(CFG['seed'])
    
    # Try to import trading bot modules
    modules = {}
    
    try:
        # Real data integration (NEW)
        try:
            from ai.real_data_integration import MLDataIntegrator, get_real_training_data
            modules['real_data'] = MLDataIntegrator({
                'data_sources': ['yfinance'],
                'storage_path': f'{REPO_PATH}/data/',
                'cache_enabled': True,
                'real_time_enabled': False
            })
            print("  ‚úì ai.real_data_integration - REAL DATA ENABLED!")
        except ImportError:
            print("  ‚ö†Ô∏è  Real data integration not found - will use synthetic data")
        
        # Feature engineering
        try:
            from ai.feature_engineering_v2 import compute_features_deterministic
            modules['feature_engineering'] = compute_features_deterministic
            print("  ‚úì ai.feature_engineering_v2")
        except ImportError:
            try:
                from ai.feature_engineering import compute_features_deterministic
                modules['feature_engineering'] = compute_features_deterministic
                print("  ‚úì ai.feature_engineering")
            except ImportError:
                print("  ‚ö†Ô∏è  Feature engineering module not found - will use synthetic features")
        
        # Training modules
        try:
            from ai.training_v2 import train_lightgbm_model
            modules['training'] = train_lightgbm_model
            print("  ‚úì ai.training_v2")
        except ImportError:
            try:
                from ai.train_lgbm import train_and_validate_lgbm
                modules['training'] = train_and_validate_lgbm
                print("  ‚úì ai.train_lgbm")
            except ImportError:
                print("  ‚ö†Ô∏è  Training module not found - will use built-in training")
        
        # Model registry
        try:
            from ai.registry import ModelRegistry
            modules['registry'] = ModelRegistry()
            print("  ‚úì ai.registry")
        except ImportError:
            print("  ‚ö†Ô∏è  Model registry not found - will save manually")
        
        # Data pipeline components
        try:
            from core.pipeline import YFinanceSource, DataPipeline
            modules['data_pipeline'] = DataPipeline()
            modules['yfinance'] = YFinanceSource()
            print("  ‚úì core.pipeline - Data sources available")
        except ImportError:
            print("  ‚ö†Ô∏è  Data pipeline not found")
        
        # Data generation (for testing)
        try:
            from ai.training_v2 import generate_synthetic_ohlcv_data
            modules['data_generator'] = generate_synthetic_ohlcv_data
            print("  ‚úì Synthetic data generator")
        except ImportError:
            print("  ‚ö†Ô∏è  Will create basic synthetic data")
        
        MODULES_IMPORTED = True
        print("‚úÖ Module import completed")
        
        # Show data source priority
        if 'real_data' in modules:
            print("\nüéØ DATA SOURCE PRIORITY:")
            print("  1. Real market data (Yahoo Finance)")
            print("  2. Cached historical data") 
            print("  3. Synthetic data (fallback)")
        else:
            print("\n‚ö†Ô∏è  Using synthetic data only")
            
        return modules
        
    except Exception as e:
        print(f"‚ùå Error importing modules: {e}")
        print("   Will proceed with basic fallback implementations")
        return {}

# Import modules
trading_modules = import_trading_modules()

# üîÑ Step 5: Generate Training Data

Create or load training data for model development.

In [None]:
async def generate_training_data():
    """Generate training data - prioritizing real market data over synthetic"""
    import pandas as pd
    import numpy as np
    
    # Try to use real data first
    if 'real_data' in trading_modules:
        try:
            print("üåü Using REAL MARKET DATA from your data pipeline!")
            
            # Determine data parameters based on test mode
            if CFG['fast_test']:
                period = "6m"  # 6 months for fast testing
                interval = "1h" 
                print(f"  üìä Fast test mode: {period} of {interval} data")
            else:
                period = "2y"   # 2 years for full training
                interval = "1h"
                print(f"  üìä Full training mode: {period} of {interval} data")
            
            # Fetch real training data using your pipeline
            integrator = trading_modules['real_data']
            dataset = await integrator.prepare_training_dataset(
                symbol=SYMBOL,
                period=period,
                interval=interval,
                horizon=CFG['horizon'],
                pos_thresh=CFG['pos_thresh']
            )
            
            print(f"‚úÖ Real data loaded successfully!")
            print(f"  üìà Data source: {dataset['metadata'].get('data_source', 'Yahoo Finance')}")
            print(f"  üìÖ Date range: {dataset['metadata']['data_range']['start']} to {dataset['metadata']['data_range']['end']}")
            print(f"  üìä Raw data points: {len(dataset['ohlcv_data'])}")
            print(f"  üßÆ ML features: {dataset['metadata']['n_features']}")
            print(f"  üéØ Training samples: {dataset['metadata']['n_samples']}")
            print(f"  üìà Class distribution: {dataset['metadata']['class_distribution']}")
            
            # Return real data
            return dataset['X'], dataset['y_binary'], dataset['y_regression'], dataset['timestamps']
            
        except Exception as e:
            print(f"‚ùå Real data loading failed: {e}")
            print("üìâ Falling back to synthetic data...")
    
    # Fallback to synthetic data
    print("üîß Generating synthetic training data...")
    
    # Data size based on test mode
    n_periods = 500 if CFG['fast_test'] else 2000
    
    print(f"üîÑ Generating {n_periods} periods of synthetic data...")
    
    # Generate synthetic OHLCV data
    dates = pd.date_range(start='2023-01-01', periods=n_periods, freq='1H')
    
    # Random walk with drift for realistic price movement
    np.random.seed(CFG['seed'])
    returns = np.random.normal(0.0001, 0.01, n_periods)  # Small positive drift
    log_prices = np.cumsum(returns)
    prices = 50000 * np.exp(log_prices)  # Start around $50,000
    
    data = []
    for i, (date, price) in enumerate(zip(dates, prices)):
        # Generate realistic OHLC
        volatility = abs(np.random.normal(0, 0.008))  # Daily volatility ~0.8%
        high = price * (1 + volatility)
        low = price * (1 - volatility)
        open_price = prices[i-1] if i > 0 else price
        volume = np.random.uniform(100, 1000)
        
        data.append({
            'timestamp': date,
            'open': open_price,
            'high': high,
            'low': low,
            'close': price,
            'volume': volume
        })
    
    df = pd.DataFrame(data)
    
    # Create features
    print("üîÑ Computing features...")
    features = pd.DataFrame(index=df.index)
    
    # Price features
    features['returns'] = df['close'].pct_change()
    features['log_returns'] = np.log(df['close'] / df['close'].shift(1))
    features['price_ma5'] = df['close'].rolling(5).mean()
    features['price_ma20'] = df['close'].rolling(20).mean()
    features['price_std'] = df['close'].rolling(20).std()
    
    # Volume features
    features['volume'] = df['volume']
    features['volume_ma5'] = df['volume'].rolling(5).mean()
    features['volume_ratio'] = df['volume'] / features['volume_ma5']
    
    # Technical indicators
    features['rsi'] = compute_rsi(df['close'], 14)
    features['macd'] = compute_macd(df['close'])
    features['bollinger_upper'], features['bollinger_lower'] = compute_bollinger_bands(df['close'])
    
    # Volatility
    features['volatility'] = features['returns'].rolling(20).std()
    features['volatility_ma'] = features['volatility'].rolling(5).mean()
    
    # Clean features
    features = features.dropna()
    
    # Create labels
    print("üîÑ Creating labels...")
    future_periods = CFG['horizon']
    threshold = CFG['pos_thresh']
    
    # Calculate future returns
    future_returns = df['close'].shift(-future_periods) / df['close'] - 1
    
    # Binary classification: Will price move up > threshold?
    labels_binary = (future_returns > threshold).astype(int)
    
    # Regression target: actual future return
    labels_regression = future_returns
    
    # Align features and labels
    valid_mask = ~future_returns.isna() & ~features.isnull().any(axis=1)
    
    X = features[valid_mask].reset_index(drop=True)
    y_binary = labels_binary[valid_mask].reset_index(drop=True)
    y_regression = labels_regression[valid_mask].reset_index(drop=True)
    timestamps = df['timestamp'][valid_mask].reset_index(drop=True)
    
    print(f"‚úÖ Synthetic dataset created:")
    print(f"  Samples: {len(X)}")
    print(f"  Features: {X.shape[1]}")
    print(f"  Time range: {timestamps.iloc[0]} to {timestamps.iloc[-1]}")
    print(f"  Binary class distribution: {y_binary.value_counts().to_dict()}")
    print(f"  Regression target stats: mean={y_regression.mean():.4f}, std={y_regression.std():.4f}")
    
    return X, y_binary, y_regression, timestamps

def compute_rsi(prices, window=14):
    """Compute RSI indicator"""
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

def compute_macd(prices, fast=12, slow=26):
    """Compute MACD indicator"""
    ema_fast = prices.ewm(span=fast).mean()
    ema_slow = prices.ewm(span=slow).mean()
    return ema_fast - ema_slow

def compute_bollinger_bands(prices, window=20, std_dev=2):
    """Compute Bollinger Bands"""
    ma = prices.rolling(window).mean()
    std = prices.rolling(window).std()
    upper = ma + (std * std_dev)
    lower = ma - (std * std_dev)
    return upper, lower

# Generate training data (async call in Jupyter requires special handling)
print("üöÄ Starting data generation...")

# In Jupyter/Colab, we need to handle async calls properly
import asyncio

# Check if we're in an existing event loop (Jupyter)
try:
    loop = asyncio.get_running_loop()
    # If we're in Jupyter, create a task
    import nest_asyncio
    nest_asyncio.apply()  # Allow nested event loops
    X, y_binary, y_regression, timestamps = await generate_training_data()
except RuntimeError:
    # If no event loop, run normally
    X, y_binary, y_regression, timestamps = asyncio.run(generate_training_data())
except ImportError:
    # If nest_asyncio not available, use asyncio.run
    X, y_binary, y_regression, timestamps = asyncio.run(generate_training_data())

print(f"\nüéâ Data generation completed!")
print(f"Final dataset: {len(X)} samples, {len(X.columns)} features")

# üèãÔ∏è Step 6: Train Models

Train machine learning models using LightGBM and other algorithms.

In [None]:
# Feature Analysis and Overview
print("üìä FEATURE ANALYSIS:")
print(f"Total Features: {len(X.columns)}")
print(f"Training Samples: {len(X)}")

# Show data source information
data_source = "Real Market Data via your existing data pipeline" if 'real_data' in trading_modules else "Synthetic trading data"
print(f"Data Source: {data_source}")

if 'real_data' in trading_modules:
    print("üåü Using PRODUCTION-GRADE features from your data pipeline:")
    print("   ‚Ä¢ Technical indicators from ta library")
    print("   ‚Ä¢ Market data from Yahoo Finance")  
    print("   ‚Ä¢ Advanced feature engineering")
    print("   ‚Ä¢ Volume and volatility metrics")
else:
    print("üîß Using SYNTHETIC features for testing:")
    print("   ‚Ä¢ Simulated price movements")
    print("   ‚Ä¢ Basic technical indicators")
    print("   ‚Ä¢ Test-grade feature generation")

print("\nüìà Feature Categories:")
feature_types = {}
for col in X.columns:
    if any(x in col.lower() for x in ['price', 'close', 'open', 'high', 'low']):
        feature_types['Price Features'] = feature_types.get('Price Features', 0) + 1
    elif any(x in col.lower() for x in ['volume', 'vol']):
        feature_types['Volume Features'] = feature_types.get('Volume Features', 0) + 1
    elif any(x in col.lower() for x in ['return', 'pct', 'change']):
        feature_types['Return Features'] = feature_types.get('Return Features', 0) + 1
    elif any(x in col.lower() for x in ['sma', 'ema', 'bb', 'rsi', 'macd', 'bollinger', 'moving']):
        feature_types['Technical Indicators'] = feature_types.get('Technical Indicators', 0) + 1
    elif any(x in col.lower() for x in ['volatility', 'std', 'var']):
        feature_types['Volatility Features'] = feature_types.get('Volatility Features', 0) + 1
    else:
        feature_types['Other Features'] = feature_types.get('Other Features', 0) + 1

for category, count in feature_types.items():
    print(f"  ‚Ä¢ {category}: {count}")

print(f"\n? Sample Features:")
print(f"  {list(X.columns[:10])}")
if len(X.columns) > 10:
    print(f"  ... and {len(X.columns) - 10} more features")

# Feature importance preview (basic correlation analysis)
print(f"\nüéØ Top Correlated Features (with target):")
try:
    correlations = X.corrwith(y_binary).abs().sort_values(ascending=False)
    print(f"  ‚Ä¢ {correlations.index[0]}: {correlations.iloc[0]:.3f}")
    print(f"  ‚Ä¢ {correlations.index[1]}: {correlations.iloc[1]:.3f}")
    print(f"  ‚Ä¢ {correlations.index[2]}: {correlations.iloc[2]:.3f}")
    print(f"  ‚Ä¢ {correlations.index[3]}: {correlations.iloc[3]:.3f}")
    print(f"  ‚Ä¢ {correlations.index[4]}: {correlations.iloc[4]:.3f}")
except:
    print("  (Correlation analysis skipped)")

print(f"\n‚úÖ Feature engineering completed!")

# üíæ Step 7: Save Model Artifacts

Save trained models and metadata to both repository and Google Drive.

In [None]:
def create_model_artifacts(model, metrics, params, model_type, X_sample):
    """Create comprehensive model artifacts"""
    import joblib
    from sklearn.preprocessing import StandardScaler
    
    # Create model directory
    model_id = f"lgbm_{model_type}_{RUN_TIMESTAMP}"
    model_dir = os.path.join(MODEL_SAVE_REPO_PATH, SYMBOL, RUN_TIMESTAMP, model_id)
    os.makedirs(model_dir, exist_ok=True)
    
    print(f"üìÅ Creating artifacts in: {model_dir}")
    
    # Save model
    model_path = os.path.join(model_dir, "model.pkl")
    joblib.dump(model, model_path, compress=3)
    print(f"  ‚úì Model saved: model.pkl")
    
    # Create and save scaler (even if not used, for consistency)
    scaler = StandardScaler()
    scaler.fit(X_sample)  # Fit on sample data for consistency
    scaler_path = os.path.join(model_dir, "scaler.pkl")
    joblib.dump(scaler, scaler_path, compress=3)
    print(f"  ‚úì Scaler saved: scaler.pkl")
    
    # Create comprehensive metadata
    metadata = {
        'model_id': model_id,
        'model_type': f'lightgbm_{model_type}',
        'symbol': SYMBOL,
        'interval': INTERVAL,
        'timestamp': RUN_TIMESTAMP,
        'training_config': CFG,
        'model_params': params,
        'metrics': metrics,
        'feature_names': list(X_sample.columns),
        'n_features': len(X_sample.columns),
        'training_samples': len(X_sample),
        'fast_test_mode': CFG['fast_test'],
        'random_seed': CFG['seed'],
        'version': '1.0',
        'framework': 'lightgbm',
        'task_type': model_type,
        'colab_training': True
    }
    
    # Save metadata
    meta_path = os.path.join(model_dir, "meta.json")
    with open(meta_path, 'w') as f:
        json.dump(metadata, f, indent=2, default=str)
    print(f"  ‚úì Metadata saved: meta.json")
    
    # Save feature names
    feature_names_path = os.path.join(model_dir, "feature_names.json")
    with open(feature_names_path, 'w') as f:
        json.dump(list(X_sample.columns), f)
    print(f"  ‚úì Feature names saved: feature_names.json")
    
    return model_dir, metadata

def copy_to_drive(source_dir, model_id):
    """Copy artifacts to Google Drive"""
    if not DRIVE_MOUNTED:
        print("‚ö†Ô∏è  Google Drive not mounted, skipping Drive backup")
        return None
    
    try:
        import shutil
        
        # Create destination directory
        drive_model_dir = os.path.join(MODEL_SAVE_DRIVE_PATH, SYMBOL, RUN_TIMESTAMP, model_id)
        os.makedirs(os.path.dirname(drive_model_dir), exist_ok=True)
        
        # Copy entire model directory
        if os.path.exists(drive_model_dir):
            shutil.rmtree(drive_model_dir)
        
        shutil.copytree(source_dir, drive_model_dir)
        print(f"‚úÖ Artifacts copied to Google Drive: {drive_model_dir}")
        
        return drive_model_dir
    
    except Exception as e:
        print(f"‚ö†Ô∏è  Failed to copy to Google Drive: {e}")
        return None

# Save artifacts for both models
saved_models = {}

if 'binary_model' in locals() and binary_model is not None:
    print("\nüíæ Saving Binary Classification Model...")
    binary_dir, binary_metadata = create_model_artifacts(
        binary_model, binary_metrics, binary_params, 'binary', binary_splits['X_train']
    )
    binary_drive_dir = copy_to_drive(binary_dir, binary_metadata['model_id'])
    
    saved_models['binary'] = {
        'local_path': binary_dir,
        'drive_path': binary_drive_dir,
        'metadata': binary_metadata
    }

if 'regression_model' in locals() and regression_model is not None:
    print("\nüíæ Saving Regression Model...")
    regression_dir, regression_metadata = create_model_artifacts(
        regression_model, regression_metrics, regression_params, 'regression', regression_splits['X_train']
    )
    regression_drive_dir = copy_to_drive(regression_dir, regression_metadata['model_id'])
    
    saved_models['regression'] = {
        'local_path': regression_dir,
        'drive_path': regression_drive_dir,
        'metadata': regression_metadata
    }

print(f"\n‚úÖ All artifacts saved! Models: {list(saved_models.keys())}")

# ‚úÖ Step 8: Model Validation

Validate that saved models can be loaded and used for inference.

In [None]:
def validate_saved_models():
    """Validate that saved models work correctly"""
    import joblib
    
    print("üîç Validating saved models...")
    
    validation_results = {}
    
    for model_type, model_info in saved_models.items():
        print(f"\nüîÑ Validating {model_type} model...")
        
        try:
            model_dir = model_info['local_path']
            
            # Load model and scaler
            model_path = os.path.join(model_dir, "model.pkl")
            scaler_path = os.path.join(model_dir, "scaler.pkl")
            meta_path = os.path.join(model_dir, "meta.json")
            
            # Check files exist
            for path, name in [(model_path, "model.pkl"), (scaler_path, "scaler.pkl"), (meta_path, "meta.json")]:
                if os.path.exists(path):
                    print(f"  ‚úì Found {name}")
                else:
                    print(f"  ‚ùå Missing {name}")
                    continue
            
            # Load artifacts
            model = joblib.load(model_path)
            scaler = joblib.load(scaler_path)
            
            with open(meta_path, 'r') as f:
                metadata = json.load(f)
            
            print(f"  ‚úì Loaded model: {metadata['model_id']}")
            print(f"  ‚úì Features: {metadata['n_features']}")
            print(f"  ‚úì Training samples: {metadata['training_samples']}")
            
            # Test prediction on sample data
            if model_type == 'binary':
                test_X = binary_splits['X_test'].iloc[:5]  # First 5 test samples
                test_y = binary_splits['y_test'].iloc[:5]
            else:
                test_X = regression_splits['X_test'].iloc[:5]
                test_y = regression_splits['y_test'].iloc[:5]
            
            # Make predictions
            predictions = model.predict(test_X)
            
            print(f"  ‚úì Sample predictions shape: {predictions.shape}")
            print(f"  ‚úì Sample predictions (first 3): {predictions[:3]}")
            
            # Validate prediction format
            if model_type == 'binary':
                # Binary predictions should be probabilities between 0 and 1
                if all(0 <= p <= 1 for p in predictions):
                    print(f"  ‚úÖ Binary probabilities valid (0-1 range)")
                else:
                    print(f"  ‚ö†Ô∏è  Binary probabilities outside 0-1 range")
            else:
                # Regression predictions should be reasonable returns
                if all(abs(p) < 1 for p in predictions):  # |return| < 100%
                    print(f"  ‚úÖ Regression predictions reasonable")
                else:
                    print(f"  ‚ö†Ô∏è  Regression predictions seem extreme")
            
            validation_results[model_type] = {
                'status': 'success',
                'model_path': model_path,
                'predictions_sample': predictions[:3].tolist(),
                'metadata': metadata
            }
            
            print(f"  ‚úÖ {model_type.capitalize()} model validation successful")
            
        except Exception as e:
            print(f"  ‚ùå {model_type.capitalize()} model validation failed: {e}")
            validation_results[model_type] = {
                'status': 'failed',
                'error': str(e)
            }
    
    return validation_results

# Validate models
if saved_models:
    validation_results = validate_saved_models()
    
    print(f"\nüèÜ Validation Summary:")
    for model_type, result in validation_results.items():
        status = "‚úÖ" if result['status'] == 'success' else "‚ùå"
        print(f"  {status} {model_type.capitalize()} Model")
else:
    print("‚ö†Ô∏è  No models to validate")

# üìã Step 9: Create Run Manifest

Create a comprehensive manifest file documenting this training run.

In [None]:
def create_run_manifest():
    """Create a comprehensive run manifest"""
    
    # Create runs directory
    runs_dir = os.path.join(REPO_PATH, "runs", f"colab-{RUN_TIMESTAMP}")
    os.makedirs(runs_dir, exist_ok=True)
    
    # Get git commit hash if available
    git_commit = "unknown"
    try:
        import subprocess
        result = subprocess.run(['git', 'rev-parse', 'HEAD'], 
                              cwd=REPO_PATH, capture_output=True, text=True)
        if result.returncode == 0:
            git_commit = result.stdout.strip()[:12]  # Short hash
    except:
        pass
    
    # Calculate dataset hash (simple hash of feature names and data size)
    import hashlib
    feature_string = f"{list(X.columns)}_{len(X)}_{CFG['seed']}"
    dataset_hash = hashlib.md5(feature_string.encode()).hexdigest()[:12]
    
    # Create comprehensive manifest
    manifest = {
        'run_info': {
            'timestamp': RUN_TIMESTAMP,
            'git_commit': git_commit,
            'dataset_hash': dataset_hash,
            'colab_session': True,
            'fast_test_mode': CFG['fast_test']
        },
        'configuration': CFG,
        'data_info': {
            'symbol': SYMBOL,
            'interval': INTERVAL,
            'n_samples': len(X),
            'n_features': len(X.columns),
            'feature_names': list(X.columns),
            'time_range': {
                'start': str(timestamps.iloc[0]),
                'end': str(timestamps.iloc[-1])
            }
        },
        'models': {},
        'artifacts': {
            'repo_base_path': MODEL_SAVE_REPO_PATH,
            'drive_base_path': MODEL_SAVE_DRIVE_PATH if DRIVE_MOUNTED else None,
            'saved_models': []
        },
        'validation_results': validation_results if 'validation_results' in locals() else {},
        'environment': {
            'python_version': sys.version,
            'key_packages': {}
        }
    }
    
    # Add package versions safely
    import pandas as pd
    import numpy as np
    
    manifest['environment']['key_packages']['pandas'] = pd.__version__
    manifest['environment']['key_packages']['numpy'] = np.__version__
    
    try:
        import lightgbm
        manifest['environment']['key_packages']['lightgbm'] = lightgbm.__version__
    except:
        pass
    
    try:
        import sklearn
        manifest['environment']['key_packages']['sklearn'] = sklearn.__version__
    except:
        pass
    
    # Add model information
    for model_type, model_info in saved_models.items():
        manifest['models'][model_type] = {
            'model_id': model_info['metadata']['model_id'],
            'local_path': model_info['local_path'],
            'drive_path': model_info['drive_path'],
            'metrics': model_info['metadata']['metrics'],
            'params': model_info['metadata']['model_params']
        }
        
        manifest['artifacts']['saved_models'].append({
            'type': model_type,
            'id': model_info['metadata']['model_id'],
            'path': model_info['local_path']
        })
    
    # Save manifest
    manifest_path = os.path.join(runs_dir, "manifest.json")
    with open(manifest_path, 'w') as f:
        json.dump(manifest, f, indent=2, default=str)
    
    print(f"üìã Run manifest created: {manifest_path}")
    
    # Display summary
    print(f"\nüìä Training Run Summary:")
    print(f"  Run ID: colab-{RUN_TIMESTAMP}")
    print(f"  Git Commit: {git_commit}")
    print(f"  Dataset Hash: {dataset_hash}")
    print(f"  Models Trained: {len(saved_models)}")
    print(f"  Total Samples: {len(X)}")
    print(f"  Features: {len(X.columns)}")
    print(f"  Fast Test Mode: {CFG['fast_test']}")
    
    if saved_models:
        print(f"\nüéØ Model Performance:")
        for model_type, model_info in saved_models.items():
            metrics = model_info['metadata']['metrics']
            if model_type == 'binary':
                print(f"  Binary: AUC={metrics['auc']:.4f}, Accuracy={metrics['accuracy']:.4f}")
            else:
                print(f"  Regression: RMSE={metrics['rmse']:.6f}, R¬≤={metrics['r2']:.4f}")
    
    return manifest_path, manifest

# Create run manifest
if saved_models:
    manifest_path, manifest = create_run_manifest()
else:
    print("‚ö†Ô∏è  No models saved, skipping manifest creation")

# üì• Step 10: Display Results & Download

Display the training results and provide download options.

In [None]:
def display_results():
    """Display comprehensive training results"""
    
    print("üèÜ" + "="*60)
    print("üèÜ GOOGLE COLAB TRAINING COMPLETED SUCCESSFULLY!")
    print("üèÜ" + "="*60)
    
    if not saved_models:
        print("‚ùå No models were successfully trained")
        return
    
    print(f"\nüìä TRAINING SUMMARY:")
    print(f"  ‚Ä¢ Run Timestamp: {RUN_TIMESTAMP}")
    print(f"  ‚Ä¢ Symbol: {SYMBOL}")
    print(f"  ‚Ä¢ Training Mode: {'Fast Test' if CFG['fast_test'] else 'Full Training'}")
    print(f"  ‚Ä¢ Models Trained: {len(saved_models)}")
    print(f"  ‚Ä¢ Dataset Size: {len(X)} samples, {len(X.columns)} features")
    
    print(f"\nüéØ MODEL PERFORMANCE:")
    for model_type, model_info in saved_models.items():
        metadata = model_info['metadata']
        metrics = metadata['metrics']
        
        print(f"\n  üìà {model_type.upper()} MODEL:")
        print(f"    Model ID: {metadata['model_id']}")
        print(f"    Framework: {metadata['framework']}")
        
        if model_type == 'binary':
            print(f"    AUC Score: {metrics['auc']:.4f}")
            print(f"    Accuracy: {metrics['accuracy']:.4f}")
        else:
            print(f"    RMSE: {metrics['rmse']:.6f}")
            print(f"    R¬≤ Score: {metrics['r2']:.4f}")
    
    print(f"\nüìÅ ARTIFACT LOCATIONS:")
    for model_type, model_info in saved_models.items():
        print(f"\n  {model_type.upper()} MODEL ARTIFACTS:")
        print(f"    Local Path: {model_info['local_path']}")
        if model_info['drive_path']:
            print(f"    Google Drive: {model_info['drive_path']}")
        
        # List files in directory
        if os.path.exists(model_info['local_path']):
            files = os.listdir(model_info['local_path'])
            print(f"    Files: {', '.join(files)}")
    
    # Display sample metadata
    if saved_models:
        sample_model = list(saved_models.values())[0]
        print(f"\nüìã SAMPLE MODEL METADATA:")
        
        # Pretty print a subset of metadata
        display_metadata = {
            'model_id': sample_model['metadata']['model_id'],
            'model_type': sample_model['metadata']['model_type'],
            'training_config': sample_model['metadata']['training_config'],
            'metrics': sample_model['metadata']['metrics'],
            'n_features': sample_model['metadata']['n_features'],
            'training_samples': sample_model['metadata']['training_samples']
        }
        
        print(json.dumps(display_metadata, indent=2))

def create_download_zip():
    """Create ZIP file of all artifacts for download"""
    try:
        from google.colab import files
        import zipfile
        
        if not saved_models:
            print("‚ùå No models to package")
            return
        
        # Create ZIP filename
        zip_filename = f"trading_bot_models_{RUN_TIMESTAMP}.zip"
        zip_path = f"/content/{zip_filename}"
        
        print(f"üì¶ Creating download package: {zip_filename}")
        
        with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for model_type, model_info in saved_models.items():
                model_dir = model_info['local_path']
                
                # Add all files from model directory
                for root, dirs, files in os.walk(model_dir):
                    for file in files:
                        file_path = os.path.join(root, file)
                        # Create relative path for ZIP
                        arcname = os.path.relpath(file_path, MODEL_SAVE_REPO_PATH)
                        zipf.write(file_path, arcname)
                        print(f"  ‚úì Added: {arcname}")
            
            # Add manifest if it exists
            if 'manifest_path' in locals():
                zipf.write(manifest_path, f"runs/colab-{RUN_TIMESTAMP}/manifest.json")
                print(f"  ‚úì Added: manifest.json")
        
        print(f"‚úÖ ZIP package created: {zip_filename}")
        print(f"üì• Downloading...")
        
        # Download the ZIP file
        files.download(zip_path)
        
        print("‚úÖ Download initiated!")
        print("üí° The ZIP file contains all model artifacts and metadata")
        
    except ImportError:
        print("‚ö†Ô∏è  Not running in Google Colab - download not available")
        print("üí° You can manually copy files from the paths shown above")
    except Exception as e:
        print(f"‚ùå Error creating download package: {e}")

# Display results
display_results()

print(f"\n" + "="*60)
print("üéâ NEXT STEPS:")
print("="*60)

if CFG['fast_test']:
    print("1. üöÄ For production training, set CFG['fast_test'] = False and re-run")

print("2. üì• Download your model artifacts using the ZIP package below")
print("3. üß™ Test your models in your trading environment")
print("4. üìà Integrate with your backtesting and live trading systems")
print("5. üîÑ Monitor performance and retrain as needed")

print(f"\nüí° Model artifacts are saved in:")
print(f"   Repository: {MODEL_SAVE_REPO_PATH}")
if DRIVE_MOUNTED:
    print(f"   Google Drive: {MODEL_SAVE_DRIVE_PATH}")

print(f"\nü§ñ To use these models in production:")
print("   ‚Ä¢ Load with: model = joblib.load('model.pkl')")
print("   ‚Ä¢ Make predictions: predictions = model.predict(features)")
print("   ‚Ä¢ Check metadata for feature requirements and preprocessing")

# üì• Download Model Artifacts

Download all trained models and artifacts as a ZIP file.

In [None]:
# Create and download ZIP package of all artifacts
create_download_zip()

print("\nüéä Training completed successfully!")
print("üéØ Your models are ready for production use!")

# Display final status
if saved_models:
    print(f"\n‚úÖ Successfully trained {len(saved_models)} models:")
    for model_type in saved_models.keys():
        print(f"  ‚Ä¢ {model_type.capitalize()} Classification/Regression Model")
    
    print(f"\nüèÜ Best practices implemented:")
    print("  ‚úì Time-based train/val/test splits")
    print("  ‚úì Comprehensive model evaluation")
    print("  ‚úì Artifact versioning and metadata")
    print("  ‚úì Model validation and integrity checks")
    print("  ‚úì Google Drive backup (if mounted)")
    print("  ‚úì Downloadable model packages")
else:
    print("\n‚ö†Ô∏è  No models were successfully trained")
    print("Please check the error messages above and try again")

# üöÄ Trading Bot ML Training - Google Colab Edition

## üìã SETUP INSTRUCTIONS (REQUIRED):

### 1. Replace Repository URL
```python
GITHUB_REPO_URL = "<YOUR_REPO_URL>"  # ‚Üê REPLACE THIS!
```

### 2. Choose Training Mode
- **`fast_test=True`** (default): Quick test run with synthetic data (5 minutes)
- **`fast_test=False`**: Full training with real data (30-60 minutes)

### 3. Private Repository?
If your repo is private, use:
```python
# GITHUB_REPO_URL = "https://<TOKEN>@github.com/owner/repo.git"
```
Replace `<TOKEN>` with your GitHub personal access token.

### 4. Alternative: Upload ZIP
Instead of cloning, you can upload your repo as a ZIP file and uncomment the ZIP upload section.

---

## üéØ What This Notebook Does:
1. **Clone** your trading bot repository
2. **Install** all dependencies automatically
3. **Mount** Google Drive for artifact storage
4. **Train** LightGBM and XGBoost models using your existing modules
5. **Save** trained models to repo and Google Drive
6. **Validate** model artifacts
7. **Download** results to your local machine

## üì¶ Output Artifacts:
- `models/{symbol}/{timestamp}/{model_id}/` - Model files (pkl, meta.json)
- `runs/colab-{timestamp}/manifest.json` - Training manifest
- Google Drive backup (if mounted)
- ZIP download for local machine

**Ready? Let's start! üëá**

# ‚öôÔ∏è Configuration Section

**IMPORTANT: Modify these variables before running!**

In [None]:
# =============================================================================
# üîß USER CONFIGURATION - MODIFY THESE VALUES!
# =============================================================================

# TODO: Replace with your GitHub repository URL
GITHUB_REPO_URL = "<YOUR_REPO_URL>"  # Example: "https://github.com/username/trading-bot.git"

# Training Configuration
SYMBOL = "BTC-USD"
INTERVAL = "1m"

CFG = {
    "fast_test": True,        # Set to False for full training
    "horizon": 5,             # Future periods for prediction
    "pos_thresh": 0.002,      # Positive class threshold (0.2%)
    "n_splits": 2,            # Cross-validation splits (fast_test)
    "seed": 42,               # Random seed
    "n_periods": 1000 if True else 5000,  # Dataset size (will be set based on fast_test)
}

# Update n_periods based on fast_test
CFG["n_periods"] = 1000 if CFG["fast_test"] else 5000

# Paths (will be set after repo clone)
REPO_NAME = None  # Will be extracted from GITHUB_REPO_URL
REPO_PATH = None  # Will be set to /content/{REPO_NAME}
MODEL_SAVE_REPO_PATH = None  # Will be set to {REPO_PATH}/models/
MODEL_SAVE_DRIVE_PATH = "/content/drive/MyDrive/models/"

# Status flags
DRIVE_MOUNTED = False
REPO_CLONED = False

print("‚úÖ Configuration loaded")
print(f"üéØ Training Mode: {'Fast Test' if CFG['fast_test'] else 'Full Training'}")
print(f"üìä Symbol: {SYMBOL} | Interval: {INTERVAL}")
print(f"üî¢ Dataset Size: {CFG['n_periods']} periods")

if GITHUB_REPO_URL == "<YOUR_REPO_URL>":
    print("‚ö†Ô∏è  WARNING: Please replace GITHUB_REPO_URL with your actual repository URL!")
    print("   Example: GITHUB_REPO_URL = 'https://github.com/username/trading-bot.git'")

# üì• Repository Setup

Clone your trading bot repository and set up the Python environment.

In [None]:
import os
import sys
import subprocess
import shutil
from pathlib import Path
import json
from datetime import datetime

def extract_repo_name(url):
    """Extract repository name from GitHub URL"""
    if url.endswith('.git'):
        url = url[:-4]
    return url.split('/')[-1]

def clone_repository(repo_url):
    """Clone the repository"""
    global REPO_NAME, REPO_PATH, MODEL_SAVE_REPO_PATH, REPO_CLONED
    
    if repo_url == "<YOUR_REPO_URL>":
        print("‚ùå ERROR: Please replace GITHUB_REPO_URL with your actual repository URL!")
        return False
    
    try:
        print(f"üîÑ Cloning repository: {repo_url}")
        
        # Extract repo name
        REPO_NAME = extract_repo_name(repo_url)
        REPO_PATH = f"/content/{REPO_NAME}"
        MODEL_SAVE_REPO_PATH = f"{REPO_PATH}/models/"
        
        # Remove existing directory if it exists
        if os.path.exists(REPO_PATH):
            print(f"üóëÔ∏è  Removing existing directory: {REPO_PATH}")
            shutil.rmtree(REPO_PATH)
        
        # Clone repository
        result = subprocess.run(
            ["git", "clone", repo_url, REPO_PATH],
            capture_output=True,
            text=True,
            cwd="/content"
        )
        
        if result.returncode != 0:
            print(f"‚ùå Git clone failed: {result.stderr}")
            print("üí° If this is a private repo, make sure you're using a personal access token:")
            print("   https://<TOKEN>@github.com/username/repo.git")
            return False
        
        # Add to Python path
        if REPO_PATH not in sys.path:
            sys.path.insert(0, REPO_PATH)
        
        print(f"‚úÖ Repository cloned successfully to: {REPO_PATH}")
        print(f"üìÅ Python path updated: {REPO_PATH}")
        
        # Show repository structure
        print("\nüìÇ Repository structure:")
        for root, dirs, files in os.walk(REPO_PATH):
            # Limit depth to avoid clutter
            level = root.replace(REPO_PATH, '').count(os.sep)
            if level < 3:
                indent = ' ' * 2 * level
                print(f"{indent}{os.path.basename(root)}/")
                subindent = ' ' * 2 * (level + 1)
                for file in files[:5]:  # Show only first 5 files per directory
                    print(f"{subindent}{file}")
                if len(files) > 5:
                    print(f"{subindent}... and {len(files) - 5} more files")
        
        REPO_CLONED = True
        return True
        
    except Exception as e:
        print(f"‚ùå Error cloning repository: {e}")
        return False

# Clone the repository
clone_success = clone_repository(GITHUB_REPO_URL)

if not clone_success:
    print("\nüîÑ Alternative: Upload ZIP file")
    print("If cloning failed, you can upload your repo as a ZIP file instead.")
    print("Uncomment and run the next cell to use ZIP upload.")

In [None]:
# # ALTERNATIVE: Upload ZIP file (uncomment if git clone failed)
# from google.colab import files
# import zipfile

# print("üì¶ Upload your repository as a ZIP file:")
# uploaded = files.upload()

# if uploaded:
#     zip_name = list(uploaded.keys())[0]
#     print(f"üì• Extracting {zip_name}...")
    
#     with zipfile.ZipFile(zip_name, 'r') as zip_ref:
#         zip_ref.extractall('/content')
    
#     # Find extracted directory
#     for item in os.listdir('/content'):
#         if os.path.isdir(f'/content/{item}') and item != 'sample_data':
#             REPO_NAME = item
#             REPO_PATH = f'/content/{item}'
#             MODEL_SAVE_REPO_PATH = f'{REPO_PATH}/models/'
#             break
    
#     if REPO_PATH and REPO_PATH not in sys.path:
#         sys.path.insert(0, REPO_PATH)
    
#     print(f"‚úÖ ZIP extracted to: {REPO_PATH}")
#     REPO_CLONED = True

# üì¶ Install Dependencies

Install required packages for ML training.

In [None]:
def display_results():
    """Display comprehensive training results"""
    
    print("üèÜ" + "="*60)
    print("üèÜ GOOGLE COLAB TRAINING COMPLETED SUCCESSFULLY!")
    print("üèÜ" + "="*60)
    
    if not saved_models:
        print("‚ùå No models were successfully trained")
        return
    
    print(f"\n? TRAINING SUMMARY:")
    print(f"  ‚Ä¢ Run Timestamp: {RUN_TIMESTAMP}")
    print(f"  ‚Ä¢ Symbol: {SYMBOL}")
    print(f"  ‚Ä¢ Training Mode: {'Fast Test' if CFG['fast_test'] else 'Full Training'}")
    print(f"  ‚Ä¢ Models Trained: {len(saved_models)}")
    print(f"  ‚Ä¢ Dataset Size: {len(X)} samples, {len(X.columns)} features")
    
    # Show data source information
    data_source = "üåü Real Market Data" if 'real_data' in trading_modules else "üîß Synthetic Data"
    print(f"  ‚Ä¢ Data Source: {data_source}")
    
    if 'real_data' in trading_modules:
        print(f"    üìà Via your existing data pipeline (Yahoo Finance)")
        print(f"    üéØ Production-grade features and validation")
    else:
        print(f"    ‚ö†Ô∏è  Synthetic data used (real data unavailable)")
    
    print(f"\nüéØ MODEL PERFORMANCE:")
    for model_type, model_info in saved_models.items():
        metadata = model_info['metadata']
        metrics = metadata['metrics']
        
        print(f"\n  üìà {model_type.upper()} MODEL:")
        print(f"    Model ID: {metadata['model_id']}")
        print(f"    Framework: {metadata['framework']}")
        print(f"    Data Source: {data_source}")
        
        if model_type == 'binary':
            print(f"    AUC Score: {metrics['auc']:.4f}")
            print(f"    Accuracy: {metrics['accuracy']:.4f}")
        else:
            print(f"    RMSE: {metrics['rmse']:.6f}")
            print(f"    R¬≤ Score: {metrics['r2']:.4f}")
    
    print(f"\n? ARTIFACT LOCATIONS:")
    for model_type, model_info in saved_models.items():
        print(f"\n  {model_type.upper()} MODEL ARTIFACTS:")
        print(f"    Local Path: {model_info['local_path']}")
        if model_info['drive_path']:
            print(f"    Google Drive: {model_info['drive_path']}")
        
        # List files in directory
        if os.path.exists(model_info['local_path']):
            files = os.listdir(model_info['local_path'])
            print(f"    Files: {', '.join(files)}")
    
    # Display sample metadata
    if saved_models:
        sample_model = list(saved_models.values())[0]
        print(f"\nüìã SAMPLE MODEL METADATA:")
        
        # Pretty print a subset of metadata including data source info
        display_metadata = {
            'model_id': sample_model['metadata']['model_id'],
            'model_type': sample_model['metadata']['model_type'],
            'data_source': data_source,
            'training_config': sample_model['metadata']['training_config'],
            'metrics': sample_model['metadata']['metrics'],
            'n_features': sample_model['metadata']['n_features'],
            'training_samples': sample_model['metadata']['training_samples']
        }
        
        print(json.dumps(display_metadata, indent=2))

# üíæ Mount Google Drive

Mount Google Drive to save trained models for long-term storage.

In [None]:
def mount_google_drive():
    """Mount Google Drive safely"""
    global DRIVE_MOUNTED
    
    try:
        from google.colab import drive
        print("üîÑ Mounting Google Drive...")
        drive.mount('/content/drive')
        
        # Verify mount
        if os.path.exists('/content/drive/MyDrive'):
            print("‚úÖ Google Drive mounted successfully")
            print(f"üìÅ Drive path: {MODEL_SAVE_DRIVE_PATH}")
            
            # Create models directory in Drive if it doesn't exist
            os.makedirs(MODEL_SAVE_DRIVE_PATH, exist_ok=True)
            DRIVE_MOUNTED = True
            return True
        else:
            print("‚ùå Drive mount verification failed")
            return False
            
    except ImportError:
        print("‚ö†Ô∏è  Not running in Google Colab - Drive mount skipped")
        return False
    except Exception as e:
        print(f"‚ö†Ô∏è  Drive mount failed: {e}")
        print("Continuing without Drive backup...")
        return False

# Mount Google Drive
mount_success = mount_google_drive()

if mount_success:
    print("üí° Models will be saved to both repo and Google Drive")
else:
    print("üí° Models will be saved to repo only")

# üì• Import Modules

Import the trading bot modules and verify everything is working.

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import pandas as pd
import numpy as np
import json
import joblib
from datetime import datetime, timedelta
from pathlib import Path
import hashlib
import subprocess

# ML libraries
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score

# Optional libraries (with fallbacks)
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
    print("‚úÖ XGBoost available")
except ImportError:
    XGB_AVAILABLE = False
    print("‚ö†Ô∏è  XGBoost not available - will skip XGBoost models")

try:
    import optuna
    OPTUNA_AVAILABLE = True
    print("‚úÖ Optuna available")
except ImportError:
    OPTUNA_AVAILABLE = False
    print("‚ö†Ô∏è  Optuna not available - will skip hyperparameter optimization")

# Set random seeds
np.random.seed(CFG['seed'])

print(f"\nüîß Core libraries imported successfully")
print(f"üéØ Random seed: {CFG['seed']}")

In [None]:
def import_trading_modules():
    """Import trading bot modules with fallbacks"""
    
    if not REPO_CLONED:
        print("‚ùå Repository not available for module imports")
        return False
    
    print("üîÑ Importing trading bot modules...")
    
    # Try to import existing modules
    modules_imported = {}
    
    # Feature engineering
    try:
        from arbi.ai.feature_engineering_v2 import compute_features_deterministic, load_feature_schema
        modules_imported['feature_engineering'] = True
        print("‚úÖ Feature engineering module")
    except ImportError as e:
        print(f"‚ö†Ô∏è  Feature engineering module not found: {e}")
        modules_imported['feature_engineering'] = False
    
    # Training module
    try:
        from arbi.ai.training_v2 import generate_synthetic_ohlcv_data
        modules_imported['training'] = True
        print("‚úÖ Training module")
    except ImportError:
        try:
            from arbi.ai.train_lgbm import train_and_validate_lgbm
            modules_imported['training'] = True
            print("‚úÖ LightGBM training module")
        except ImportError as e:
            print(f"‚ö†Ô∏è  Training module not found: {e}")
            modules_imported['training'] = False
    
    # Model registry
    try:
        from arbi.ai.registry import ModelRegistry
        modules_imported['registry'] = True
        print("‚úÖ Model registry")
    except ImportError as e:
        print(f"‚ö†Ô∏è  Model registry not found: {e}")
        modules_imported['registry'] = False
    
    # Inference module
    try:
        from arbi.ai.inference_v2 import ProductionInferenceEngine
        modules_imported['inference'] = True
        print("‚úÖ Inference engine")
    except ImportError:
        try:
            from arbi.ai.inference import InferenceEngine
            modules_imported['inference'] = True
            print("‚úÖ Inference engine (v1)")
        except ImportError as e:
            print(f"‚ö†Ô∏è  Inference module not found: {e}")
            modules_imported['inference'] = False
    
    imported_count = sum(modules_imported.values())
    total_count = len(modules_imported)
    
    print(f"\nüìä Module Import Summary: {imported_count}/{total_count} modules imported")
    
    if imported_count == 0:
        print("‚ö†Ô∏è  No trading bot modules found - will use fallback implementations")
        return False
    elif imported_count < total_count:
        print("‚ö†Ô∏è  Some modules missing - will use fallbacks where needed")
        return True
    else:
        print("‚úÖ All modules imported successfully")
        return True

# Import trading bot modules
modules_available = import_trading_modules()

# üèãÔ∏è Model Training

Train LightGBM and XGBoost models using your existing modules or fallback implementations.

In [None]:
def create_fallback_features(df):
    """Create basic technical indicators as fallback"""
    features = pd.DataFrame(index=df.index)
    
    # Price features
    features['returns'] = df['close'].pct_change()
    features['log_returns'] = np.log(df['close'] / df['close'].shift(1))
    features['price_ma5'] = df['close'].rolling(5).mean()
    features['price_ma20'] = df['close'].rolling(20).mean()
    features['price_ratio_ma5'] = df['close'] / features['price_ma5']
    features['price_ratio_ma20'] = df['close'] / features['price_ma20']
    
    # Volume features
    features['volume_ma5'] = df['volume'].rolling(5).mean()
    features['volume_ratio'] = df['volume'] / features['volume_ma5']
    features['volume_price_trend'] = features['volume_ratio'] * features['returns']
    
    # Volatility
    features['volatility'] = features['returns'].rolling(20).std()
    features['volatility_ratio'] = features['returns'].abs() / features['volatility']
    
    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    features['rsi'] = 100 - (100 / (1 + rs))
    
    # MACD
    exp1 = df['close'].ewm(span=12).mean()
    exp2 = df['close'].ewm(span=26).mean()
    features['macd'] = exp1 - exp2
    features['macd_signal'] = features['macd'].ewm(span=9).mean()
    features['macd_hist'] = features['macd'] - features['macd_signal']
    
    return features.dropna()

def generate_fallback_ohlcv_data(n_periods=1000, symbol="BTC-USD"):
    """Generate synthetic OHLCV data"""
    dates = pd.date_range(start='2023-01-01', periods=n_periods, freq='1H')
    
    # Random walk with drift and regime changes
    np.random.seed(CFG['seed'])
    
    # Create regime changes
    regime_changes = np.random.choice(n_periods, size=5, replace=False)
    regime_changes.sort()
    
    returns = []
    current_vol = 0.01
    
    for i in range(n_periods):
        # Change volatility at regime boundaries
        if i in regime_changes:
            current_vol = np.random.uniform(0.005, 0.02)
        
        # Generate return with current volatility
        ret = np.random.normal(0.00005, current_vol)
        returns.append(ret)
    
    returns = np.array(returns)
    prices = 50000 * np.exp(np.cumsum(returns))
    
    data = []
    for i, (date, price) in enumerate(zip(dates, prices)):
        high = price * (1 + abs(np.random.normal(0, 0.005)))
        low = price * (1 - abs(np.random.normal(0, 0.005)))
        open_price = prices[i-1] if i > 0 else price
        volume = np.random.uniform(100, 1000) * (1 + abs(returns[i]) * 10)
        
        data.append({
            'timestamp': date,
            'open': open_price,
            'high': high,
            'low': low,
            'close': price,
            'volume': volume
        })
    
    return pd.DataFrame(data)

def create_training_dataset(n_periods, symbol):
    """Create training dataset with features and labels"""
    
    print(f"üîÑ Creating training dataset ({n_periods} periods)...")
    
    # Generate or load OHLCV data
    try:
        if modules_available:
            from arbi.ai.training_v2 import generate_synthetic_ohlcv_data
            df = generate_synthetic_ohlcv_data(n_periods, symbol)
            print("‚úÖ Using repository OHLCV generation")
        else:
            raise ImportError("Using fallback")
    except:
        df = generate_fallback_ohlcv_data(n_periods, symbol)
        print("‚úÖ Using fallback OHLCV generation")
    
    # Compute features
    try:
        if modules_available:
            from arbi.ai.feature_engineering_v2 import compute_features_deterministic
            feature_result = compute_features_deterministic(df, symbol)
            feature_df = feature_result.features
            print("‚úÖ Using repository feature engineering")
        else:
            raise ImportError("Using fallback")
    except:
        feature_df = create_fallback_features(df)
        print("‚úÖ Using fallback feature engineering")
    
    # Create labels
    future_periods = CFG['horizon']
    threshold = CFG['pos_thresh']
    
    # Calculate future returns
    future_returns = df['close'].shift(-future_periods) / df['close'] - 1
    
    # Binary classification: 1 if return > threshold, 0 otherwise
    labels_binary = (future_returns > threshold).astype(int)
    
    # Regression target: actual future return
    labels_regression = future_returns
    
    # Remove rows where we can't calculate future returns
    valid_mask = ~future_returns.isna()
    
    feature_df = feature_df[valid_mask].reset_index(drop=True)
    labels_binary = labels_binary[valid_mask].reset_index(drop=True)
    labels_regression = labels_regression[valid_mask].reset_index(drop=True)
    timestamps = df['timestamp'][valid_mask].reset_index(drop=True)
    
    print(f"‚úÖ Dataset created:")
    print(f"  Samples: {len(feature_df)}")
    print(f"  Features: {feature_df.shape[1]}")
    print(f"  Positive class: {labels_binary.sum()}/{len(labels_binary)} ({100*labels_binary.mean():.1f}%)")
    print(f"  Regression target range: {labels_regression.min():.4f} to {labels_regression.max():.4f}")
    
    return feature_df, labels_binary, labels_regression, timestamps

# Create training dataset
X, y_binary, y_regression, timestamps = create_training_dataset(CFG['n_periods'], SYMBOL)

# üéâ Notebook Complete!

This is the complete Google Colab training notebook. To continue with training and saving models, use the additional chunks or run the CLI script.

**Next steps:**
1. Run the remaining training cells
2. Save model artifacts
3. Create training manifest
4. Download results

**Or use the CLI script:** `python tools/colab_train.py`