# Kaggle Competition - Training Pipeline

This notebook is designed to work with:
- Google Colab for computation (GPU/TPU)
- Google Drive for data storage
- GitHub repository for code versioning
- Local development with Claude Code

## Setup Checklist
- [ ] Update `GITHUB_REPO` with your repository URL
- [ ] Update `DRIVE_BASE` with your Google Drive path
- [ ] Update `COMPETITION_NAME` with the competition name
- [ ] Ensure data is in Google Drive under `data/raw/`


## Configuration

In [None]:
# === CONFIGURATION - UPDATE THESE ===
GITHUB_REPO = "https://github.com/your-username/competition-name.git"
DRIVE_BASE = "/content/drive/MyDrive/kaggle/competition-name"
COMPETITION_NAME = "competition-name"

# Paths
DATA_PATH = f"{DRIVE_BASE}/data/raw"
PROCESSED_PATH = f"{DRIVE_BASE}/data/processed"
MODEL_PATH = f"{DRIVE_BASE}/models"
OUTPUT_PATH = f"{DRIVE_BASE}/outputs"
SUBMISSION_PATH = f"{DRIVE_BASE}/submissions"

# Experiment tracking
EXPERIMENT_NAME = "baseline_v1"
EXPERIMENT_NOTES = "Initial baseline with basic features"


## 1. Environment Setup

In [None]:
# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Verify paths exist
import os
for path in [DATA_PATH, MODEL_PATH, OUTPUT_PATH, SUBMISSION_PATH]:
    os.makedirs(path, exist_ok=True)
    print(f"✓ {path}")


## 2. Clone Repository

In [None]:
# Clone your GitHub repository
!rm -rf /content/repo  # Clean up if exists
!git clone {GITHUB_REPO} /content/repo
%cd /content/repo
!git status


## 3. Install Dependencies

In [None]:
# Install dependencies from requirements.txt or pyproject.toml

# Option A: requirements.txt
if os.path.exists('/content/repo/requirements.txt'):
    !pip install -q -r /content/repo/requirements.txt

# Option B: pyproject.toml with uv
elif os.path.exists('/content/repo/pyproject.toml'):
    !pip install -q uv
    !uv pip install -e /content/repo

print("✓ Dependencies installed")


## 4. Import Custom Modules

In [None]:
# Add repository to Python path
import sys
sys.path.insert(0, '/content/repo/src')

# Import your custom modules
# Example:
# from data_loader import load_train_data, load_test_data
# from feature_engineering import create_features
# from models import train_model, predict

# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

print("✓ Modules imported")


## 5. Load Data from Google Drive

In [None]:
# Load data from Google Drive
print("Loading data from Google Drive...")

# Example: Adjust to your competition's data structure
train_df = pd.read_csv(f"{DATA_PATH}/train.csv")
test_df = pd.read_csv(f"{DATA_PATH}/test.csv")

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

# Display sample
display(train_df.head())


## 6. Exploratory Data Analysis (Optional)

In [None]:
# Quick EDA
print("\n=== Data Info ===")
train_df.info()

print("\n=== Missing Values ===")
print(train_df.isnull().sum())

print("\n=== Basic Statistics ===")
display(train_df.describe())


## 7. Feature Engineering

In [None]:
# Feature engineering
print("Creating features...")

# Example: Use your custom functions
# train_df = create_features(train_df)
# test_df = create_features(test_df)

# Or implement inline for quick experiments
# train_df['new_feature'] = train_df['col1'] * train_df['col2']

print(f"✓ Features created. Train shape: {train_df.shape}")


## 8. Train Model

In [None]:
# Prepare data for training
from sklearn.model_selection import train_test_split

# Example: Adjust to your competition
feature_cols = [col for col in train_df.columns if col not in ['id', 'target']]
X = train_df[feature_cols]
y = train_df['target']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train: {X_train.shape}, Val: {X_val.shape}")


In [None]:
# Train model
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

print("Training model...")

model = LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.01,
    max_depth=7,
    num_leaves=31,
    device='gpu',  # Use GPU
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='auc',
    callbacks=[early_stopping(50), log_evaluation(100)]
)

# Evaluate
val_pred = model.predict_proba(X_val)[:, 1]
val_score = roc_auc_score(y_val, val_pred)
print(f"\n✓ Validation AUC: {val_score:.4f}")


## 9. Save Model to Google Drive

In [None]:
# Save model
import joblib
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_filename = f"{MODEL_PATH}/{EXPERIMENT_NAME}_{timestamp}.pkl"

joblib.dump(model, model_filename)
print(f"✓ Model saved: {model_filename}")

# Save feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

feature_importance.to_csv(f"{OUTPUT_PATH}/feature_importance_{timestamp}.csv", index=False)
print(f"✓ Feature importance saved")


## 10. Generate Predictions

In [None]:
# Generate predictions on test set
print("Generating predictions...")

X_test = test_df[feature_cols]
test_pred = model.predict_proba(X_test)[:, 1]

print(f"✓ Predictions generated for {len(test_pred)} samples")


## 11. Create Submission File

In [None]:
# Create submission file
submission = pd.DataFrame({
    'id': test_df['id'],
    'target': test_pred
})

submission_filename = f"{SUBMISSION_PATH}/submission_{EXPERIMENT_NAME}_{timestamp}.csv"
submission.to_csv(submission_filename, index=False)

print(f"✓ Submission saved: {submission_filename}")
display(submission.head())


## 12. Submit to Kaggle (Optional)

In [None]:
# Setup Kaggle credentials (use Colab secrets)
from google.colab import userdata
import os

try:
    os.environ['KAGGLE_USERNAME'] = userdata.get('KAGGLE_USERNAME')
    os.environ['KAGGLE_KEY'] = userdata.get('KAGGLE_KEY')

    # Submit
    !kaggle competitions submit -c {COMPETITION_NAME} \
        -f {submission_filename} \
        -m "{EXPERIMENT_NAME}: {EXPERIMENT_NOTES}"

    print("✓ Submission successful!")
except Exception as e:
    print(f"⚠ Could not submit: {e}")
    print("Tip: Add KAGGLE_USERNAME and KAGGLE_KEY to Colab secrets")


## 13. Log Experiment

In [None]:
# Log experiment details
import json

experiment_log = {
    'experiment_name': EXPERIMENT_NAME,
    'timestamp': timestamp,
    'val_score': float(val_score),
    'model': 'LightGBM',
    'features': feature_cols,
    'params': model.get_params(),
    'notes': EXPERIMENT_NOTES,
    'model_path': model_filename,
    'submission_path': submission_filename
}

log_file = f"{OUTPUT_PATH}/experiments.jsonl"
with open(log_file, 'a') as f:
    json.dump(experiment_log, f)
    f.write('\n')

print(f"✓ Experiment logged: {log_file}")
print(json.dumps(experiment_log, indent=2))


## Summary

**Experiment:** {EXPERIMENT_NAME}

**Results:**
- Validation Score: [Check cell output above]
- Model saved to Google Drive
- Submission file created

**Next Steps:**
1. Review feature importance
2. Update code in local repository
3. Run new experiments with improved features/models
4. Track results and iterate
