# Model Training for Air Quality Prediction

This notebook demonstrates the machine learning model training pipeline for air quality prediction.

## Objectives
1. Train baseline Random Forest model
2. Optimize LightGBM hyperparameters using Optuna
3. Train primary LightGBM model with best parameters
4. Evaluate model performance
5. Compare baseline vs primary model
6. Extract feature importance

## Models
- **Baseline**: Random Forest (n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
- **Primary**: LightGBM with hyperparameter optimization
- **Evaluation**: Time-based 80:20 split with custom accuracy metric


In [None]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    """Install package using pip if not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install required packages
required_packages = [
    "pandas",
    "numpy", 
    "matplotlib",
    "seaborn",
    "scikit-learn",
    "lightgbm",
    "optuna",
    "xgboost",
    "tqdm",
    "joblib"
]

for package in required_packages:
    install_package(package)

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import lightgbm as lgb
import optuna
from optuna.integration import LightGBMPruningCallback
import joblib
import warnings
import os

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("Model training libraries imported successfully!")


## 1. Data Loading and Model Setup


In [None]:
# Load engineered features data
data_path = "../data/features/"

# TODO: Replace with actual engineered data loading
print("Model training section - to be implemented with engineered data")
print(f"Expected data path: {data_path}")

# Import model training class
import sys
sys.path.append('../src')
from models import AirQualityModelTrainer

# Initialize model trainer
model_trainer = AirQualityModelTrainer(data_path + "engineered_features.csv")

print("Model trainer initialized successfully!")
