# Breakout Stock Classifier: Scaffolding and Expansion

This notebook scaffolds a modular backend for a breakout stock classifier, breaks out model components, adds data point functionality, displays and edits the training cell, and updates the workflow to handle more stocks.

## 0. Setup & Installation

Run this cell first to install required packages (required for Kaggle/Colab).

In [1]:
# Install required packages (UNCOMMENT and run in Kaggle/Colab)
!pip install eodhd xgboost scikit-learn pandas numpy joblib -q

# Optional: Upload your stock list CSV in Kaggle/Colab
# For Kaggle: Use "Add Data" button on the right panel instead
# For Colab:
# from google.colab import files
# print("üìÅ Upload your stocks-list.csv file:")
# uploaded = files.upload()

## Google Colab / Kaggle: Uploading Your CSV

### For Google Colab:

If you are using Google Colab, you can upload your `stocks-list.csv` file directly to the Colab runtime with the following code cell:

```python
from google.colab import files
uploaded = files.upload()  # This will prompt you to select and upload your CSV file
```

- After uploading, the file will be in the current working directory.
- If your code expects the file in a `data/` folder, move it with:

```python
import os
os.makedirs('data', exist_ok=True)
os.replace('stocks-list.csv', 'data/stocks-list.csv')
```

Alternatively, you can mount your Google Drive and access files from there:

```python
from google.colab import drive
drive.mount('/content/drive')
# Then use the path: '/content/drive/My Drive/path/to/stocks-list.csv'
```

### For Kaggle:

1. **Upload your dataset to Kaggle:**
   - Go to [kaggle.com/datasets](https://www.kaggle.com/datasets)
   - Click "New Dataset" and upload `stocks-list.csv`
   - Name it "stocks-list" (or similar)

2. **Add the dataset to your notebook:**
   - In your Kaggle notebook, click "Add Data" on the right panel
   - Search for your uploaded dataset
   - Add it to your notebook

3. **The notebook will automatically detect Kaggle paths:**
   - `/kaggle/input/stocks-list/stocks-list.csv`
   - `/kaggle/input/stock-list/stocks-list.csv`

4. **Output files** will be saved to `/kaggle/working/` and appear in the "Output" tab

Adjust your code to use the correct path depending on your upload method.

## 1. Scaffold Model Backend
Set up the basic backend structure for the breakout classifier, including imports and class/function definitions.

In [2]:
# Imports and backend scaffolding
import pandas as pd
import numpy as np
from typing import List, Dict
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import joblib

# Placeholder for backend class
class BreakoutStockClassifier:
    def __init__(self):
        self.model = None
        self.features = None
    
    def fit(self, X, y):
        self.model = XGBClassifier(n_estimators=200, max_depth=5, random_state=42)
        self.model.fit(X, y)
    
    def predict(self, X):
        return self.model.predict_proba(X)[:, 1] if self.model else None
    
    def save(self, path):
        joblib.dump(self.model, path)
    
    def load(self, path):
        self.model = joblib.load(path)


In [3]:
# --- EOD Historical Data: Download US stock price data (replace Alpha Vantage) ---

from eodhd import APIClient
import pandas as pd
import time
import numpy as np

# Use your actual EODHD API token here
EODHD_API_KEY = "68ebce6775f004.44089353"

def download_eodhd_bulk(tickers, api_key=EODHD_API_KEY, start_date="2015-01-01", end_date=None, batch_size=5, delay=2):
    """
    Download daily historical data for a list of tickers from EODHD.
    Returns a DataFrame with all data concatenated.
    Ensures 'close' and 'date' columns exist for all tickers.
    """
    # Defensive validation: ensure tickers is a list of strings
    if tickers is None:
        print("Warning: tickers is None, returning empty DataFrame")
        return pd.DataFrame()
    
    if not hasattr(tickers, '__iter__') or isinstance(tickers, (str, int, float)):
        print(f"Warning: tickers is not iterable (got {type(tickers).__name__}), returning empty DataFrame")
        return pd.DataFrame()
    
    # Convert to list and filter out non-string items
    tickers = [t for t in tickers if isinstance(t, str) and t.strip()]
    
    if len(tickers) == 0:
        print("Warning: No valid tickers provided, returning empty DataFrame")
        return pd.DataFrame()
    
    client = APIClient(api_key)
    all_data = []
    total = len(tickers)
    for i in range(0, total, batch_size):
        batch = tickers[i:i+batch_size]
        print(f"Downloading batch {i//batch_size+1} ({i+1}-{min(i+batch_size, total)}) of {total}...")
        for ticker in batch:
            try:
                df = client.get_historical_data(
                    symbol=ticker,
                    interval="d",
                    iso8601_start=start_date,
                    iso8601_end=end_date if end_date else ""
                )
                if not df.empty:
                    # Ensure 'date' column exists (EODHD may return date as index)
                    if 'date' not in df.columns:
                        if df.index.name == 'date' or df.index.name is None:
                            df = df.reset_index()
                            # Rename index column to 'date' if needed
                            if 'index' in df.columns:
                                df = df.rename(columns={'index': 'date'})
                            elif df.columns[0] not in ['date', 'open', 'high', 'low', 'close', 'volume']:
                                df = df.rename(columns={df.columns[0]: 'date'})
                    
                    # Ensure 'close' column exists
                    if 'close' not in df.columns:
                        print(f"Ticker {ticker} missing 'close' column, adding as NaN.")
                        df['close'] = np.nan
                    
                    df['symbol'] = ticker
                    all_data.append(df)
                else:
                    print(f"No data for {ticker}")
            except Exception as e:
                print(f"Failed to download {ticker}: {e}")
        if i + batch_size < total:
            print(f"Sleeping for {delay} seconds to avoid API rate limits...")
            time.sleep(delay)
    if all_data:
        result_df = pd.concat(all_data, ignore_index=True)
        # Final check: ensure 'date' column exists
        if 'date' not in result_df.columns:
            print("Warning: 'date' column still missing after concat, attempting to fix...")
            result_df = result_df.reset_index()
            if 'index' in result_df.columns:
                result_df = result_df.rename(columns={'index': 'date'})
        return result_df
    else:
        print("No valid data downloaded.")
        return pd.DataFrame()

# Example usage:
# us_tickers = ["AAPL", "MSFT", "GOOG"]
# df = download_eodhd_bulk(us_tickers)
# print(df.head())

# Remove Alpha Vantage import and function
def download_alpha_vantage_bulk(*args, **kwargs):
    raise NotImplementedError("Alpha Vantage integration has been replaced by EODHD.")

In [4]:
# --- Download and use a comprehensive ticker list from an external CSV ---
import os

def get_tickers_from_csv(csv_path: str) -> list:
    """Load a comprehensive list of tickers from an external CSV file."""
    import pandas as pd
    df = pd.read_csv(csv_path)    # Accept common column names for tickers
    for col in ['symbol', 'ticker', 'Ticker', 'SYMBOL', 'Symbol']:
        if col in df.columns:
            tickers = df[col].dropna().unique().tolist()
            return tickers
    raise ValueError(f"No ticker column found in {csv_path}. Columns found: {df.columns.tolist()}")

# Example usage:
# Download a full US stock list from NASDAQ, NYSE, AMEX, or use a third-party source like 'eodhistoricaldata.com', 'nasdaqtrader.com', or 'stockanalysis.com'.
# Place the CSV in your data directory, e.g., 'data/all_us_tickers.csv'.
# all_tickers = get_tickers_from_csv('data/all_us_tickers.csv')


## 2. Break Out Model Components
Separate the workflow into modular functions for data loading, preprocessing, model definition, and evaluation.

In [5]:
# Data loading function
def load_stock_data(csv_path: str) -> pd.DataFrame:
    """Load historical stock data from CSV or other sources."""
    return pd.read_csv(csv_path)

# Quality filter function
def filter_quality_stocks(df: pd.DataFrame, min_price=5.0, min_volume=500000) -> pd.DataFrame:
    """Filter out low-quality stocks (penny stocks, illiquid stocks)."""
    df = df.copy()
    
    # Minimum price filter (avoid penny stocks)
    df = df[df['close'] >= min_price]
    
    # Minimum volume filter (ensure liquidity)
    df = df[df['volume'] >= min_volume]
    
    # Price stability check (avoid stocks with extreme volatility)
    df['price_std_30d'] = df.groupby('symbol')['close'].transform(
        lambda x: x.rolling(30, min_periods=1).std()
    )
    df['volatility'] = df['price_std_30d'] / df['close']
    
    # Filter out extremely volatile stocks (>50% daily volatility)
    df = df[df['volatility'] < 0.5]
    
    return df

# Calculate momentum features
def calculate_momentum_features(df: pd.DataFrame) -> pd.DataFrame:
    """Add momentum indicators to help identify sustained moves."""
    df = df.copy()
    
    # Sort by symbol and date
    df = df.sort_values(['symbol', 'date'])
    
    # RSI (Relative Strength Index)
    delta = df.groupby('symbol')['close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = -delta.where(delta < 0, 0).rolling(14).mean()
    rs = gain / (loss + 1e-10)  # Avoid division by zero
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # MACD
    ema_12 = df.groupby('symbol')['close'].transform(lambda x: x.ewm(span=12, adjust=False).mean())
    ema_26 = df.groupby('symbol')['close'].transform(lambda x: x.ewm(span=26, adjust=False).mean())
    df['macd'] = ema_12 - ema_26
    df['macd_signal'] = df.groupby('symbol')['macd'].transform(lambda x: x.ewm(span=9, adjust=False).mean())
    df['macd_histogram'] = df['macd'] - df['macd_signal']
    
    # Rate of Change (10-day)
    df['roc_10'] = (df.groupby('symbol')['close'].shift(0) / 
                    df.groupby('symbol')['close'].shift(10) - 1) * 100
    
    # Average True Range (volatility measure)
    df['prev_close'] = df.groupby('symbol')['close'].shift(1)
    df['tr'] = df[['high', 'low', 'prev_close']].apply(
        lambda x: max(x['high'] - x['low'], 
                     abs(x['high'] - x['prev_close']) if pd.notna(x['prev_close']) else 0,
                     abs(x['low'] - x['prev_close']) if pd.notna(x['prev_close']) else 0),
        axis=1
    )
    df['atr'] = df.groupby('symbol')['tr'].transform(lambda x: x.rolling(14, min_periods=1).mean())
    
    # Volume ratio
    df['avg_volume_30d'] = df.groupby('symbol')['volume'].transform(
        lambda x: x.rolling(30, min_periods=1).mean()
    )
    df['volume_ratio'] = df['volume'] / (df['avg_volume_30d'] + 1)  # Avoid division by zero
    
    return df

# Preprocessing function with trend filtering and sustained breakout definition
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Feature engineering and breakout labeling with trend filtering.
    Labels sustained breakouts (not just one-day spikes).
    """
    df = df.copy()
    
    # Ensure all price/volume columns are numeric
    for col in ['open', 'high', 'low', 'close', 'volume']:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Sort by symbol and date
    df = df.sort_values(['symbol', 'date'])
    
    # Calculate moving averages for trend detection
    df['sma_50'] = df.groupby('symbol')['close'].transform(lambda x: x.rolling(50, min_periods=1).mean())
    df['sma_200'] = df.groupby('symbol')['close'].transform(lambda x: x.rolling(200, min_periods=1).mean())
    
    # Trend filter: Price must be above both 50 and 200 day MA
    df['uptrend'] = (df['close'] > df['sma_50']) & (df['close'] > df['sma_200'])
    
    # Filter out downtrending stocks BEFORE labeling breakouts
    print(f"Before trend filter: {len(df)} rows")
    df = df[df['uptrend'] == True].copy()
    print(f"After trend filter (uptrend only): {len(df)} rows")
    
    # Calculate multiple forward returns for sustained breakout detection
    valid = (df['close'].notnull()) & (df['close'] != 0)
    
    for days in [30, 60, 90]:
        col_name = f'forward_return_{days}d'
        df[col_name] = np.nan
        shifted = df.groupby('symbol')['close'].shift(-days)
        df.loc[valid, col_name] = shifted[valid] / df['close'][valid] - 1
    
    # Calculate max drawdown during breakout period (30 days)
    df['future_min_price'] = df.groupby('symbol')['close'].transform(
        lambda x: x.rolling(30, min_periods=1).min().shift(-30)
    )
    df['max_drawdown_30d'] = (df['future_min_price'] / df['close']) - 1
    
    # Volume confirmation
    df['volume_confirmed'] = df['volume_ratio'] > 1.2
    
    # Breakout definition: SUSTAINED growth across all periods + volume + no major drawdown
    df['breakout'] = (
        (df['forward_return_30d'] > 0.20) &   # Up 20%+ at 30 days
        (df['forward_return_60d'] > 0.30) &   # Up 30%+ at 60 days
        (df['forward_return_90d'] > 0.40) &   # Up 40%+ at 90 days
        (df['max_drawdown_30d'] > -0.15) &    # No major crash (>15% drop)
        (df['volume_confirmed'] == True)      # Volume above average
    ).astype(int)
    
    # Drop rows with any NaN in required feature columns
    df = df.dropna(subset=['open', 'high', 'low', 'close', 'volume', 'rsi', 'macd', 'atr'])
    
    return df

# Model evaluation function
def evaluate_model(model, X_test, y_test):
    from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = (y_pred_proba > 0.5).astype(int)
    
    print("\n" + "="*60)
    print("MODEL EVALUATION")
    print("="*60)
    
    # ROC AUC
    if len(np.unique(y_test)) > 1:
        auc = roc_auc_score(y_test, y_pred_proba)
        print(f"ROC AUC: {auc:.3f}")
    else:
        print("ROC AUC: N/A (only one class in test set)")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Breakout', 'Breakout']))
    
    # Confusion matrix
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"True Negatives:  {cm[0,0]:>6}")
    print(f"False Positives: {cm[0,1]:>6}")
    print(f"False Negatives: {cm[1,0]:>6}")
    print(f"True Positives:  {cm[1,1]:>6}")
    
    # Breakout prediction stats
    print(f"\nBreakout predictions: {y_pred.sum()} out of {len(y_pred)} ({y_pred.sum()/len(y_pred)*100:.1f}%)")
    print("="*60)


## 3. Add Data Point Functionality
Implement a function to add or update individual data points for training or testing.

In [6]:
# Function to add or update a data point
def add_data_point(df: pd.DataFrame, new_row: Dict) -> pd.DataFrame:
    """Add or update a single data point in the DataFrame."""
    df = df.copy()
    # Assume 'date' and 'symbol' uniquely identify a row
    mask = (df['date'] == new_row['date']) & (df['symbol'] == new_row['symbol'])
    if mask.any():
        df.loc[mask, :] = pd.DataFrame([new_row])
    else:
        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
    return df

## 4. Display and Edit Training Cell
This cell trains the breakout classifier. You can edit model parameters or code as needed.

In [7]:
# Training cell: Handle 5000+ stocks with batching and progress tracking

from pathlib import Path
import time
import os

# Configuration
CONFIG = {
    'start_date': '2020-01-01',  # 5 years of data
    'batch_size': 100,  # Process 100 stocks at a time
    'api_delay': 3,  # Seconds between API batches
    'min_price': 5.0,
    'min_volume': 500000,
    'max_stocks': None,  # None = use all stocks from CSV
}

print("="*60)
print("TRAINING BREAKOUT CLASSIFIER ON 5000+ STOCKS")
print("="*60)

# 1. Load tickers from CSV - try multiple paths (Kaggle, Colab, Local)
csv_path = None
for path in [
    # Kaggle paths (when using "Add Data" feature)
    '/kaggle/input/stockss-list/stocks-list.csv',
    '/kaggle/input/stock-list/stocks-list.csv',
    '/kaggle/working/stocks-list.csv',
    # Colab paths
    '/content/stocks-list.csv',
    '/stocks-list.csv',
    # Current directory
    'stocks-list.csv',
    # Local development paths
    '../data/stocks-list.csv',
    'data/stocks-list.csv',
]:
    if os.path.exists(path):
        csv_path = path
        break

if csv_path is None:
    raise FileNotFoundError("stocks-list.csv not found! Please upload it first.")

all_tickers = get_tickers_from_csv(csv_path)

if CONFIG['max_stocks']:
    all_tickers = all_tickers[:CONFIG['max_stocks']]

print(f"Loaded {len(all_tickers)} tickers from {csv_path}")
print(f"Estimated download time: {(len(all_tickers) * CONFIG['api_delay'] / 60):.1f} minutes")
print(f"Start time: {time.strftime('%H:%M:%S')}")

# 2. Download historical data in batches with progress tracking
print("\n" + "="*60)
print("DOWNLOADING HISTORICAL DATA")
print("="*60)

all_dataframes = []
failed_tickers = []
total_rows = 0

for batch_num, i in enumerate(range(0, len(all_tickers), CONFIG['batch_size']), 1):
    batch_tickers = all_tickers[i:i + CONFIG['batch_size']]
    progress = (i / len(all_tickers)) * 100
    
    print(f"\nBatch {batch_num} ({i+1}-{min(i+CONFIG['batch_size'], len(all_tickers))} of {len(all_tickers)}) [{progress:.1f}%]")
    
    try:
        batch_df = download_eodhd_bulk(
            batch_tickers,
            start_date=CONFIG['start_date'],
            batch_size=20,  # EODHD internal batch
            delay=1  # Shorter delay within batch
        )
        
        if not batch_df.empty:
            # Ensure lowercase columns
            batch_df.columns = [col.lower() for col in batch_df.columns]
            all_dataframes.append(batch_df)
            total_rows += len(batch_df)
            print(f"Downloaded {len(batch_df)} rows from {batch_df['symbol'].nunique()} stocks")
        else:
            print(f"Batch {batch_num} returned no data")
            failed_tickers.extend(batch_tickers)
    
    except Exception as e:
        print(f"Batch {batch_num} failed: {e}")
        failed_tickers.extend(batch_tickers)
    
    # Progress update
    print(f"Progress: {total_rows:,} rows | {len(all_dataframes)} batches | {len(failed_tickers)} failures")
    
    # Rate limiting between batches
    if i + CONFIG['batch_size'] < len(all_tickers):
        print(f"Cooling down for {CONFIG['api_delay']} seconds...")
        time.sleep(CONFIG['api_delay'])

# 3. Combine all data
print("\n" + "="*60)
print("COMBINING DATA")
print("="*60)

if all_dataframes:
    bulk_df = pd.concat(all_dataframes, ignore_index=True)
    print(f"Combined {len(bulk_df):,} total rows")
    print(f"Unique stocks: {bulk_df['symbol'].nunique()}")
    print(f"Failed tickers: {len(failed_tickers)}")
else:
    raise ValueError("No data downloaded successfully!")

# 4. Check for required columns
print(f"Columns in bulk_df: {bulk_df.columns.tolist()}")

if 'close' not in bulk_df.columns:
    raise ValueError(f"'close' column not found. Columns: {bulk_df.columns.tolist()}")

# Ensure 'date' column exists (EODHD may return date as index)
if 'date' not in bulk_df.columns:
    print("'date' column not found, attempting to recover from index...")
    bulk_df = bulk_df.reset_index()
    # Try common column name variations
    for col in bulk_df.columns:
        if 'date' in col.lower() or 'time' in col.lower():
            bulk_df = bulk_df.rename(columns={col: 'date'})
            print(f"   Renamed '{col}' to 'date'")
            break
    # If still no date column, use the first column if it looks like dates
    if 'date' not in bulk_df.columns and 'index' in bulk_df.columns:
        bulk_df = bulk_df.rename(columns={'index': 'date'})
        print("   Renamed 'index' to 'date'")

    if 'date' not in bulk_df.columns:
        raise ValueError(f"'date' column not found and could not be recovered. Columns: {bulk_df.columns.tolist()}")

print(f"Required columns verified: 'close' and 'date' present")

# 5. Filter quality stocks (parallel processing for speed)
print("\n" + "="*60)
print("FILTERING QUALITY STOCKS")
print("="*60)
print(f"Before filter: {len(bulk_df):,} rows")

bulk_df = filter_quality_stocks(
    bulk_df,
    min_price=CONFIG['min_price'],
    min_volume=CONFIG['min_volume']
)

print(f"After filter: {len(bulk_df):,} rows")
print(f"Removed: {total_rows - len(bulk_df):,} low-quality data points")
print(f"Remaining stocks: {bulk_df['symbol'].nunique()}")

# 6. Add momentum features (vectorized for speed)
print("\n" + "="*60)
print("CALCULATING MOMENTUM FEATURES")
print("="*60)

start_time = time.time()
bulk_df = calculate_momentum_features(bulk_df)
calc_time = time.time() - start_time

print(f"Calculated in {calc_time:.1f} seconds")
print(f"Features added: RSI, MACD, ROC, ATR, Volume Ratio")

# 7. Preprocess with trend filter and sustained breakout definition
print("\n" + "="*60)
print("LABELING SUSTAINED BREAKOUTS")
print("="*60)

bulk_df = preprocess_data(bulk_df)

print(f"\nFINAL DATASET:")
print(f"   Total rows: {len(bulk_df):,}")
print(f"   Unique stocks: {bulk_df['symbol'].nunique()}")
print(f"   Date range: {bulk_df['date'].min()} to {bulk_df['date'].max()}")
print(f"   Breakouts found: {bulk_df['breakout'].sum():,}")
print(f"   Breakout rate: {bulk_df['breakout'].mean():.2%}")

# 8. Validate sufficient training data
if bulk_df['breakout'].sum() < 50:
    print("\nWARNING: Very few breakouts found (<50). Consider:")
    print("   - Lowering return thresholds in preprocess_data()")
    print("   - Using more historical data (start_date='2015-01-01')")
    print("   - Checking data quality")
elif bulk_df['breakout'].sum() < 200:
    print("\nCAUTION: Limited breakouts found (<200). Model may be less robust.")
else:
    print(f"\nSufficient training data: {bulk_df['breakout'].sum()} breakouts")

# 9. Select features and prepare training data
print("\n" + "="*60)
print("PREPARING TRAINING DATA")
print("="*60)

features = [
    'open', 'high', 'low', 'close', 'volume',
    'sma_50', 'sma_200', 'rsi', 'macd', 'macd_histogram',
    'roc_10', 'atr', 'volume_ratio'
]

X = bulk_df[features].fillna(0)
y = bulk_df['breakout']

print(f"Feature matrix shape: {X.shape}")
print(f"Features: {len(features)}")
print(f"Class distribution:")
print(f"   No Breakout: {(y==0).sum():,} ({(y==0).mean():.1%})")
print(f"   Breakout:    {(y==1).sum():,} ({(y==1).mean():.1%})")

# 10. Train/test split with stratification
print("\n" + "="*60)
print("SPLITTING TRAIN/TEST DATA")
print("="*60)

try:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print("Stratified split successful")
except ValueError as e:
    print(f"Cannot stratify: {e}")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

print(f"Train: {len(X_train):,} samples ({y_train.sum():,} breakouts, {y_train.mean():.2%})")
print(f"Test:  {len(X_test):,} samples ({y_test.sum():,} breakouts, {y_test.mean():.2%})")

# 11. Train model with optimized hyperparameters
print("\n" + "="*60)
print("TRAINING XGBOOST CLASSIFIER")
print("="*60)

# Calculate class weight to handle imbalance
pos_count = y_train.sum()
neg_count = len(y_train) - pos_count
scale_pos_weight = neg_count / pos_count if pos_count > 0 else 1

print(f"Scale pos weight: {scale_pos_weight:.2f}")

clf = BreakoutStockClassifier()
clf.model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss',
    n_jobs=-1,  # Use all CPU cores
    tree_method='hist'  # Faster for large datasets
)

train_start = time.time()
clf.fit(X_train, y_train)
train_time = time.time() - train_start

print(f"Training complete in {train_time:.1f} seconds!")

# 12. Evaluate model
print("\n" + "="*60)
print("MODEL EVALUATION")
print("="*60)

evaluate_model(clf.model, X_test, y_test)

# 13. Feature importance
print("\n" + "="*60)
print("TOP 10 MOST IMPORTANT FEATURES")
print("="*60)

feature_importance = pd.DataFrame({
    'feature': features,
    'importance': clf.model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance.head(10).to_string(index=False))

# 14. Save model with metadata - detect environment
print("\n" + "="*60)
print("SAVING MODEL")
print("="*60)

# Detect environment and set output directory
if os.path.exists('/kaggle/working'):
    # Kaggle environment - save to working directory (shows in Output tab)
    model_dir = '/kaggle/working'
    print("Detected Kaggle environment")
elif os.path.exists('/content'):
    # Colab environment
    model_dir = '/content/ml_models'
    print("Detected Colab environment")
else:
    # Local environment
    model_dir = '../python/ml_models'
    print("Detected local environment")

os.makedirs(model_dir, exist_ok=True)

# Save model with descriptive name
model_filename = 'breakout_classifier_xgb_stockslist.pkl'
model_path = f'{model_dir}/{model_filename}'
clf.save(model_path)

# Save metadata
metadata = {
    'training_date': time.strftime('%Y-%m-%d %H:%M:%S'),
    'total_stocks': len(all_tickers),
    'successful_stocks': bulk_df['symbol'].nunique(),
    'failed_stocks': len(failed_tickers),
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'breakout_rate': float(y.mean()),
    'features': features,
    'config': CONFIG,
    'model_performance': {
        'training_time_seconds': train_time,
        'test_breakouts': int(y_test.sum())
    }
}

import json
metadata_path = f'{model_dir}/breakout_classifier_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Model saved to: {model_path}")
print(f"Metadata saved to: {metadata_path}")

# 15. Summary
print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"Trained on {bulk_df['symbol'].nunique()} stocks")
print(f"Found {bulk_df['breakout'].sum():,} sustained breakouts")
print(f"Model finds SUSTAINED breakouts in UPTRENDING stocks")
print(f"Model location: {model_path}")

if '/kaggle/' in model_dir:
    print("\nTo download: Go to Output tab on the right")
elif '/content/' in model_dir:
    print("\nNext steps:")
    print("   1. Run the download cell below to get a zip file")
    print("   2. Upload to your python/ml_models/ directory")
    print("   3. Use in your ML API for predictions")
else:
    print("\nNext steps:")
    print("   1. Model is saved locally")
    print("   2. Use in your ML API for predictions")
print("="*60)

TRAINING BREAKOUT CLASSIFIER ON 5000+ STOCKS
Loaded 5519 tickers from /kaggle/input/stockss-list/stocks-list.csv
Estimated download time: 275.9 minutes
Start time: 05:26:59

DOWNLOADING HISTORICAL DATA

Batch 1 (1-100 of 5519) [0.0%]
Downloading batch 1 (1-20) of 100...
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 2 (21-40) of 100...
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 3 (41-60) of 100...
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 4 (61-80) of 100...
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 5 (81-100) of 100...
Downloaded 29452 rows from 100 stocks
Progress: 29,452 rows | 1 batches | 0 failures
Cooling down for 3 seconds...

Batch 2 (101-200 of 5519) [1.8%]
Downloading batch 1 (1-20) of 100...
No data for AEC
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 2 (21-40) of 100...
Sleeping for 1 seconds to avoid API rate limits...
Downloading batch 3 (41-60) of

  return op(a, b)
  X = bulk_df[features].fillna(0)



FINAL DATASET:
   Total rows: 225,244
   Unique stocks: 3341
   Date range: 2020-01-06 00:00:00 to 2025-11-28 00:00:00
   Breakouts found: 2,077
   Breakout rate: 0.92%

Sufficient training data: 2077 breakouts

PREPARING TRAINING DATA
Feature matrix shape: (225244, 13)
Features: 13
Class distribution:
   No Breakout: 223,167 (99.1%)
   Breakout:    2,077 (0.9%)

SPLITTING TRAIN/TEST DATA
Stratified split successful
Train: 180,195 samples (1,662 breakouts, 0.92%)
Test:  45,049 samples (415 breakouts, 0.92%)

TRAINING XGBOOST CLASSIFIER
Scale pos weight: 107.42
Training complete in 2.0 seconds!

MODEL EVALUATION

MODEL EVALUATION
ROC AUC: 0.956

Classification Report:
              precision    recall  f1-score   support

 No Breakout       0.99      1.00      1.00     44634
    Breakout       0.55      0.04      0.08       415

    accuracy                           0.99     45049
   macro avg       0.77      0.52      0.54     45049
weighted avg       0.99      0.99      0.99     450

## 5. Improved Breakout Detection

This updated notebook now finds **real, sustained breakouts** instead of one-day spikes:

### Key Improvements:

1. **Trend Filter**: Only considers stocks in uptrends (price above 50 & 200-day MA)
2. **Sustained Returns**: Requires positive returns at 30, 60, AND 90 days:
   - 20%+ at 30 days
   - 30%+ at 60 days  
   - 40%+ at 90 days
3. **Volume Confirmation**: Breakouts need above-average volume (1.2x+)
4. **Quality Filter**: Excludes penny stocks (<$5) and illiquid stocks (<500K volume)
5. **Drawdown Protection**: Excludes stocks that crash >15% mid-period
6. **Momentum Features**: RSI, MACD, ROC help identify real momentum

### Expected Results:
- Fewer breakout labels (more selective)
- Higher quality picks that sustain growth
- Better for real trading/investment decisions

### If you get too few breakouts:
- Lower the return thresholds (e.g., 15%/25%/35%)
- Increase the stock universe (use more tickers)
- Use more historical data (start_date="2015-01-01")


## 6. Download Models (For Colab Users)

In [8]:
# Download models for Colab (UNCOMMENT if running in Colab)
# from google.colab import files
# import zipfile
# import os

# print("üì¶ Creating zip file with breakout classifier models...")

# # Create zip file
# zip_filename = 'breakout_classifier_models.zip'
# with zipfile.ZipFile(zip_filename, 'w') as zipf:
#     for file in ['breakout_classifier_xgb_stockslist.pkl', 'breakout_classifier_metadata.json']:
#         file_path = f"{model_dir}/{file}"
#         if os.path.exists(file_path):
#             zipf.write(file_path, file)
#             print(f"  ‚úÖ Added {file}")
#         else:
#             print(f"  ‚ö†Ô∏è  {file} not found")

# print(f"\n‚¨áÔ∏è  Downloading {zip_filename}...")
# files.download(zip_filename)
# print("‚úÖ Download complete!")
# print("\nüìã Next steps:")
# print("1. Extract the zip file")
# print("2. Upload the .pkl file to your project's python/ml_models/ directory")
# print("3. Use the model in your ML API for breakout predictions")