# Breakout Stock Classifier: Scaffolding and Expansion

This notebook scaffolds a modular backend for a breakout stock classifier, breaks out model components, adds data point functionality, displays and edits the training cell, and updates the workflow to handle more stocks.

## Google Colab: Uploading Your CSV

If you are using Google Colab, you can upload your `stocks-list.csv` file directly to the Colab runtime with the following code cell:

```python
from google.colab import files
uploaded = files.upload()  # This will prompt you to select and upload your CSV file
```

- After uploading, the file will be in the current working directory.
- If your code expects the file in a `data/` folder, move it with:

```python
import os
os.makedirs('data', exist_ok=True)
os.replace('stocks-list.csv', 'data/stocks-list.csv')
```

Alternatively, you can mount your Google Drive and access files from there:

```python
from google.colab import drive
drive.mount('/content/drive')
# Then use the path: '/content/drive/My Drive/path/to/stocks-list.csv'
```

Adjust your code to use the correct path depending on your upload method.

## 1. Scaffold Model Backend
Set up the basic backend structure for the breakout classifier, including imports and class/function definitions.

In [None]:
# Imports and backend scaffolding
import pandas as pd
import numpy as np
from typing import List, Dict
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import joblib

# Placeholder for backend class
class BreakoutStockClassifier:
    def __init__(self):
        self.model = None
        self.features = None
    
    def fit(self, X, y):
        self.model = XGBClassifier(n_estimators=200, max_depth=5, random_state=42)
        self.model.fit(X, y)
    
    def predict(self, X):
        return self.model.predict_proba(X)[:, 1] if self.model else None
    
    def save(self, path):
        joblib.dump(self.model, path)
    
    def load(self, path):
        self.model = joblib.load(path)


In [None]:
# --- EOD Historical Data: Download US stock price data (replace Alpha Vantage) ---

from eodhd import APIClient
import pandas as pd
import time
import numpy as np

# Use your actual EODHD API token here
EODHD_API_KEY = "68ebce6775f004.44089353"

def download_eodhd_bulk(tickers, api_key=EODHD_API_KEY, start_date="2015-01-01", end_date=None, batch_size=5, delay=2):
    """
    Download daily historical data for a list of tickers from EODHD.
    Returns a DataFrame with all data concatenated.
    Ensures 'close' column exists for all tickers.
    """
    client = APIClient(api_key)
    all_data = []
    total = len(tickers)
    for i in range(0, total, batch_size):
        batch = tickers[i:i+batch_size]
        print(f"Downloading batch {i//batch_size+1} ({i+1}-{min(i+batch_size, total)}) of {total}...")
        for ticker in batch:
            try:
                df = client.get_historical_data(
                    symbol=ticker,
                    interval="d",
                    iso8601_start=start_date,
                    iso8601_end=end_date if end_date else ""
                )
                if not df.empty:
                    if 'close' not in df.columns:
                        print(f"Ticker {ticker} missing 'close' column, adding as NaN.")
                        df['close'] = np.nan
                    df['symbol'] = ticker
                    all_data.append(df)
                else:
                    print(f"No data for {ticker}")
            except Exception as e:
                print(f"Failed to download {ticker}: {e}")
        if i + batch_size < total:
            print(f"Sleeping for {delay} seconds to avoid API rate limits...")
            time.sleep(delay)
    if all_data:
        return pd.concat(all_data, ignore_index=True)
    else:
        print("No valid data downloaded.")
        return pd.DataFrame()

# Example usage:
# us_tickers = ["AAPL", "MSFT", "GOOG"]
# df = download_eodhd_bulk(us_tickers)
# print(df.head())

# Remove Alpha Vantage import and function
def download_alpha_vantage_bulk(*args, **kwargs):
    raise NotImplementedError("Alpha Vantage integration has been replaced by EODHD.")


In [None]:
# --- Download and use a comprehensive ticker list from an external CSV ---
import os

def get_tickers_from_csv(csv_path: str) -> list:
    """Load a comprehensive list of tickers from an external CSV file."""
    import pandas as pd
    df = pd.read_csv(csv_path)    # Accept common column names for tickers
    for col in ['symbol', 'ticker', 'Ticker', 'SYMBOL', 'Symbol']:
        if col in df.columns:
            tickers = df[col].dropna().unique().tolist()
            return tickers
    raise ValueError(f"No ticker column found in {csv_path}. Columns found: {df.columns.tolist()}")

# Example usage:
# Download a full US stock list from NASDAQ, NYSE, AMEX, or use a third-party source like 'eodhistoricaldata.com', 'nasdaqtrader.com', or 'stockanalysis.com'.
# Place the CSV in your data directory, e.g., 'data/all_us_tickers.csv'.
# all_tickers = get_tickers_from_csv('data/all_us_tickers.csv')


## 2. Break Out Model Components
Separate the workflow into modular functions for data loading, preprocessing, model definition, and evaluation.

In [None]:
# Data loading function
def load_stock_data(csv_path: str) -> pd.DataFrame:
    """Load historical stock data from CSV or other sources."""
    return pd.read_csv(csv_path)

# Quality filter function
def filter_quality_stocks(df: pd.DataFrame, min_price=5.0, min_volume=500000) -> pd.DataFrame:
    """Filter out low-quality stocks (penny stocks, illiquid stocks)."""
    df = df.copy()
    
    # Minimum price filter (avoid penny stocks)
    df = df[df['close'] >= min_price]
    
    # Minimum volume filter (ensure liquidity)
    df = df[df['volume'] >= min_volume]
    
    # Price stability check (avoid stocks with extreme volatility)
    df['price_std_30d'] = df.groupby('symbol')['close'].transform(
        lambda x: x.rolling(30, min_periods=1).std()
    )
    df['volatility'] = df['price_std_30d'] / df['close']
    
    # Filter out extremely volatile stocks (>50% daily volatility)
    df = df[df['volatility'] < 0.5]
    
    return df

# Calculate momentum features
def calculate_momentum_features(df: pd.DataFrame) -> pd.DataFrame:
    """Add momentum indicators to help identify sustained moves."""
    df = df.copy()
    
    # Sort by symbol and date
    df = df.sort_values(['symbol', 'date'])
    
    # RSI (Relative Strength Index)
    delta = df.groupby('symbol')['close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = -delta.where(delta < 0, 0).rolling(14).mean()
    rs = gain / (loss + 1e-10)  # Avoid division by zero
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # MACD
    ema_12 = df.groupby('symbol')['close'].transform(lambda x: x.ewm(span=12, adjust=False).mean())
    ema_26 = df.groupby('symbol')['close'].transform(lambda x: x.ewm(span=26, adjust=False).mean())
    df['macd'] = ema_12 - ema_26
    df['macd_signal'] = df.groupby('symbol')['macd'].transform(lambda x: x.ewm(span=9, adjust=False).mean())
    df['macd_histogram'] = df['macd'] - df['macd_signal']
    
    # Rate of Change (10-day)
    df['roc_10'] = (df.groupby('symbol')['close'].shift(0) / 
                    df.groupby('symbol')['close'].shift(10) - 1) * 100
    
    # Average True Range (volatility measure)
    df['prev_close'] = df.groupby('symbol')['close'].shift(1)
    df['tr'] = df[['high', 'low', 'prev_close']].apply(
        lambda x: max(x['high'] - x['low'], 
                     abs(x['high'] - x['prev_close']) if pd.notna(x['prev_close']) else 0,
                     abs(x['low'] - x['prev_close']) if pd.notna(x['prev_close']) else 0),
        axis=1
    )
    df['atr'] = df.groupby('symbol')['tr'].transform(lambda x: x.rolling(14, min_periods=1).mean())
    
    # Volume ratio
    df['avg_volume_30d'] = df.groupby('symbol')['volume'].transform(
        lambda x: x.rolling(30, min_periods=1).mean()
    )
    df['volume_ratio'] = df['volume'] / (df['avg_volume_30d'] + 1)  # Avoid division by zero
    
    return df

# Preprocessing function with trend filtering and sustained breakout definition
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Feature engineering and breakout labeling with trend filtering.
    Labels sustained breakouts (not just one-day spikes).
    """
    df = df.copy()
    
    # Ensure all price/volume columns are numeric
    for col in ['open', 'high', 'low', 'close', 'volume']:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Sort by symbol and date
    df = df.sort_values(['symbol', 'date'])
    
    # Calculate moving averages for trend detection
    df['sma_50'] = df.groupby('symbol')['close'].transform(lambda x: x.rolling(50, min_periods=1).mean())
    df['sma_200'] = df.groupby('symbol')['close'].transform(lambda x: x.rolling(200, min_periods=1).mean())
    
    # Trend filter: Price must be above both 50 and 200 day MA
    df['uptrend'] = (df['close'] > df['sma_50']) & (df['close'] > df['sma_200'])
    
    # Filter out downtrending stocks BEFORE labeling breakouts
    print(f"Before trend filter: {len(df)} rows")
    df = df[df['uptrend'] == True].copy()
    print(f"After trend filter (uptrend only): {len(df)} rows")
    
    # Calculate multiple forward returns for sustained breakout detection
    valid = (df['close'].notnull()) & (df['close'] != 0)
    
    for days in [30, 60, 90]:
        col_name = f'forward_return_{days}d'
        df[col_name] = np.nan
        shifted = df.groupby('symbol')['close'].shift(-days)
        df.loc[valid, col_name] = shifted[valid] / df['close'][valid] - 1
    
    # Calculate max drawdown during breakout period (30 days)
    df['future_min_price'] = df.groupby('symbol')['close'].transform(
        lambda x: x.rolling(30, min_periods=1).min().shift(-30)
    )
    df['max_drawdown_30d'] = (df['future_min_price'] / df['close']) - 1
    
    # Volume confirmation
    df['volume_confirmed'] = df['volume_ratio'] > 1.2
    
    # Breakout definition: SUSTAINED growth across all periods + volume + no major drawdown
    df['breakout'] = (
        (df['forward_return_30d'] > 0.20) &   # Up 20%+ at 30 days
        (df['forward_return_60d'] > 0.30) &   # Up 30%+ at 60 days
        (df['forward_return_90d'] > 0.40) &   # Up 40%+ at 90 days
        (df['max_drawdown_30d'] > -0.15) &    # No major crash (>15% drop)
        (df['volume_confirmed'] == True)      # Volume above average
    ).astype(int)
    
    # Drop rows with any NaN in required feature columns
    df = df.dropna(subset=['open', 'high', 'low', 'close', 'volume', 'rsi', 'macd', 'atr'])
    
    return df

# Model evaluation function
def evaluate_model(model, X_test, y_test):
    from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    y_pred = (y_pred_proba > 0.5).astype(int)
    
    print("\n" + "="*60)
    print("MODEL EVALUATION")
    print("="*60)
    
    # ROC AUC
    if len(np.unique(y_test)) > 1:
        auc = roc_auc_score(y_test, y_pred_proba)
        print(f"ROC AUC: {auc:.3f}")
    else:
        print("ROC AUC: N/A (only one class in test set)")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Breakout', 'Breakout']))
    
    # Confusion matrix
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"True Negatives:  {cm[0,0]:>6}")
    print(f"False Positives: {cm[0,1]:>6}")
    print(f"False Negatives: {cm[1,0]:>6}")
    print(f"True Positives:  {cm[1,1]:>6}")
    
    # Breakout prediction stats
    print(f"\nBreakout predictions: {y_pred.sum()} out of {len(y_pred)} ({y_pred.sum()/len(y_pred)*100:.1f}%)")
    print("="*60)


## 3. Add Data Point Functionality
Implement a function to add or update individual data points for training or testing.

In [None]:
# Function to add or update a data point
def add_data_point(df: pd.DataFrame, new_row: Dict) -> pd.DataFrame:
    """Add or update a single data point in the DataFrame."""
    df = df.copy()
    # Assume 'date' and 'symbol' uniquely identify a row
    mask = (df['date'] == new_row['date']) & (df['symbol'] == new_row['symbol'])
    if mask.any():
        df.loc[mask, :] = pd.DataFrame([new_row])
    else:
        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
    return df

## 4. Display and Edit Training Cell
This cell trains the breakout classifier. You can edit model parameters or code as needed.

In [None]:
# Training cell: Improved breakout detection with trend filtering and sustained growth

from pathlib import Path

# 1. Load tickers from your CSV
csv_path = Path('../data/stocks-list.csv')
all_tickers = get_tickers_from_csv(str(csv_path))
print(f"‚úÖ Loaded {len(all_tickers)} tickers from {csv_path}")

# 2. Download historical price data (adjust slice as needed)
print("\nüìä Downloading historical data...")
bulk_df = download_eodhd_bulk(all_tickers[:500], start_date="2020-01-01")  # 5 years of data

# 3. Ensure columns are lowercase
bulk_df.columns = [col.lower() for col in bulk_df.columns]
print(f"Downloaded {len(bulk_df)} total rows")

# 4. Check for required columns
if 'close' not in bulk_df.columns:
    raise ValueError(f"'close' column not found. Columns: {bulk_df.columns.tolist()}")

# 5. Filter quality stocks first (remove penny stocks and low liquidity)
print("\nüîç Filtering quality stocks...")
bulk_df = filter_quality_stocks(bulk_df, min_price=5.0, min_volume=500000)
print(f"After quality filter: {len(bulk_df)} rows")

# 6. Add momentum features (RSI, MACD, volume ratio, etc.)
print("\nüìà Calculating momentum features...")
bulk_df = calculate_momentum_features(bulk_df)

# 7. Preprocess with trend filter and sustained breakout definition
print("\nüéØ Labeling sustained breakouts...")
bulk_df = preprocess_data(bulk_df)
print(f"After preprocessing: {len(bulk_df)} rows")
print(f"Breakout rate: {bulk_df['breakout'].mean():.2%} ({bulk_df['breakout'].sum()} breakouts)")

# 8. Check if we have enough breakouts to train
if bulk_df['breakout'].sum() < 10:
    print("\n‚ö†Ô∏è  WARNING: Very few breakouts found. Consider:")
    print("   - Lowering return thresholds")
    print("   - Downloading more stocks")
    print("   - Using more historical data")

# 9. Select features and target
features = ['open', 'high', 'low', 'close', 'volume', 
            'sma_50', 'sma_200', 'rsi', 'macd', 'macd_histogram', 
            'roc_10', 'atr', 'volume_ratio']

X = bulk_df[features].fillna(0)  # Handle any remaining NaNs
y = bulk_df['breakout']

print(f"\nüìä Training data shape: {X.shape}")
print(f"Features: {features}")

# 10. Train/test split with stratification (if possible)
try:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"Train set: {len(X_train)} samples ({y_train.sum()} breakouts)")
    print(f"Test set: {len(X_test)} samples ({y_test.sum()} breakouts)")
except ValueError as e:
    print(f"‚ö†Ô∏è  Cannot stratify (likely too few breakouts): {e}")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

# 11. Train model with class imbalance handling
print("\nüöÄ Training XGBoost classifier...")
clf = BreakoutStockClassifier()

# Calculate scale_pos_weight to handle imbalanced data
pos_count = y_train.sum()
neg_count = len(y_train) - pos_count
scale_pos_weight = neg_count / pos_count if pos_count > 0 else 1

clf.model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos_weight,
    random_state=42,
    eval_metric='logloss'
)

clf.fit(X_train, y_train)
print("‚úÖ Training complete!")

# 12. Evaluate model
evaluate_model(clf.model, X_test, y_test)

# 13. Feature importance
print("\nüîù Top 10 Most Important Features:")
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': clf.model.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance.head(10).to_string(index=False))

# 14. Save model
model_path = '../ml_models/breakout_classifier_xgb.pkl'
clf.save(model_path)
print(f"\nüíæ Model saved to: {model_path}")
print("\nüéâ Training complete! Model finds SUSTAINED breakouts in UPTRENDING stocks.")


## 5. Improved Breakout Detection

This updated notebook now finds **real, sustained breakouts** instead of one-day spikes:

### Key Improvements:

1. **Trend Filter**: Only considers stocks in uptrends (price above 50 & 200-day MA)
2. **Sustained Returns**: Requires positive returns at 30, 60, AND 90 days:
   - 20%+ at 30 days
   - 30%+ at 60 days  
   - 40%+ at 90 days
3. **Volume Confirmation**: Breakouts need above-average volume (1.2x+)
4. **Quality Filter**: Excludes penny stocks (<$5) and illiquid stocks (<500K volume)
5. **Drawdown Protection**: Excludes stocks that crash >15% mid-period
6. **Momentum Features**: RSI, MACD, ROC help identify real momentum

### Expected Results:
- Fewer breakout labels (more selective)
- Higher quality picks that sustain growth
- Better for real trading/investment decisions

### If you get too few breakouts:
- Lower the return thresholds (e.g., 15%/25%/35%)
- Increase the stock universe (use more tickers)
- Use more historical data (start_date="2015-01-01")
