# Train LSTM Model for Premium Tier

This notebook trains the Premium tier LSTM deep learning model (10% weight in ensemble)

**Training Time:** 30-60 minutes  
**GPU Required:** Yes (T4 or better recommended)  
**Output:** lstm.pth, lstm_scaler.pkl, lstm_metadata.json  
**Recommended:** Run in Google Colab with GPU runtime

---

## üöÄ Quick Start Guide

### For Google Colab Users (RECOMMENDED):

1. **Upload this notebook to Colab**
   - Go to [colab.research.google.com](https://colab.research.google.com)
   - File ‚Üí Upload notebook ‚Üí Select this file

2. **Enable GPU Runtime** ‚ö†Ô∏è IMPORTANT!
   - Runtime ‚Üí Change runtime type
   - Hardware accelerator ‚Üí **GPU** (select T4 or better)
   - Click Save
   - Without GPU, training will take HOURS instead of minutes!

3. **Check GPU is enabled** (Cell 3 below)
   - Run cell 3
   - Should see: "Using device: cuda" ‚úÖ
   - If you see "cpu", go back to step 2

4. **Install packages** (Cell 2 below)
   - Uncomment the install line
   - Run the cell

5. **Upload your stock list CSV** (Optional)
   - If you have `eodhd_us_tickers.csv`, upload it:
   ```python
   from google.colab import files
   uploaded = files.upload()
   ```
   - Or skip this - will download S&P 500 stocks automatically

6. **Run all cells**
   - Runtime ‚Üí Run all
   - Go for a walk üö∂ (takes 30-60 minutes with GPU)

7. **Download trained models**
   - Run the last cell to download a zip file
   - Upload `lstm.pth` and `lstm_scaler.pkl` to your `ml_models/` directory

### For Local Users:

**Prerequisites:**
- NVIDIA GPU with CUDA support
- PyTorch with CUDA installed

1. **Install dependencies:**
   ```bash
   # Install PyTorch with CUDA (check pytorch.org for your system)
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
   
   # Install other packages
   pip install pandas numpy yfinance joblib scikit-learn
   ```

2. **Verify GPU:**
   ```python
   import torch
   print(torch.cuda.is_available())  # Should print True
   ```

3. **Run all cells in order**

4. **Models save to:** `../ml_models/`

---

## ‚ö†Ô∏è Important Notes

- **GPU is REQUIRED** - CPU training will be extremely slow
- **Free Colab has limits** - You get ~12 hours of GPU time per day
- **Training 200 stocks** takes ~30-60 minutes with T4 GPU
- **Increase stocks** = longer training time (linear scaling)

---

## 1. Setup & Check GPU

In [None]:
# Install required packages (UNCOMMENT if running in Colab)
# !pip install torch pandas numpy yfinance joblib scikit-learn -q

# Optional: Upload your stock list CSV in Colab
# from google.colab import files
# print("üìÅ Upload your eodhd_us_tickers.csv or stocks-list.csv file:")
# uploaded = files.upload()
# # Move to data directory
# import os
# os.makedirs('data', exist_ok=True)
# for filename in uploaded.keys():
#     os.rename(filename, f'data/{filename}')
#     print(f"‚úÖ Uploaded {filename} to data/")

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import yfinance as yf
import joblib
import json
from datetime import datetime, timedelta
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("="*60)
print("üîç GPU CHECK")
print("="*60)
print(f"Using device: {device}")

if device.type == 'cuda':
    print(f"‚úÖ GPU ENABLED: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("\nüöÄ Training will be FAST!")
else:
    print("‚ùå GPU NOT AVAILABLE - Training will be VERY SLOW!")
    print("\nüí° If you're in Google Colab:")
    print("   1. Runtime ‚Üí Change runtime type")
    print("   2. Hardware accelerator ‚Üí GPU")
    print("   3. Click Save")
    print("   4. Re-run this cell")
    print("\n‚ö†Ô∏è  Consider enabling GPU before proceeding!")
print("="*60)

## 2. Configuration

In [None]:
CONFIG = {
    'training_universe_size': None,  # Smaller for faster LSTM training
    'sequence_length': 60,          # 60 days of history per sequence
    'forward_prediction_days': 30,  # Predict 30 days ahead
    'hidden_size': 128,             # LSTM hidden units
    'num_layers': 2,                # LSTM layers
    'dropout': 0.2,
    'batch_size': 64,
    'epochs': 50,
    'learning_rate': 0.001,
    'output_dir': 'ml_models',
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 3. Load Stock Universe

In [None]:
# Load stock list from CSV or download S&P 500
import os

try:
    # Try multiple possible paths
    for path in ['../data/eodhd_us_tickers.csv', 'data/eodhd_us_tickers.csv',
                 '../data/stocks-list.csv', 'data/stocks-list.csv']:
        if os.path.exists(path):
            stocks_df = pd.read_csv(path)
            # Try different column names
            for col in ['Symbol', 'symbol', 'ticker', 'Ticker', 'SYMBOL']:
                if col in stocks_df.columns:
                    universe = stocks_df[col].head(CONFIG['training_universe_size']).tolist()
                    print(f"‚úÖ Loaded {len(universe)} stocks from {path}")
                    break
            break
    else:
        raise FileNotFoundError("No CSV found")
except:
    # Fallback: Download S&P 500 stocks
    print("‚ö†Ô∏è  CSV not found, downloading S&P 500 list...")
    sp500_table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
    universe = sp500_table['Symbol'].head(CONFIG['training_universe_size']).tolist()
    print(f"‚úÖ Loaded {len(universe)} stocks from S&P 500")

print(f"Total stocks: {len(universe)}")
print(f"First 10: {universe[:10]}")
print(f"\n‚è±Ô∏è  ESTIMATED TRAINING TIME:")
print(f"   - 200 stocks: ~30-60 minutes with GPU")
print(f"   - 1000 stocks: ~2-3 hours with GPU")
print(f"   - 5000 stocks: ~8-12 hours with GPU")
print(f"\nüí° TIP: Training with {len(universe)} stocks will produce more robust models!")

## 4. Download & Prepare Time Series Data

In [None]:
# Replace Cell 4 "Download & Prepare Time Series Data" with this:

print("üìä Downloading historical data...\n")

all_sequences = []
all_labels = []
failed_tickers = []

end_date = datetime.now()
start_date = end_date - timedelta(days=730)  # 2 years

for i, symbol in enumerate(universe):
    try:
        ticker = yf.Ticker(symbol)
        df = ticker.history(start=start_date, end=end_date)
        
        if len(df) < CONFIG['sequence_length'] + CONFIG['forward_prediction_days']:
            failed_tickers.append(symbol)
            continue
        
        # üîß FIX: Handle division by zero and extreme values
        # Calculate returns with safety checks
        df['return'] = df['Close'].pct_change()
        
        # Replace infinite returns (from zero prices)
        df['return'] = df['return'].replace([np.inf, -np.inf], np.nan)
        
        # Clip extreme returns (>100% or <-100%)
        df['return'] = df['return'].clip(-1.0, 1.0)
        
        # Fill NaN returns with 0
        df['return'] = df['return'].fillna(0)
        
        # Normalize volume safely
        volume_mean = df['Volume'].mean()
        volume_std = df['Volume'].std()
        
        if volume_std > 0:
            df['volume_norm'] = (df['Volume'] - volume_mean) / volume_std
        else:
            df['volume_norm'] = 0
        
        # Clip volume to reasonable range
        df['volume_norm'] = df['volume_norm'].clip(-10, 10)
        
        # Create sequences
        for j in range(CONFIG['sequence_length'], len(df) - CONFIG['forward_prediction_days']):
            # Features: return and normalized volume
            sequence = df[['return', 'volume_norm']].iloc[j-CONFIG['sequence_length']:j].values
            
            # Label: forward return
            current_price = df['Close'].iloc[j]
            future_price = df['Close'].iloc[j + CONFIG['forward_prediction_days']]
            
            # Safety check for label calculation
            if current_price <= 0 or future_price <= 0:
                continue
            
            label = (future_price / current_price) - 1
            
            # Clip label to reasonable range
            label = np.clip(label, -1.0, 1.0)
            
            # Only add if no NaN or Inf
            if not np.isnan(sequence).any() and not np.isinf(sequence).any() and not np.isnan(label) and not np.isinf(label):
                all_sequences.append(sequence)
                all_labels.append(label)
        
        if (i + 1) % 25 == 0:
            print(f"Processed {i+1}/{len(universe)} stocks, {len(all_sequences):,} sequences")
    
    except Exception as e:
        failed_tickers.append(symbol)
        if len(failed_tickers) <= 5:  # Print first 5 errors
            print(f"‚ö†Ô∏è  {symbol} failed: {str(e)[:100]}")
        continue

print(f"\n‚úÖ Data collection complete!")
print(f"Total sequences: {len(all_sequences):,}")
print(f"Failed tickers: {len(failed_tickers)}")
if failed_tickers:
    print(f"Failed tickers: {failed_tickers[:10]}..." if len(failed_tickers) > 10 else f"Failed tickers: {failed_tickers}")

## 5. Prepare PyTorch Dataset

In [None]:
# Replace Cell 5 "Prepare PyTorch Dataset" with this fixed version:

# Convert to numpy arrays
X = np.array(all_sequences)
y = np.array(all_labels)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# üîß FIX: Clean data before scaling
print("\nüîç Checking for invalid values...")

# Check for NaN
nan_mask_X = np.isnan(X)
nan_mask_y = np.isnan(y)
print(f"NaN in X: {nan_mask_X.any()} ({nan_mask_X.sum()} values)")
print(f"NaN in y: {nan_mask_y.any()} ({nan_mask_y.sum()} values)")

# Check for infinity
inf_mask_X = np.isinf(X)
inf_mask_y = np.isinf(y)
print(f"Inf in X: {inf_mask_X.any()} ({inf_mask_X.sum()} values)")
print(f"Inf in y: {inf_mask_y.any()} ({inf_mask_y.sum()} values)")

# Remove sequences with NaN or Inf
valid_mask = ~(nan_mask_X.any(axis=(1,2)) | inf_mask_X.any(axis=(1,2)) | nan_mask_y | inf_mask_y)
X_clean = X[valid_mask]
y_clean = y[valid_mask]

print(f"\n‚úÖ Cleaned data:")
print(f"   Original: {len(X):,} sequences")
print(f"   After cleaning: {len(X_clean):,} sequences")
print(f"   Removed: {len(X) - len(X_clean):,} invalid sequences ({(1 - len(X_clean)/len(X))*100:.2f}%)")

# Clip extreme values (handle outliers)
print("\nüîß Clipping extreme values...")
percentile_99 = np.percentile(np.abs(X_clean), 99)
X_clean = np.clip(X_clean, -percentile_99, percentile_99)

y_percentile_99 = np.percentile(np.abs(y_clean), 99)
y_clean = np.clip(y_clean, -y_percentile_99, y_percentile_99)

print(f"   X clipped to [-{percentile_99:.4f}, {percentile_99:.4f}]")
print(f"   y clipped to [-{y_percentile_99:.4f}, {y_percentile_99:.4f}]")

# Scale features
print("\nüîÑ Scaling features...")
scaler = StandardScaler()
X_reshaped = X_clean.reshape(-1, X_clean.shape[-1])
X_scaled = scaler.fit_transform(X_reshaped)
X_scaled = X_scaled.reshape(X_clean.shape)

# Final check
if np.isnan(X_scaled).any() or np.isinf(X_scaled).any():
    raise ValueError("‚ùå Scaling produced NaN or Inf values!")
else:
    print("‚úÖ Scaling successful - no invalid values")

# Split train/test
split_idx = int(0.8 * len(X_scaled))
X_train = X_scaled[:split_idx]
X_test = X_scaled[split_idx:]
y_train = y_clean[:split_idx]
y_test = y_clean[split_idx:]

print(f"\nüìä Final dataset split:")
print(f"   Training samples: {len(X_train):,}")
print(f"   Testing samples: {len(X_test):,}")

# Create PyTorch datasets
class StockDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.FloatTensor(y)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = StockDataset(X_train, y_train)
test_dataset = StockDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

print("\n‚úÖ PyTorch datasets created and ready for training!")
print(f"   Batch size: {CONFIG['batch_size']}")
print(f"   Training batches: {len(train_loader)}")
print(f"   Test batches: {len(test_loader)}")

## 6. Define LSTM Model

In [None]:
class StockLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, dropout):
        super(StockLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 1)
        )
    
    def forward(self, x):
        # LSTM forward pass
        lstm_out, _ = self.lstm(x)
        
        # Use last time step
        last_output = lstm_out[:, -1, :]
        
        # Fully connected layer
        output = self.fc(last_output)
        
        return output.squeeze()

# Initialize model
input_size = X.shape[2]  # Number of features
model = StockLSTM(
    input_size=input_size,
    hidden_size=CONFIG['hidden_size'],
    num_layers=CONFIG['num_layers'],
    dropout=CONFIG['dropout']
).to(device)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'])

print("Model architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
class StockLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, dropout):
        super(StockLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True
        )
        
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 1)
        )
    
    def forward(self, x):
        # LSTM forward pass
        lstm_out, _ = self.lstm(x)
        
        # Use last time step
        last_output = lstm_out[:, -1, :]
        
        # Fully connected layer
        output = self.fc(last_output)
        
        return output.squeeze()

# Initialize model
input_size = X.shape[2]  # Number of features
model = StockLSTM(
    input_size=input_size,
    hidden_size=CONFIG['hidden_size'],
    num_layers=CONFIG['num_layers'],
    dropout=CONFIG['dropout']
).to(device)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'])

print("Model architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

## 7. Train Model

In [None]:
print("üöÄ Training LSTM model...\n")

train_losses = []
test_losses = []
best_test_loss = float('inf')

for epoch in range(CONFIG['epochs']):
    # Training
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
    
    train_loss /= len(train_loader)
    train_losses.append(train_loss)
    
    # Validation
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            test_loss += loss.item()
    
    test_loss /= len(test_loader)
    test_losses.append(test_loss)
    
    # Save best model
    if test_loss < best_test_loss:
        best_test_loss = test_loss
        best_model_state = model.state_dict()
    
    # Print progress
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{CONFIG['epochs']}] - Train Loss: {train_loss:.6f}, Test Loss: {test_loss:.6f}")

print("\n‚úÖ Training complete!")
print(f"Best test loss: {best_test_loss:.6f}")

# Restore best model
model.load_state_dict(best_model_state)

## 8. Evaluate Model

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

model.eval()
all_preds = []
all_targets = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        outputs = model(X_batch)
        all_preds.extend(outputs.cpu().numpy())
        all_targets.extend(y_batch.numpy())

all_preds = np.array(all_preds)
all_targets = np.array(all_targets)

mse = mean_squared_error(all_targets, all_preds)
mae = mean_absolute_error(all_targets, all_preds)
r2 = r2_score(all_targets, all_preds)

print("\n" + "="*60)
print("üìä LSTM MODEL PERFORMANCE")
print("="*60)
print(f"MSE:  {mse:.6f}")
print(f"MAE:  {mae:.6f}")
print(f"R¬≤ Score: {r2:.6f}")
print("="*60)

## 9. Save Model

In [None]:
import os

output_dir = f"../{CONFIG['output_dir']}"
os.makedirs(output_dir, exist_ok=True)

print(f"üíæ Saving model to {output_dir}/...\n")

# Save PyTorch model
torch.save({
    'model_state_dict': model.state_dict(),
    'input_size': input_size,
    'hidden_size': CONFIG['hidden_size'],
    'num_layers': CONFIG['num_layers'],
    'dropout': CONFIG['dropout'],
    'sequence_length': CONFIG['sequence_length'],
}, f"{output_dir}/lstm.pth")
print("‚úÖ Saved lstm.pth")

# Save scaler
joblib.dump(scaler, f"{output_dir}/lstm_scaler.pkl")
print("‚úÖ Saved lstm_scaler.pkl")

# Save metadata
metadata = {
    'training_date': datetime.now().isoformat(),
    'num_stocks': len(universe) - len(failed_tickers),
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'sequence_length': CONFIG['sequence_length'],
    'input_size': input_size,
    'performance': {
        'mse': float(mse),
        'mae': float(mae),
        'r2': float(r2),
        'best_test_loss': float(best_test_loss)
    },
    'config': CONFIG
}

with open(f"{output_dir}/lstm_metadata.json", 'w') as f:
    json.dump(metadata, f, indent=2)
print("‚úÖ Saved lstm_metadata.json")

print("\n" + "="*60)
print("üéâ LSTM TRAINING COMPLETE!")
print("="*60)
print(f"\nFiles saved to: {output_dir}/")
print("  - lstm.pth")
print("  - lstm_scaler.pkl")
print("  - lstm_metadata.json")
print("\nReady to deploy! üöÄ")

## 10. Download Models (For Colab Users)

In [None]:
# Download models for Colab (UNCOMMENT if running in Colab)
# from google.colab import files
# import zipfile
# import os

# print("üì¶ Creating zip file with LSTM models...")

# # Create zip file
# zip_filename = 'lstm_models.zip'
# with zipfile.ZipFile(zip_filename, 'w') as zipf:
#     for file in ['lstm.pth', 'lstm_scaler.pkl', 'lstm_metadata.json']:
#         file_path = f"{output_dir}/{file}"
#         if os.path.exists(file_path):
#             zipf.write(file_path, file)
#             print(f"  ‚úÖ Added {file}")
#         else:
#             print(f"  ‚ö†Ô∏è  {file} not found")

# print(f"\n‚¨áÔ∏è  Downloading {zip_filename}...")
# files.download(zip_filename)
# print("‚úÖ Download complete!")
# print("\nüìã Next steps:")
# print("1. Extract the zip file")
# print("2. Upload lstm.pth and lstm_scaler.pkl to your project's ml_models/ directory")
# print("3. The LSTM model adds 10% weight to ensemble predictions for Premium tier")
# print("4. Make sure to train ensemble models first (train_ensemble_models.ipynb)")