# Machine Learning for Stock Price Prediction

This notebook demonstrates machine learning approaches for predicting stock prices.

## ⚠️ IMPORTANT DISCLAIMER ⚠️

**Stock market prediction is extremely challenging and inherently uncertain:**
- Markets are influenced by countless unpredictable factors
- Past performance does not guarantee future results
- These models are for educational purposes only
- Do NOT use these predictions for actual trading without professional advice
- Always consult with a qualified financial advisor

## Learning Objectives:
- Create features from stock data for ML models
- Train Random Forest model for price prediction
- Train LSTM neural network for time-series forecasting
- Evaluate and compare model performance
- Understand limitations and risks of ML in finance
- Visualize predictions and errors
- Simulate trading strategies based on predictions

## Table of Contents:
1. [Setup and Data Loading](#setup)
2. [Feature Engineering](#features)
3. [Data Preparation](#preparation)
4. [Random Forest Model](#random-forest)
5. [LSTM Neural Network](#lstm)
6. [Model Comparison](#comparison)
7. [Prediction Visualization](#visualization)
8. [Trading Simulation](#trading)
9. [Key Takeaways and Limitations](#takeaways)

<a id='setup'></a>
## 1. Setup and Data Loading

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import our modules
from src.data.fetcher import get_stock_data
from src.data.preprocessor import StockDataPreprocessor
from src.models import (
    FeatureEngineer,
    RandomForestPredictor,
    ModelEvaluator,
    prepare_ml_dataset,
    evaluate_model
)

# Try to import LSTM (TensorFlow may not be installed)
try:
    from src.models import LSTMPredictor
    HAS_LSTM = True
    print("TensorFlow available - LSTM models enabled")
except ImportError:
    HAS_LSTM = False
    print("TensorFlow not available - LSTM models disabled")
    print("Install with: pip install tensorflow")

# Configure display
pd.set_option('display.max_columns', None)
%matplotlib inline

print("\nImports successful!")

TensorFlow not installed. LSTM model unavailable.
TensorFlow available - LSTM models enabled

Imports successful!


In [2]:
# Fetch stock data
ticker = 'AAPL'
df = get_stock_data(ticker, start='2020-01-01')

print(f"Data loaded for {ticker}")
print(f"Date range: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Total records: {len(df)}")
print(f"\nData preview:")
display(df.tail())

Cached AAPL data
Successfully fetched AAPL (1506 records)
Data loaded for AAPL
Date range: 2020-01-02 to 2025-12-29
Total records: 1506

Data preview:


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2025-12-22 00:00:00-05:00,272.859985,273.880005,270.51001,270.970001,36571800,0.0,0.0
2025-12-23 00:00:00-05:00,270.839996,272.5,269.559998,272.359985,29642000,0.0,0.0
2025-12-24 00:00:00-05:00,272.339996,275.429993,272.200012,273.809998,17910600,0.0,0.0
2025-12-26 00:00:00-05:00,274.160004,275.369995,272.859985,273.399994,21521800,0.0,0.0
2025-12-29 00:00:00-05:00,272.690002,274.359985,272.350006,273.76001,23700400,0.0,0.0


<a id='features'></a>
## 2. Feature Engineering

Create features from raw stock data for machine learning models.

In [3]:
# Create all features
print("Creating features...")
df_features = FeatureEngineer.prepare_features(df)

print(f"\nOriginal columns: {len(df.columns)}")
print(f"After feature engineering: {len(df_features.columns)}")
print(f"New features created: {len(df_features.columns) - len(df.columns)}")

print("\nFeature categories created:")
print("  - Lagged features (Close, Volume)")
print("  - Rolling statistics (mean, std, min, max)")
print("  - Return features (1d, 5d, 10d, 20d)")
print("  - Volatility features")
print("  - Technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands, ATR)")
print("  - Temporal features (day of week, month, cyclical encoding)")

Creating features...

Original columns: 7
After feature engineering: 63
New features created: 56

Feature categories created:
  - Lagged features (Close, Volume)
  - Rolling statistics (mean, std, min, max)
  - Return features (1d, 5d, 10d, 20d)
  - Volatility features
  - Technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands, ATR)
  - Temporal features (day of week, month, cyclical encoding)


In [4]:
# Show sample of created features
print("Sample features:")
feature_cols = [col for col in df_features.columns if col not in df.columns]
print(f"\nTotal new features: {len(feature_cols)}")
print("\nFirst 20 features:")
for i, col in enumerate(feature_cols[:20], 1):
    print(f"  {i}. {col}")
if len(feature_cols) > 20:
    print(f"  ... and {len(feature_cols) - 20} more")

Sample features:

Total new features: 56

First 20 features:
  1. return_1d
  2. return_5d
  3. return_10d
  4. return_20d
  5. log_return
  6. volatility_5d
  7. volatility_10d
  8. volatility_20d
  9. Close_rolling_mean_5
  10. Close_rolling_std_5
  11. Close_rolling_min_5
  12. Close_rolling_max_5
  13. Close_rolling_mean_10
  14. Close_rolling_std_10
  15. Close_rolling_min_10
  16. Close_rolling_max_10
  17. Close_rolling_mean_20
  18. Close_rolling_std_20
  19. Close_rolling_min_20
  20. Close_rolling_max_20
  ... and 36 more


<a id='preparation'></a>
## 3. Data Preparation

Prepare features and target variable for ML models.

In [5]:
# Prepare ML dataset
print("Preparing ML dataset...")
X, y, scaler = prepare_ml_dataset(
    df,
    target_column='Close',
    forecast_horizon=1,  # Predict next day
    target_type='price',  # Predict actual price
    scale_features=True,
    scaler_type='minmax'
)

print(f"\nFeatures (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}")

Preparing ML dataset...


TypeError: prepare_ml_dataset() got an unexpected keyword argument 'target_type'

In [None]:
# Split data into train and test sets
# IMPORTANT: Don't shuffle time-series data!
split_ratio = 0.8
split_idx = int(len(X) * split_ratio)

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"Training set: {X_train.shape[0]} samples ({split_ratio*100:.0f}%)")
print(f"Test set: {X_test.shape[0]} samples ({(1-split_ratio)*100:.0f}%)")
print(f"\nTraining period: {X_train.index[0].date()} to {X_train.index[-1].date()}")
print(f"Test period: {X_test.index[0].date()} to {X_test.index[-1].date()}")

<a id='random-forest'></a>
## 4. Random Forest Model

Train a Random Forest model for price prediction.

In [None]:
# Initialize and train Random Forest
print("Training Random Forest model...\n")

rf_model = RandomForestPredictor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42
)

rf_model.fit(X_train, y_train)

print("\nModel trained successfully!")

In [None]:
# Make predictions
rf_train_pred = rf_model.predict(X_train)
rf_test_pred = rf_model.predict(X_test)

# Evaluate on training set
print("=== Random Forest Performance ===")
print("\nTraining Set:")
train_metrics = rf_model.evaluate(X_train, y_train)
for metric, value in train_metrics.items():
    print(f"  {metric}: {value:.4f}")

# Evaluate on test set
print("\nTest Set:")
test_metrics = rf_model.evaluate(X_test, y_test)
for metric, value in test_metrics.items():
    print(f"  {metric}: {value:.4f}")

In [None]:
# Feature importance analysis
print("=== Top 15 Most Important Features ===")
importance_df = rf_model.get_feature_importance(top_n=15)
display(importance_df)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(importance_df)), importance_df['importance'], color='skyblue', edgecolor='black')
plt.yticks(range(len(importance_df)), importance_df['feature'])
plt.xlabel('Importance', fontsize=12)
plt.title('Random Forest - Feature Importance (Top 15)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Visualize predictions
ModelEvaluator.plot_predictions(
    y_test.values,
    rf_test_pred,
    title='Random Forest - Predictions vs Actual',
    dates=y_test.index
)

In [None]:
# Directional accuracy
rf_dir_acc = ModelEvaluator.calculate_directional_accuracy(y_test.values, rf_test_pred)
print(f"Directional Accuracy: {rf_dir_acc:.2f}%")
print("\nDirectional accuracy measures how often the model correctly predicts")
print("whether the price will go up or down (regardless of magnitude).")
print("Above 50% is better than random guessing.")

<a id='lstm'></a>
## 5. LSTM Neural Network

Train an LSTM model for time-series prediction (if TensorFlow is available).

In [None]:
if HAS_LSTM:
    print("Preparing data for LSTM...")
    
    # For LSTM, we need to reshape data into sequences
    from sklearn.preprocessing import MinMaxScaler
    
    # Use only Close price for simplicity
    close_prices = df['Close'].values.reshape(-1, 1)
    
    # Scale data
    lstm_scaler = MinMaxScaler(feature_range=(0, 1))
    close_scaled = lstm_scaler.fit_transform(close_prices)
    
    # Create LSTM model
    lookback = 60  # Use 60 days to predict next day
    lstm_model = LSTMPredictor(
        lookback=lookback,
        lstm_units=[50, 50],
        dropout=0.2,
        learning_rate=0.001
    )
    
    # Create sequences
    X_lstm, y_lstm = lstm_model.create_sequences(close_scaled, close_scaled)
    
    print(f"\nLSTM sequences created:")
    print(f"  X shape: {X_lstm.shape} (samples, lookback, features)")
    print(f"  y shape: {y_lstm.shape}")
    
    # Split data
    lstm_split = int(len(X_lstm) * 0.8)
    X_lstm_train = X_lstm[:lstm_split]
    X_lstm_test = X_lstm[lstm_split:]
    y_lstm_train = y_lstm[:lstm_split]
    y_lstm_test = y_lstm[lstm_split:]
    
    print(f"\nTraining set: {X_lstm_train.shape[0]} sequences")
    print(f"Test set: {X_lstm_test.shape[0]} sequences")
else:
    print("TensorFlow not available. Skipping LSTM model.")
    print("Install with: pip install tensorflow")

In [None]:
if HAS_LSTM:
    print("Training LSTM model...")
    print("This may take a few minutes...\n")
    
    lstm_model.fit(
        X_lstm_train,
        y_lstm_train,
        epochs=30,
        batch_size=32,
        validation_split=0.2,
        early_stopping=True,
        patience=5,
        verbose=1
    )
    
    print("\nLSTM model trained!")
else:
    print("Skipping LSTM training (TensorFlow not available)")

In [None]:
if HAS_LSTM:
    # Plot training history
    lstm_model.plot_training_history()
else:
    print("No training history to plot")

In [None]:
if HAS_LSTM:
    # Make predictions
    lstm_train_pred_scaled = lstm_model.predict(X_lstm_train)
    lstm_test_pred_scaled = lstm_model.predict(X_lstm_test)
    
    # Inverse transform to get actual prices
    lstm_train_pred = lstm_scaler.inverse_transform(lstm_train_pred_scaled.reshape(-1, 1)).flatten()
    lstm_test_pred = lstm_scaler.inverse_transform(lstm_test_pred_scaled.reshape(-1, 1)).flatten()
    y_lstm_train_actual = lstm_scaler.inverse_transform(y_lstm_train.reshape(-1, 1)).flatten()
    y_lstm_test_actual = lstm_scaler.inverse_transform(y_lstm_test.reshape(-1, 1)).flatten()
    
    # Evaluate
    print("=== LSTM Performance ===")
    print("\nTest Set:")
    lstm_metrics = ModelEvaluator.calculate_regression_metrics(y_lstm_test_actual, lstm_test_pred)
    for metric, value in lstm_metrics.items():
        print(f"  {metric}: {value:.4f}")
    
    lstm_dir_acc = ModelEvaluator.calculate_directional_accuracy(y_lstm_test_actual, lstm_test_pred)
    print(f"\nDirectional Accuracy: {lstm_dir_acc:.2f}%")
else:
    print("Skipping LSTM evaluation")

In [None]:
if HAS_LSTM:
    # Visualize LSTM predictions
    # Create date index for LSTM test set
    lstm_test_dates = df.index[lookback + lstm_split:]
    
    ModelEvaluator.plot_predictions(
        y_lstm_test_actual,
        lstm_test_pred,
        title='LSTM - Predictions vs Actual',
        dates=lstm_test_dates
    )
else:
    print("Skipping LSTM visualization")

<a id='comparison'></a>
## 6. Model Comparison

Compare performance of different models.

In [None]:
# Compare models
if HAS_LSTM:
    # Align test sets (LSTM has shorter test set due to lookback)
    # Use only the overlapping test period
    rf_test_aligned = y_test.values[-len(y_lstm_test_actual):]
    rf_pred_aligned = rf_test_pred[-len(y_lstm_test_actual):]
    
    models_comparison = {
        'Random Forest': (rf_test_aligned, rf_pred_aligned),
        'LSTM': (y_lstm_test_actual, lstm_test_pred)
    }
    
    comparison_df = ModelEvaluator.compare_models(models_comparison)
    
    print("=== Model Comparison ===")
    display(comparison_df)
    
    # Determine best model
    best_model_rmse = comparison_df['RMSE'].idxmin()
    best_model_r2 = comparison_df['R2'].idxmax()
    best_model_dir = comparison_df['Dir_Accuracy_%'].idxmax()
    
    print(f"\nBest model by RMSE: {best_model_rmse}")
    print(f"Best model by R²: {best_model_r2}")
    print(f"Best model by Directional Accuracy: {best_model_dir}")
else:
    print("Only Random Forest model available for comparison")
    print("Install TensorFlow to train LSTM: pip install tensorflow")

In [None]:
if HAS_LSTM:
    # Visual comparison
    comparison_dates = lstm_test_dates
    ModelEvaluator.plot_model_comparison(models_comparison, dates=comparison_dates)
else:
    print("Skipping visual comparison")

<a id='visualization'></a>
## 7. Prediction Visualization

Detailed visualization of model predictions and errors.

In [None]:
# Residual analysis for Random Forest
print("=== Random Forest Residual Analysis ===")
ModelEvaluator.plot_residuals(y_test.values, rf_test_pred, title='Random Forest - Residual Analysis')

In [None]:
# Error distribution
print("=== Random Forest Error Distribution ===")
ModelEvaluator.plot_error_distribution(y_test.values, rf_test_pred)

In [None]:
if HAS_LSTM:
    print("=== LSTM Residual Analysis ===")
    ModelEvaluator.plot_residuals(y_lstm_test_actual, lstm_test_pred, title='LSTM - Residual Analysis')
else:
    print("Skipping LSTM residual analysis")

<a id='trading'></a>
## 8. Trading Simulation

Simulate trading strategies based on model predictions.

In [None]:
# Simulate trading with Random Forest
print("=== Trading Simulation: Random Forest ===")
rf_pl = ModelEvaluator.calculate_profit_loss(
    y_test.values,
    rf_test_pred,
    initial_capital=10000,
    commission=0.001
)

print(f"\nInitial Capital: $10,000.00")
print(f"Final Capital: ${rf_pl['Final_Capital']:.2f}")
print(f"Total Return: {rf_pl['Total_Return_%']:.2f}%")
print(f"Number of Trades: {rf_pl['Trades']}")

# Buy and hold comparison
buy_hold_return = ((y_test.values[-1] - y_test.values[0]) / y_test.values[0]) * 100
print(f"\nBuy & Hold Return: {buy_hold_return:.2f}%")

if rf_pl['Total_Return_%'] > buy_hold_return:
    print("✓ Model strategy outperformed buy & hold")
else:
    print("✗ Model strategy underperformed buy & hold")

In [None]:
if HAS_LSTM:
    print("=== Trading Simulation: LSTM ===")
    lstm_pl = ModelEvaluator.calculate_profit_loss(
        y_lstm_test_actual,
        lstm_test_pred,
        initial_capital=10000,
        commission=0.001
    )
    
    print(f"\nInitial Capital: $10,000.00")
    print(f"Final Capital: ${lstm_pl['Final_Capital']:.2f}")
    print(f"Total Return: {lstm_pl['Total_Return_%']:.2f}%")
    print(f"Number of Trades: {lstm_pl['Trades']}")
    
    # Buy and hold for same period
    lstm_buy_hold = ((y_lstm_test_actual[-1] - y_lstm_test_actual[0]) / y_lstm_test_actual[0]) * 100
    print(f"\nBuy & Hold Return: {lstm_buy_hold:.2f}%")
    
    if lstm_pl['Total_Return_%'] > lstm_buy_hold:
        print("✓ Model strategy outperformed buy & hold")
    else:
        print("✗ Model strategy underperformed buy & hold")
else:
    print("Skipping LSTM trading simulation")

<a id='takeaways'></a>
## 9. Key Takeaways and Limitations

### Key Learnings:

1. **Feature Engineering is Critical**
   - Created 50+ features from raw OHLCV data
   - Technical indicators, lagged values, and statistical features
   - Feature importance reveals which signals matter most

2. **Multiple Models, Different Strengths**
   - Random Forest: Good feature importance, fast training
   - LSTM: Captures temporal patterns, slower but powerful
   - No single "best" model - depends on objectives

3. **Directional Accuracy Matters**
   - Predicting exact price is extremely hard
   - Predicting direction (up/down) is more achievable
   - Above 50% directional accuracy is better than random

4. **Transaction Costs Impact Results**
   - Commission fees reduce profitability
   - Frequent trading amplifies costs
   - Real-world trading has additional costs (slippage, spreads)

### Critical Limitations:

⚠️ **DO NOT USE THESE MODELS FOR REAL TRADING WITHOUT:**

1. **Acknowledging Fundamental Challenges:**
   - Markets are influenced by countless unpredictable factors
   - Black swan events (COVID-19, crashes) can't be predicted
   - Past patterns may not repeat in the future

2. **Understanding Model Limitations:**
   - Models trained on historical data (look-ahead bias risk)
   - Overfitting to past patterns is common
   - Market regime changes invalidate models
   - Correlation ≠ Causation

3. **Recognizing Practical Challenges:**
   - Backtesting overestimates real performance
   - Real trading has psychological factors
   - Liquidity issues affect execution
   - Regulatory and tax implications

4. **Risk Management:**
   - Never risk money you can't afford to lose
   - Diversification is essential
   - Stop-loss strategies required
   - Position sizing critical

### Best Practices:

✓ Use as one input among many (not sole decision maker)
✓ Combine with fundamental analysis
✓ Regularly retrain models with new data
✓ Monitor model drift and performance
✓ Use proper risk management
✓ Paper trade before real money
✓ Consult professional financial advisors

### Conclusion:

Machine learning for stock prediction is an exciting field with real potential, but it's **extremely challenging** and **inherently risky**. These models are educational tools to understand ML techniques, not guaranteed money-making systems.

**Remember: If predicting stock prices was easy, everyone would be rich!**

In [None]:
# Summary of results
print("="*60)
print("SUMMARY OF RESULTS")
print("="*60)

print(f"\nStock: {ticker}")
print(f"Test Period: {y_test.index[0].date()} to {y_test.index[-1].date()}")
print(f"Test Samples: {len(y_test)}")

print("\n--- Random Forest ---")
print(f"RMSE: ${test_metrics['RMSE']:.2f}")
print(f"R²: {test_metrics['R2']:.4f}")
print(f"MAPE: {test_metrics['MAPE']:.2f}%")
print(f"Directional Accuracy: {rf_dir_acc:.2f}%")
print(f"Simulated Return: {rf_pl['Total_Return_%']:.2f}%")

if HAS_LSTM:
    print("\n--- LSTM ---")
    print(f"RMSE: ${lstm_metrics['RMSE']:.2f}")
    print(f"R²: {lstm_metrics['R2']:.4f}")
    print(f"MAPE: {lstm_metrics['MAPE']:.2f}%")
    print(f"Directional Accuracy: {lstm_dir_acc:.2f}%")
    print(f"Simulated Return: {lstm_pl['Total_Return_%']:.2f}%")

print("\n--- Baseline (Buy & Hold) ---")
print(f"Return: {buy_hold_return:.2f}%")

print("\n" + "="*60)
print("Remember: These are educational simulations only!")
print("Never trade based solely on ML predictions.")
print("="*60)