# <center> **ITESO** </center>
# <center> **Final Project Procesamiento de Datos Masivos** </center>
---
## <center> **Machine Learning Applications** </center>
## <center> **Real-Time Stock Price Analysis** </center>
---
## <center> **Par de Foraneos** </center>
---
#### <center> **Spring 2025** </center>
---
#### <center> **05/14/2025** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez
**Students**: Eddie, Konrad, Diego, Aaron

## Introduction

In this notebook, we'll apply machine learning to our processed stock price data. Having already set up the data streaming, collection, and preprocessing pipelines, we now focus on applying ML techniques to predict trading signals and evaluate trading strategies using backtesting.

Our approach will be to:
1. Either use our processed parquet files or download historical data
2. Prepare the data with technical indicators and lag features
3. Generate trading signals (BUY/SELL/WAIT) based on future price movements
4. Train SVM models to predict these signals
5. Optimize parameters using Optuna
6. Backtest the trading strategy and evaluate performance

## Import Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from par_de_foraneos.stock_utils import load_and_prepare_data, generate_signals, \
                                        StockModel, Model, Backtest

ModuleNotFoundError: No module named 'par_de_foraneos'

## Data Source Selection

We can either use the parquet files generated by our streaming pipeline or download historical data directly from Yahoo Finance.

In [4]:
# Define the tickers we're working with
STOCKS = ['CAT', 'AAPL', 'NVDA', 'CVX']

# Choose data source: 'parquet' or 'historical'
data_source = "historical"

## Process Each Stock

Now we'll process each stock in our list using the StockModel class imported from stock_utils.py.

In [None]:
# Set to True to optimize parameters, False to use default parameters
optimize = True

# Process each stock
stock_models = {}
backtest_results = {}

for ticker in STOCKS:
    print(f"\nProcessing {ticker}...")
    
    # Initialize stock model
    stock_model = StockModel(ticker, data_source=data_source)
    
    # Prepare data
    success = stock_model.prepare_data()
    if not success:
        print(f"Skipping {ticker} due to data preparation failure")
        continue
    
    # Optimize indicators if requested
    if optimize:
        print(f"Optimizing indicators for {ticker}...")
        stock_model.optimize_indicators(n_trials=20)
    
    # Train model
    print(f"Training model for {ticker}...")
    stock_model.train_model()
    
    # Optimize backtest parameters if requested
    if optimize:
        print(f"Optimizing backtest parameters for {ticker}...")
        stock_model.optimize_backtest(n_trials=20)
    
    # Run backtest
    print(f"Running backtest for {ticker}...")
    backtest_results[ticker] = stock_model.run_backtest()
    
    # Store model
    stock_models[ticker] = stock_model

## Portfolio Performance Analysis

Now we'll analyze the performance of our portfolio, considering it as an equally-weighted portfolio of all the stocks.

In [None]:
# Calculate portfolio metrics
if len(backtest_results) > 0:
    # Calculate average metrics
    avg_calmar = sum(res['calmar'] for res in backtest_results.values()) / len(backtest_results)
    avg_sharpe = sum(res['sharpe'] for res in backtest_results.values()) / len(backtest_results)
    avg_sortino = sum(res['sortino'] for res in backtest_results.values()) / len(backtest_results)
    avg_return = sum(res['return'] for res in backtest_results.values()) / len(backtest_results)
    
    # Calculate total portfolio value over time (equal weighting)
    min_length = min(len(res['portfolio_value']) for res in backtest_results.values())
    portfolio_values = np.zeros(min_length)
    
    for ticker, res in backtest_results.items():
        # Normalize to percentage of initial investment
        normalized_values = np.array(res['portfolio_value'][:min_length]) / res['portfolio_value'][0]
        portfolio_values += normalized_values / len(backtest_results)
    
    # Scale back to dollars (assuming equal initial investment per stock)
    initial_investment = 1_000_000  # $1M total, divided equally among stocks
    per_stock_investment = initial_investment / len(backtest_results)
    portfolio_values = portfolio_values * per_stock_investment
    
    print("\n==== Portfolio Performance Summary ====\n")
    print(f"Number of stocks: {len(backtest_results)}")
    print(f"Average Calmar Ratio: {avg_calmar:.4f}")
    print(f"Average Sharpe Ratio: {avg_sharpe:.4f}")
    print(f"Average Sortino Ratio: {avg_sortino:.4f}")
    print(f"Average Return: {avg_return:.2f}%")
    print(f"Initial Portfolio Value: ${initial_investment:,.2f}")
    print(f"Final Portfolio Value: ${portfolio_values[-1]:,.2f}")
    print(f"Total Return: {((portfolio_values[-1]/portfolio_values[0])-1)*100:.2f}%")
else:
    print("No backtest results available for portfolio analysis.")

## Visualization

Let's visualize the portfolio performance and each stock's performance.

In [None]:
# Create portfolio value visualization
if len(backtest_results) > 0:
    plt.figure(figsize=(14, 7))
    
    # Plot portfolio value
    plt.plot(portfolio_values, label='Portfolio Value', linewidth=3, color='black')
    
    # Initial investment line
    plt.axhline(y=initial_investment, color='r', linestyle='--', label='Initial Investment')
    
    plt.title('Portfolio Performance Over Time', fontsize=16)
    plt.xlabel('Time Steps', fontsize=12)
    plt.ylabel('Portfolio Value ($)', fontsize=12)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Individual stock performance
    for ticker, res in backtest_results.items():
        # Create figure with two y-axes
        fig, ax1 = plt.subplots(figsize=(14, 7))
        
        # Plot portfolio value on left axis
        color = 'tab:blue'
        ax1.set_xlabel('Time Steps', fontsize=12)
        ax1.set_ylabel('Portfolio Value ($)', color=color, fontsize=12)
        ax1.plot(res['portfolio_value'], color=color, linewidth=2, label='Portfolio Value')
        ax1.tick_params(axis='y', labelcolor=color)
        ax1.axhline(y=res['initial_value'], color='r', linestyle='--', label='Initial Investment')
        
        # Create second y-axis for stock price
        ax2 = ax1.twinx()
        color = 'tab:orange'
        ax2.set_ylabel('Stock Price ($)', color=color, fontsize=12)
        ax2.plot(res['close_prices'], color=color, linewidth=1, alpha=0.7, label=f'{ticker} Price')
        ax2.tick_params(axis='y', labelcolor=color)
        
        # Title and legend
        plt.title(f'{ticker} - Portfolio Value vs Stock Price', fontsize=16)
        
        # Combined legend
        lines1, labels1 = ax1.get_legend_handles_labels()
        lines2, labels2 = ax2.get_legend_handles_labels()
        ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
        
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
else:
    print("No backtest results available for visualization.")

## Signal Distribution Analysis

Let's analyze the distribution of trading signals for each stock.

In [None]:
if len(stock_models) > 0:
    plt.figure(figsize=(14, 8))
    
    for i, (ticker, model) in enumerate(stock_models.items()):
        # Get signal counts
        signal_counts = model.test_data['Signal'].value_counts()
        
        # Create subplot
        plt.subplot(2, 2, i+1)
        signal_counts.plot(kind='bar', color=['g', 'r', 'gray'])
        plt.title(f'{ticker} Signal Distribution')
        plt.ylabel('Count')
        plt.xlabel('Signal Type')
        plt.grid(axis='y', alpha=0.3)
        
        # Add count labels
        for j, count in enumerate(signal_counts):
            plt.text(j, count + 0.5, str(count), ha='center')
    
    plt.tight_layout()
    plt.show()
else:
    print("No signal distribution available for analysis.")

## Conclusion