<div style="text-align: center;">
    <h1>Logistic Regression-Based Stock Return Classification</h1>
    </div>
</div>

# 1. Introduction
In this assignment, we aim to build a classification model that categorizes stock returns into negative, neutral, and positive classes. The model will use daily return signals to predict stock returns over one day, one week, and one month. Our approach involves logistic regression with various features, including price-based, volume-based, technical, fundamental, and macro-economic indicators for 20 stocks across technology, finance, healthcare, and energy sectors. We evaluate the models with metrics such as accuracy, ROC-AUC, and F1-scores to compare their performance. The models will be evaluated and tested across different stocks to determine the most effective prediction horizon.

# 2. Logistic Regression Model

For a classification problem with three possible outcomes, we adopt a multinomial logistic regression model. This model estimates the probability of each class by comparing the exponentiated linear combinations of predictors for each class to the sum of these combinations across all classes. In the formula, $k$ represents a specific class, while $j$ ranges from -1 to 1 in the denominator, ensuring that the probabilities for all classes sum to one.

$$
P(Y_t = k | X_t = x_t) = \frac{e^{\alpha_k + \beta_{k1} \cdot x_{1,t} + \beta_{k2} \cdot x_{2,t} + \dots + \beta_{kn} \cdot x_{n,t}}}{\sum_{j=-1}^{1} e^{\alpha_j + \beta_{j1} \cdot x_{1,t} + \beta_{j2} \cdot x_{2,t} + \dots + \beta_{jn} \cdot x_{n,t}}}
$$

- $\beta_i$: Coefficients representing the impact of each feature $X_{i,t}$ on the log-odds of each return class.
- $\alpha$: Intercept term for each class, representing the baseline log-odds when all features are zero.
- $Y_t$: Daily return class at time $t$ ($k = -1$ for negative, $0$ for neutral, $1$ for positive), based on next day’s return $R_{t+1} = \frac{\text{Price}_{t+1} - \text{Price}_t}{\text{Price}_t}$ and threshold $\tau$.
- $P(Y_t = k)$: Probability of the return falling into class $k$.
- $X_{n,t}$: Features at time $t$, including the features in the below table.


| Category                  | Feature                  | Description                                                                 | Daily Model | Weekly Model | Monthly Model |
|:--------------------------|:-------------------------|:----------------------------------------------------------------------------|:------------|:-------------|:--------------|
| **Price-Based Features**  | Daily Returns            | The percentage change in stock price from one day to the next.              | -           | -            | -             |
|                           | Moving Averages          | Simple Moving Average (SMA) and Exponential Moving Average (EMA) over different periods. We chose 5-day, 10-day, and 20-day for daily; 10-day, 20-day, and 50-day for weekly; 20-day, 50-day, and 100-day for monthly SMA and EMA. | X           | X            | X             |
|                           | Volatility               | Standard deviation of daily returns over a certain period. We chose 20-day for daily and weekly; 50-day for monthly volatilities.                 | X           | X            | X             |
|                           | Relative Strength Index (RSI) | Measures the speed and change of price movements. 14-day is the traditional default for daily, weekly, and monthly momentum.                           | X           | X            | X             |
|                           | Bollinger Bands          | Uses moving averages and standard deviations to identify overbought or oversold conditions. We chose 20-day SMA with 2 std devs for daily; 50-day with 2 std devs for weekly; and 100-day with 2 std devs for monthly. | X           | X            | X             |
| **Volume-Based Features** | Trading Volume           | The number of shares traded in a given period.                              | X           | X            | X             |
|                           | Volume Moving Averages   | Similar to price moving averages but applied to trading volume. 5-day, 10-day, and 20-day for daily; 10-day, 20-day, and 50-day for weekly; 20-day, 50-day, and 100-day for monthly.            | X           | X            | X             |
|                           | On-Balance Volume (OBV)  | A cumulative total of volume that adds or subtracts volume based on price movement. | X           | X            | X             |
| **Technical Indicators**  | Moving Average Convergence Divergence (MACD) | Shows the relationship between two moving averages of a stock’s price. 12-day EMA, 26-day EMA with a 9-day signal line is the standard setting for daily, weekly, and monthly momentum.      | X           | X            | X             |
|                           | Stochastic Oscillator    | Compares a particular closing price of a security to a range of its prices over a certain period. 14-day is the conventional period optimized for daily, weekly, and monthly price range comparisons. | X           | X            | X             |
|                           | Average True Range (ATR) | Measures market volatility by decomposing the entire range of an asset price for that period. We chose 14-day for daily and weekly; 50-day for monthly volatility. | X           | X            | X             |
| **Fundamental Features**  | Earnings Reports         | Quarterly earnings, earnings per share (EPS), and revenue growth.           |             | X            | X             |
|                           | Financial Ratios         | Price-to-Earnings (P/E) ratio, Price-to-Book (P/B) ratio, and Debt-to-Equity ratio. |             | X            | X             |
|                           | Dividend Yield           | The dividend income relative to the stock price.                            |             |              | X             |
| **Macro-Economic Indicators** | Interest Rates       | Changes in interest rates can affect stock prices.                          |             | X            | X             |
|                           | Economic Indicators      | GDP growth rate, unemployment rate, and inflation rate.                     |             |              | X             |
| **Time-Based Features**   | Day of the Week          | Some stocks exhibit patterns based on the day of the week.                  | X           |              |               |
|                           | Seasonality              | Monthly or quarterly trends. Month of the year for monthly model.           |             | X            | X             |

## 2.1 Daily Model
For the daily model, we aim to predict the next day’s stock return class using short-term price and volume signals in a multinomial logistic regression framework. We selected features like Moving Averages, Volatility, RSI, Bollinger Bands, Trading Volume, Volume Moving Averages, OBV, MACD, Stochastic Oscillator, ATR, and Day of the Week, while excluding the following variables for these reasons:  
- **Earnings Reports**: Quarterly data, not daily, making it irrelevant for day-to-day predictions.  
- **Financial Ratios**: Updated infrequently (e.g., quarterly), less impactful on daily returns.  
- **Dividend Yield**: Static over short periods, more suited for long-term analysis.  
- **Interest Rates**: Not daily-updated, better for longer horizons like weekly or monthly.  
- **Economic Indicators**: Broad, slow-changing metrics (e.g., GDP), not specific to daily movements.  
- **Seasonality**: Monthly/quarterly trends are too coarse; Day of the Week covers daily time effects.

## 2.2 Weekly Model
For the weekly model, we aim to predict the weekly stock return class using medium-term price, volume, and select fundamental signals in a multinomial logistic regression framework. We selected features like Moving Averages, Volatility, RSI, Bollinger Bands, Trading Volume, Volume Moving Averages, OBV, MACD, Stochastic Oscillator, ATR, Earnings Reports, Financial Ratios, Interest Rates, and Seasonality (Week of the Month), while excluding the following variables for these reasons:

- **Dividend Yield**: Static over medium periods, more relevant for long-term analysis.
- **Economic Indicators**: Broad, slow-changing metrics (e.g., GDP), better suited for monthly or longer horizons.
- **Day of the Week**: Too granular for weekly predictions; Seasonality captures broader time effects.

## 2.3 Monthly Model
For the monthly model, we aim to predict the monthly stock return class using long-term price, volume, fundamental, and macro-economic signals in a multinomial logistic regression framework. We selected features like Moving Averages, Volatility, RSI, Bollinger Bands, Trading Volume, Volume Moving Averages, OBV, MACD, Stochastic Oscillator, ATR, Earnings Reports, Financial Ratios, Dividend Yield, Interest Rates, Economic Indicators, and Seasonality (Month of the Year), while excluding the following variable for this reason:  
- **Day of the Week**: Too short-term and granular for monthly predictions; Seasonality captures broader time effects.

** <small> Note we excluded daily returns as a feature in above three models because it is closely tied to the target, risking data leakage, and its predictive information is already captured by features like moving averages and momentum indicators such as RSI and MACD.

In [66]:
!pip install -U yfinance pandas numpy ta
!pip install pandas==2.1.4 --force-reinstall
!pip install fredapi

import yfinance as yf
import pandas as pd
import numpy as np
from ta.momentum import RSIIndicator, StochasticOscillator
from ta.volume import OnBalanceVolumeIndicator
from ta.trend import MACD
from ta.volatility import BollingerBands, AverageTrueRange
from ta.momentum import StochasticOscillator

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

from fredapi import Fred

from IPython.display import display, HTML

Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-win_amd64.whl (11.6 MB)
Collecting numpy
  Using cached numpy-2.2.4-cp310-cp310-win_amd64.whl (12.9 MB)
Installing collected packages: numpy, pandas
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.4
    Uninstalling pandas-2.1.4:
      Successfully uninstalled pandas-2.1.4
Successfully installed numpy-2.2.4 pandas-2.2.3


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tensorflow-intel 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 2.2.4 which is incompatible.
streamlit 1.29.0 requires numpy<2,>=1.19.3, but you have numpy 2.2.4 which is incompatible.
scipy 1.10.0 requires numpy<1.27.0,>=1.19.5, but you have numpy 2.2.4 which is incompatible.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 2.2.4 which is incompatible.


Collecting pandas==2.1.4
  Using cached pandas-2.1.4-cp310-cp310-win_amd64.whl (10.7 MB)
Collecting numpy<2,>=1.22.4
  Using cached numpy-1.26.4-cp310-cp310-win_amd64.whl (15.8 MB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting pytz>=2020.1
  Using cached pytz-2025.1-py2.py3-none-any.whl (507 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2025.1-py2.py3-none-any.whl (346 kB)
Collecting six>=1.5
  Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, six, numpy, python-dateutil, pandas
  Attempting uninstall: pytz
    Found existing installation: pytz 2025.1
    Uninstalling pytz-2025.1:
      Successfully uninstalled pytz-2025.1
  Attempting uninstall: tzdata
    Found existing installation: tzdata 2025.1
    Uninstalling tzdata-2025.1:
      Successfully uninstalled tzdata-2025.1
  Attempting uninstall: six
    Found existing installation: six 1.17.0
    Uninstalling

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
tensorflow-intel 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.26.4 which is incompatible.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.26.4 which is incompatible.
conda-repo-cli 1.0.27 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.27 requires nbformat==5.4.0, but you have nbformat 5.7.0 which is incompatible.
conda-repo-cli 1.0.27 requires python-dateutil==2.8.2, but you have python-dateutil 2.9.0.post0 which is incompatible.
conda-repo-cli 1.0.27 requires requests==2.28.1, but you have requests 2.31.0 which is incompatible.






# 3. Data Overview

## 3.1 Data Source

We used the `yfinance` Python library to retrieve daily OHLCV (Open, High, Low, Close, Volume) data. Additionally, the table below provides a high level overview on how we computed th features.

| Category                  | Feature                  | General Computation Method                                                                 |
|:--------------------------|:-------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| **Price-Based Features**  | Daily Returns            | - |
|                           | Moving Averages          | SMA and EMA calculated using `pandas` `rolling.mean` and `ewm` on `Close`. Periods: 5, 10, 20 (Daily); 10, 20, 50 (Weekly); 20, 50, 100 (Monthly). |
|                           | Volatility               | Standard deviation of daily returns from `Close` price changes using `pandas` `rolling.std`. Periods: 20-day (Daily, Weekly); 50-day (Monthly). |
|                           | Relative Strength Index (RSI) | 14-day RSI from `Close` using average gains/losses over 14 days via `ta` library or `pandas` (consistent across Daily, Weekly, Monthly). |
|                           | Bollinger Bands          | SMA of `Close` ± 2 × standard deviation of `Close` using `pandas` `rolling`. Periods: 20-day SMA, 2 std dev (Daily); 50-day (Weekly); 100-day (Monthly). |
| **Volume-Based Features** | Trading Volume           | Direct from `yfinance` `Volume` column, aggregated as sum per period (Daily, Weekly, Monthly).                |
|                           | Volume Moving Averages   | SMA of `Volume` using `pandas` `rolling.mean`. Periods: 5, 10, 20 (Daily); 10, 20, 50 (Weekly); 20, 50, 100 (Monthly). |
|                           | On-Balance Volume (OBV)  | Cumulative sum of `Volume`, adding if `Close` rises, subtracting if it falls, computed with `pandas` (consistent across horizons). |
| **Technical Indicators**  | Moving Average Convergence Divergence (MACD) | 12-day EMA - 26-day EMA of `Close`, with 9-day EMA signal line, using `pandas` `ewm` (consistent across Daily, Weekly, Monthly). |
|                           | Stochastic Oscillator    | 14-day %K from `High`, `Low`, `Close`: $\%K = 100 \times \frac{\text{Close} - \text{Low}_{14}}{\text{High}_{14} - \text{Low}_{14}}$ (consistent across horizons). |
|                           | Average True Range (ATR) | 14-day average of true range (max of `High - Low`, adjusted for gaps) using `pandas` for Daily/Weekly; 50-day for Monthly. |
| **Fundamental Features**  | Earnings Reports         | Quarterly EPS retrieved from `yfinance` `get_earnings_dates()`, nearest quarter applied and forward-filled within quarter for Weekly/Monthly. |
|                           | Financial Ratios         | P/E ratio from `yfinance` `info` (`trailingPE`) for Weekly/Monthly; P/B and Debt-to-Equity via `yfinance` `info` if available. |
|                           | Dividend Yield           | Retrieved from `yfinance` `info` (`dividendYield`) for Monthly model. |
| **Macro-Economic Indicators** | Interest Rates       | Retrieved from FRED API (`FEDFUNDS`) series, resampled to weekly/monthly and forward-filled. |
|                           | Economic Indicators      | GDP growth rate, unemployment rate, inflation rate retrieved from FRED API for Monthly model. |
| **Time-Based Features**   | Day of the Week          | Extracted as 0-4 (Mon-Fri) from `yfinance` date index using `pandas` `dayofweek` for Daily model. |
|                           | Seasonality              | Week of the year for Weekly (via `pandas` `isocalendar().week`); Month of the year for Monthly (via `pandas` `month`). |


## 3.2 Stock Data

We selected 20 stocks across four industries — Technology, Finance, Healthcare, and Energy — for analysis. The stock data spans from November 2020 to December 2023 for the daily and weekly models. For the monthly model, to ensure a sufficient dataset for the train-test split given the longer horizon, we extended the data range from 2010 to 2023.

| Technology           | Finance             | Healthcare          | Energy             |
|:---------------------|:--------------------|:--------------------|:-------------------|
| AAPL (Apple)         | JPM (JPMorgan Chase)| JNJ (Johnson & Johnson) | XOM (ExxonMobil)   |
| MSFT (Microsoft)     | BAC (Bank of America)| PFE (Pfizer)       | CVX (Chevron)      |
| GOOGL (Alphabet)     | MS (Morgan Stanley)   | MRK (Merck)         | SLB (Schlumberger) |
| AMZN (Amazon)        | MS (Goldman Sachs)  | LLY (Eli Lilly)     | COP (ConocoPhillips)|
| NVDA (NVIDIA)        | C (Citigroup)       | ABBV (AbbVie)       | EOG (EOG Resources)|


## 3.3 τ Value

We set τ = 0.01 (1%), a common threshold in finance for daily returns. This balances noise and meaningful moves, as stocks often fluctuate ±1% daily and captures significant shifts without over-classifying noise as positive or negative.

# 4. Data Preparation 

## 4.1. Daily Model Dataset

In [67]:
# Define the 20 stocks
stocks = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA',  # Technology
    'JPM', 'BAC', 'MS', 'GS', 'C',            # Finance
    'JNJ', 'PFE', 'MRK', 'LLY', 'ABBV',       # Healthcare
    'XOM', 'CVX', 'SLB', 'COP', 'EOG'         # Energy
]

# Pull OHLCV data from yfinance
start_date = '2020-11-01'
end_date = '2023-12-31'
try:
    data = yf.download(stocks, start=start_date, end=end_date, group_by='ticker')
    print("Data downloaded. Shape:", data.shape)
    if data.empty:
        raise ValueError("Downloaded data is empty.")
except Exception as e:
    print(f"Error downloading data: {e}")
    raise

# Initialize an empty list to store DataFrames
all_dfs = []

# Process each stock
for stock in stocks:
    try:
        if stock not in data.columns.levels[0]:
            print(f"Warning: No data for {stock} - skipping.")
            continue
        df = data[stock].copy()
        if df.empty or df['Close'].isna().all():
            print(f"Warning: Empty or all-NaN data for {stock} - skipping.")
            continue
        df['Stock'] = stock
        df = df.reset_index()
        
        # Compute Daily Returns (R_t+1) and shift for Y
        df['Returns'] = df['Close'].pct_change().shift(-1)
        df['Y'] = np.where(df['Returns'] > 0.01, 1,
                           np.where(df['Returns'] < -0.01, -1, 0))
        
        # Price-Based Features
        df['SMA_5'] = df['Close'].rolling(window=5).mean()
        df['SMA_10'] = df['Close'].rolling(window=10).mean()
        df['SMA_20'] = df['Close'].rolling(window=20).mean()
        df['EMA_5'] = df['Close'].ewm(span=5, adjust=False).mean()
        df['EMA_10'] = df['Close'].ewm(span=10, adjust=False).mean()
        df['EMA_20'] = df['Close'].ewm(span=20, adjust=False).mean()
        df['Volatility_20'] = df['Close'].pct_change().rolling(window=20).std()
        rsi = RSIIndicator(close=df['Close'], window=14)
        df['RSI_14'] = rsi.rsi()
        bb = BollingerBands(close=df['Close'], window=20, window_dev=2)
        df['BB_Upper'] = bb.bollinger_hband()
        df['BB_Lower'] = bb.bollinger_lband()
        
        # Volume-Based Features
        df['Volume_SMA_5'] = df['Volume'].rolling(window=5).mean()
        df['Volume_SMA_10'] = df['Volume'].rolling(window=10).mean()
        df['Volume_SMA_20'] = df['Volume'].rolling(window=20).mean()
        df['OBV'] = np.where(df['Close'] > df['Close'].shift(1), df['Volume'],
                             np.where(df['Close'] < df['Close'].shift(1), -df['Volume'], 0)).cumsum()
        
        # Technical Indicators
        macd = MACD(close=df['Close'], window_slow=26, window_fast=12, window_sign=9)
        df['MACD'] = macd.macd()
        df['MACD_Signal'] = macd.macd_signal()
        stoch = StochasticOscillator(high=df['High'], low=df['Low'], close=df['Close'], window=14)
        df['Stoch_%K'] = stoch.stoch()
        atr = AverageTrueRange(high=df['High'], low=df['Low'], close=df['Close'], window=14)
        df['ATR_14'] = atr.average_true_range()
        
        # Time-Based Features
        df['Day_of_Week'] = df['Date'].dt.dayofweek
        
        # Check for unexpected NaNs after initial 35 days (MACD_Signal needs 26 + 9)
        feature_cols = ['SMA_5', 'SMA_10', 'SMA_20', 'EMA_5', 'EMA_10', 'EMA_20',
                        'Volatility_20', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume',
                        'Volume_SMA_5', 'Volume_SMA_10', 'Volume_SMA_20', 'OBV',
                        'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_14', 'Day_of_Week']
        if len(df) > 35:  # Enough rows for MACD_Signal
            df_after_window = df.iloc[35:]
            nan_cols = df_after_window[feature_cols].isna().any()
            if nan_cols.any():
                print(f"Warning: Unexpected NaN values in {stock} after initial 35 days in columns: {nan_cols[nan_cols].index.tolist()}")
        
        # Select final columns
        final_cols = ['Date', 'Stock', 'Y'] + feature_cols
        df = df[final_cols]
        
        all_dfs.append(df)
    except Exception as e:
        print(f"Error processing {stock}: {e}")
        continue

if not all_dfs:
    raise ValueError("No stocks were successfully processed.")

# Concatenate all stock DataFrames
try:
    daily_model_data = pd.concat(all_dfs, ignore_index=True)
    print("DataFrame concatenated. Shape before dropna:", daily_model_data.shape)
except Exception as e:
    print(f"Error concatenating DataFrames: {e}")
    raise

# Drop rows with NaN values
daily_model_data = daily_model_data.dropna()
print("DataFrame after dropna. Shape:", daily_model_data.shape)

# Validate the final DataFrame
if daily_model_data.empty:
    raise ValueError("Final DataFrame is empty after dropping NaNs.")
if daily_model_data['Y'].isna().any():
    print("Warning: NaN values found in Y column - this should not happen.")

# Save to CSV
try:
    daily_model_data.to_csv('daily_model_data.csv', index=False)
    print("Dataset saved as 'daily_model_data.csv'")
except Exception as e:
    print(f"Error saving to CSV: {e}")
    raise

# Display the first few rows and shape
print(daily_model_data.head())
print(f"Final Dataset Shape: {daily_model_data.shape}")

[*********************100%***********************]  20 of 20 completed


Data downloaded. Shape: (795, 100)
DataFrame concatenated. Shape before dropna: (15900, 23)
DataFrame after dropna. Shape: (15240, 23)
Dataset saved as 'daily_model_data.csv'
Price       Date Stock  Y       SMA_5      SMA_10      SMA_20       EMA_5  \
33    2020-12-18  AAPL  1  123.627269  121.940367  119.040793  123.619508   
34    2020-12-21  AAPL  1  124.887314  122.377964  119.572650  124.163880   
35    2020-12-22  AAPL  0  125.668741  123.110553  120.453219  125.715221   
36    2020-12-23  AAPL  0  126.284113  124.007236  121.224389  126.449892   
37    2020-12-24  AAPL  1  126.922932  124.859969  122.002884  127.268536   

Price      EMA_10      EMA_20  Volatility_20  ...     Volume  Volume_SMA_5  \
33     122.190325  119.879217       0.017857  ...  192541500   124307620.0   
34     122.747107  120.390970       0.017629  ...  121251600   132721040.0   
35     123.850888  121.193535       0.016426  ...  168904800   135053260.0   
36     124.590587  121.834078       0.016705  ... 

## 4.2 Weekly Model Dataset

In [68]:
# Initialize FRED API with your key
fred = Fred(api_key='b190dd4fb366b045066de88c061298ca')

# Define stocks
stocks = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA', 'JPM', 'BAC', 'MS', 'GS', 'C',
          'JNJ', 'PFE', 'MRK', 'LLY', 'ABBV', 'XOM', 'CVX', 'SLB', 'COP', 'EOG']

# Pull OHLCV data (daily, then resample to weekly)
try:
    data = yf.download(stocks, start='2020-11-01', end='2023-12-31', group_by='ticker')
    print("Data downloaded. Shape:", data.shape)
    if data.empty:
        raise ValueError("Downloaded data is empty.")
except Exception as e:
    print(f"Error downloading data: {e}")
    raise

# Fetch interest rates from FRED
try:
    interest_rates = fred.get_series('FEDFUNDS', start_date='2020-11-01', end='2023-12-31')
    interest_rates = interest_rates.resample('W').ffill()
except Exception as e:
    print(f"Error fetching interest rates: {e}")
    raise

# Initialize list for DataFrames
all_dfs = []

# Process each stock
for stock in stocks:
    try:
        if stock not in data.columns.levels[0]:
            print(f"Warning: No data for {stock} - skipping.")
            continue
        df = data[stock].copy()
        df['Stock'] = stock
        df = df.reset_index()
        
        # Resample to weekly
        df_weekly = df.resample('W', on='Date').agg({'Open': 'first', 'High': 'max', 
                                                     'Low': 'min', 'Close': 'last', 
                                                     'Volume': 'sum', 'Stock': 'first'}).reset_index()
        
        # Returns and Y (weekly)
        df_weekly['Returns'] = df_weekly['Close'].pct_change().shift(-1)
        df_weekly['Y'] = np.where(df_weekly['Returns'] > 0.01, 1,
                                 np.where(df_weekly['Returns'] < -0.01, -1, 0))
        
        # Price-Based Features
        df_weekly['SMA_10'] = df_weekly['Close'].rolling(window=10).mean()
        df_weekly['SMA_20'] = df_weekly['Close'].rolling(window=20).mean()
        df_weekly['SMA_50'] = df_weekly['Close'].rolling(window=50).mean()
        df_weekly['EMA_10'] = df_weekly['Close'].ewm(span=10, adjust=False).mean()
        df_weekly['EMA_20'] = df_weekly['Close'].ewm(span=20, adjust=False).mean()
        df_weekly['EMA_50'] = df_weekly['Close'].ewm(span=50, adjust=False).mean()
        df_weekly['Volatility_20'] = df_weekly['Close'].pct_change().rolling(window=20).std()
        
        # RSI
        df_weekly['RSI_14'] = (df_weekly['Close'].diff().clip(lower=0).rolling(14).mean() /
                              df_weekly['Close'].diff().abs().rolling(14).mean() * 100)
        
        # Bollinger Bands (50-day, 2 std dev)
        bb = BollingerBands(close=df_weekly['Close'], window=50, window_dev=2)
        df_weekly['BB_Upper'] = bb.bollinger_hband()
        df_weekly['BB_Lower'] = bb.bollinger_lband()
        
        # Volume-Based Features
        df_weekly['Volume'] = df_weekly['Volume']
        df_weekly['Volume_SMA_10'] = df_weekly['Volume'].rolling(window=10).mean()
        df_weekly['Volume_SMA_20'] = df_weekly['Volume'].rolling(window=20).mean()
        df_weekly['Volume_SMA_50'] = df_weekly['Volume'].rolling(window=50).mean()
        df_weekly['OBV'] = OnBalanceVolumeIndicator(close=df_weekly['Close'], volume=df_weekly['Volume']).on_balance_volume()
        
        # Technical Indicators
        macd = MACD(close=df_weekly['Close'], window_slow=26, window_fast=12, window_sign=9)
        df_weekly['MACD'] = macd.macd()
        df_weekly['MACD_Signal'] = macd.macd_signal()
        stoch = StochasticOscillator(high=df_weekly['High'], low=df_weekly['Low'], close=df_weekly['Close'], window=14)
        df_weekly['Stoch_%K'] = stoch.stoch()
        atr = AverageTrueRange(high=df_weekly['High'], low=df_weekly['Low'], close=df_weekly['Close'], window=14)
        df_weekly['ATR_14'] = atr.average_true_range()
        
        # Earnings data (using Reported EPS with nearest quarter)
        ticker = yf.Ticker(stock)
        try:
            earnings_dates = ticker.get_earnings_dates(limit=16)  # Last 4 years (~16 quarters)
            if isinstance(earnings_dates, pd.DataFrame) and not earnings_dates.empty:
                earnings_dates = earnings_dates.reset_index()
                earnings_dates['Date'] = pd.to_datetime(earnings_dates['Earnings Date']).dt.tz_localize(None)
                earnings_dates['EPS'] = earnings_dates['Reported EPS'].fillna(0)

                # Sort earnings dates for nearest merge
                earnings_dates = earnings_dates.sort_values('Date')

                # Merge with weekly data using nearest date
                df_weekly = pd.merge_asof(df_weekly, earnings_dates[['Date', 'EPS']],
                                          on='Date', direction='nearest',
                                          suffixes=('', '_earnings'))

                # Forward-fill within quarter and handle initial NaNs
                df_weekly['Quarter'] = df_weekly['Date'].dt.to_period('Q')
                df_weekly['EPS'] = df_weekly.groupby('Quarter')['EPS'].ffill().fillna(0)

        except Exception as e:
            print(f"Warning: Failed to fetch earnings for {stock}: {e}")
            df_weekly['EPS'] = 0
        
        # Financial ratios
        info = ticker.info
        df_weekly['PE_Ratio'] = info.get('trailingPE', np.nan)
        
        # Interest rates
        df_weekly = df_weekly.merge(interest_rates.rename('Interest_Rate'), 
                                   left_on='Date', right_index=True, how='left')
        
        # Seasonality
        df_weekly['Week_of_Year'] = df_weekly['Date'].dt.isocalendar().week
        seasonality = df_weekly.groupby('Week_of_Year')['Returns'].mean().to_dict()
        df_weekly['Seasonality'] = df_weekly['Week_of_Year'].map(seasonality)
        
        # Check for NaNs after initial window (e.g., ATR_14 needs 14 weeks)
        feature_cols = ['SMA_10', 'SMA_20', 'SMA_50', 'EMA_10', 'EMA_20', 'EMA_50',
                        'Volatility_20', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume',
                        'Volume_SMA_10', 'Volume_SMA_20', 'Volume_SMA_50', 'OBV',
                        'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_14', 'EPS',
                        'PE_Ratio', 'Interest_Rate', 'Seasonality']
        if len(df_weekly) > 50:  # Use 50 for BB and MACD
            df_after_window = df_weekly.iloc[50:]
            nan_cols = df_after_window[feature_cols].isna().any()
            if nan_cols.any():
                print(f"Warning: Unexpected NaN values in {stock} after initial 50 weeks in columns: {nan_cols[nan_cols].index.tolist()}")
        
        # Select final columns and drop temporary Quarter column
        final_cols = ['Date', 'Stock', 'Y'] + feature_cols
        df_weekly = df_weekly[final_cols]
        
        all_dfs.append(df_weekly)
    except Exception as e:
        print(f"Error processing {stock}: {e}")
        continue

if not all_dfs:
    raise ValueError("No stocks were successfully processed.")

# Concatenate and clean
try:
    weekly_model_data = pd.concat(all_dfs, ignore_index=True)
    print("DataFrame concatenated. Shape before dropna:", weekly_model_data.shape)
except Exception as e:
    print(f"Error concatenating DataFrames: {e}")
    raise

# Drop rows with NaN values (only for technical indicators, not EPS)
weekly_model_data = weekly_model_data.dropna(subset=[col for col in feature_cols if col != 'EPS'])
print("DataFrame after dropna (excluding EPS). Shape:", weekly_model_data.shape)

# Fill EPS NaNs with 0 (though nearest/quarterly fill should handle this)
weekly_model_data['EPS'] = weekly_model_data['EPS'].fillna(0)

# Validate
if weekly_model_data.empty:
    raise ValueError("Final DataFrame is empty after dropping NaNs.")
if weekly_model_data['Y'].isna().any():
    print("Warning: NaN values found in Y column - this should not happen.")

# Save to CSV
try:
    weekly_model_data.to_csv('weekly_model_data.csv', index=False)
    print("Dataset saved as 'weekly_model_data.csv'")
except Exception as e:
    print(f"Error saving to CSV: {e}")
    raise

# Display
print(weekly_model_data.head())
print(f"Final Dataset Shape: {weekly_model_data.shape}")

[*********************100%***********************]  20 of 20 completed


Data downloaded. Shape: (795, 100)
DataFrame concatenated. Shape before dropna: (3300, 26)
DataFrame after dropna (excluding EPS). Shape: (2320, 26)
Dataset saved as 'weekly_model_data.csv'
         Date Stock  Y      SMA_10      SMA_20      SMA_50      EMA_10  \
49 2021-10-17  AAPL  1  144.509744  140.311485  130.608782  142.518569   
50 2021-10-24  AAPL  0  144.469508  141.439489  131.208499  143.137008   
51 2021-10-31  AAPL  1  144.627513  142.550428  131.818868  143.841068   
52 2021-11-07  AAPL  0  144.912186  143.592443  132.500126  144.720578   
53 2021-11-14  AAPL  1  144.510699  144.441234  133.170679  145.209670   

        EMA_20      EMA_50  Volatility_20  ...  Volume_SMA_50         OBV  \
49  139.935909  132.660689       0.022034  ...    447764920.0  3331423600   
50  140.505821  133.180661       0.022427  ...    442387310.0  3672114900   
51  141.125203  133.722962       0.022417  ...    438450552.0  4064854900   
52  141.844553  134.309449       0.022116  ...    4371422

## 4.3 Monthly Model Dataset

In [74]:
# Define stocks
stocks = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA', 'JPM', 'BAC', 'MS', 'GS', 'C',
          'JNJ', 'PFE', 'MRK', 'LLY', 'ABBV', 'XOM', 'CVX', 'SLB', 'COP', 'EOG']

# Initialize FRED API with your key
fred = Fred(api_key='b190dd4fb366b045066de88c061298ca')

# Fetch economic indicators (adjusted start date to 2010)
try:
    gdp_growth = fred.get_series('GDP', start_date='2010-01-01', end='2023-12-31').resample('M').ffill()
    unemployment_rate = fred.get_series('UNRATE', start_date='2010-01-01', end='2023-12-31').resample('M').ffill()
    inflation_rate = fred.get_series('CPIAUCSL', start_date='2010-01-01', end='2023-12-31').resample('M').ffill()
except Exception as e:
    print(f"Error fetching economic indicators: {e}")
    raise

# Combine economic indicators into a single DataFrame
economic_indicators = pd.DataFrame({
    'GDP_Growth': gdp_growth,
    'Unemployment_Rate': unemployment_rate,
    'Inflation_Rate': inflation_rate
})

# Fetch interest rates
try:
    interest_rates = fred.get_series('FEDFUNDS', start_date='2010-01-01', end='2023-12-31').resample('M').ffill()
except Exception as e:
    print(f"Error fetching interest rates: {e}")
    raise

# Pull OHLCV data (daily, then resample to monthly)
try:
    data = yf.download(stocks, start='2010-01-01', end='2023-12-31', group_by='ticker')
    print("Data downloaded. Shape:", data.shape)
    if data.empty:
        raise ValueError("Downloaded data is empty.")
except Exception as e:
    print(f"Error downloading data: {e}")
    raise

# Initialize list for DataFrames
all_dfs = []

# Process each stock
for stock in stocks:
    try:
        if stock not in data.columns.levels[0]:
            print(f"Warning: No data for {stock} - skipping.")
            continue
        df = data[stock].copy()
        df['Stock'] = stock
        df = df.reset_index()
        
        # Resample to monthly
        df_monthly = df.resample('M', on='Date').agg({'Open': 'first', 'High': 'max', 
                                                      'Low': 'min', 'Close': 'last', 
                                                      'Volume': 'sum', 'Stock': 'first'}).reset_index()
        
        # Ensure enough data points
        if len(df_monthly) < 20:
            print(f"Warning: Not enough data for {stock} - skipping.")
            continue
        
        # Returns and Y (monthly)
        df_monthly['Returns'] = df_monthly['Close'].pct_change().shift(-1)
        df_monthly['Y'] = np.where(df_monthly['Returns'] > 0.01, 1,
                                   np.where(df_monthly['Returns'] < -0.01, -1, 0))
        
        # Price-Based Features
        df_monthly['SMA_20'] = df_monthly['Close'].rolling(window=20).mean()
        df_monthly['SMA_50'] = df_monthly['Close'].rolling(window=50).mean()
        df_monthly['SMA_100'] = df_monthly['Close'].rolling(window=100).mean()
        df_monthly['EMA_20'] = df_monthly['Close'].ewm(span=20, adjust=False).mean()
        df_monthly['EMA_50'] = df_monthly['Close'].ewm(span=50, adjust=False).mean()
        df_monthly['EMA_100'] = df_monthly['Close'].ewm(span=100, adjust=False).mean()
        df_monthly['Volatility_50'] = df_monthly['Close'].pct_change().rolling(window=50).std()
        
        # RSI
        df_monthly['RSI_14'] = (df_monthly['Close'].diff().clip(lower=0).rolling(14).mean() /
                                df_monthly['Close'].diff().abs().rolling(14).mean() * 100)
        
        # Bollinger Bands (100-day, 2 std dev)
        bb = BollingerBands(close=df_monthly['Close'], window=100, window_dev=2)
        df_monthly['BB_Upper'] = bb.bollinger_hband()
        df_monthly['BB_Lower'] = bb.bollinger_lband()
        
        # Volume-Based Features
        df_monthly['Volume'] = df_monthly['Volume']
        df_monthly['Volume_SMA_20'] = df_monthly['Volume'].rolling(window=20).mean()
        df_monthly['Volume_SMA_50'] = df_monthly['Volume'].rolling(window=50).mean()
        df_monthly['Volume_SMA_100'] = df_monthly['Volume'].rolling(window=100).mean()
        df_monthly['OBV'] = OnBalanceVolumeIndicator(close=df_monthly['Close'], volume=df_monthly['Volume']).on_balance_volume()
        
        # Technical Indicators
        macd = MACD(close=df_monthly['Close'], window_slow=26, window_fast=12, window_sign=9)
        df_monthly['MACD'] = macd.macd()
        df_monthly['MACD_Signal'] = macd.macd_signal()
        stoch = StochasticOscillator(high=df_monthly['High'], low=df_monthly['Low'], close=df_monthly['Close'], window=14)
        df_monthly['Stoch_%K'] = stoch.stoch()
        atr = AverageTrueRange(high=df_monthly['High'], low=df_monthly['Low'], close=df_monthly['Close'], window=50)
        df_monthly['ATR_50'] = atr.average_true_range()
        
        # Earnings data (using Reported EPS with nearest quarter)
        ticker = yf.Ticker(stock)
        try:
            earnings_dates = ticker.get_earnings_dates(limit=168)  # ~14 years (~168 quarters)
            if isinstance(earnings_dates, pd.DataFrame) and not earnings_dates.empty:
                earnings_dates = earnings_dates.reset_index()
                earnings_dates['Date'] = pd.to_datetime(earnings_dates['Earnings Date']).dt.tz_localize(None)
                earnings_dates['EPS'] = earnings_dates['Reported EPS'].fillna(0)

                # Sort earnings dates for nearest merge
                earnings_dates = earnings_dates.sort_values('Date')

                # Merge with monthly data using nearest date
                df_monthly = pd.merge_asof(df_monthly, earnings_dates[['Date', 'EPS']],
                                           on='Date', direction='nearest',
                                           suffixes=('', '_earnings'))

                # Forward-fill within quarter and handle initial NaNs
                df_monthly['Quarter'] = df_monthly['Date'].dt.to_period('Q')
                df_monthly['EPS'] = df_monthly.groupby('Quarter')['EPS'].ffill().fillna(0)
        except Exception as e:
            print(f"Warning: Failed to fetch earnings for {stock}: {e}")
            df_monthly['EPS'] = 0
        
        # Financial ratios and dividend yield
        info = ticker.info
        df_monthly['PE_Ratio'] = info.get('trailingPE', np.nan)
        df_monthly['Dividend_Yield'] = dividend_yields.get(stock, 0)
        
        # Interest rates and economic indicators
        df_monthly = df_monthly.merge(interest_rates.rename('Interest_Rate'), 
                                      left_on='Date', right_index=True, how='left')
        df_monthly = df_monthly.merge(economic_indicators, left_on='Date', right_index=True, how='left')
        
        # Seasonality
        df_monthly['Month_of_Year'] = df_monthly['Date'].dt.month
        seasonality = df_monthly.groupby('Month_of_Year')['Returns'].mean().to_dict()
        df_monthly['Seasonality'] = df_monthly['Month_of_Year'].map(seasonality).fillna(0)
        
        # Check for NaNs after initial window (e.g., 100 for BB)
        feature_cols = ['SMA_20', 'SMA_50', 'SMA_100', 'EMA_20', 'EMA_50', 'EMA_100',
                        'Volatility_50', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume',
                        'Volume_SMA_20', 'Volume_SMA_50', 'Volume_SMA_100', 'OBV',
                        'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_50', 'EPS',
                        'PE_Ratio', 'Dividend_Yield', 'Interest_Rate', 'GDP_Growth',
                        'Unemployment_Rate', 'Inflation_Rate', 'Seasonality']
        if len(df_monthly) > 100:  # Use 100 for BB
            df_after_window = df_monthly.iloc[100:]
            nan_cols = df_after_window[feature_cols].isna().any()
            if nan_cols.any():
                print(f"Warning: Unexpected NaN values in {stock} after initial 100 months in columns: {nan_cols[nan_cols].index.tolist()}")
        
        # Select final columns and drop temporary Quarter column
        final_cols = ['Date', 'Stock', 'Y'] + feature_cols
        df_monthly = df_monthly[final_cols]
        
        all_dfs.append(df_monthly)
    except Exception as e:
        print(f"Error processing {stock}: {e}")
        continue

if not all_dfs:
    raise ValueError("No stocks were successfully processed.")

# Concatenate and clean
try:
    monthly_model_data = pd.concat(all_dfs, ignore_index=True)
    print("DataFrame concatenated. Shape before dropna:", monthly_model_data.shape)
except Exception as e:
    print(f"Error concatenating DataFrames: {e}")
    raise

# Drop rows with NaN values (only for technical indicators, not EPS or economic indicators)
monthly_model_data = monthly_model_data.dropna(subset=[col for col in feature_cols if col not in ['EPS', 'Interest_Rate', 'GDP_Growth', 'Unemployment_Rate', 'Inflation_Rate']])
print("DataFrame after dropna (excluding EPS and economic indicators). Shape:", monthly_model_data.shape)

# Fill EPS and economic indicator NaNs with 0 or last valid value
monthly_model_data['EPS'] = monthly_model_data['EPS'].fillna(0)
monthly_model_data[['Interest_Rate', 'GDP_Growth', 'Unemployment_Rate', 'Inflation_Rate']] = monthly_model_data[['Interest_Rate', 'GDP_Growth', 'Unemployment_Rate', 'Inflation_Rate']].fillna(method='ffill').fillna(0)

# Validate
if monthly_model_data.empty:
    raise ValueError("Final DataFrame is empty after dropping NaNs.")
if monthly_model_data['Y'].isna().any():
    print("Warning: NaN values found in Y column - this should not happen.")

# Save to CSV
try:
    monthly_model_data.to_csv('monthly_model_data.csv', index=False)
    print("Dataset saved as 'monthly_model_data.csv'")
except Exception as e:
    print(f"Error saving to CSV: {e}")
    raise

# Display
print(monthly_model_data.head())
print(f"Final Dataset Shape: {monthly_model_data.shape}")

SyntaxError: unexpected character after line continuation character (1907019150.py, line 176)

# 5. Modeling

## 5.1 Model Building

### 5.1.1 Daily Model Building

In [70]:
# Load the dataset with error handling
try:
    data = pd.read_csv('daily_model_data.csv')
    print("Data loaded. Shape:", data.shape)
except FileNotFoundError:
    print("Error: 'daily_model_data.csv' not found.")
    raise
except Exception as e:
    print(f"Error loading data: {e}")
    raise

# Ensure Date is datetime
try:
    data['Date'] = pd.to_datetime(data['Date'])
except Exception as e:
    print(f"Error converting Date to datetime: {e}")
    raise

# Define feature columns
feature_cols = ['SMA_5', 'SMA_10', 'SMA_20', 'EMA_5', 'EMA_10', 'EMA_20', 
                'Volatility_20', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume', 
                'Volume_SMA_5', 'Volume_SMA_10', 'Volume_SMA_20', 'OBV', 
                'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_14', 'Day_of_Week']

# Time-based 80/20 split
try:
    unique_dates = data['Date'].unique()
    if len(unique_dates) < 2:
        raise ValueError("Not enough unique dates for splitting.")
    train_days = int(len(unique_dates) * 0.8)  # 80% of unique dates
    train_dates = unique_dates[:train_days]
    test_dates = unique_dates[train_days:]

    train_data = data[data['Date'].isin(train_dates)]
    test_data = data[data['Date'].isin(test_dates)]
    if train_data.empty or test_data.empty:
        raise ValueError("Train or test split resulted in empty dataset.")
    print("Train shape:", train_data.shape)
    print("Test shape:", test_data.shape)
    print("Train date range:", train_data['Date'].min(), "to", train_data['Date'].max())
    print("Test date range:", test_data['Date'].min(), "to", test_data['Date'].max())
except Exception as e:
    print(f"Error during train-test split: {e}")
    raise

# Prepare X and y
try:
    X_train = train_data[feature_cols]
    y_train = train_data['Y']
    X_test = test_data[feature_cols]
    y_test = test_data['Y']
    if X_train.empty or y_train.empty or X_test.empty or y_test.empty:
        raise ValueError("Features or target data is empty.")
except KeyError as e:
    print(f"Error: Missing column in data: {e}")
    raise
except Exception as e:
    print(f"Error preparing X and y: {e}")
    raise

# Scale features
try:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
except Exception as e:
    print(f"Error scaling features: {e}")
    raise

# Train logistic regression
try:
    model = LogisticRegression(multi_class='multinomial', max_iter=1000)
    model.fit(X_train_scaled, y_train)
except Exception as e:
    print(f"Error training logistic regression: {e}")
    raise

# Predict on test data
try:
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)
except Exception as e:
    print(f"Error predicting with logistic regression: {e}")
    raise

# Baseline: Dummy classifier (most frequent)
try:
    dummy = DummyClassifier(strategy='most_frequent')
    dummy.fit(X_train_scaled, y_train)
    y_dummy_pred = dummy.predict(X_test_scaled)
except Exception as e:
    print(f"Error with dummy classifier: {e}")
    raise

# Evaluate
try:
    print("\nLogistic Regression Performance:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['-1', '0', '1']))
    
    # Improved Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm, index=['Actual -1', 'Actual 0', 'Actual 1'], 
                        columns=['Predicted -1', 'Predicted 0', 'Predicted 1'])
    print("Confusion Matrix:")
    print(cm_df)
    
    print("ROC-AUC (one-vs-rest):", roc_auc_score(y_test, y_pred_proba, multi_class='ovr'))

    print("\nDummy Classifier Baseline:")
    print("Accuracy:", accuracy_score(y_test, y_dummy_pred))
except Exception as e:
    print(f"Error evaluating performance: {e}")
    raise

Data loaded. Shape: (15240, 23)
Train shape: (12180, 23)
Test shape: (3060, 23)
Train date range: 2020-12-18 00:00:00 to 2023-05-22 00:00:00
Test date range: 2023-05-23 00:00:00 to 2023-12-29 00:00:00

Logistic Regression Performance:
Accuracy: 0.5715686274509804

Classification Report:
               precision    recall  f1-score   support

          -1       0.26      0.03      0.05       619
           0       0.59      0.96      0.73      1763
           1       0.35      0.06      0.11       678

    accuracy                           0.57      3060
   macro avg       0.40      0.35      0.30      3060
weighted avg       0.47      0.57      0.45      3060

Confusion Matrix:
           Predicted -1  Predicted 0  Predicted 1
Actual -1            17          570           32
Actual 0             27         1688           48
Actual 1             21          613           44
ROC-AUC (one-vs-rest): 0.5666569178800639

Dummy Classifier Baseline:
Accuracy: 0.5761437908496732


### 5.1.2 Weekly Model Building

In [71]:
# Load the dataset with error handling
try:
    data = pd.read_csv('weekly_model_data.csv')
    print("Data loaded. Shape:", data.shape)
except FileNotFoundError:
    print("Error: 'weekly_model_data.csv' not found.")
    raise
except Exception as e:
    print(f"Error loading data: {e}")
    raise

# Ensure Date is datetime
try:
    data['Date'] = pd.to_datetime(data['Date'])
except Exception as e:
    print(f"Error converting Date to datetime: {e}")
    raise

# Define feature columns (excluding 'Date', 'Stock', and 'Y')
feature_cols = ['SMA_10', 'SMA_20', 'SMA_50', 'EMA_10', 'EMA_20', 'EMA_50',
                'Volatility_20', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume',
                'Volume_SMA_10', 'Volume_SMA_20', 'Volume_SMA_50', 'OBV',
                'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_14', 'EPS',
                'PE_Ratio', 'Interest_Rate', 'Seasonality']

# Time-based 80/20 split
try:
    unique_dates = data['Date'].unique()
    if len(unique_dates) < 2:
        raise ValueError("Not enough unique dates for splitting.")
    train_days = int(len(unique_dates) * 0.8)  # 80% of unique dates
    train_dates = unique_dates[:train_days]
    test_dates = unique_dates[train_days:]

    train_data = data[data['Date'].isin(train_dates)]
    test_data = data[data['Date'].isin(test_dates)]
    if train_data.empty or test_data.empty:
        raise ValueError("Train or test split resulted in empty dataset.")
    print("Train shape:", train_data.shape)
    print("Test shape:", test_data.shape)
    print("Train date range:", train_data['Date'].min(), "to", train_data['Date'].max())
    print("Test date range:", test_data['Date'].min(), "to", test_data['Date'].max())
except Exception as e:
    print(f"Error during train-test split: {e}")
    raise

# Prepare X and y
try:
    X_train = train_data[feature_cols]
    y_train = train_data['Y']
    X_test = test_data[feature_cols]
    y_test = test_data['Y']
    if X_train.empty or y_train.empty or X_test.empty or y_test.empty:
        raise ValueError("Features or target data is empty.")
except KeyError as e:
    print(f"Error: Missing column in data: {e}")
    raise
except Exception as e:
    print(f"Error preparing X and y: {e}")
    raise

# Scale features
try:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
except Exception as e:
    print(f"Error scaling features: {e}")
    raise

# Train logistic regression
try:
    model = LogisticRegression(multi_class='multinomial', max_iter=1000)
    model.fit(X_train_scaled, y_train)
except Exception as e:
    print(f"Error training logistic regression: {e}")
    raise

# Predict on test data
try:
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)
except Exception as e:
    print(f"Error predicting with logistic regression: {e}")
    raise

# Baseline: Dummy classifier (most frequent)
try:
    dummy = DummyClassifier(strategy='most_frequent')
    dummy.fit(X_train_scaled, y_train)
    y_dummy_pred = dummy.predict(X_test_scaled)
except Exception as e:
    print(f"Error with dummy classifier: {e}")
    raise

# Evaluate
try:
    print("\nLogistic Regression Performance:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['-1', '0', '1']))
    
    # Improved Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm, index=['Actual -1', 'Actual 0', 'Actual 1'], 
                        columns=['Predicted -1', 'Predicted 0', 'Predicted 1'])
    print("Confusion Matrix:")
    print(cm_df)
    
    print("ROC-AUC (one-vs-rest):", roc_auc_score(y_test, y_pred_proba, multi_class='ovr'))

    print("\nDummy Classifier Baseline:")
    print("Accuracy:", accuracy_score(y_test, y_dummy_pred))
except Exception as e:
    print(f"Error evaluating performance: {e}")
    raise

Data loaded. Shape: (2320, 26)
Train shape: (1840, 26)
Test shape: (480, 26)
Train date range: 2021-10-17 00:00:00 to 2023-07-16 00:00:00
Test date range: 2023-07-23 00:00:00 to 2023-12-31 00:00:00

Logistic Regression Performance:
Accuracy: 0.4625

Classification Report:
               precision    recall  f1-score   support

          -1       0.49      0.51      0.50       162
           0       0.35      0.05      0.09       140
           1       0.46      0.75      0.57       178

    accuracy                           0.46       480
   macro avg       0.43      0.43      0.38       480
weighted avg       0.44      0.46      0.40       480

Confusion Matrix:
           Predicted -1  Predicted 0  Predicted 1
Actual -1            82            7           73
Actual 0             48            7           85
Actual 1             39            6          133
ROC-AUC (one-vs-rest): 0.6375470189870948

Dummy Classifier Baseline:
Accuracy: 0.37083333333333335


### 5.1.3 Monthly Model Building

In [72]:
# Load the dataset with error handling
try:
    data = pd.read_csv('monthly_model_data.csv')
    print("Data loaded. Shape:", data.shape)
except FileNotFoundError:
    print("Error: 'monthly_model_data.csv' not found.")
    raise
except Exception as e:
    print(f"Error loading data: {e}")
    raise

# Ensure Date is datetime
try:
    data['Date'] = pd.to_datetime(data['Date'])
except Exception as e:
    print(f"Error converting Date to datetime: {e}")
    raise

# Define feature columns (based on the corrected monthly_model_data.csv)
feature_cols = ['SMA_20', 'SMA_50', 'SMA_100', 'EMA_20', 'EMA_50', 'EMA_100',
                'Volatility_50', 'RSI_14', 'BB_Upper', 'BB_Lower', 'Volume',
                'Volume_SMA_20', 'Volume_SMA_50', 'Volume_SMA_100', 'OBV',
                'MACD', 'MACD_Signal', 'Stoch_%K', 'ATR_50', 'EPS',
                'PE_Ratio', 'Dividend_Yield', 'Interest_Rate', 'GDP_Growth',
                'Unemployment_Rate', 'Inflation_Rate', 'Seasonality']

# Time-based 80/20 split
try:
    unique_dates = data['Date'].unique()
    if len(unique_dates) < 2:
        raise ValueError("Not enough unique dates for splitting.")
    train_days = int(len(unique_dates) * 0.8)  # 80% of unique dates
    train_dates = unique_dates[:train_days]
    test_dates = unique_dates[train_days:]

    train_data = data[data['Date'].isin(train_dates)]
    test_data = data[data['Date'].isin(test_dates)]
    if train_data.empty or test_data.empty:
        raise ValueError("Train or test split resulted in empty dataset.")
    print("Train shape:", train_data.shape)
    print("Test shape:", test_data.shape)
    print("Train date range:", train_data['Date'].min(), "to", train_data['Date'].max())
    print("Test date range:", test_data['Date'].min(), "to", test_data['Date'].max())
except Exception as e:
    print(f"Error during train-test split: {e}")
    raise

# Prepare X and y
try:
    X_train = train_data[feature_cols]
    y_train = train_data['Y']
    X_test = test_data[feature_cols]
    y_test = test_data['Y']
    if X_train.empty or y_train.empty or X_test.empty or y_test.empty:
        raise ValueError("Features or target data is empty.")
except KeyError as e:
    print(f"Error: Missing column in data: {e}")
    raise
except Exception as e:
    print(f"Error preparing X and y: {e}")
    raise

# Scale features
try:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
except Exception as e:
    print(f"Error scaling features: {e}")
    raise

# Train logistic regression
try:
    model = LogisticRegression(multi_class='multinomial', max_iter=1000)
    model.fit(X_train_scaled, y_train)
except Exception as e:
    print(f"Error training logistic regression: {e}")
    raise

# Predict on test data
try:
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)
except Exception as e:
    print(f"Error predicting with logistic regression: {e}")
    raise

# Baseline: Dummy classifier (most frequent)
try:
    dummy = DummyClassifier(strategy='most_frequent')
    dummy.fit(X_train_scaled, y_train)
    y_dummy_pred = dummy.predict(X_test_scaled)
except Exception as e:
    print(f"Error with dummy classifier: {e}")
    raise

# Evaluate
try:
    print("\nLogistic Regression Performance:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['-1', '0', '1']))
    
    # Improved Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm, index=['Actual -1', 'Actual 0', 'Actual 1'], 
                        columns=['Predicted -1', 'Predicted 0', 'Predicted 1'])
    print("Confusion Matrix:")
    print(cm_df)
    
    print("ROC-AUC (one-vs-rest):", roc_auc_score(y_test, y_pred_proba, multi_class='ovr'))

    print("\nDummy Classifier Baseline:")
    print("Accuracy:", accuracy_score(y_test, y_dummy_pred))
except Exception as e:
    print(f"Error evaluating performance: {e}")
    raise

Data loaded. Shape: (1344, 30)
Train shape: (1064, 30)
Test shape: (280, 30)
Train date range: 2018-04-30 00:00:00 to 2022-10-31 00:00:00
Test date range: 2022-11-30 00:00:00 to 2023-12-31 00:00:00

Logistic Regression Performance:
Accuracy: 0.48928571428571427

Classification Report:
               precision    recall  f1-score   support

          -1       0.59      0.27      0.37       107
           0       0.00      0.00      0.00        48
           1       0.47      0.86      0.61       125

    accuracy                           0.49       280
   macro avg       0.35      0.38      0.33       280
weighted avg       0.43      0.49      0.41       280

Confusion Matrix:
           Predicted -1  Predicted 0  Predicted 1
Actual -1            29            0           78
Actual 0              3            0           45
Actual 1             17            0          108
ROC-AUC (one-vs-rest): 0.5933885711783634

Dummy Classifier Baseline:
Accuracy: 0.44642857142857145


## 5.2 Model Evaluation

In [73]:
# Create a DataFrame with the data
data = {
    '': ['Daily', 'Weekly', 'Monthly'],
    'Accuracy': [0.571569, 0.4625, 0.489286],
    'ROC-AUC': [0.566657, 0.637595, 0.593389],
    'Precision -1': [0.26, 0.49, 0.59],
    'Precision 0': [0.59, 0.35, 0.00],
    'Precision 1': [0.35, 0.46, 0.47],
    'Recall -1': [0.03, 0.51, 0.27],
    'Recall 0': [0.96, 0.05, 0.00],
    'Recall 1': [0.06, 0.75, 0.86],
    'F1-Score -1': [0.05, 0.50, 0.37],
    'F1-Score 0': [0.73, 0.09, 0.00],
    'F1-Score 1': [0.11, 0.57, 0.61],
    'Dummy Accuracy': [0.576144, 0.370833, 0.446429]
}

df = pd.DataFrame(data)

# Display the DataFrame with merged headers and custom styling
html = """
<table style="width:100%; border-collapse: collapse;">
    <tr style="border-bottom: 1px solid black;">
        <th style="padding: 8px;"></th>
        <th style="padding: 8px;">Accuracy</th>
        <th style="padding: 8px;">ROC-AUC</th>
        <th style="padding: 8px; text-align: center;" colspan="3">Precision</th>
        <th style="padding: 8px; text-align: center;" colspan="3">Recall</th>
        <th style="padding: 8px; text-align: center;" colspan="3">F1-Score</th>
        <th style="padding: 8px;">Dummy Accuracy</th>
    </tr>
    <tr style="border-bottom: 2px solid black;">
        <th style="padding: 8px;"></th>
        <th style="padding: 8px;"></th>
        <th style="padding: 8px;"></th>
        <th style="padding: 8px; text-align: center;">-1</th>
        <th style="padding: 8px; text-align: center;">0</th>
        <th style="padding: 8px; text-align: center;">1</th>
        <th style="padding: 8px; text-align: center;">-1</th>
        <th style="padding: 8px; text-align: center;">0</th>
        <th style="padding: 8px; text-align: center;">1</th>
        <th style="padding: 8px; text-align: center;">-1</th>
        <th style="padding: 8px; text-align: center;">0</th>
        <th style="padding: 8px; text-align: center;">1</th>
        <th style="padding: 8px;"></th>
    </tr>
"""

for index, row in df.iterrows():
    html += "<tr>"
    for col in df.columns:
        if col == '':
            html += f"<td style='padding: 8px; font-weight: bold;'>{row[col]}</td>"
        else:
            html += f"<td style='padding: 8px; text-align: center;'>{row[col]}</td>"
    html += "</tr>"

html += "</table>"

display(HTML(html))

Unnamed: 0_level_0,Accuracy,ROC-AUC,Precision,Precision,Precision,Recall,Recall,Recall,F1-Score,F1-Score,F1-Score,Dummy Accuracy
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,-1,0,1,-1,0,1,-1,0,1,Unnamed: 12_level_1
Daily,0.571569,0.566657,0.26,0.59,0.35,0.03,0.96,0.06,0.05,0.73,0.11,0.576144
Weekly,0.4625,0.637595,0.49,0.35,0.46,0.51,0.05,0.75,0.5,0.09,0.57,0.370833
Monthly,0.489286,0.593389,0.59,0.0,0.47,0.27,0.0,0.86,0.37,0.0,0.61,0.446429


As shown in the above table, we observed the following:

- **Accuracy**: Accuracy is the ratio of correctly predicted instances to the total instances. It measures how often the model is correct. 

    - The daily model has the highest accuracy (0.571569), followed by the monthly (0.489286) and weekly (0.4625) models. However, the daily model's accuracy is lower than the dummy accuracy (0.576144), indicating it performs worse than random guessing. 

    - The weekly and monthly models have accuracies higher than their respective dummy accuracies, suggesting they perform better than random guessing but are still relatively low.

---

- **ROC-AUC**: ROC-AUC (Receiver Operating Characteristic - Area Under Curve) is a performance measurement for classification problems at various threshold settings. A higher AUC value indicates a better performing model that can distinguish between positive and negative classes effectively. 

    - The weekly model has the highest ROC-AUC (0.637595), indicating it has the best balance between sensitivity and specificity. The monthly model is next (0.593389), followed by the daily model (0.566657).

---

- **Precision**: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It indicates how many of the predicted positives are actual positives. High precision indicates that the model has a low false positive rate.

    - For class -1, the monthly model performs best (0.59). 

    - For class 0, the daily model performs best (0.59). 

    - For class 1, the monthly model performs best (0.47). 

---

- **Recall**: Recall is the ratio of correctly predicted positive observations to all the observations in the actual class. It indicates how many of the actual positives are captured by the model. High recall indicates that the model has a low false negative rate.

    - For class -1, the weekly model performs best (0.51). 

    - For class 0, the daily model performs best (0.96). 

    - For class 1, the monthly model performs best (0.86). 

---

- **F1-Score**: The F1-Score is the weighted average of Precision and Recall. It considers both false positives and false negatives. A high F1-Score indicates a balance between precision and recall.

    - For class -1, the weekly model performs best (0.50). 

    - For class 0, the daily model performs best (0.73). 

    - For class 1, the monthly model performs best (0.61). 

---

- **Dummy Accuracy**: Dummy accuracy is the accuracy of a simple model that makes predictions based on the most frequent class or random guessing. It serves as a baseline to compare the performance of the actual model. 

    - The daily model's accuracy (0.571569) is lower than the dummy accuracy (0.576144), which indicates that it performs worse than random guessing. 

    - The weekly model's accuracy (0.4625) is higher than its dummy accuracy (0.370833), which suggests that it performs better than random guessing. 

    - The monthly model's accuracy (0.489286) is higher than its dummy accuracy (0.446429), which indicates that it performs better than random guessing.


# 6. Conclusion

The overall performance of all three models is not very impressive. The daily model's accuracy is even worse than the dummy accuracy, indicating it performs worse than random guessing. The weekly and monthly models have accuracies higher than their respective dummy accuracies, but they are still relatively low, suggesting that these models are not highly reliable.

The weekly model shows the best balance between sensitivity and specificity as indicated by the highest ROC-AUC. However, the overall performance metrics suggest that none of the models are performing exceptionally well.

In summary, while the weekly model appears to provide the most balanced performance across different metrics, the overall performance of all models indicates that there is significant room for improvement in the classification models.