# Stock Market Prediction Pipeline: A Machine Learning Approach

**Authors:** Team 003 (Itay, Moran, Shaked)  
**Course:** Workshop in Data Science, Tel Aviv University  
**Date:** November 2025

---

## Abstract

This report documents the development of an end-to-end machine learning pipeline for predicting short-term stock movements of Apple Inc. (AAPL). We address the challenge of non-stationarity in financial time series by modeling **Logarithmic Returns** rather than raw prices. Our evaluation framework incorporates statistical metrics (MSE), financial risk metrics (Sharpe Ratio), and trading utility metrics (Directional Accuracy). This document serves as a living research report, integrating theoretical methodology with empirical results.

## Table of Contents

1. [Introduction & Problem Formulation](#1.-Introduction-&-Problem-Formulation)
    - [1.1 Objective](#1.1-Objective)
    - [1.2 Target Variable & Stationarity](#1.2-Target-Variable-&-Stationarity)
2. [Methodology](#2.-Methodology)
    - [2.1 Evaluation Strategy](#2.1-Evaluation-Strategy)
    - [2.2 Validation Strategy](#2.2-Validation-Strategy)
3. [Empirical Analysis](#3.-Empirical-Analysis)
    - [3.1 Stationarity Tests](#3.1-Stationarity-Tests)
    - [3.2 Baseline Performance](#3.2-Baseline-Performance)
4. [Data Access & Ingestion](#4.-Data-Access-&-Ingestion)
5. [Feature Engineering](#5.-Feature-Engineering)
6. [Modeling](#6.-Modeling)
7. [Advanced Evaluation](#7.-Advanced-Evaluation)
8. [Explainability](#8.-Explainability)

In [None]:
# --- Setup & Configuration ---
import sys
import platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Internal Modules (src/)
from src.data.loader import fetch_sample_data
from src.evaluation.analysis import check_stationarity, plot_price_vs_returns
from src.models.baselines import NaiveBaseline, RandomBaseline
from src.evaluation.metrics import evaluate_regression, print_eval
from src.evaluation.plots import set_style
from src.features.preprocessing import LogReturnTransformer

# Apply Academic Plotting Style
%matplotlib inline
set_style()

print(f"Environment: Python {platform.python_version()}")
print("Pipeline modules loaded successfully.")

## 1. Introduction & Problem Formulation

### 1.1 Objective
The unpredictability of financial markets presents a significant challenge for predictive modeling. Traditional forecasting methods often fail due to the stochastic nature of asset prices. The primary objective of this research is to develop a robust machine learning pipeline capable of predicting the **Logarithmic Returns** of the next trading day ($t+1$) for Apple Inc. (AAPL).

### 1.2 Target Variable & Stationarity
A fundamental assumption of many statistical learning methods is that the underlying data generating process is stationary (i.e., mean and variance do not change over time). Raw stock prices ($P_t$) violate this assumption, exhibiting trends and heteroscedasticity.

To mitigate this, we transform our target variable to **Log-Returns** ($Y_t$):

$$Y_t = \ln(P_{t+1}) - \ln(P_t) = \ln\left(\frac{P_{t+1}}{P_t}\right)$$

This transformation offers three key advantages:
1.  **Stationarity:** Log-returns are approximately stationary, making them suitable for ML algorithms.
2.  **Additivity:** Unlike simple percentage returns, log-returns are time-additive.
3.  **Normality:** They often approximate a normal distribution, simplifying statistical inference.

## 2. Methodology

### 2.1 Evaluation Strategy
We employ a multi-faceted evaluation framework to assess model performance from statistical, financial, and trading perspectives.

| Metric | Formula | Rationale |
| :--- | :--- | :--- |
| **MSE** | $\frac{1}{N} \sum (y - \hat{y})^2$ | Penalizes large errors; standard loss function for regression. |
| **Sharpe Ratio** | $\frac{E[R_p] - R_f}{\sigma_p} \sqrt{252}$ | Measures risk-adjusted return. Crucial for validating financial viability. |
| **Directional Accuracy** | $\frac{1}{N} \sum \mathbb{1}_{sign(y) == sign(\hat{y})}$ | Assesses the model's ability to predict market direction (Up/Down). |

### 2.2 Validation Strategy
To prevent **Look-Ahead Bias**, we strictly adhere to a **Walk-Forward Validation** scheme (Expanding Window). Random shuffling (k-fold CV) is strictly prohibited as it destroys the temporal structure of the data.

## 3. Empirical Analysis

### 3.1 Stationarity Tests
We first empirically validate the stationarity of our target variable using the Augmented Dickey-Fuller (ADF) test.

In [None]:
# 1. Data Ingestion
df_research = fetch_sample_data("AAPL", period="2y")

# 2. Feature Engineering (Log Returns)
log_transformer = LogReturnTransformer()
df_research = log_transformer.transform(df_research)
df_research.dropna(inplace=True)

# 3. Visualization
plot_price_vs_returns(df_research, 'Log_Returns')

# 4. Hypothesis Testing (ADF)
check_stationarity(df_research['Close'], "Raw Close Price")
check_stationarity(df_research['Log_Returns'], "Log Returns")

### 3.2 Baseline Performance
To establish a performance benchmark, we evaluate two naive models. Any sophisticated model must significantly outperform these baselines to be considered valuable.

*   **Naive Baseline (Zero Strategy):** Predicts $y_{t+1} = 0$ (Martingale assumption).
*   **Random Baseline:** Predicts $y_{t+1} \sim \mathcal{N}(\mu, \sigma)$.

In [None]:
# Initialize Baselines
naive_model = NaiveBaseline(strategy="zero")
random_model = RandomBaseline(seed=42)

# Fit (Random Baseline learns mean/std)
y_true = df_research['Log_Returns']
random_model.fit(y_true)

# Predict
y_pred_naive = naive_model.predict(df_research)
y_pred_random = random_model.predict(df_research)

# Evaluate
metrics_naive = evaluate_regression(y_true, y_pred_naive)
metrics_random = evaluate_regression(y_true, y_pred_random)

print_eval(metrics_naive, "Naive Baseline (Zero)")
print_eval(metrics_random, "Random Baseline")

## 4. Data Access & Ingestion

*(To be implemented in Stage 3)*

This section will detail the data contracts, caching mechanisms, and ingestion procedures for the full historical dataset.

## 5. Feature Engineering

*(To be implemented in Stage 4)*

We will implement technical indicators (RSI, MACD, Bollinger Bands) and potentially sentiment scores as features for our models.

## 6. Modeling

*(To be implemented in Stage 5)*

This section will cover the training and validation of advanced machine learning models (e.g., XGBoost, LightGBM) and deep learning architectures (LSTM, Transformer).

## 7. Advanced Evaluation

*(To be implemented in Stage 6)*

Comprehensive backtesting, error analysis, and performance comparison against the baselines established in Section 3.

## 8. Explainability

*(To be implemented in Stage 7)*

Interpretation of model predictions using SHAP values and feature importance analysis.