# Stock Market Movement Prediction
### Using Machine Learning to Forecast Next-Day Stock Price Movements
This project tackles a binary classification problem in financial markets -
Anal
predicting whether individual US stocks will move up or down the following day.
The goal is to develop a model that can assist in making data-driven investment decisions.
The model uses 20 days of historical price returns and trading volumes, along with
categorical stock metadata like industry and sector classifications, to identify
predictive patterns in market behavior.
The public benchmark accuracy of 51.31% was achieved using a Random Forest model that considered
the previous 5 days of data along with the average sector returns from the prior day.

#### Agenda
1. **Data Preprocessing**
   - Loading training and test datasets
   - Handling missing values and target encoding
   - Feature engineering (technical indicators of different types)
2. **Model Implementation and Evaluation**
   - Decision Tree Classifier
      - Baseline model (accuracy: 0.510)
      - Tuned model with hyperparameter optimization (accuracy: 0.5325)
   - XGBoost Classifier
      - Baseline model (accuracy: 0.53)
      - Tuned model with hyperparameter optimization (accuracy: 0.8775)
   - Neural Network
      - Accuracy: 0.5144
3. **Model Comparison**
   - Cross-validation results
   - Feature importance analysis
   - ROC curves and confusion matrices


## Data description

3 datasets are provided as csv files, split between training inputs and outputs, and test inputs.

Input datasets comprise 47 columns: the first ID column contains unique row identifiers while the other 46 descriptive features correspond to:

* **DATE**: an index of the date (the dates are randomized and anonymized so there is no continuity or link between any dates),
* **STOCK**: an index of the stock,
* **INDUSTRY**: an index of the stock industry domain (e.g., aeronautic, IT, oil company),
* **INDUSTRY_GROUP**: an index of the group industry,
* **SUB_INDUSTRY**: a lower level index of the industry,
* **SECTOR**: an index of the work sector,
* **RET_1 to RET_20**: the historical residual returns among the last 20 days (i.e., RET_1 is the return of the previous day and so on),
* **VOLUME_1 to VOLUME_20**: the historical relative volume traded among the last 20 days (i.e., VOLUME_1 is the relative volume of the previous day and so on),

Output datasets are only composed of 2 columns:

* **ID**: the unique row identifier (corresponding to the input identifiers)
and the binary target:
* **RET**: the sign of the residual stock return at time $t$

------------------------------------------------------------------------------------------------
The one-day return of a stock :
$$R^t = \frac{P_j^t}{P_j^{t-1}} - 1$$

The volume is the ratio of the stock volume to the median volume of the past 20 days.
The relative volumes are computed using the median of the past 20 days' volumes.
If any day within this 20-day window has a missing volume value, it will cause NaN values in the calculation for subsequent days.
For example, if there is a missing value on day $D$, then the relative volumes for days $D$ to $D+19$ will be affected.

The relative volume $\tilde{V}^t_j$ at time $t$ of a stock $j$ is calculated as:
$$
\tilde{V}^t_j = \frac{V^t_j}{\text{median}( \{ V^{t-1}_j, \ldots, V^{t-20}_j \} )}
$$

The adjusted relative volume $V^t_j$ is then given by:
$$
V^t_j = \tilde{V}^t_j - \frac{1}{n} \sum_{i=1}^{n} \tilde{V}^t_i
$$
------------------------------------------------------------------------------------------------
Guidelines from the organizers:
The solution files submitted by participants shall follow this output dataset format (i.e contain only two columns, ID and RET, where the ID values correspond to the input test data).
An example submission file containing random predictions is provided.

**418595 observations (i.e. lines) are available for the training datasets while 198429 observations are used for the test datasets.**


## Implementation Steps

This notebook implements the following steps:

1. **Data Loading and Preprocessing**
   - Load training and test datasets
   - Handle missing values and data cleaning
   - Calculate technical indicators using TA-Lib
   - Filter out infinity values and remove duplicated columns
   - Split data into training and test sets (75%/25% split)

2. **Feature Engineering**
   - Calculate technical indicators like RSI, OBV, EMA etc.
   - Save indicators to pickle files for reuse
   - Drop unnecessary ID and categorical columns
   - Remove redundant technical indicators

3. **Model Development and Tuning**
   - Decision Tree Classifier
      - Baseline model (accuracy: 0.510)
      - Tuned model with hyperparameters (accuracy: 0.533)
   - Gradient Boosting
      - Stepwise tuning of n_estimators, tree params, leaf params
      - Best model achieves significant improvement
   - Neural Network
      - Simple feed-forward architecture
      - Training with BCE loss and Adam optimizer

4. **Model Comparison and Analysis**
   - Compare accuracy across all models
   - Analyze feature importance
   - Key findings on model performance and technical indicators
   - Discussion of overfitting and benchmark results

### Importing libraries

In [14]:
import sys
from pathlib import Path

try:
    # vscode
    path = Path(__file__).parent.parent
    path = path / "src"
except NameError:
    # jupyter notebook
    path = Path().absolute().parent
sys.path.append(str(path))

import kedro.ipython
from kedro.ipython import get_ipython

kedro.ipython.load_ipython_extension(get_ipython())

/Users/krzysztofwojdalski/github_projects/ml-in-finance-i-project/src/ml_in_finance_i_project


In [15]:
import logging as log

from IPython.display import Markdown as md

from src.ml_in_finance_i_project.utils import get_node_idx, get_node_outputs

In [16]:
# Load the datasets
x_train_raw = catalog.load("x_train_raw")
y_train_raw = catalog.load("y_train_raw")
x_test_raw = catalog.load("x_test_raw")

### Checking and configuring environment

In [17]:
## Google Colab used to speed up the computation in xgboost model
## warning: this function must be run before importing libraries
# if running in Google Colab
def setup_colab_environment():
    """
    Set up Google Colab environment by mounting drive and creating symlinks.
    Returns True if running in Colab, False otherwise.
    """
    try:
        import os

        from google.colab import drive

        drive.mount("/content/drive")
        req_symlinks = [
            ("data", "ml_in_finance_i_project/data"),
            ("src", "ml_in_finance_i_project/src"),
        ]
        # Create symlinks if they don't exist
        for dest, src in req_symlinks:
            if not os.path.exists(dest):
                os.symlink(f"/content/drive/Othercomputers/My Mac/{src}", dest)
        return True

    except ImportError:
        return False

#### Run pipeline node definition. This one must be evaluated within the notebook

In [18]:
def run_pipeline_node(pipeline_name: str, node_name: str, inputs: dict):
    """Run a specific node from a pipeline.

    Args:
        pipeline_name: Name of the pipeline
        node_name: Name of the node to run
        inputs: Dictionary of input parameters for the node

    Returns:
        Output from running the node
    """
    node_idx = get_node_idx(pipelines[pipeline_name], node_name)
    return pipelines[pipeline_name].nodes[node_idx].run(inputs)

##### Setup environment

In [19]:
IN_COLAB = setup_colab_environment()

# Configure logging to stdout
log.basicConfig(
    level=log.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[log.StreamHandler()],
)
conf_params = context.config_loader.get("parameters")
target = conf_params["model_options"]["target"]
kfold = conf_params["model_options"]["kfold"]

## Loading data

#### Preprocessing data
* Dropping rows with NAs
* RET is encoded from bool to binary
* [Optional] Loading only a fraction of the dataset to speed up the computation
* [Optional] Trimming the dataset to the first n days (for instance, `n_days=5` mean we will only load `RET_1` to `RET_5`)

In [20]:
# Set data directory based on environment
# Run data processing pipeline node
out = run_pipeline_node(
    "data_processing",
    "load_and_preprocess_data_node",
    {
        "x_train_raw": x_train_raw,
        "y_train_raw": y_train_raw,
        "x_test_raw": x_test_raw,
        "params:remove_id_cols": conf_params["remove_id_cols"],
        "params:n_days": conf_params["n_days"],
        "params:sample_n": conf_params["sample_n"],
    },
)

#### Problem visualization

* Plot returns and volume
* Red bar is the target variable (RET) we are trying to predict

In [33]:
# Plot returns and volume
run_pipeline_node(
    "reporting",
    "plot_returns_volume_node",
    {
        "train_df": out["train_df"],
        "params:example_row_id": 2,
    },
)["returns_volume_plot"]

#### Glimpse of the data

* As we can see, the raw data is not clean (NaN values are visible)
* We will need to clean it up in the next step

In [22]:
out["test_df"].head()

Unnamed: 0_level_0,DATE,STOCK,INDUSTRY,INDUSTRY_GROUP,SECTOR,SUB_INDUSTRY,RET_1,VOLUME_1,RET_2,VOLUME_2,...,RET_1_SUB_INDUSTRY_median,RET_1_SUB_INDUSTRY_std,RET_2_SUB_INDUSTRY_median,RET_2_SUB_INDUSTRY_std,RET_3_SUB_INDUSTRY_median,RET_3_SUB_INDUSTRY_std,RET_4_SUB_INDUSTRY_median,RET_4_SUB_INDUSTRY_std,RET_5_SUB_INDUSTRY_median,RET_5_SUB_INDUSTRY_std
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
602753,215,712,53,19,7,126,0.020782,-0.432386,0.016149,-0.091659,...,0.017591,0.005053,0.016149,0.006815,0.001227,0.006519,-0.007307,0.00598,-0.015588,0.021054
540249,152,3677,27,8,4,66,-0.00472,-0.552429,0.009311,-0.66723,...,-0.0056,0.007644,0.009311,0.003188,-0.000182,0.002441,0.003651,0.013475,-0.013314,0.01819
525141,133,2571,34,11,5,85,-0.010844,0.105083,-0.030768,-0.023836,...,-0.0097,0.001619,-0.022414,0.011815,-0.000185,0.004249,-0.017408,0.00048,-0.018167,0.000826
567011,176,4525,41,14,6,101,-0.000775,-0.543201,0.002331,-0.338408,...,-0.000775,,0.002331,,-0.000776,,-0.00155,,-0.000775,
508581,114,4180,66,24,9,162,-0.035664,0.108727,-0.026389,1.278625,...,-0.035664,,-0.026389,,0.015515,,-0.001409,,-0.011142,


#### Info about the dataset

In [23]:
print("Training Dataset Info:")
out["train_df"].info()
print("\nTest Dataset Info:")
out["test_df"].info()

Training Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 43340 to 177228
Data columns (total 87 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   DATE                         10000 non-null  int64  
 1   STOCK                        10000 non-null  int64  
 2   INDUSTRY                     10000 non-null  int64  
 3   INDUSTRY_GROUP               10000 non-null  int64  
 4   SECTOR                       10000 non-null  int64  
 5   SUB_INDUSTRY                 10000 non-null  int64  
 6   RET_1                        10000 non-null  float64
 7   VOLUME_1                     10000 non-null  float64
 8   RET_2                        10000 non-null  float64
 9   VOLUME_2                     10000 non-null  float64
 10  RET_3                        10000 non-null  float64
 11  VOLUME_3                     10000 non-null  float64
 12  RET_4                        10000 non-null  float6

#### Plot nan percentages across categorical var


Possible reasons for missing values:
1. Market closures (market data might be missing for weekends and public holidays) - some dates clearly show a very high
percentage of missing values
2. Data collection issues (for instance, market data might come from different US venues, e.g. NYSE, NASDAQ, CBOE, etc.)
3. Randomization and anonymization of dates
4. The way relative volumes are calculated (one day missing causes missing values for the next 19 days) - could have something to do with calculating volumes on weekends / public holidays
5. Done on purpose by the organizers to make the problem more challenging
6. Some stocks might be delisted or suspended from trading (reference data problem) - some stocks in fact have up to 100% missing values
7. Some stocks might be barely trading (either due to low volume or in a non-continuous manner)

In [24]:
run_pipeline_node(
    "reporting",
    "plot_nan_percentages_node",
    {"train_df": out["train_df"]},
)["nan_percentages_plot"]

#### Check Class Imbalance

Classes seem to be balanced almost perfectly. This is expected, as the target variable is the sign of the return.
Intuitively, it is expected that the sign of the return is more likely to be positive (by a small margin) than negative
unless data comes from a bear market period.

In [25]:
md(
    f"Class imbalance: {out['train_df']['RET'].value_counts(normalize=True)[0] * 100:.2f}%"
    + f" {out['train_df']['RET'].value_counts(normalize=True)[1] * 100:.2f}%"
)

Class imbalance: 50.02% 49.98%

#### Plot correlation matrix

Findings:
* Most stock returns are nearly not correlated with each other (this is expected).
Otherwise, someone could make a lot of money
by exploiting this non-subtle pattern.
    * Eventually, excess alpha would converge to 0
* Among stock returns the strongest correlation is within stock returns adjacent to each other (e.g. $RET_1$ and $RET_2$)
    * This is expected as the magnitude return of a stock is likely to be correlated with the magnitude stock return of the previous day
* Volumes are highly correlated (this is kind of expected) due to the way $VOLUME_i$ variables are calculated.
Moreover, Volatility and Volumes tend to cluster. Hence, correlation is positive.
* There is a strong positive correlation between the volume of the previous day and the return of the following day

In [13]:
out_corr = run_pipeline_node(
    "reporting",
    "plot_correlation_matrix_node",
    {"train_df": out["train_df"]},
)
out_corr["correlation_matrix_plot"]

25/01/18 13:18:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## Feature engineering - cont'd

* This comes from organizers' notebook, it's an extended version of variables they used
* Feature engineering
* Calculate statistical features

In [26]:


import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    out2 = run_pipeline_node(
        "data_processing",
        "calculate_statistical_features_node",
        {
            "train_df": out["train_df"],
            "test_df": out["test_df"],
        },
    )

### Feature Engineering - Technical Indicators using TA-Lib

In this part, we calculate the technical indicators for the train and test data.
We save the results in pickle files to avoid recalculating them every time.
The following functions inside the function are used:
- talib.OBV, {"data_type": "both"}),
- talib.RSI, {"data_type": "ret"}),
- talib.MOM, {"timeperiod": 5, "data_type": "ret"}),
- talib.ROCR, {"timeperiod": 5, "data_type": "ret"}),
- talib.CMO, {"timeperiod": 14, "data_type": "ret"}),
- talib.EMA, {"timeperiod": 5, "data_type": "ret"}),
- talib.SMA, {"timeperiod": 5, "data_type": "ret"}),
- talib.WMA, {"timeperiod": 5, "data_type": "ret"}),
- talib.MIDPOINT, {"timeperiod": 10, "data_type": "ret"}),

----

This part has optional calculation of technical indicators as it typically takes a long time to compute (unless we use a small subset of the data)

In [27]:
calculate = True

if calculate:
    out3 = run_pipeline_node(
        "data_processing",
        "calculate_technical_indicators_node",
        {
            "train_df_statistical_features": out2["train_df_statistical_features"],
            "test_df_statistical_features": out2["test_df_statistical_features"],
            "params:features_ret_vol": conf_params["features_ret_vol"],
        },
    )
else:
    out5 = get_node_outputs(
        pipelines["data_processing"].nodes[
            get_node_idx(
                pipelines["data_processing"], "calculate_technical_indicators_node"
            )
        ],
        catalog,
    )

### Columns to drop

* ID cols can be used after all features are calculated as portion of them were used.
* ID cols could bring in some predictive power, but we don't want to use them in this case
* We drop the following columns: `ID`, `STOCK`, `DATE`, `INDUSTRY`, `INDUSTRY_GROUP`, `SECTOR`, `SUB_INDUSTRY`

In [28]:

out4 = run_pipeline_node(
    "data_processing",
    "drop_id_cols_node",
    {
        "train_ta_indicators": out3["train_ta_indicators"],
        "test_ta_indicators": out3["test_ta_indicators"],
    },
)

#### Drop obsolete technical indicators

* Assumption is that probably some technical indicators are not useful for the prediction and they can overfit the model
* For instance SMA(10), SMA(11) etc. dont give any information in the context of RET.
* It's an arbitrary choice, but we want to keep the number of features low

In [29]:
out5 = run_pipeline_node(
    "data_processing",
    "drop_obsolete_technical_indicators_node",
    {
        "train_ta_indicators_dropped": out4["train_ta_indicators_dropped"],
        "test_ta_indicators_dropped": out4["test_ta_indicators_dropped"],
        "params:target": conf_params["model_options"]["target"],
    },
)

#### Filter infinity values

* Filtering out infinity values as they can cause problems with model calculation (for instance, neural networks can't handle them)

In [30]:
out6 = run_pipeline_node(
    "data_processing",
    "filter_infinity_values_node",
    {
        "train_df_technical_indicators": out5["train_df_technical_indicators"],
        "test_df_technical_indicators": out5["test_df_technical_indicators"],
        "params:target": conf_params["model_options"]["target"],
    },
)

#### Remove duplicated columns and handle NaN values

* Before data is passed to the model, we need to make sure that no duplicated columns and NaN values are present
* NaN has been previously handled but this was before calculation of technical indicators. For this reason, we need to handle NaN values again

In [31]:
# Remove duplicated columns and handle NaN values
out7 = run_pipeline_node(
    "data_processing",
    "remove_duplicates_and_nans_node",
    {
        "train_df_filtered": out6["train_df_filtered"],
        "test_df_filtered": out6["test_df_filtered"],
    },
)