This is a basic ML algorithm taken from Medium article: Introduction to Quantitative Trading - Building a Machine Learning Model

In [None]:
# Essential libraries -> pandas, numpy, yfinance, talib (used for technical analysis of financial markets), plotly, scikit-learn and XGBoost (classification model)

import pandas as pd
import numpy as np 
import yfinance as yf 
import talib as ta 
import plotly.express as px
from sklearn.metrics import classification_report
from xgboost import XGBClassifier

In [None]:
# Define parameters : define training period and test period
# Using a large period for the out-of-sample data helps to reduce overfitting risk
ticker = "AAPL"
start_date = '2012-01-01'
end_train = '2017-12-31'
end_date = '2025-11-24'

df = yf.download(ticker, start=start_date, end=end_date)

# Flatten column names if MultiIndex
if isinstance(df.columns, pd.MultiIndex):
    df.columns = df.columns.droplevel(1)
    
# Verify columns are correct
print(df.columns.tolist())


  df = yf.download(ticker, start=start_date, end=end_date)
[*********************100%***********************]  1 of 1 completed

['Close', 'High', 'Low', 'Open', 'Volume']





In [20]:
# Data Preprocessing
# We need to adjust the prices for dividends and stock splits to get a true picture of the stock's performance over time.
#df["Adj Low"] = df["Low"] - ( df["Close"] - df["Adj Close"] )
#df["Adj High"] = df["High"] - ( df["Close"] - df["Adj Close"] )


# Calculate daily returns
# Calculate target variable, which is 1 if positive and 0 otherwise
df["Returns"] = df["Close"].pct_change()
df["Target"] = df["Returns"].shift(-1)
df["Target_cat"] = np.where(df["Target"] > 0, 1, 0)

In [21]:
# Feature Engineering
# We want to add technical indicator features to help the model learn patterns in the data

# z-score: measure of how many std. dev's a data point is from the mean
# Aroon oscillator: measure of how many periods have passed since the last high or low
# price trend: sum of the last four day returns

df['std15'] = df['Close'].rolling(15).std()
df['moving_average'] = df['Close'].rolling(15).mean()
df['zscore'] = ( df['Close'] - df['moving_average'] )/ df['std15']
df['aroon'] = ta.AROONOSC(df['High'], df['Low'], timeperiod=14)
df['price_trend'] = df['Returns'].shift().rolling(4).sum()

# Next: categorise the features into bins using the qcut function from pandas.
# Allows the model to understand the data and make better predictions without overfitting.

df['zscore'] = pd.qcut(df['zscore'], 6, labels=False)
df['aroon'] = pd.qcut(df['aroon'], 4, labels=False)
df['price_trend'] = pd.qcut(df['price_trend'], 6, labels=False)

In [22]:
# Drop NaN values resulting from rolling calculations
# Create an array with the features names

df = df.dropna()
features = ['aroon', 'zscore', 'price_trend']

In [23]:
# Main Bit: Building the Model

# Split data in training and testing sets
X_train, X_test = df[features].loc[start_date : end_train], df[features].loc[end_train : end_date]
y_train, y_test = df['Target_cat'].loc[start_date : end_train], df['Target_cat'].loc[end_train : end_date]

# Classification model : XGBoost
model = XGBClassifier()
model.fit(X_train, y_train)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [26]:
# Evaluating the model: check the performance using scikit learn classification report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.47      0.45      0.46       921
           1       0.54      0.56      0.55      1063

    accuracy                           0.51      1984
   macro avg       0.51      0.51      0.51      1984
weighted avg       0.51      0.51      0.51      1984



In [27]:
# accuracy of 0.51 is not the best but acceptable for stock market predictions

# Backtesting the Strategy: Very Important Step!

# We need to check if the model is profitable
# To check profitability, we will check accumulated return of trading based on model's predictions
# Prediction will return either 0 or 1, we need to transform it into 1 or -1 to use as the signal. 
# Then, we can calculate the model's returns by multiplying the signal with the next day's returns.
# ie. a signal of -1 (sell) with a negative return (price went down) will result in a positive return for the model


df['train_test'] = np.where(df.index > end_train, 'Test', 'Train')
y_pred_all = model.predict(df[features])
df['Signal'] = np.where(y_pred_all == 1, 1, -1)
df['Model_Returns'] = df['Signal'] * df['Target']

fig = px.line(df, x = df.index, y = df['Model_Returns'].cumsum()*100, color="train_test",
            labels={'y': 'Cumulative Returns (%)'},
            title=f"{model.__class__.__name__} - {ticker}",
            line_shape='linear')

fig.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['train_test'] = np.where(df.index > end_train, 'Test', 'Train')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Signal'] = np.where(y_pred_all == 1, 1, -1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Model_Returns'] = df['Signal'] * df['Target']


## Summary of Concepts, Methodology, and Strategy

This section provides an overview of the data pipeline, technical
indicators, machine learning approach, and backtesting logic implemented
in this notebook.

------------------------------------------------------------------------

### 1. Objective

The goal of this project is to build a simple **binary classification
model** that predicts whether a stock's next-day return will be positive
or negative.\
The target is defined as:

-   **1** → next-day return is positive\
-   **0** → next-day return is zero or negative

This allows the model to serve as the engine of a long/short trading
strategy.

------------------------------------------------------------------------

### 2. Data Collection and Preprocessing

Daily OHLCV price data for AAPL is downloaded from Yahoo Finance using
`yfinance`, covering 2012--2025. The following fields are computed:

-   **Daily Returns**\
    Returns_t = (Close_t - Close\_{t-1}) / Close\_{t-1}

-   **Next-Day Return (Target)**\
    Target_t = Returns\_{t+1}

-   **Directional Label**\
    Target_cat = 1 if Target \> 0 else 0

These labels allow the machine learning model to learn directional price
movement.

------------------------------------------------------------------------

### 3. Feature Engineering

Several technical indicators are created to help the model detect
structure in price movements.

#### 3.1 Z-Score (normalized price deviation)

-   Rolling 15-day moving average\
-   Rolling 15-day standard deviation\
-   Z-Score defined as:\
    z = (Close - MA_15) / std_15

This captures how far the current price deviates from its recent
average.

#### 3.2 Aroon Oscillator

The Aroon Oscillator measures trend strength based on how long it has
been since recent highs and lows.

#### 3.3 Price Trend

Short-term cumulative momentum: price_trend(t) = sum(Returns\_{t-i} for
i=1..4)

#### 3.4 Quantile Binning

To stabilize the features and help the model generalize, each technical
indicator is transformed into discrete bins using `pd.qcut`.

------------------------------------------------------------------------

### 4. Train/Test Split

To avoid look-ahead bias, a time-based split is used:

-   Training: 2012--2017\
-   Testing: 2018--2025

------------------------------------------------------------------------

### 5. Machine Learning Model

An XGBoost Classifier is trained on the engineered feature set.

------------------------------------------------------------------------

### 6. Model Evaluation

A classification report evaluates accuracy, precision, recall, and
F1-score on the test set.

------------------------------------------------------------------------

### 7. Backtesting the Strategy

#### 7.1 Prediction → Trading Signal

Signal = +1 if model predicts upward movement, else -1

#### 7.2 Strategy Returns

Model_Returns_t = Signal_t × Target_t

#### 7.3 Cumulative Performance

Cumulative_Returns(t) = sum(Model_Returns up to t)

------------------------------------------------------------------------

### 8. Takeaways

This notebook demonstrates a complete quantitative ML workflow.
