<a href="https://colab.research.google.com/github/nickhward/ML_Trading_methods/blob/main/macd_ml_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline XGBoost Model: An Introduction

The primary aim of this notebook revolves around establishing a fundamental **XGBoost model** that sets a benchmark for all subsequent model iterations.

## Objective

This baseline model plays a pivotal role as a standard for comparison, offering valuable insights into performance enhancements realized by future models. It's the starting point from which we aim to measure the effectiveness of improvements in the following areas:

- **Data Preprocessing**
- **Feature Selection**
- **Hyperparameter Tuning**

By adhering to this approach, we ensure a structured and systematic model development process. This not only enables us to make informed and data-driven decisions but also to effectively track the impact and efficacy of modifications made over time.

In [1]:
import pandas as pd
import numpy as np

macd_data = pd.read_csv('/content/drive/MyDrive/macd new - macd_strat.csv')

macd_data.head()

Unnamed: 0.1,Unnamed: 0,Outcome,Order,Date,EMA,Points,Take Profit,Stop Loss,Entry,Amount,Profit,Challenge Profit,Unnamed: 12
0,0,LOSS,SELL,2015-12-01 10:20:00,17790.5254,94.17,17644.275,17879.7,17785.53,-2000,-2000.0,-2000.0,
1,1,LOSS,BUY,2015-12-02 12:30:00,17829.52727,22.45,17885.585,17829.46,17851.91,-2000,-4000.0,-4000.0,
2,2,LOSS,BUY,2015-12-02 13:35:00,17832.37283,14.46,17865.61,17829.46,17843.92,-2000,-6000.0,-6000.0,
3,3,WIN,SELL,2015-12-14 14:50:00,17348.06082,31.53,17268.295,17347.12,17315.59,3000,-3000.0,-3000.0,
4,4,LOSS,BUY,2015-12-16 12:00:00,17497.78091,42.16,17627.5,17522.1,17564.26,-2000,-5000.0,-5000.0,


In [2]:
macd_sells = macd_data[macd_data['Order'] == 'SELL']
macd_buys = macd_data[macd_data['Order'] == 'BUY']

In [3]:
us30_data = pd.read_csv('/content/drive/MyDrive/file.csv')
#flip data so that we can go by time

us30_data = us30_data.iloc[::-1].reset_index()
us30_data = us30_data.drop('index', axis = 1)

us30_data['200_EMA'] = us30_data['Last'].ewm(span=200, adjust=False).mean()
us30_data.describe()

Unnamed: 0,Open,High,Low,Last,Change,Volume,200_EMA
count,149333.0,149333.0,149333.0,149333.0,149333.0,149333.0,149333.0
mean,26238.988747,26252.213311,26225.603104,26239.006851,0.103376,3685441.0,26228.58718
std,5579.442101,5582.587937,5576.117095,5579.41565,32.911242,6649631.0,5576.591825
min,15464.97,15471.44,15450.56,15464.9,-2250.46,0.0,15793.006835
25%,21815.66,21825.64,21806.6,21815.25,-7.83,1722201.0,21815.681169
50%,25904.94,25916.54,25891.99,25904.55,0.0,2615862.0,25910.012643
75%,30895.44,30916.07,30874.92,30896.34,8.23,3855830.0,30910.314944
max,36947.89,36952.65,36934.63,36947.65,1480.81,569378300.0,36752.325894


In [6]:
def feature_engineering_us30(data):

    exp12     = data['Last'].ewm(span=12, adjust=False).mean()
    exp26     = data['Last'].ewm(span=26, adjust=False).mean()
    macd_line = exp12 - exp26
    signal_line = macd_line.ewm(span=9, adjust=False).mean()

    data['MACD Line'] = macd_line
    data['Signal Line'] = signal_line    
    return data


In [7]:
feature_data = feature_engineering_us30(us30_data)
feature_data.rename(columns={'Time':'Date'}, inplace=True)
feature_data.head()


Unnamed: 0,Date,Open,High,Low,Last,Change,%Chg,Volume,200_EMA,MACD Line,Signal Line
0,11/17/2015 09:30,17486.99,17523.64,17486.99,17503.73,20.72,+0.12%,6745474,17503.73,0.0,0.0
1,11/17/2015 09:35,17501.59,17510.94,17476.07,17480.54,-23.19,-0.13%,2175392,17503.499254,-1.849915,-0.369983
2,11/17/2015 09:40,17480.31,17487.96,17466.98,17481.08,0.54,0.00%,2122478,17503.276177,-3.235119,-0.94301
3,11/17/2015 09:45,17480.61,17483.01,17462.98,17468.85,-12.23,-0.07%,2320527,17502.933628,-5.259139,-1.806236
4,11/17/2015 09:50,17468.92,17480.28,17453.62,17473.63,4.78,+0.03%,2042108,17502.642049,-6.403665,-2.725722


In [8]:
#merge the dataframes

macd_sells['Date'] = pd.to_datetime(macd_sells['Date'])
macd_buys['Date'] = pd.to_datetime(macd_buys['Date'])
feature_data['Date'] = pd.to_datetime(feature_data['Date'])
macd_sells = macd_sells.reset_index()
macd_buys = macd_buys.reset_index()

macd_sells = pd.merge(macd_sells, feature_data, on = 'Date')
macd_buys = pd.merge(macd_buys, feature_data, on='Date')

print(f'Macd_buys shape: {macd_buys.shape}')
print(f'Macd_sells shape: {macd_sells.shape}')

for_later = macd_buys

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  macd_sells['Date'] = pd.to_datetime(macd_sells['Date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  macd_buys['Date'] = pd.to_datetime(macd_buys['Date'])


Macd_buys shape: (762, 24)
Macd_sells shape: (277, 24)


In [9]:
dropping = ['200_EMA', 'Unnamed: 12', 'Open', 'High', 'Low', 'Last', 'index', 'Unnamed: 0','Challenge Profit', 'Challenge Profit', 'Entry', 'Amount', 'Profit', 'Change', '%Chg',  'Take Profit', 'Stop Loss', 'Order', 'EMA', 'Date']

macd_sells['distance_ema'] = abs(macd_sells['Low'] - macd_sells['EMA'])
macd_buys['distance_ema'] = abs(macd_buys['Low'] - macd_buys['EMA'])
macd_sells['hour'] = macd_sells['Date'].dt.hour
macd_buys['hour'] = macd_buys['Date'].dt.hour
macd_sells['minute'] = macd_sells['Date'].dt.minute
macd_buys['minute'] = macd_buys['Date'].dt.minute

macd_sells = macd_sells.drop(columns=dropping, axis = 1)
macd_buys = macd_buys.drop(columns=dropping, axis=1)

print(macd_buys.columns)

Index(['Outcome', 'Points', 'Volume', 'MACD Line', 'Signal Line',
       'distance_ema', 'hour', 'minute'],
      dtype='object')


In [10]:
map_target = {'WIN' : 1, 'LOSS' : 0}
macd_buys['Outcome'] = macd_buys['Outcome'].map(map_target)
macd_buys = macd_buys.dropna()
y = macd_buys.pop('Outcome')
X = macd_buys



In [11]:
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, accuracy_score
from xgboost import XGBClassifier


# Assuming you already have your features (X) and target (y)
xgb_model = XGBClassifier()
cv = StratifiedKFold(n_splits=5)

# Define dictionary of metrics
scoring = {'accuracy': make_scorer(accuracy_score), 
           'precision': make_scorer(precision_score), 
           'recall': make_scorer(recall_score),
           'f1': make_scorer(f1_score)}

scores = cross_validate(xgb_model, X, y, cv=cv, scoring=scoring, return_train_score=True)

print("Train Accuracy: ", scores['train_accuracy'].mean())
print("Test Accuracy: ", scores['test_accuracy'].mean())
print("Train Precision: ", scores['train_precision'].mean())
print("Test Precision: ", scores['test_precision'].mean())
print("Train Recall: ", scores['train_recall'].mean())
print("Test Recall: ", scores['test_recall'].mean())
print("Train F1 Score: ", scores['train_f1'].mean())
print("Test F1 Score: ", scores['test_f1'].mean())

Train Accuracy:  1.0
Test Accuracy:  0.5406690746474029
Train Precision:  1.0
Test Precision:  0.4021292205946937
Train Recall:  1.0
Test Recall:  0.32686409307244846
Train F1 Score:  1.0
Test F1 Score:  0.35442555227944117


# Addressing Overfitting: A Priority

From the initial assessment, it's clear that the xgboost model is subject to significant overfitting. The model is demonstrating perfect accuracy, precision, recall, and an F1 score of 1.0 on the training set. However, its performance drops drastically when applied to the test set, with accuracy falling to approximately 0.54, precision to about 0.40, recall to around 0.33, and the F1 score to nearly 0.35.

The stark contrast between the model's performance on the training and test sets is a clear indication of overfitting. This suggests that the model has essentially memorized the training data, rather than learning patterns that can generalize to unseen data.

Moving forward, a primary focus of efforts will be to devise strategies to curb this overfitting. This will involve looking into methods such as regularization, early stopping, or potentially gathering more data.

## Model Performance Summary

Here is the performance of the xgboost model on both the training and test datasets:

- **Training Accuracy**: 1.0
- **Test Accuracy**: 0.5406690746474029

- **Training Precision**: 1.0
- **Test Precision**: 0.4021292205946937

- **Training Recall**: 1.0
- **Test Recall**: 0.32686409307244846

- **Training F1 Score**: 1.0
- **Test F1 Score**: 0.35442555227944117