# Investment strategies using machine learning

Made by Junho Kook

I would like to develop a draft program that uses Python language and machine learning technology to help plan investment strategies.

# Purpose of Project

This project is meaningful in generating technical indicators (moving average, transaction volume average, RSI, etc.) based on financial market data and using machine learning models (XGBoost, Random Forest) to predict short-term upward and downward directions of stock prices. Beyond simple data analysis, it can experimentally evaluate the performance of investment strategies using machine learning and develop it into a basic model for algorithm trading, strategy backtesting, and portfolio rebalancing that can be applied to actual investment environments.

To make it easier, the project creates a feature of technical indicators typically used in investments based on ETF data, and designs a binary classification model that predicts the direction of returns (up/down) for the next day. This prediction result can be used as basic data for real investment strategies, as investors can buy when they are likely to rise or develop defense strategies when they are likely to fall.

## 1. Import required libraries

In [1]:
import warnings
import glob
import os
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn import svm
import seaborn as sns; sns.set()

## 2. Import Dataset

In [4]:
df = pd.read_csv('ETFs_main.csv')
# Major price and transaction data related to ETFs (Exchange-Traded Funds)

## 3. Creating Technical Indicators

Generates technical analysis indicators of typical financial analysis and then uses them as input variables for machine learning models

In [5]:
# Calculates the n-day moving average (MA) of the stock price (CLOSE_SPY)
def moving_average(df, n):
    MA = pd.Series(df['CLOSE_SPY'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
    df = df.join(MA)
    return df

# Calculate the n-day moving average (VMA) of VOLUME
def volume_moving_average(df, n):
    VMA = pd.Series(df['VOLUME'].rolling(n, min_periods=n).mean(), name='VMA_' + str(n))
    df = df.join(VMA)
    return df

# Calculate RSI indicators that reflect the strength of the stock price's rise and fall
def relative_strength_index(df, n):
    delta = df['CLOSE_SPY'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=n).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=n).mean()
    RS = gain / loss
    RSI = 100 - (100 / (1 + RS))
    RSI.name = 'RSI_' + str(n)
    df = df.join(RSI)
    return df

In [6]:
# Apply technical indicators
df = moving_average(df, 45)
df = volume_moving_average(df, 45)
df = relative_strength_index(df, 14)

# Set 'Dates' column as index
df = df.set_index('Dates')
df = df.dropna()

print(len(df))

2727


## 4. Create a target variable

In [7]:
# Create a target variable(pct_change)
df['pct_change'] = df['CLOSE_SPY'].pct_change()

# Create a binary classification target with a positive rate of return of 1 and a negative rate of 0
df['target'] = np.where(df['pct_change'] > 0, 1, 0)
df = df.dropna(subset=['target'])  # 결측값 제거

df['target'] = df['target'].astype(np.int64)

print(df['target'].value_counts())

target
1    1471
0    1256
Name: count, dtype: int64


In [8]:
# Shift the target variable for next-day prediction
df['target'] = df['target'].shift(-1)
df = df.dropna()
print(len(df))

2725


In [9]:
# Separating Descriptive and Target Variables
y_var = df['target']
x_var = df.drop(['target', 'OPEN', 'HIGH', 'LOW', 'VOLUME', 'CLOSE_SPY', 'pct_change'], axis=1)

In [10]:
# Check the Up and Down Rates
up = df[df['target'] == 1].target.count()
total = df.target.count()
print('up/down ratio: {0:.2f}'.format(up / total))

up/down ratio: 0.54


## 5. Split train and test datasets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size=0.3, shuffle=False, random_state=3)

train_count = y_train.count()
test_count = y_test.count()

print('train set label ratio')
print(y_train.value_counts() / train_count)
print('test set label ratio')
print(y_test.value_counts() / test_count)

train set label ratio
target
1.0    0.543786
0.0    0.456214
Name: count, dtype: float64
test set label ratio
target
1.0    0.530562
0.0    0.469438
Name: count, dtype: float64


In [12]:
x_var.head( )

Unnamed: 0_level_0,CLOSE_GLD,CLOSE_FXY,CLOSE_T10Y2Y,CLOSE_TED,CLOSE_USO,CLOSE_UUP,CLOSE_VIX,CLOSE_VWO,MA_45,VMA_45,RSI_14
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2007-04-30,67.09,83.7166,2.4361,0.57,51.24,24.49,14.22,40.935,143.601556,111646600.0,70.95672
2007-05-02,66.66,83.38,2.4366,0.59,49.59,24.66,13.08,42.02,143.680667,112161300.0,79.237288
2007-05-03,67.49,83.11,2.4346,0.6,49.28,24.69,13.09,42.435,143.780222,112342100.0,79.604579
2007-05-04,68.19,83.23,2.4006,0.6,48.3,24.6,12.91,42.595,143.905111,112885300.0,79.411765
2007-05-08,67.88,83.37,2.3913,0.6,48.64,24.73,13.21,42.36,144.029111,113135700.0,74.368231


## 6. Model Learning and Evaluation

- XGBoost
- RandomForest

- GridSearchCV
- Evaluation

In [13]:
# Confusion Matrix and Performance Evaluation Functions
def get_confusion_matrix(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_score = roc_auc_score(y_test, pred)
    print('confusion matrix')
    print(confusion)
    print('accuracy: {0:.4f}, precision: {1:.4f}, recall: {2:.4f}, F1: {3:.4f}, ROC AUC score: {4:.4f}'.format(
        accuracy, precision, recall, f1, roc_score))

In [14]:
# Learning and predicting XGBoost models
xgb_dis = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)
xgb_dis.fit(X_train, y_train)
xgb_pred = xgb_dis.predict(X_test)

print(xgb_dis.score(X_train, y_train))

get_confusion_matrix(y_test, xgb_pred)

0.8479286837965391
confusion matrix
[[333  51]
 [358  76]]
accuracy: 0.5000, precision: 0.5984, recall: 0.1751, F1: 0.2709, ROC AUC score: 0.5212


In [15]:
# Learning and predicting RandomForest models
n_estimators = range(10, 200, 10)
params = {
    'bootstrap': [True],
    'n_estimators': n_estimators,
    'max_depth': [4, 6, 8, 10, 12],
    'min_samples_leaf': [2, 3, 4, 5],
    'min_samples_split': [2, 4, 6, 8, 10],
    'max_features': [4]
}

# Cross-validation settings
my_cv = TimeSeriesSplit(n_splits=5).split(X_train)

# Model learning using GridSearchCV
clf = GridSearchCV(RandomForestClassifier(), params, cv=my_cv, n_jobs=-1)
clf.fit(X_train, y_train)

# Optimal parameter output
print('best parameter:\n', clf.best_params_)
print('best prediction: {0:.4f}'.format(clf.best_score_))

best parameter:
 {'bootstrap': True, 'max_depth': 4, 'max_features': 4, 'min_samples_leaf': 4, 'min_samples_split': 6, 'n_estimators': 70}
best prediction: 0.5565


  _data = np.array(data, dtype=dtype, copy=copy,


In [16]:
# Check performance on test dataset
pred_con = clf.predict(X_test)
accuracy_con = accuracy_score(y_test, pred_con)
print('accuracy: {0:.4f}'.format(accuracy_con))
get_confusion_matrix(y_test, pred_con)

accuracy: 0.5061
confusion matrix
[[319  65]
 [339  95]]
accuracy: 0.5061, precision: 0.5938, recall: 0.2189, F1: 0.3199, ROC AUC score: 0.5248
