<a href="https://colab.research.google.com/github/kennedynjoroge/10-steps-to-become-a-data-scientist/blob/master/Forex_Algorithmic_Trading_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Forex Algorithmic Trading Prediction


## 1.0 Defining the question
### Objective
Predict the highest and lowest price of the candle in the next upcoming hour.

### Metric for success
- Accuracy of predicted highest and lowest price.
- Cummulative net profit of the orders based on predicted signals.
- Number of won/lost orders

### Understanding the context
Foreign Exchange(FOREX) is the global market that allows the exchange of one currency for another. 5.3 trillion dollars worth of forex are traded daily. The forex market is open 24 hours a day and 5 days a week, only closing down during the weekend. The day starts when traders wake up in Sydney then moves to Tokyo, London, Frankfurt and finally, New York, before trading starts all over again in Sydney!

Buying a currency is buying a share in a particular country. The price of the currency is usually a direct reflection of the market’s opinion on the current and future health of its respective economy.

Forex trading is the simultaneous buying of one currency and selling another. Currencies are traded through a broker or dealer, and are traded in pairs e.g USD/JPY(US dollars/Japanese Yen).

The forex market is a decentralized global network of trading partners, including banks, public and private institutions, retail dealers, speculators, and central banks in-volved in the business of buying and selling money. Trades can take place anywhere as long as you have an Internet connection! While forex market has commercial and financial transactions as part of the trading volume, 90% currency trading is based on speculation. 

You would buy the pair if you believe the base currency will appreciate (gain value) relative to the quote currency.
You would sell the pair if you think the base currency will depreciate (lose value) relative to the quote currency.

90% of retail buyers lose money due to making of wrong decision. Therebeing, objective is to use machine learning to increase the chances of making a profit.
More info about forex trading available here -> https://www.babypips.com/learn/forex/

### Data Source
- Forex Capital Markets(FXCM), is a retail broker. Python wrapper API utilized to extract historical bid(sell) and ask(buy) data for USD/JPY instrument.
Historical data is useful for detailed examination of a market's past behaviour, traders and investors can gain perspective on the inner workings of that market.

- Quandl a marketplace for financial, economic data. Python wrapper API utilized to extract fundamental datasets i.e interest and employment rates.


### Assumptions
- Data from demo/sandbox is similar to production.
- Currency and Fundamental data e.g interest rates is valid.

## 2.0 Libraries and Data Importation



### Import Libraries


In [0]:
import datetime  as dt # Import data as date time
import pytz #Convert UTC to GMT+3
import numpy as np
import pandas as pd
import plotly.graph_objects as go 
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline

#Modelling
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

In [0]:
## FXCM API - for importing currencies
# API definition https://www.fxcm.com/fxcmpy/appendix.html
!pip  install fxcmpy -q 
!pip install python-socketio -q 
import fxcmpy
# api token/key
api = fxcmpy.fxcmpy(config_file='fxcm.cfg')

In [2]:
## Quandl API - for fundamanetal data import e.g interest rate, employment rate.
# API Definition https://www.quandl.com/data/FRED-Federal-Reserve-Economic-Data/documentation
!pip install quandl -q 
import quandl
# api token/key
quandl.ApiConfig.api_key = "Y4Z_EXQ7qh7xxJhC8J6E"

  Building wheel for inflection (setup.py) ... [?25l[?25hdone


### Import Bid and Ask Historical Data

Import historical bid and ask currency prices data. Will import 10 data for 10 years for starters. If need be, the years will be increased to enhance model accuracy by providing more data to make it able to generalize better.

Dataset is hourly candle data. Candle is a sample of the raw underlying tick data combined with some descriptive statistical information about the sample it self (tick count)

In [76]:
data_201920 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2019, 3, 30), stop = dt.datetime(2020, 3, 30))
data_201819 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2018, 3, 30), stop = dt.datetime(2019, 3, 30))
data_201718 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2017, 3, 30), stop = dt.datetime(2018, 3, 30))
data_201617 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2016, 3, 30), stop = dt.datetime(2017, 3, 30))
data_201516 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2015, 3, 30), stop = dt.datetime(2016, 3, 30))
data_201415 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2014, 3, 30), stop = dt.datetime(2015, 3, 30))
data_201314 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2013, 3, 30), stop = dt.datetime(2014, 3, 30))
data_201213 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2012, 3, 30), stop = dt.datetime(2013, 3, 30))
data_201112 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2011, 3, 30), stop = dt.datetime(2012, 3, 30))
data_201011 = api.get_candles('USD/JPY', period = 'H1', start = dt.datetime(2010, 3, 30), stop = dt.datetime(2011, 3, 30))

#Concatenate data for all the 10 years
years_list = [data_201920, data_201819, data_201718, data_201617, data_201516, data_201415, data_201314, data_201213, data_201112, data_201011]
df = pd.concat(years_list)

#Print Count
print("Count of hourly records per year \n")
print("2019-2020 - ",len(data_201920))
print("2018-2019 - ",len(data_201819))
print("2017-2018 - ",len(data_201718))
print("2016-2017 - ",len(data_201617))
print("2015-2016 - ",len(data_201516))
print("2014-2015 - ",len(data_201415))
print("2013-2014 - ",len(data_201314))
print("2012-2013 - ",len(data_201213))
print("2011-2012 - ",len(data_201112))
print("2010-2011 - ",len(data_201011))
print("\n Merged Data",len(df))

#View Sample
print("\n Sample Records From the DataFrame Before and After Time Conversion")
print(df.head(5)) # Before time convertsion
df = df.tz_localize("UTC").tz_convert('Africa/Nairobi').tz_localize(None)
df.head(5)

Count of hourly records per year 

2019-2020 -  6300
2018-2019 -  6315
2017-2018 -  6331
2016-2017 -  6331
2015-2016 -  6360
2014-2015 -  6304
2013-2014 -  6295
2012-2013 -  6217
2011-2012 -  6281
2010-2011 -  6247

 Merged Data 62981

 Sample Records From the DataFrame Before and After Time Conversion
                     bidopen  bidclose  bidhigh  ...  askhigh   asklow  tickqty
date                                             ...                           
2019-03-31 18:00:00  110.847   110.954  110.993  ...  110.998  110.970      193
2019-03-31 19:00:00  110.954   111.007  111.018  ...  111.030  110.974      215
2019-03-31 20:00:00  111.007   111.013  111.091  ...  111.107  111.012      321
2019-03-31 21:00:00  111.013   110.932  111.041  ...  111.086  110.893     1320
2019-03-31 22:00:00  110.932   110.936  110.983  ...  110.999  110.911     7250

[5 rows x 9 columns]


Unnamed: 0_level_0,bidopen,bidclose,bidhigh,bidlow,askopen,askclose,askhigh,asklow,tickqty
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-03-31 21:00:00,110.847,110.954,110.993,110.923,110.886,110.993,110.998,110.97,193
2019-03-31 22:00:00,110.954,111.007,111.018,110.94,110.993,111.012,111.03,110.974,215
2019-03-31 23:00:00,111.007,111.013,111.091,110.987,111.012,111.074,111.107,111.012,321
2019-04-01 00:00:00,111.013,110.932,111.041,110.878,111.074,110.965,111.086,110.893,1320
2019-04-01 01:00:00,110.932,110.936,110.983,110.897,110.965,110.95,110.999,110.911,7250


Each row represents a candle of data for a single hour per day for USD/JPY currency pair.

a) Date - Date and hour. Date is originally in UTC/GMT. It's converted to GMT+3 local time

b) BID PRICES - Price that retailer trader sells at

- Bid open - price at the start of the candle

- Bid close - price at the end of the candle

- Bid high - highest price during the candle window

- Bid low - lowest price during the candle window

c) ASK PRICES - Price that retailer trader buys at

- Ask open - price at the start of the candle

- Ask close - price at the end of the candle

- Ask high - highest price during the candle window

- Ask low - lowest price during the candle window

d) Ticket Qty -  number of price changes that occured within the candle boundaries. It does not tell when the changes occured or how big or small the changes were.

In [9]:
df.describe()

Unnamed: 0,bidopen,bidclose,bidhigh,bidlow,askopen,askclose,askhigh,asklow,tickqty
count,62981.0,62981.0,62981.0,62981.0,62981.0,62981.0,62981.0,62981.0,62981.0
mean,101.711437,101.711665,101.788197,101.632136,101.72706,101.727287,101.802874,101.647278,7700.089503
std,13.966758,13.966725,13.973681,13.959091,13.967255,13.967223,13.974333,13.959214,8652.60488
min,75.671,75.671,75.746,75.559,75.705,75.705,75.759,75.575,1.0
25%,88.504,88.504,88.576,88.397,88.518,88.518,88.589,88.413,2204.0
50%,106.29,106.29,106.374,106.209,106.305,106.305,106.389,106.226,5222.0
75%,111.728,111.728,111.795,111.661,111.743,111.743,111.81,111.676,10288.0
max,125.676,125.676,125.85,125.602,125.689,125.689,125.864,125.617,171038.0


### Interest Rates

In [77]:
# Read US Interest rates from quandl for the last 10 years
# API frequency is daily
v_start_date = '2010-03-30'
v_end_date = '2020-03-30'
df_US_Interest_Rates = quandl.get("FRED/DFF", start_date=v_start_date, end_date=v_end_date,timezone='GMT+3')
df_US_Interest_Rates.columns = df_US_Interest_Rates.columns.str.replace('Value', 'US_Monthly_Interest_Rate')
df_JPY_Interest_Rates = quandl.get("MOFJ/INTEREST_RATE_JAPAN_40Y", start_date=v_start_date, end_date=v_end_date,timezone='GMT+3')
df_JPY_Interest_Rates.columns = df_JPY_Interest_Rates.columns.str.replace('Value', 'JPY_Monthly_Interest_Rate')

#Create transaction date from index
df['Transaction_Date'] = df.index.strftime('%Y-%m-%d')
df['Candle_Date'] = df.index
df_US_Interest_Rates['Transaction_Date']  = df_US_Interest_Rates.index.strftime('%Y-%m-%d')
df_JPY_Interest_Rates['Transaction_Date']  = df_JPY_Interest_Rates.index.strftime('%Y-%m-%d')

df_JPY_Interest_Rates[0:5]

Unnamed: 0_level_0,JPY_Monthly_Interest_Rate,Transaction_Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-03-30,2.32,2010-03-30
2010-03-31,2.308,2010-03-31
2010-04-01,2.285,2010-04-01
2010-04-02,2.258,2010-04-02
2010-04-05,2.259,2010-04-05


In [67]:
# Check length of Japanese versus US interest rates
print(len(df_JPY_Interest_Rates), len(df_US_Interest_Rates))

2448 3654


Some Japanese interest rates missing compared to US interest rates. Imputation to be preferred based on nearest neigbour

### Merge Currency and Interest Data

In [78]:
# Merge Dataframe and  US interest rates
df = df.merge(df_US_Interest_Rates,on='Transaction_Date',how='left')
df.set_index("Candle_Date", inplace = True) 
df[0:3]

Unnamed: 0_level_0,bidopen,bidclose,bidhigh,bidlow,askopen,askclose,askhigh,asklow,tickqty,Transaction_Date,US_Monthly_Interest_Rate
Candle_Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-03-31 21:00:00,110.847,110.954,110.993,110.923,110.886,110.993,110.998,110.97,193,2019-03-31,2.43
2019-03-31 22:00:00,110.954,111.007,111.018,110.94,110.993,111.012,111.03,110.974,215,2019-03-31,2.43
2019-03-31 23:00:00,111.007,111.013,111.091,110.987,111.012,111.074,111.107,111.012,321,2019-03-31,2.43


In [79]:
# Merge Dataframe and  Japan interest rates
df['Candle_Date'] = df.index
df = df.merge(df_JPY_Interest_Rates,on='Transaction_Date',how='left')
df.set_index("Candle_Date", inplace = True) 
df[0:3]

Unnamed: 0_level_0,bidopen,bidclose,bidhigh,bidlow,askopen,askclose,askhigh,asklow,tickqty,Transaction_Date,US_Monthly_Interest_Rate,JPY_Monthly_Interest_Rate
Candle_Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2019-03-31 21:00:00,110.847,110.954,110.993,110.923,110.886,110.993,110.998,110.97,193,2019-03-31,2.43,
2019-03-31 22:00:00,110.954,111.007,111.018,110.94,110.993,111.012,111.03,110.974,215,2019-03-31,2.43,
2019-03-31 23:00:00,111.007,111.013,111.091,110.987,111.012,111.074,111.107,111.012,321,2019-03-31,2.43,


### Unemployment rates

In [0]:
#  % of Total Labor Force. API Frequency is annual. 
US_Unemployment_percent = quandl.get("ODA/JPN_LUR", start_date=v_start_date, end_date=v_end_date)
JPY_Unemployment_percent = quandl.get("ODA/USA_LUR", start_date=v_start_date, end_date=v_end_date)
# quandl.get("ODA/USA_LUR", authtoken="Y4Z_EXQ7qh7xxJhC8J6E")
US_Unemployment_percent

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
2010-12-31,5.058
2011-12-31,4.583
2012-12-31,4.325
2013-12-31,4.008
2014-12-31,3.583
2015-12-31,3.375
2016-12-31,3.108
2017-12-31,2.817
2018-12-31,2.442
2019-12-31,2.433


## Feature Engineering

In [0]:
##Spread
#A spread is the price difference between where a trader may purchase or sell an underlying asset. 
df['spread']= df['askclose'] - df['bidclose']

## Bid Close Return
# References: https://quant.stackexchange.com/questions/21092/calculating-log-returns-across-multiple-securities-and-time , https://stackoverflow.com/questions/31742545/python-calculating-log-returns-of-a-time-series
# NB: In Quantitative Finance, doing your math in log-returns considered good manners
df['bid_close_return'] = np.log(df['bidclose']/df['bidclose'].shift(1))

## Bid Close Up or Down
#Was the bid close price up or down
df['bid_close_up_down'] = np.where(df['bid_close_return'] >0 ,1,0) # Bid Close went up or down?

##Simple Moving Average
df['bid_close_rolling_mean_1'] = df['bidclose'].rolling(window=20).mean()
df['bid_close_rolling_mean_2'] = df['bidclose'].rolling(window=50).mean()

## Bid High Return
# References: https://quant.stackexchange.com/questions/21092/calculating-log-returns-across-multiple-securities-and-time , https://stackoverflow.com/questions/31742545/python-calculating-log-returns-of-a-time-series
# NB: In Quantitative Finance, doing your math in log-returns considered good manners
df['bid_high_return'] = np.log(df['bidhigh']/df['bidhigh'].shift(1))

## Bid Close Up or Down
#Was the bid close price up or down
df['bid_high_up_down'] = np.where(df['bid_high_return'] >0 ,1,0) # Bid Close went up or down?

##Simple Moving Average
df['bid_high_rolling_mean_1'] = df['bidhigh'].rolling(window=20).mean()
df['bid_high_rolling_mean_2'] = df['bidhigh'].rolling(window=50).mean()

df['bid_high_next'] = df['bidhigh'].shift(1)

In [0]:
df = df.drop(['Transaction_Date','JPY_Monthly_Interest_Rate'], axis=1)
df[0:10]

## Exploratory Data Analysis

### Data Cleaning

a) Missing records

In [0]:
#Drop columns with null values
df = df.dropna()

In [0]:
df.info()

b) Duplicate records

In [0]:
pass

c) Outliers

In [0]:
pass

###  Univariate Analysis

In [0]:
df['bidclose'].plot()
plt.title("Bid Close ")
plt.show()

df['Bid_Close_rolling_mean_2'].plot()
plt.title("Bid Close Rolling Mean ")
plt.show()

In [0]:
#Create Plot
fig = go.Figure(data=[go.Candlestick(x=df.index,open=df['bidopen'],high=df['bidhigh'],low=df['bidlow'],close=df['bidclose'])])
#Specify title and y axis
fig.update_layout( title='Forex Pricing Patterns for last 10 years',yaxis_title='Price',xaxis_title='Year')
pio.write_html(fig, file='../forex.html')
fig.show(renderer = "colab",auto_open=True)
# fig.write_html('tmp.html', auto_open=True)

Price was lowest in 2012 and highest in 2016


## Bivariate Analysis

## Multivariate Analysis

## Pre-Modelling Steps

a) Normality Test

b) Scaling and train test split

In [0]:
df.columns

In [0]:
columns = ['bidopen', 'bidclose', 'bidhigh', 'bidlow', 'askopen', 'askclose','askhigh', 'asklow', 'tickqty', 
           'US_Monthly_Interest_Rate', 'spread','bid_close_return', 'bid_close_up_down', 'bid_close_rolling_mean_1',
       'bid_close_rolling_mean_2','bid_high_rolling_mean_1', 'bid_high_rolling_mean_2', 'bid_high_next']
labels = df['bid_high_up_down'].values #bid_high_next 'bid_close_return',  'bid_high_up_down'
features = df[list(columns)].values

min_max = MinMaxScaler()
newfeatures = min_max.fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(newfeatures, labels, test_size=0.1)

## Modelling

Support Vector Machine
- Check with Radial as well versus linear and compare results


In [0]:
clf = svm.SVC(kernel='linear')
clf = clf.fit(X_train, y_train)

accuracy = clf.score(X_train, y_train)
print (' training data accuracy ', accuracy*100)

accuracy = clf.score(X_test, y_test)
print (' testing data accuracy ', accuracy*100)

ypredict = clf.predict(X_train)
print ('\n Training classification report\n', classification_report(y_train, ypredict))

ypredict = clf.predict(X_test)
print ('\n Testing classification report\n', classification_report(y_test, ypredict))

 training data accuracy  69.80692116094615
 testing data accuracy  70.0

 Training classification report
               precision    recall  f1-score   support

           0       0.74      0.68      0.71     30412
           1       0.66      0.72      0.69     26197

    accuracy                           0.70     56609
   macro avg       0.70      0.70      0.70     56609
weighted avg       0.70      0.70      0.70     56609


 Testing classification report
               precision    recall  f1-score   support

           0       0.74      0.68      0.71      3378
           1       0.66      0.73      0.69      2912

    accuracy                           0.70      6290
   macro avg       0.70      0.70      0.70      6290
weighted avg       0.70      0.70      0.70      6290



Arima

In [0]:
from statsmodels.tsa.arima_model import ARIMA

  import pandas.util.testing as tm


Gradient Boosting

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0, learning_rate=0.01) #, n_estimators=10000

clf.fit(X_train, y_train)

accuracy = clf.score(X_train, y_train)
print('Testing Accuracy: %f' % accuracy)

KeyboardInterrupt: ignored

In [0]:
#Predict test data
pred = clf.predict(y_test)
# Calculate equity..
contracts  = 10000.0
commission = 0.0

df_trade = pd.DataFrame(X_train[train_len:,-1], columns=['return'])
df_trade['label']  = train_y[train_len:]
df_trade['pred']   = pred
df_trade['won']    = df_trade['label'] == df_trade['pred']
df_trade['return'] = df_trade['return'].shift(-1) * return_range
df_trade.drop(df_trade.index[len(df_trade)-1], inplace=True)

def calc_profit(row):
    if row['won']:
        return abs(row['return'])*contracts - commission
    else:
        return -abs(row['return'])*contracts - commission

df_trade['pnl'] = df_trade.apply(lambda row: calc_profit(row), axis=1)
df_trade['equity'] = df_trade['pnl'].cumsum()

display(df_trade.tail())
df_trade.plot(y='equity', figsize=(10,4), title='Backtest with $10000 initial capital')
plt.xlabel('Trades')
plt.ylabel('Equity (USD)')
for r in df_trade.iterrows():
    if r[1]['won']:
        plt.axvline(x=r[0], linewidth=0.5, alpha=0.8, color='g')
    else:
        plt.axvline(x=r[0], linewidth=0.5, alpha=0.8, color='r')
