# Background

This post will introduce data scientists who are interested in cryptocurrency exchange models that can be used to predict buy and sell opportunities based on several strategies with indepth descriptions and access to code that is being used to simulate a variety of models with varying performance.

## What is Forex

Forex is short for foreign currency exchange and is the trading of currencies with the goal making profits by timing the buy and sell of specific currency paris while using candlestick charts. Strategies for trading are created by looking for patterns that can be used to predict future currency exchange price flucations.  

## Candlestick Charts

A candlestick chart is the standard plot used in visualizing trading activity where a candle is represented by a box plot that visualizes 4 prices within a given period: the high, low, open and close price.  The box, or body of the candle, is colored based on if the open price is greater than the close and differently if vice versa.  In the below chart, a white candlestick means the close price is higher than the open price meaning the price is going up.  The lines coming out of the candlestick body are called "shadows" or "wicks" and represent the price spread for the given period by extending out to the high and low price.  An individual candlestick can represnt a period as short as a second to days or weeks or more.  The chart below is a 15 minute candlestick chart so each candlestick represents a 15 minute period.

![images/15min_candle.png](images/15min_candle.png)
<center><b>Figure X</b> - Here is an example 15-minute candlestick chart for the Ethereum/Bitcoin cryptocurrency exchange rate.<br>This visualization was rendered using the Python library <a href='https://github.com/matplotlib/mplfinance'><code>mplfinance</code></a>.</center>`

# Feature Engineering

# Strategies

## Target and Stop Loss

In traditional forex trading, stop and limit orders are methods to protect an investor that can be used to buy and sell currencies when a price reaches a certain level.  Using this, a preditive model can focus only on buy opportunities and then rely on a simple strategy to determine when to sell. A sell strategy defines two sell prices for a given buy opportunity.  The first sell price is called the **target**, which is the high price that results in a profit and the next is the **stop loss**, which is the low resulting in a loss. When a buy opportunity is identified, and the a target and stop loss is calculated, the purchase can be made and the sell will be automatic either by the exchange or by another system that monitors the market price.

In the example below, a buy opportunity is identified at the close of the 4:00am candlestick at a price of `0.060497` Bitcoin (`BTC`) per 1.0 Etherum (`ETH`).  Buying ETH at this price, a target and stop loss is calculated with a `1.0% : 0.5%` ratio, thus `0.061102` for a target and `0.060195` for a stop loss.  The price reaches the target price eight candlesticks later or 2 hours later at 6:00am, thus securing `1.0%` profit (assuming no fees).

![images/target_profits.png](images/target_profits.png)
<center><b>Figure X</b> - Example <code>ETH</code> buy opportunity.</center>

### Identifying Buying Opportunities

Using a target and stop loss approach simplifies the model to a binary classification problem but a new problem is created, there is no labeled data to train on.  The goal here is to create a label for each record.  A record being the data for one candlestick.  This includes the low, high, open and close prices as well as some additional features such as volume and number of trades.  Using the close price for each record, a target and stop loss price is calculated using the same threshold ratio that will be used on the deployed model.  Using the example above, a ratio of `1.0% : 0.5%` returns a target price `1.0%` higher than the close price and a stop loss of `0.5%` below the close price.  The next step is to peek into the future and see what happens first. Does the price reach the target price first or the stop loss?  If it reaches the target, the record's label will be a `1` meaning "buy".  Another consideration is how far in the future it should look.  This is called the "window".  Typically, 15 candles in the future is used.  If the price reaches stop loss first or if price hovers between the target and stop loss within the window, the record will be a `0` meaning "not buy".

A common question is why not make the stop loss as small as possible?  Setting the stop loss too small can result in being "wicked out" of a trade.  Looking at figure above, if the stoploss is made too small, the wick of the next candle after the buy could poke through resulting in the stop loss being breached before the target price, thus resulting in a "not buy".  Therefore, setting a higher stop loss gives some buffer for the price to flucate before a gain is achieved while minimizing losses.

For the remainder of the target stop loss strategy discussion, the strategy will focus on `BTC` buy opportunities with the starting coin being `ETH`.  In other words, `ETH` will be used to buy `BTC` and will be sold back to `ETH` when the price reaches a target or stop loss price.  This can cause a bit of confusion because the price is the number of BTC within 1 ETH which means, a profit is made when the price actually drops thus the target will be lower than stop loss (opposite the figure above).  It is for this reason the `reverse` flag is set to `True` (seen below).

### Determining Ideal Ratios

In the example above, a `1.0% : 0.5%` ratio is used but is this a good ratio to use?  Setting a ratio of `10% : 5%` might be too high because it would be unlikely to gain `10%` resulting in a very sparsely labeled dataset.  Likewise, using a ratio of `0.1% : 0.005%` could be too low, especially when considering transaction fees (to be dicussed later).  It's also worth mentioning that using a percentage might result in inconsistencies since some currency pairs are more volatile than others and volatility for a given pair can change over time.  For this reason, forex traders sometimes use a ratio of the ATR.  For example, using an ATR `2:1` ratio is a good place to start.

Models generally perform better on balanced data so getting half of the labels to be `1` is ideal.  But achieving this with a ratio that is consistent and profitable may not be practical.  To find a good ratio, different multiples are generated and the percent of `1`'s is plotted.  On the below ATR ratio figure, the multiple of `2x` means the numerator is `2` times the denominator, where the denominator is the `x-axis` value.  Therefore, when `x-axis = 3` the ratio is `6:3`.  When the multiple is `4x` and `x-axis = 2`, the ratio is `8:2`, etc.  For the percentage ratio, `x-axis` represents the numerator and the denominator is then the numerator divided by the legend's label.  For example, when `x-axis = 0.01` for the `/2` line, the ratio is `1.0% : 0.5%`.

![images/find_ratio.png](images/find_ratio.png)
<center><b>Figure X</b> - Finding the best ratios to maximize label data using a window of <code>30</code> on ETHBTC 15 minute candles.</center>

Unsurpringly, as the ratio grows or ratio multiple grows, fewer buy opportunities can be found in the data because there are fewer windows where high profits can be achieved and fewer windows where smaller stop losses don't get wicked out by the volatility of the market.  None of the plots reaches the goal of `50%` but the results provide plenty of options to avoid sparesly labeled data.  From this analysis, the ATR Ratio is maximized at `2:1` with approximately `40%` of the labels being `1`.  The percentage ratio is maximized at `1.0% : 0.5%` with approximately `27%` of the labels being `1`.

### Building Xy Datasets

The imbalance of a dataset is not the only criteria for determining if if one ratio is better than another but it does give a sense.  To explore this further, several labeled datasets are generated with different labelling strategies by changing ratios, the window, and the whether the ratio represents a percentage or ATR ratio.  The following table shows 12 different labeled datasets generated that are used to compare different model performances.

|dataset|use_atr|ratio|reverse|window|size|true_labels|imbalance|train_imbal|test_imbal|
|:-|:-:|:-:|:-:|:-:|-:|-:|-:|-:|-:|
|20210806a|False|(0.01, 0.005)|True|30|141174|37979|0.269023|0.262033|0.319201|
|20210806b|False|(0.01, 0.0025)|True|30|141174|23782|0.168459|0.169634|0.196954|
|20210806c|False|(0.0075, 0.0025)|True|30|141174|29824|0.211257|0.220094|0.233808|
|20210806d|True|(2, 1)|True|15|141174|44769|0.317119|0.317522|0.341973|
|20210806e|True|(4, 2)|True|15|141174|17024|0.120589|0.123430|0.124245|
|20210806f|True|(4, 1)|True|15|141174|15304|0.108405|0.110710|0.111111|
|20210806g|True|(4, 2)|True|30|141174|31640|0.224121|0.227442|0.238302|
|20210806h|True|(4, 1)|True|30|141174|26488|0.187627|0.189435|0.201498|
|20210806i|True|(2, 1)|True|30|141174|55315|0.391821|0.391990|0.409738|
|20210806j|False|(0.01, 0.005)|True|15|141174|29035|0.205668|0.187259|0.265218|
|20210806k|False|(0.01, 0.0025)|True|15|141174|19184|0.135889|0.129528|0.174831|
|20210806l|False|(0.0075, 0.0025)|True|15|141174|25995|0.184134|0.183938|0.224619|

It is during this time that feature engineering was initially performed but later not necessary since the AWS pipeline has already completed this step (discussed later).  For this reason, each dataset already includes all the features discussed in the feature engineering section plus `14` lookbacks resulting in `542` features (a lookback is the previous records features).

Each dataset is also split into train and validation sets according to the table below.

|Purpose|Start Date|End Date|Number of Records|
|:-|:-:|:-:|-:|
|*not used*|--|2017/12/31|16353|
|Train|2018/01/01|2020/12/31|104796|
|Validation|2021/01/01|2021/07/29|20025|
|Test|2021/07/30|--|--|

In 2017, Binance started reporting figures and it took some time for these to develop robustness.  For this reason, 2017 is excluded from the training and testing sets.  The final test set analysis is performed on AWS in simulation using near-live data.

### Simulating Trades on Labeled Data

To get a sense of what an ideal profit would look like for each of the labelling strategies, it is necessary to run them through a simulator.  Why is this necessary?  Why can't this be calculated from the above figures?  To answer this, imagine having two candles, one after another, where the labeling has marked both of these as `1`.  In a deployed model, when a buy signal is received, all available currency will be spent.  When evaluating the next candlestick data, the model will be evaluating for selling, not buying, so that candlestick will not be evaluated for a buy opportunity.

The simulator logic is the same logic as in the deployed model pipeline but instead of looking at the predictions, it looks at the validation labeled data which is already guaranteed to contain profitable trades assuming it uses the same ratio and other hyperparameters as the dataset's labelling strategy.  The simulator works, in short, by progressing through the labeled data, looking for the next buy opportunity, calculates the target and stop loss prices, finds the next record that surpasses one of these, calculating the profit/loss along with any fees, and then repeats the process until it reaches the end.  The last sell indicates the maximized profit achievable for the dataset.  The below table shows how each dataset's number of trades and maximized profit based on a starting value of `1 ETH` and a fee of `0.1%` for each buy or sell transaction using the validation data.

|dataset|sim_num_trades|sim_max_profit|sim_bad_trades|
|:-|-:|-:|-:|
|20210806a|1179.0|13228.993600|0.0|
|20210806b|1093.0|6620.481356|0.0|
|20210806c|1456.0|3126.592203|0.0|
|20210806d|1128.0|17342.974458|1.0|
|20210806e|335.0|416.560613|0.0|
|20210806f|340.0|467.521090|1.0|
|20210806g|368.0|1084.998178|0.0|
|20210806h|376.0|1295.216356|2.0|
|20210806i|1033.0|9843.078247|2.0|
|20210806j|1199.0|15539.693313|0.0|
|20210806k|1097.0|6837.112001|0.0|
|20210806l|1497.0|3921.842442|0.0|

While the label data guarantees a trade is profitable, it doesn't guarantee the profit surpasses the fee amount.  For this reason, some labels result in bad trades.  In the above table, only the ATR ratio datasets result in bad trades which makes sense since all the percentage based datasets are larger than the fee.  Keeping an eye on this number will be important when using an ATR ratio dataset.

### Comparing Datasets with Base Classifiers

For each of the datasets, a set of base classifiers are trained.  Below is a table of the base classifiers used.

|Name|Parameters|
|:-|:-|
|GaussianNB|*none*|
|LogisticRegression|`random_state=42, max_iter=10000`|
|RandomForestClassifier|`random_state=42, n_jobs=-1`|
|AdaBoostClassifier|`random_state=42`|
|GradientBoostingClassifier|`random_state=42`|
|XGBClassifier|`n_jobs=-1, random_state=42, use_label_encoder=False`|
|MLPClassifier|`random_state=42`|

<small>One note about the `MLPClassifier`: since this classifier is senstive to scaling, `make_pipeline()` with `StandardScaler()` is used.</small>

This exercise determines the F1-score, precision, recall and the simulator's profit on the validation set for each dataset/classifer combination resulting in `84` results.  This was performed two more times on the same datasets and classifiers but first reducing the number of lookbacks from `14` to `3` and then to `0` thus reducing the number of features each time and ending up with a total of 252 trained model results.

### Identifying Best Performing Dataset/Classifier Combinations

When it comes to ranking best performance, precision is a good starting point.  A high precision means the model is able to reduce the number of false positives (FP).  In other words, it reduces the chance of predicting a buy opportunity that turns out to be unprofitable--the absolute worst case that should be avoided.  A low recall, on the other hand, just means the model is predicting fewer buy opportunties than expected.  This is generally fine so long as it does predict buys often enough (one true positive every couple of days on average).

Looking at precision alone can be misleading.  On the extreme side, a precision of `1.0` is perfect precision but if recall was very low, such as having only one `1` true positive (TP), the model would be ineffective since it so rarely makes predictions.  For this reason, the F1-score is not a good show of performance and several factors must be considered.  Maximizing precision and the number of TPs is the overall goal.  Ranking based on the number TPs does little in the way of explaining performance if the number of FPs is still high.  Therefore, ranking is performed first on precision, second on the difference between TPs and FPs and then on the ratio between .  The top 10 models based on this ranking is shown in the table below.

|Rank|Classifier|Dataset|Lookbacks|TP|FP|Diff|Ratio|Precision|Recall|Sim. Profit|
|-:|:-|:-|-:|-:|-:|-:|-:|-:|-:|-:|
|1|LogisticRegression|20210806i|0|453|277|176|1.64|0.6204|0.0082|1.3210|
|2|AdaBoostClassifier|20210806g|14|98|13|85|7.54|0.8824|0.0031|1.0243|
|3|LogisticRegression|20210806i|3|840|631|209|1.33|0.5708|0.0152|1.6651|
|4|RandomForestClassifier|20210806g|14|2075|1856|219|1.12|0.5278|0.0656|1.2348|
|5|LogisticRegression|20210806i|14|1731|1535|196|1.13|0.5299|0.0313|1.0503|
|6|LogisticRegression|20210806d|0|40|26|14|1.54|0.6000|0.0009|1.0045|
|7|GradientBoostingClassifier|20210806d|0|58|38|20|1.53|0.6000|0.0013|1.0126|
|8|GradientBoostingClassifier|20210806i|3|2710|2460|250|1.10|0.5241|0.0490|0.6239|
|9|GradientBoostingClassifier|20210806i|14|1919|1770|149|1.08|0.5201|0.0347|0.8400|
|10|AdaBoostClassifier|20210806a|14|60|54|6|1.11|0.5263|0.0016|0.9951|


Reviewing the simulated profit for each of these, the top 7 all produce profits suggesting the ranking is robust.  The highest performing models have then been deployed to AWS for live simulations which is discussed below.

### Building a Logistic Regression Ensemble

A large number of logistic regression models out performed other models and are easy to train and tune so it seems logical to ask if performance could be improved further with an ensemble.  To build an ensemble, the prediction from each model in the ensemble is weighted and summed and if that sum is greater than or equal to some threshold, the prediction would be considered a `1` or else `0`.  The weights will be the precision of each model on the validation set.  The equation for this can be explained as follows:

\begin{align}
x_{j} &= \sum_{i=0}^M \left( p_{m[i]} \cdot c_{m[i]}\right)
\\
r_{j} &= \begin{cases}
1 & \text{if } x_{j} \geq t \\
0 & \text{if } x_{j} < t \\
\end{cases}
\end{align}

Where $r_{j}$ is the prediction result for the $j$th record, $p_{m[i]}$ is the prediction of $i$th model in $m$, $c_{m[i]}$ is the precision of said model, and $t$ is the hyperparameter threshold.  Any model that has a validation set precision of `0` would always be zeroed out so these models will not be included in the ensemble.

Finding a good value for $t$ can be achieved by trying out by measuring the precision, recall and simulated profit.  Another issue that needs to be considered is that the datasets use different ratios so which ratio should be used on the ensemble?  It stands to reason that the model/dataset with the highest precision should be used since that carries the most weight.  However, simulations show this is not always the case as will be shown later with the scaled version shown later.  For the logistic regression, it so happens that these are aligned.  In the below figure, profit is maximized at threshold of `0.12` with a value of `1.86` which surpasses any individual model simulation performance.

![images/ensemble_find_t.png](images/ensemble_find_t.png)
<center><b>Figure X</b> - Simulated precision, recall and profit for varying thresholds on a Logistic Regression ensemble.</center>

### Scaling Data in Isolation

There is a forex theory that a good strategy is generalizable, in that is can be applied to any currency pair, even opposite pairs, and be profitable.  All models previously train (with the exception of the MLP) have not been scaled so it is impractical to expect one of these models perform well for both ETH to BTC and the opposite BTC to ETH.  Likewise, using standard scaling, like done for the MLP, is also not practical since the dataset for BTC to ETH trades is scaled very differently.  So can the data be scaled in isolation?  The answer is yes, but at the cost of zeroing-out one of the features.  By defining the open price as the mean and using the close, high, and low in the calculation of a standard deviation, all price data can be scaled such that one standard deviation difference is -1 or 1.  The formula can be described as the following:

\begin{align}
\mu_{j} &= p_{j}^\text{open}
\\
\sigma_{j} &= \sqrt{\left(p_{j}^\text{close} - p_{j}^\text{open}\right)^2 + \left(p_{j}^\text{high} - p_{j}^\text{open}\right)^2 + \left(p_{j}^\text{low} - p_{j}^\text{open}\right)^2}
\\
p_{j} &= \begin{cases}
\frac{\left(p_{j} - \mu_{j}\right)}{\sigma_{j}} & \text{if } p_{j} \text{ is a price} \\
\frac{p_{j}}{p_{j}^\text{ATR}} & \text{if } p_{j} \text{ is an ATR difference} \\
\frac{p_{j} - p_{j}^\text{ATR}}{p_{j}^\text{ATR}} & \text{if } p_{j} \text{ is an ATR lookback} \\
\frac{p_{j}}{p_{j}^\text{RSI}} & \text{if } p_{j} \text{ is an RSI difference} \\
\frac{p_{j} - 50}{20} & \text{if } p_{j} \text{ is an RSI or RSI lookback} \\
\end{cases}
\end{align}

Using this scaling algorithm, each record is individually scaled independent of the other data in the dataset.  Repeating the same model/dataset comparison using this scaler produces another 252 trained models.  Many of the logistic regression models returned profits in simulation of `ETHBTC` suggesting again an ensemble might out perform a single model.

### Ensemble with Custom Scaling

Using the scaler discussed previously, a new ensemble of logistic regression models can be produced with the goal of having a model that be able to perform on both `ETHBTC` and `BTCETH` trading.  Again, a threshold and ratio is brute forced.  In the below figure, profit is maximized at threshold of `0.10` with a value of `1.35` again surpassing any individual model simulation performance.  This time, the profit is maximized with a ratio of `4:2` in contrast with the highest precision dataset being `2:1`. 

![images/scaled_ensemble_find_t.png](images/scaled_ensemble_find_t.png)
<center><b>Figure X</b> - Simulated precision, recall and profit for varying thresholds on a scaled Logistic Regression ensemble.</center>

### Deep Neural Network

To see how a deep neural network can perform on the same datasets used to train base classifiers, the dataset `20210806i` is chosen for its consistent performance.  A variety of ResNet28 models are built using PyTorch and trained with varying hyperparameters or model changes over 20 epochs on a GPU to allow for quicker iteration.  A subset of these run configurations are shown in the table below.

|Model Name|Scaler|Width|Optimizer|Learning Rate|
|:-|:-|-:|:-|:-|
|nm_torch1_alpha33|CustomScaler1|128|AdamW|0.006|
|nm_torch1_alpha34|*none*|128|AdamW|0.006|
|nm_torch1_alpha35|*none*|32|AdamW|0.001|
|nm_torch1_alpha36|*none*|32|AdamW|0.006|
|nm_torch1_alpha37|CustomScaler1|32|AdamW|0.006|
|nm_torch1_alpha38|CustomScaler1|128|AdamW|0.03|

Below is a figure of the training and validations results of this subset of models.  While 20 epochs is still quite young for a comprehensive analysis, some trends do start to appear which only become more pronouned with futher epochs and as many as 100 when the training set begins to converge.

![images/resnet28_res.png](images/resnet28_res.png)
<center><b>Figure X</b> - Simulated precision, recall and profit for varying thresholds on a scaled Logistic Regression ensemble.</center>

The general trend that is consistent across all models is that the training loss drops predictably and consistently but the validation loss percipicely rises after just a few epochs.  The recall on the validation set generally does slowly improve which also improves the F1-score but the precision remains erratic, averaging around `0.5`.  This unexpected behavior is likely due to a shift in the how the `ETHBTC` market behaves over time so the model is learning a strategy that is no longer profitable in 2021.  To validate this, each model was simulated with the results in the below table.

|Model Name|Buys|Starting Value|Ending Value|
|:-|-:|:-:|:-|
|nm_torch1_alpha33|3989|1.0|0.3472|
|nm_torch1_alpha34|0|1.0|0.0|
|nm_torch1_alpha35|51|1.0|1.0391|
|nm_torch1_alpha36|0|1.0|0.0|
|nm_torch1_alpha37|3083|1.0|0.4240|
|nm_torch1_alpha38|236|1.0|0.9454|