# Reinforcement Learning for Trading

This is my first attempt at reinforcement learning. This project is quite similar to the one described in the following article: 
    https://www.mlq.ai/deep-reinforcement-learning-for-trading-with-tensorflow-2-0/

### Project Setup

In [3]:
!pip install --upgrade pip
!pip install pandas-datareader
!pip install tqdm
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.4.0-cp38-cp38-macosx_10_11_x86_64.whl (175.5 MB)
[K     |████████████████████████████████| 175.5 MB 49.0 MB/s eta 0:00:011
Collecting gast==0.3.3
  Downloading gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting absl-py~=0.10
  Downloading absl_py-0.11.0-py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 60.9 MB/s eta 0:00:01
[?25hCollecting astunparse~=1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers~=1.12.0
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting google-pasta~=0.2
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 8.0 MB/s  eta 0:00:01
[?25hCollecting grpcio~=1.32.0
  Downloading grpcio-1.32.0-cp38-cp38-macosx_10_9_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 13.4 MB/s eta 0:00:01
[?25hCollecting keras-preprocessing~=1.1.2
  Downloading Keras_Preprocessing-1

Installing collected packages: pyasn1, rsa, pyasn1-modules, oauthlib, requests-oauthlib, google-auth, wheel, tensorboard-plugin-wit, numpy, markdown, grpcio, google-auth-oauthlib, absl-py, wrapt, termcolor, tensorflow-estimator, tensorboard, opt-einsum, keras-preprocessing, google-pasta, gast, flatbuffers, astunparse, tensorflow
  Attempting uninstall: wheel
    Found existing installation: wheel 0.34.2
    Uninstalling wheel-0.34.2:
      Successfully uninstalled wheel-0.34.2
  Attempting uninstall: numpy
    Found existing installation: numpy 1.18.5
    Uninstalling numpy-1.18.5:
      Successfully uninstalled numpy-1.18.5
  Attempting uninstall: wrapt
    Found existing installation: wrapt 1.11.2
    Uninstalling wrapt-1.11.2:
      Successfully uninstalled wrapt-1.11.2
Successfully installed absl-py-0.11.0 astunparse-1.6.3 flatbuffers-1.12 gast-0.3.3 google-auth-1.24.0 google-auth-oauthlib-0.4.2 google-pasta-0.2.0 grpcio-1.32.0 keras-preprocessing-1.1.2 markdown-3.3.3 numpy-1.19.4 

In [4]:
import math
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas_datareader as data_reader

from tqdm import tqdm_notebook, tqdm
from collections import deque

## Define the Deep Q-Learning Trader

This is the actual "deep learning" part; it is object-based. 

I am going to create an AI_Trader() class that will learn to trade profitably (hopefully, at least). 

This AI_Trader object needs some definition...
* The only actions it can take are Buy, Hold, or Sell
* It can 'remember' 2000 events
* It will need an inventory of stocks it owns
* It will need a risk-aversion parameter, $\gamma$, to help maximize the long-term reward
* The $\epsilon$ parameter to decide whether to use the model or randomize the action it takes
* Over time, the AI_Trader should stop randomizing and obey the model, so $\epsilon$ should decrease
* If $\epsilon$ is going to decrease, we need to define the speed at which it decreases

In [3]:
class AI_Trader():
    
    def __init__(self, state_size, action_space=3, model_name="AITrader"):
        self.state_size = state_size
        self.action_space = action_space
        self.memory = deque(maxlen=2000)
        self.inventory = []
        self.model_name = model_name
        
        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_final = 0.01
        self.epsilon_decay = 0.995
        

## Define the Neural Network

This is the model that the AI_Trader is going to use to learn how to make profitable trades. 

In this instance, it will be be a Neural Network with 4 Dense Layers that outputs the expected utility of talking any of the three possible actions Buy, Hold, or Sell. Since the AI_Trader learns over time, we want the model to start at time $t$, then the AI_Trader takes an action, then new data arrives at time $t+1$ and, the AI_Trader is rewarded/punished for its action, the model is reformed and the AI_Trader takes an ew action, and so on. So we will need a function that creates a new Neural Network with 4 Layers every time. 

Here it is:

In [4]:
def model_builder(self):
    model = tf.keras.models.Sequential()
    
    model.add(tf.keras.layers.Dense(units=32, activation='relu', input_dim=self.state_size))
    model.add(tf.keras.layers.Dense(units=64, activation='relu')),
    model.add(tf.keras.layers.Dense(units=128, activation='relu')),
    model.add(tf.keras.layers.Dense(units=self.action_space, activation='linear'))
    
    #Now we can compile the model. We will use the Adam optimizer and MSE as the loss function.
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.001), loss='mse')
    return model

## Build a Trading Function

This function will take the state as an input and allow the AI_Trader to take action based on that state. It is the function that *actually does the trading*...can't think of a better way to say that. Hopefully it makes sense.

This function will be called **trade**, and will take one argument, **state**. 

Based on the state, the function will decide if our AI_Trader should do what the model says, or take a random action. Since $\epsilon$ is the "learnedness" of the AI_Trader, we will tell it to perform a random action with probability $\epsilon$, or follow the model with probability $1=\epsilon$. In code, this means we draw a random uniform $U \in [0,1]$, and if  $U > \epsilon$, follow the model. And of course if $U \le \epsilon$ perform a random action.  

In [5]:
def trade(self, state):
    # Draw random number and see if it is less than epsilon
    if random.random() <= self.epsilon:
        # random action
        return random.randrange(self.action_space)
    
    # Get the action from the model
    action = self.model.predict(action[0])

## Train the Model

The model needs a training function that the AI_Trader can learn from. The training function will train the prelimary model on the saved data, then allow the AI_Trader to learn from the performance of the trained model. Step by Step:

1. Append the datafrom the AI_Trader's **memory**, in batches
2. Get the reward from each batch based on **state**, **next_state**, and the **action** taken
3. Set the **target** based on the prediction from the model
4. Fit the model with **state** as the independent variable and **target** as the dependent variable
5. If the previous $\epsilon$ is greater than the final $\epsilon$, let $\epsilon$ decay


In [6]:
def batch_train(self, batch_size):
    # batch starts empty
    batch = []
    # Cycle through the memory, append batches to the batch list
    for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
        batch.append(self.memory[i])
        
    # Get the reward from each batch
    for state, action, reward, next_state, done in batch:
        reward = reward
        
        if not done:
            reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
            
        target = self.model.predict(state)
        target[0][action] = reward
    
        self.model.fit(state, target, epochs=1, verbose=0)
    
    if self.epsilon > self.epsilon_final:
        self.epsilon *= self.epsilon_decay

## Stock Data Preprocessing

The stock data will need to be loaded, preprocessed so that it is in a precise format that the AI_Trader can easily understand, and normalized. This is going to take three functions. They are all pretty simple.

In [7]:
# This function is used to normalize the stock prices between 0 and 1
def sigmoid(x):
    return 1 / (1+math.exp(-x))

# This function formats the stock prices like "(-)$ {price}"
def stocks_price_format(n):
    if n <0:
        return "-$ {0:2f}".format(abs(n))
    else:
        return "$ {0:2f}".format(abs(n))

# Load the data from yahoo finance
def dataset_loader(stock_name):
    dataset = data_reader.DataReader(stock_name, data_source='yahoo')
    
    start_date = str(dataset.index[0]).split()[0]
    end_date = str(dataset.index[1]).split()[0]
    close = dataset['Close']
    
    return close

In [8]:
dataset_loader('AAPL').head()

Date
2015-12-29    27.184999
2015-12-30    26.830000
2015-12-31    26.315001
2016-01-04    26.337500
2016-01-05    25.677500
Name: Close, dtype: float64

### Create States

The AI_Trader needs the **state**.

Here is what is happening under the hood, in general terms:
* Stock prices are floating point numbers that represent the value of a stock at a point in time
* The AI_Trader needs to predict what is going to happen at the next point in time; with the stock price be higher or lower than it is now?
* Based on that prediction, the AI_Trader needs to take the best action: buy, sell, or hold.
How should the AI_Trader make those predictions? That's the real art here...for this example, we are going to let the AI_Trader use a regresssion model with data from the 5 previous time-steps. 

Instead of using raw stock prices, we will use the difference in price from one day to the next. Maybe this will all be easier to explain after the code. Meet me on the other side of the next cell!

In [9]:
def state_creator(data, timestep, window_size):
    starting_id = timestep - window_size + 1
    if starting_id >= 0:
        windowed_data = data[starting_id:timestep+1]
    else:
        windowed_data = abs(starting_id) * [data[0]] + list(data[0:timestep+1])
    
    state = []
    for i in range(window_size - 1):
        state.append(sigmoid(windowed_data[i+1] - windowed_data[i]))
        
    return np.array([state])

Okay. The state_creator function takes in the data, how many timesteps we want, and the window_size. It creates a starting_id that is the starting point of the window being used to create the states. If that starting_id is positive, we grab the data in the window. If the starting_id is negative, we carry the earliest observation back until we have a list of prices that has length of window_size.

Then, we find the differences from one timestep to the last and normalize it using the previously defined **sigmoid** function; a difference of zero has the final value of 0.5, negative differences have final values below 0.5 and positive differences have final values greater than 0.5.

Let's see what it looks like if we load a dataset for Apple.

In [10]:
stock_name = 'AAPL'
data = dataset_loader(stock_name)
data.head()

Date
2015-12-29    27.184999
2015-12-30    26.830000
2015-12-31    26.315001
2016-01-04    26.337500
2016-01-05    25.677500
Name: Close, dtype: float64

## Q-Learning: Training the AI_Trader

The AI_Trader needs to learn, so we need to train it. Since the state_creator function works with windows, the AI_Trader needs to know the **window_size**; because we train the model in batches, the AI_Trader needs to know the **batch_size**; because the AI_Trader learns by events, it needs to be told how many **episodes** it should train for; because we use the difference in prices rather than raw closing prices, there are N-1 **data_samples** in the training set.

In [11]:
window_size = 10
batch_size = 32
episodes = 1000
data_samples = len(data) - 1

So...the AI_Trader class is defined in pieces above. We need to put the whole thing together. That happens in the next cell. Just putting everything together from above.

In [12]:
class AI_Trader():
    
    def __init__(self, state_size, action_space=3, model_name="AITrader"):
        self.state_size = state_size
        self.action_space = action_space
        self.memory = deque(maxlen=2000)
        self.inventory = []
        self.model_name = model_name
        
        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_final = 0.01
        self.epsilon_decay = 0.995
        
        self.model = self.model_builder()
        
    def model_builder(self):
        model = tf.keras.models.Sequential()
    
        model.add(tf.keras.layers.Dense(units=32, activation='relu', input_dim=self.state_size))
        model.add(tf.keras.layers.Dense(units=64, activation='relu')),
        model.add(tf.keras.layers.Dense(units=128, activation='relu')),
        model.add(tf.keras.layers.Dense(units=self.action_space, activation='linear'))
    
        #Now we can compile the model. We will use the Adam optimizer and MSE as the loss function.
        model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.001), loss='mse')
        return model
    
    def trade(self, state):
        # Draw random number and see if it is less than epsilon
        if random.random() <= self.epsilon:
            # random action
            return random.randrange(self.action_space)
    
        # Get the action from the model
        action = self.model.predict(state)
        
    def batch_train(self, batch_size):
        # batch starts empty
        batch = []
        # Cycle through the memory, append batches to the batch list
        for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
            batch.append(self.memory[i])
        
        # Get the reward from each batch
        for state, action, reward, next_state, done in batch:
            reward = reward
        
            if not done:
                reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
            
            target = self.model.predict(state)
            target[0][action] = reward
    
            self.model.fit(state, target, epochs=1, verbose=0)
    
        if self.epsilon > self.epsilon_final:
            self.epsilon *= self.epsilon_decay

We can now initialize the AI_Trader.

In [13]:
trader = AI_Trader(state_size = window_size)
trader.model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                352       
_________________________________________________________________
dense_1 (Dense)              (None, 64)                2112      
_________________________________________________________________
dense_2 (Dense)              (None, 128)               8320      
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 387       
Total params: 11,171
Trainable params: 11,171
Non-trainable params: 0
_________________________________________________________________


After hunting down a few typos and errors (which are not in the final version, so you can't see them any more!), the AI_Trader appears to be working exactly as expected. The model looks good, and the AI_Trader should be ready to train. 

### Define a Training Loop

The training loop is going to iterate through all 1000 **episodes**. In each iteration, it should:

1. Bring out the next episode
2. Define the initial state using *state_creator*
3. Define new variables to track **total_profit** and set initial **trader.inventory=[]**
4. Define the **timestep**, where 1 = 1 Day, and define **action**, **next_state**, and **reward**
5. Update the **inventory** based on the **action**
6. Update **total_profit** based on the **reward**
7. Check if this is the last event (sample) in the dataset
8. Append the data to the trader's memory with *trader.memory.append()*
9. Change the **state** to the **next_state** to continue iterating through the episode
10. ***IF*** we've reached the end of the episode (**done == True**), *print* out the **total_profit**

That's plenty of stuff. But before the loop does that, we want to check:

* If there is more information in the AI_Trader's **memory** than **batch_size**, call *trader.batch_train(batch_size)*
* Every 10 episodes, save the model in an H5 file using *trader.model.save()*

Now let's do it!

In [None]:
for episode in range(1, episodes+1):
    # Indicate Training Status
    print('Episode: {}/{}'.format(episode, episodes))
    
    # Inititalize all the stuff we need in each episode
    state = state_creator(data, 0, window_size + 1)
    
    total_profit = 0
    trader.inventory = []
    
    # Iterate through the episode
    for t in tqdm(range(data_samples)):
        
        # Select the action to take
        action = trader.trade(state)
        
        # Define the next_state
        next_state = state_creator(data, t+1, window_size+1)
        
        # Initialize reward
        reward = 0
        
        if action == 1: #Buy
            trader.inventory.append(data[t])
            print("AI_Trader bought: ", stocks_price_format(data[t]))
            
        elif action == 2 and len(trader.inventory) > 0: #Sell
            buy_price = trader.inventory.pop(0)
            reward = max(data[t] - buy_price, 0)
            total_profit += data[t] - buy_price
            print("AI_Trader sold: ", stocks_price_format(data[t]), 
                  "\nProfit: " + stocks_price_format(data[t] - buy_price))
        
        # Check if the episode is ending
        if t == data_samples - 1:
            done=True
        else:
            done = False
        
        # Append what just happened to memory
        trader.memory.append((state, action, reward, next_state, done))
        
        # Move state forward
        state = next_state
        
        # If done, print total_profit
        if done:
            print("################################################################")
            print("TOTAL PROFIT: {}".format(total_profit))
            print("################################################################")
        
        # If too memory is full, retrain on batches
        if len(trader.memory) > batch_size:
            trader.batch_train(batch_size)
            
        # Every 10 episodes, save the model
        if episode % 10 == 0:
            trader.model.save("ai_trader_{}.h5".format(episode))

  0%|          | 0/1257 [00:00<?, ?it/s]

Episode: 1/1000
AI_Trader bought:  $ 24.632500
AI_Trader bought:  $ 24.347500
AI_Trader sold:  $ 24.879999 
Profit: $ 0.247499
AI_Trader sold:  $ 24.165001 
Profit: -$ 0.182499
AI_Trader bought:  $ 24.860001
AI_Trader sold:  $ 23.522499 
Profit: -$ 1.337502
AI_Trader bought:  $ 24.087500
AI_Trader sold:  $ 23.504999 
Profit: -$ 0.582500
AI_Trader bought:  $ 23.752501
AI_Trader sold:  $ 23.567499 
Profit: -$ 0.185001
AI_Trader bought:  $ 23.424999
AI_Trader sold:  $ 23.497499 
Profit: $ 0.072500


  3%|▎         | 36/1257 [00:14<47:03,  2.31s/it]

AI_Trader bought:  $ 24.219999


  3%|▎         | 41/1257 [00:32<1:05:22,  3.23s/it]

AI_Trader bought:  $ 24.172501


  3%|▎         | 42/1257 [00:35<1:09:30,  3.43s/it]

AI_Trader bought:  $ 25.132500


  3%|▎         | 43/1257 [00:39<1:08:32,  3.39s/it]

AI_Trader bought:  $ 25.187500


  4%|▎         | 46/1257 [00:49<1:09:35,  3.45s/it]

AI_Trader sold:  $ 25.467501 
Profit: $ 1.247501


  4%|▍         | 48/1257 [00:57<1:12:58,  3.62s/it]

AI_Trader sold:  $ 25.280001 
Profit: $ 1.107500


  4%|▍         | 50/1257 [01:03<1:09:18,  3.44s/it]

AI_Trader sold:  $ 25.565001 
Profit: $ 0.432501


  4%|▍         | 51/1257 [01:06<1:08:17,  3.40s/it]

AI_Trader bought:  $ 25.629999


  4%|▍         | 52/1257 [01:11<1:14:14,  3.70s/it]

AI_Trader sold:  $ 26.145000 
Profit: $ 0.957500


  4%|▍         | 53/1257 [01:14<1:12:34,  3.62s/it]

AI_Trader sold:  $ 26.492500 
Profit: $ 0.862501


  4%|▍         | 54/1257 [01:18<1:11:49,  3.58s/it]

AI_Trader bought:  $ 26.450001


  4%|▍         | 55/1257 [01:21<1:09:50,  3.49s/it]

AI_Trader sold:  $ 26.480000 
Profit: $ 0.029999


  5%|▍         | 57/1257 [01:29<1:13:36,  3.68s/it]