# Stock Buy/Sell/Sit Prediction using RL + expReplay!

## Our Methodology

Although many would argue that RL implementation should begin from defining the agent and then further to other parts, I like a top down approach...

1. Define what would the training / evaluation be like for some entity named agent.
2. Define Agent.

Ref : 
https://github.com/llSourcell/Reinforcement_Learning_for_Stock_Prediction

### About the Dataset

Obtain stock Information -> Name , Window_Size, Episode_Count

a. Name - Apple, Google, or some other stock name

b. Window-Size - Similar to a supervised learning approach to this problem, we divide the dataset into frames...
   
   1. d1,d2,d3,d4,d5..... is our data, say window/frame size is 2 then we have 
   
   2. d1,d2; d2,d3; d3,d4; ....
   
c. Episode_Count - Number of episodes to run!



In [1]:
# First we'll have a look at the dataset!
import pandas as pd

In [2]:
data_ = pd.read_csv('./data/stockPrediction/^GSPC_2011.csv')
data_.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2011-01-03,1257.619995,1276.170044,1257.619995,1271.869995,1271.869995,4286670000
1,2011-01-04,1272.949951,1274.119995,1262.660034,1270.199951,1270.199951,4796420000
2,2011-01-05,1268.780029,1277.630005,1265.359985,1276.560059,1276.560059,4764920000
3,2011-01-06,1276.290039,1278.170044,1270.430054,1273.849976,1273.849976,4844100000
4,2011-01-07,1274.410034,1276.829956,1261.699951,1271.5,1271.5,4963110000


In [3]:
data_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
Date         252 non-null object
Open         252 non-null float64
High         252 non-null float64
Low          252 non-null float64
Close        252 non-null float64
Adj Close    252 non-null float64
Volume       252 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 13.9+ KB


In [4]:
import numpy as np
import math
# referenced code to read
def getStockDataVec(key):
	vec = []
	lines = open("./data/stockPrediction/" + key + ".csv", "r").read().splitlines()

	for line in lines[1:]:
		vec.append(float(line.split(",")[4]))
#         only takes the 'Close' feature of the dataset.

	return vec

In [5]:
data = getStockDataVec('^GSPC_2011')
print data[:4]

[1271.869995, 1270.199951, 1276.560059, 1273.849976]


In [6]:
# same for above can be done via pandas as 
data_ = data_['Close']
print data_[:4]

0    1271.869995
1    1270.199951
2    1276.560059
3    1273.849976
Name: Close, dtype: float64


In [7]:
# defining length of data and batch size
l = len(data) - 1
batch_size = 32

In [8]:
# define the number of episodes and window_size
ep = 2
window_size = 10

In [9]:
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def getState(data, t, n):
    #gets the window size from time t , length n, of data "data".
    d = t - n + 1
    #define our block based on the input information and pad when necessary
    block = data[d:t + 1] if d >= 0 else -d * [data[0]] + data[0:t + 1] # pad with t0
    res = []
    
    #get normalized states.
    for i in range(n - 1):
        res.append(sigmoid(block[i + 1] - block[i]))

    return np.array([res])

## Train Loop 
Now we can make our train loop ; wherein train our RL agent. We just need to know the behavior of the agents and their actual implementation can be defferd for a Later time.

In [10]:
# This is just a helper function for pretty console outputs.
def formatPrice(n):
	return ("-$" if n < 0 else "$") + "{0:.2f}".format(abs(n))

In [11]:
def train():
    agent = Agent(window_size)
    
    for e in range(ep+1) :
        print 'Running Episode:{}/{}'.format(e,ep)

        #now we get our inital state. State would represent the set of features we get by defining our window.
        #this getState would take our dataset, 0th position (starting position) , and window size
        #Note : I'm using data_ instead of data, to be able to use getState as it is.
        state = getState(data,0,window_size+1)
        print 'State:',state.shape
        #we need to keep total_profil accumulated uptill now.
        #agent is supposed to have an inventory - our agent is a banker! which is empty initially.
        agent.inventory = []
        total_profit = 0
        
        #iterate for complete l range of dataset length
        for t in range(l) :
            #suppose our agent is already learnt, then we need to let it take an action, based on the current state.
            action = agent.act(state)
            
            #Now we may have 3 actions - Sit=0, Buy=1, Sell=2
            # if action is to sit, we may get to our next state directly and recieve a 0 reward, no profit no loss.
            reward = 0
            next_state = getState(data,t+1,window_size+1)
            
            if action == 1: #if agent chooses to buy
                #we need to add this to the agent's inventory and output Buy Message to Console.
                agent.inventory.append(data[t])
                print 'Buy: ',formatPrice(data[t])
            
            elif action == 2 and len(agent.inventory)>0 : #Now the agent chooses to sell, but the agent cannot unless 
                #it actually has something already in its inventory! Adding this logic will help you get a stable code.
                
                #we record a couple of things at this point 
                #boughtPrice 
                #reward - because actual reward of buying is obtained when you sell it! Either you gain something or you dont.
                #update total_profit
                boughtPrice = agent.inventory.pop(0)
                reward = max(data[t]-boughtPrice,0)
                total_profit += data[t]-boughtPrice
                
                print "Sell: ", formatPrice(data[t]), " | Profit: ", formatPrice(data[t] - boughtPrice)
            
            #Now we must check are we done or not!
            done = True if t == l - 1 else False
            
            #Whatever happened good or bad, must be recorded so that agent may later use to to enhance its performance.
            #this is where we use experienceReplay feature.
            #store as a tuple (currentState,actionYouTook,NextState,WereYouDone?)
            
            agent.memory.append((state, action, reward, next_state, done))
            
            #go ahead in the loop, therefore make the next state our current state and move one step ahead in timestep.
            state = next_state
        
            if done:
                print '-'*20,"\nTotal Profit: ",formatPrice(total_profit),"\n",'-'*20

            #Use ReplayMemory / Stored Memory when you have enough memory.
            if len(agent.memory) > batch_size :
                agent.expReplay(batch_size)
                                                            
                                                            
        #------- save the agent periodically.
        if e % 10 == 0:
            torch.save(agent.model,"models/model_ep" + str(e))

## The Agent

Now we can go on to define the agent. We know what behaviors it must showcase, thereby the only thing left is to actually implement.

In [12]:
#Ref uses Keras, I prefer Pytorch.

In [13]:
import torch
import random
from collections import deque

In [14]:
#We need to define agentModel before hand , 

class agentModel(torch.nn.Module) :
    def __init__(self,inp,out) :
        super(agentModel,self).__init__()
        
        self.fc1 = torch.nn.Linear(inp,64)
        self.fc2 = torch.nn.Linear(64,32)
        self.fc3 = torch.nn.Linear(32,8)
        self.fc4 = torch.nn.Linear(8,out)
        
        self.loc = torch.nn.Sequential(
            torch.nn.Linear(inp,64),
            torch.nn.ReLU(True),
            torch.nn.Linear(64,32),
            torch.nn.ReLU(True),
            torch.nn.Linear(32,8),
            torch.nn.ReLU(True),
            torch.nn.Linear(8,out)
        )
        
    def forward(self,x) :
        return self.loc(x)
    
    def predict(self,x) :
        x = torch.from_numpy(x)
        x = x.type(torch.FloatTensor)
        return self.forward(x)
    

In [15]:
#We define an agent class 
class Agent :
    
    def __init__(self,state_size,is_eval=False,model_name="") :
        '''
            state_size is how many states can there be..
            is_eval is mode of the agent, evalutation (where no weight updation occurs) or training 
            model_name is to save/load the model
        '''
        
        self.state_size = state_size 
        self.action_size = 3 #Sit, Buy, Sell
        self.memory = deque(maxlen=1000) #memory is limited, to 1000 instances.
        self.inventory = [] #This records purchases.
        self.model_name = model_name 
        self.is_eval = is_eval
        
        
        #some hyper-params :
        self.gamma = 0.95 # in the bellman equation
        # epsilon is used for exploration vs exploitation and greedy epsilon strategy, 
        # generate a random number, and if this turns out to be less than this epsilon value then
        # we choose a random action (explore), else  we choose a predicted action (expliot).
        # However, as we proceed through the episodes we want our exploration to reduce and exploit more as
        # the agent learns. However, to get out of local minima it is important to have certain amount of 
        # stochasticity, given by epsilon_min threshlod which never lets exploration go to 0.
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        
        #if is_eval is True : then load the model, else Create One!
        self.model = torch.load('models/'+model_name) if is_eval else self.createModel()
        #define Model criterion and optimizer 
        self.optimizer = torch.optim.Adam(self.model.parameters(),lr=0.001)
        self.criterion = torch.nn.MSELoss(reduction='sum')
        
    
    def createModel(self):
        #this returns a deepL model for the agent, you can pass as many params as you want depeding upon your 
        #implementation details.
        # We define this model later on.
        return agentModel(self.state_size, self.action_size)
    
    def act(self,state):
        #use the epsilon greedy approach for explotation vs exploration.
        if not self.is_eval and random.random() <= self.epsilon:
            #return a random action from action range.
            return random.randrange(self.action_size)

        #if not, then exploite the model to output a state.
        #I have interfaced Pytorch similar to Keras for ease of Keras Users.
        
        #You get the probs of all the options.
        options = self.model.predict(state)

        #Either you may take the max prob option, or you may sample. We have taken the argmax one here.
        return torch.argmax(options[0])
    
    
    def expReplay(self,batch_size) :
        mini_batch = []
        l = len(self.memory)
        
        #create a mini_batch from the memory created of batch_size size.
        for i in range(l-batch_size+1, l) :
            mini_batch.append(self.memory[i])
            
        #use each of the instances in this minibatch to update model params.
        for state,action,reward,next_state,done in mini_batch :
            
            #if the example ended the episode, or done=True, then the reward is same as is recieved 
            #if done=False, then we use bellman equation to find the target value (utility)
            # as intuitively the reward we get immediately, and a discounted reward from future states
            # that is the second term on line 73.
            target = torch.Tensor([reward])
            if not done : 
                out = self.model.predict(next_state)[0].data
#                 print out
                target = reward + self.gamma*torch.max(out)
            
            #make model prediction at this state.
            target_ = self.model.predict(state)
            #modify the current prediction to have this target.
            target_f = target_
            target_f[0][action] = target
            #our aim is to get this state prediction as target_f so ...
            # we minimize the loss between these
            
            #Train 1 epoch of model.
            self.optimizer.zero_grad()
            loss = self.criterion(target_f, target_)
            loss.backward()
            self.optimizer.step()
            
        #Decay the epsilon value to bias the model more towards exploitation.
        if self.epsilon > self.epsilon_min :
            self.epsilon *= self.epsilon_decay
            

# Training the Agent

In [16]:
tr = Agent(10)
p = tr.model(torch.randn(1,10))
print p.shape

torch.Size([1, 3])


In [17]:
train()

Running Episode:0/2
State: (1, 10)
Buy:  $1271.50
Buy:  $1269.75
Sell:  $1274.48  | Profit:  $2.98
Buy:  $1285.96
Buy:  $1293.24
Buy:  $1295.02
Buy:  $1280.26
Buy:  $1283.35
Sell:  $1291.18  | Profit:  $21.43
Buy:  $1296.63
Buy:  $1276.34
Buy:  $1286.12
Sell:  $1304.03  | Profit:  $18.07
Sell:  $1310.87  | Profit:  $17.63
Sell:  $1319.05  | Profit:  $24.03
Sell:  $1324.57  | Profit:  $44.31
Buy:  $1320.88
Buy:  $1321.87
Sell:  $1332.32  | Profit:  $48.97
Sell:  $1340.43  | Profit:  $43.80
Sell:  $1343.01  | Profit:  $66.67
Sell:  $1315.44  | Profit:  $29.32
Sell:  $1307.40  | Profit:  -$13.48
Buy:  $1306.10
Buy:  $1319.88
Sell:  $1327.22  | Profit:  $5.35
Sell:  $1306.33  | Profit:  $0.23
Buy:  $1308.44
Sell:  $1321.82  | Profit:  $1.94
Buy:  $1320.02
Sell:  $1295.11  | Profit:  -$13.33
Buy:  $1281.87
Buy:  $1273.72
Buy:  $1279.21
Sell:  $1298.38  | Profit:  -$21.64
Sell:  $1297.54  | Profit:  $15.67
Sell:  $1309.66  | Profit:  $35.94
Buy:  $1313.80
Buy:  $1328.26
Sell:  $1335.54  | Pr

  "type " + obj.__name__ + ". It won't be checked "


Buy:  $1293.24
Sell:  $1295.02  | Profit:  $1.78
Buy:  $1286.12
Sell:  $1307.59  | Profit:  $21.47
Buy:  $1307.40
Sell:  $1306.10  | Profit:  -$1.30
Buy:  $1319.88
Sell:  $1327.22  | Profit:  $7.34
Buy:  $1306.33
Sell:  $1308.44  | Profit:  $2.11
Buy:  $1330.97
Buy:  $1310.13
Sell:  $1321.82  | Profit:  -$9.15
Sell:  $1320.02  | Profit:  $9.89
Buy:  $1298.38
Sell:  $1293.77  | Profit:  -$4.61
Buy:  $1305.14
Sell:  $1312.62  | Profit:  $7.48
Buy:  $1335.25
Sell:  $1347.24  | Profit:  $11.99
Buy:  $1355.66
Sell:  $1360.48  | Profit:  $4.82
Buy:  $1346.29
Sell:  $1357.16  | Profit:  $10.87
Buy:  $1326.73
Buy:  $1325.84
Sell:  $1343.80  | Profit:  $17.07
Sell:  $1345.02  | Profit:  $19.18
Buy:  $1300.67
Sell:  $1292.28  | Profit:  -$8.39
Buy:  $1188.68
Sell:  $1209.11  | Profit:  $20.43
Buy:  $1162.95
Sell:  $1175.38  | Profit:  $12.43
Buy:  $1207.25
Sell:  $1203.66  | Profit:  -$3.59
Buy:  $1218.28
Sell:  $1237.90  | Profit:  $19.62
-------------------- 
Total Profit:  $137.77 
----------