# Machine Learning Engineering NanoDegree Project


## Reinforcement Learning in investing

Question: Can we use Reinforcement Learning to time the market (S&P 500)?

Many investors such as Warren Buffett say that the best way to accumulate weath is to keep your money in the S&P 500 Index Fund. The S&P 500 is a weighted average of all the stocks of the top 500 companies weighted by the value of the company. Historically S&P 500 has gone 7% above inflation which beats many other funds and performs way better than keeping your money in the bank. My question is whether we can make more than 7% over inflation by timing the market. I want to see if reinforcement learning can be used to predict when the stocks will go up or down and use that information to take money out or put money back in the market.

## Data Preprocessing:

In [291]:
import csv
import random
data_file = open('Shiller_Main_Data.csv')
read_file = csv.reader(data_file)
data_dictionary = {}                # Where all the S&P Data is stored
count = 0
header = None
for item in read_file:
    if count == 0:
        header = item               # First line of the data has header names such as ('Real Price')
    else:
        Date = item[0]              # The month and year of S&P Price and other data
        After_Decimal = Date[Date.index('.') + 1:]
        if After_Decimal == '1':    # Changing Data entries '1900.1' to '1900.10' 
            Date += '0'             # '1900.01' mean January, 1900 while '1900.10' mean October, 1900
        data_dictionary[count] = {}
        data_dictionary[count][header[0]] = Date
        data_dictionary[count][header[1]] = float(item[1])    #S&P Price
        data_dictionary[count][header[2]] = float(item[2])    #Long Interest Rate
        data_dictionary[count][header[3]] = float(item[3])    #Real Price
        data_dictionary[count][header[4]] = float(item[4])    #Real Dividend
        data_dictionary[count][header[5]] = float(item[5])    #CAPE
        data_dictionary[count][header[6]] = float(item[6])    #Date Fraction
    count += 1
max_count = len(data_dictionary)
print (max_count)

1638


This is the preprocessing step in order to take the data from .csv file and organize it into an easy to use way. The variable max_count represents the number of months in the data. It includes the information of the Date, S&P Price, Long Interest Rate, Real Price, Real Dividend, CAPE, and Date Fraction of every month from January of 1881 to June of 2017.
* S&P Price: Weighted average of all stocks from top 500 companies
* Long Interest Rate: Interest Rate on bonds
* Real Price: The S&P Price in terms of the dollar today
* Real Dividend: The S&P Dividend in terms of the dollar today
* CAPE: Cyclically Adjusted Price Per Earnings Ratio
* Date Fraction: Used to graph the Date

The Real Price and Real Dividend will be used to figure out the amount of money made each month, while CAPE and Long Interest Rate will be used for the predictions. The Real Price will also be used for rewards in the Q-Learning algorithm. The Data set will be split approximately 70% for training and 30% for testing. The data from January 1881 to December 1975 will be used for learning and the data from January 1976 to June 2017 will be used for testing. Can Reinforcement Learning use information from 1881-1975 to make good financial decisions for the last 40 years? 

## Base Cases:

Let's see how much money is made during the testing interval by holding it in the market and this will be used as the base case. If Reinforcement Learning performs better than the Base Case, then we have succeeded. We will show how putting $1000 a month (today's dollars) in S&P can accumlate. This is done to simulate real life where people continue to invest money that they earn.

In [292]:
def BaseCase(data_dictionary,start,end):
    money = 0         
    shares = 0
    for n in range(start,end):
        Real_Price = data_dictionary[n]['Real Price']
        Real_Dividend = data_dictionary[n]['Real Dividend']
        if (n == start): #First Month Money 
            money = 1000 #No Investment has occured yet
        else:            
            money = shares*(Real_Price + Real_Dividend/12) + 1000 #Shares converted into money
        shares = money/Real_Price #Money converted into shares
    return money

The Real_Dividend is for the whole year, so we divide it by 12. We assume that you reinvest all the money made from the stock back into the stock market.

In [293]:
training_start = 1 # January, 1881
training_end = (1976 - 1881)*12 - 1  #December, 1975
training_case = BaseCase(data_dictionary,training_start,training_end+1)
testing_case = BaseCase(data_dictionary,training_end + 1, max_count + 1)
print ("The Amount of Money Made from 1881 to 1975 by holding in the Market:")
print ("${}".format(training_case))
print ("The Amount of Money Made from 1976 to 2017 by holding in the Market:")
print ("${}".format(testing_case))

The Amount of Money Made from 1881 to 1975 by holding in the Market:
$55323777.7857
The Amount of Money Made from 1976 to 2017 by holding in the Market:
$3351912.47717


The testing_case will be compared to the result of the Q-Learning Algorithm. If the result does better it means that the market can be timed. The training_case is important because it represents how much money can be made holding in the market in the training time period. We look at the models that do best in the training period compared to this and use that on the testing case.

## Reinforcement Learning: Q Learning

The greedy epsilon Q learning algorithm will be used to make predictions. We set discount factor to 0 because we are not trying to encourage the program to get a certain goal as we don't know what the maximum amount of money can be made in a real world scenario. The formula becomes:

$Q_{(s,t)}$ = (1 - $\alpha$)$Q_{(s,t)}$ + $\alpha$$r_{t}$
* $Q_{(s,t)}$ represents the score for a state at the current time.
* $\alpha$ is the learning rate (range 0-1)
* $r$ is the reward

Price difference will be compared month to month and rewards will depend on that:
* If money is kept in the market and the price goes up that is a positive reward
* If money is kept in the market and the price goes down that is a negative reward
* If stocks are sold and the price goes up that is a negative reward
* If stocks are sold and the price goes down that is a positive reward

A state is based on the Cape and long interest rate. The action is whether we hold the money in the market or we sell. Holding in the market includes buying shares for the $1000 that is earned each month. We initialize each state an action pair in a dictionary with score of 0. Based on the rewards for each occurence, the dictionary entry is updated based on the Q-Learning formula above. The result of the training will be this dictionary. When the testing occurs, this will be used to determine at a give time whether stocks should be kept or sold. Whichever action has a higher score for a given long interest rate and Cape pair will be the action that is taken. Now in order to create the dictionary, continous values of interest rate and Cape need to be converted to discrete values so that they can be kept in a dictionary.

In [294]:
def interest_range(data_dictionary,division):
    min_interest = float('inf')
    max_interest = 8.0               #8.0 is set based on data
    for items in data_dictionary:
        if data_dictionary[items]['Long Interest Rate'] < min_interest:
            min_interest = data_dictionary[items]['Long Interest Rate']
    interest_array = []
    space = float((max_interest - min_interest)/division)                        #even intervals
    interest_array.append((float('-inf'),min_interest + space))                  #set lowerbound
    for n in range(1,division-1):
        interest_array.append((min_interest+space*n,min_interest+space*(n+1)))
    interest_array.append((min_interest + space*(division - 1),float('inf')))    #set upperbounds
    return interest_array
def cape_range(data_dictionary,division):  #Same as interest_range(), but with different max value
    min_cape = float('inf')
    max_cape = 32          #32 based on data
    for items in data_dictionary:
        if data_dictionary[items]['CAPE'] < min_cape:
            min_cape = data_dictionary[items]['CAPE']
    cape_array = []
    space = float((max_cape-min_cape)/division)
    cape_array.append((float('-inf'),min_cape + space))
    for n in range(1,division-1):
        cape_array.append((min_cape + space*n,min_cape + space*(n+1)))
    cape_array.append((min_cape + space*(division - 1),float('inf')))
    return cape_array

In [295]:
print (interest_range(data_dictionary,2))      #Long Interest ranges with 2 intervals
print (interest_range(data_dictionary,5))      #Long Interest ranges with 5 intervals
print (cape_range(data_dictionary,2))          #CAPE Ranges with 2 intervals
print (cape_range(data_dictionary,5))          #CAPE Ranges with 5 intervals

[(-inf, 4.75), (4.75, inf)]
[(-inf, 2.8), (2.8, 4.1), (4.1, 5.4), (5.4, 6.7), (6.7, inf)]
[(-inf, 18.39), (18.39, inf)]
[(-inf, 10.224), (10.224, 15.668), (15.668, 21.112000000000002), (21.112000000000002, 26.556), (26.556, inf)]


If you look at the data in the testing set, the long interest rates go really high in the 1980's. In the training data it never reaches that high, so an upper limit has to be set. Anything close to 8.0 or above was treated as the same range. In the testing set during the dot com boom the Cape went really high and in the last couple years it has been really high. It never went that high in the training set so 32 is set the upper bound. Now here we initialize the dictionary:

In [296]:
def create_learner_dictionary(data_dictionary,division_interest,division_cape):
    interest_array = interest_range(data_dictionary,division_interest)
    cape_array = cape_range(data_dictionary,division_cape)
    combination = [(x,y) for x in interest_array for y in cape_array]    #Every state action pair
    learner_dictionary = {}
    for items in combination:
        learner_dictionary[items] = {}
        learner_dictionary[items]['hold'] = 0                            #Initialize dictionary
        learner_dictionary[items]['sell'] = 0
    return learner_dictionary

In [297]:
print (create_learner_dictionary(data_dictionary,2,2))   #Prints example initalized dictionary with 4 states

{((-inf, 4.75), (18.39, inf)): {'sell': 0, 'hold': 0}, ((4.75, inf), (-inf, 18.39)): {'sell': 0, 'hold': 0}, ((-inf, 4.75), (-inf, 18.39)): {'sell': 0, 'hold': 0}, ((4.75, inf), (18.39, inf)): {'sell': 0, 'hold': 0}}


A function to access dictionary scores for givien interest and Cape. Since the values are continous, the values have to be mapped to the ranges:

In [298]:
def dictionary_entry(learner_dictionary,interest,cape):
    for items in learner_dictionary:
        if interest >= items[0][0] and interest < items[0][1]:         #finds long interest rate range
            if cape >= items[1][0] and cape < items[1][1]:             #finds cape range
                return items

In the learning stage, epsilon determines the probability that we choose sell or hold randomly. Otherwise we choose the max based on the dictionary entries. In the beginning we don't know what will work, so we base the decision on randomness and overtime we choose decisions based on what we learned. Epsilon determines the probability that we choose randomness. Epsilon will decay over time. Here we have the sell_or_hold function which uses the find_max function. We also have assign_rewards function which determines action scores for each state.

In [299]:
def find_max(learner_dictionary,state):                             #Chooses 'sell' or 'hold' from dictionary
    s = learner_dictionary[state]['sell']
    h = learner_dictionary[state]['hold']
    if s == h:
        action = random.choice(['sell','hold'])
    elif s > h:
        action = 'sell'
    else:
        action = 'hold'
    return action
def sell_or_hold(epsilon,state,learner_dictionary):
    chance = random.uniform(0,1.0)
    if chance <= epsilon:                                            #epsilon determines randomness
        action = random.choice(['sell','hold'])
    else:
        action = find_max(learner_dictionary,state)
    return action
def assign_rewards(alpha,history,data_dictionary,learner_dictionary,current):
    Real_Price = data_dictionary[current]['Real Price']
    Previous_Real_Price = data_dictionary[current-1]['Real Price']
    Price_Difference = Real_Price - Previous_Real_Price                   #Basing rewards on the Price Difference
    Real_Dividend = data_dictionary[current]['Real Dividend']
    Previous_Action = history[-1][2]                                      #Rewarding based on results
    Previous_State = history[-1][1]                                       #and previous action/states
    if (Price_Difference >= 0 and Previous_Action == 'hold'):
        reward = Price_Difference                  
    elif (Price_Difference >= 0 and Previous_Action == 'sell'):
        reward = -1*(Price_Difference)
    elif (Price_Difference < 0 and Previous_Action == 'hold'):
        reward = -1*(Price_Difference)
    else:
        reward = Price_Difference                           #Price_Difference < 0 and Previous_Action == 'sell'
    learner_dictionary[Previous_State][Previous_Action] = (1-alpha)*learner_dictionary[Previous_State][Previous_Action] + (alpha)*reward
    return learner_dictionary

Based on experimenting with different rewards, using price difference from current month to last month worked the best. Adding dividend or static rewards made the program perform worse. Rewards need to be given based on the gained or lost money. Gaining money is just as important as not losing money so that is why the rewards are equal. 

## Learning

In [300]:
import math
money = 0                 #Initialize money and shares
shares = 0
def Learning(alpha,data_dictionary,division_interest,division_cape,start,end):
    history = []          
    learner_dictionary = create_learner_dictionary(data_dictionary,division_interest,division_cape)
    for n in range(start,end):
        t = n - 1                      #Time Step
        epsilon = 1 - .001*t             #Determines how epsilon decays
        if (epsilon < 0 or epsilon > 1):
            epsilon = 0
        Real_Price = data_dictionary[n]['Real Price']
        Real_Dividend = data_dictionary[n]['Real Dividend']
        Interest_Rate = data_dictionary[n]['Long Interest Rate']
        CAPE = data_dictionary[n]['CAPE']
        if (n == start):
            money = 1000              #$1000 made first month 
        else:
            learner_dictionary = assign_rewards(alpha,history,data_dictionary,learner_dictionary,n)
            if (shares > 0):
                money = shares*(Real_Price + Real_Dividend/12) + 1000   #Shares converted to money
            else:
                money += 1000
        state = dictionary_entry(learner_dictionary,Interest_Rate,CAPE)
        action = sell_or_hold(epsilon,state,learner_dictionary)
        history.append([Real_Price,state,action])              #Records information at each time step
        if (action == 'hold'):
            shares = money/Real_Price                       #Money converted to shares
        else:
            shares = 0
    return money,learner_dictionary

## Testing

In [301]:
def Testing(learner_dictionary,data_dictionary,start,end):
    money = 0                            #Intialize money and shares
    shares = 0
    history = []
    for n in range(start,end):
        Real_Price = data_dictionary[n]['Real Price']
        Real_Dividend = data_dictionary[n]['Real Dividend']
        Interest_Rate = data_dictionary[n]['Long Interest Rate']
        CAPE = data_dictionary[n]['CAPE']
        if (n == start):
            money = 1000                 #$1000 made first month
        else:
            if (shares > 0):
                money = shares*(Real_Price + Real_Dividend/12) + 1000     #Shares converted to money
            else:
                money += 1000
        state = dictionary_entry(learner_dictionary,Interest_Rate,CAPE)
        action = find_max(learner_dictionary,state)
        history.append(action)              #This represents total actions taken
        if (action == 'hold'):
            shares = money/Real_Price          #Money converted to shares
        else:
            shares = 0
    return money,history

Here is an example result:

In [302]:
alpha = .85
division_interest = 4
division_cape = 4
start = training_start
end = training_end
old_money, learner_dictionary = Learning(alpha,data_dictionary,division_interest,division_cape,start,end)
money, history = Testing(learner_dictionary,data_dictionary,end+1,max_count+1)
print ("${}".format(money))

$3921540.31194


Now let's find the best alpha value, interest range divisions, and cape range divisions that lead to the best results.

In [303]:
max_money = 0
values = None
for n in range(0,101):                 #Range of alphas
    for s in range(2,11):           #Number of long interest rate divisions
        for t in range(2,11):       #Number of cape divisions
            money, learner_dictionary = Learning(.01*n,data_dictionary,s,t,start,end)
            if money > max_money:
                max_money = money
                values = (n,s,t)
print ("${}".format(max_money))
print ("alpha: {}".format(.01*values[0]))
print ("Long Interest Rate Intervals: {}".format(values[1]))
print ("CAPE Intervals: {}".format(values[2]))

$147496528.385
alpha: 0.46
Long Interest Rate Intervals: 9
CAPE Intervals: 2


Let's use these values on the test set and see the results:

In [304]:
max_money = 0
for _ in range(10):
    old_money, learner_dictionary = Learning(values[0]*.01,data_dictionary,values[1],values[2],start,end)
    money, history = Testing(learner_dictionary,data_dictionary,end+1,max_count+1)
    print ("${}".format(money))

$2333764.42864
$2368155.94771
$2307656.07719
$2593611.8265
$2583602.04642
$2298300.67749
$2429477.65495
$2239984.90776
$2205415.08121
$2850026.67737


There is a wide range of answers so let's look at the values that give the highest average out of 10 runs:

In [305]:
max_money = 0
values = None
for n in range(0,101):
    for s in range(2,11):
        for t in range(2,11):
            average = 0
            for _ in range(10):
                money, learner_dictionary = Learning(.01*n,data_dictionary,s,t,start,end)
                average += money
            average /= 10
            if average > max_money:
                max_money = average
                values = (n,s,t)
print ("${}".format(max_money))
print ("alpha: {}".format(.01*values[0]))
print ("Long Interest Rate Intervals: {}".format(values[1]))
print ("CAPE Intervals: {}".format(values[2]))

$55300595.7521
alpha: 0.28
Long Interest Rate Intervals: 10
CAPE Intervals: 10


Now let's find the best dictionary from these values:

In [306]:
max_money = 0
record_dict = {}
for _ in range(10):
    money,learner_dictionary = Learning(.01*values[0],data_dictionary,values[1],values[2],start,end)
    if money > max_money:
        max_money = money
        record_dict = learner_dictionary
print ("${}".format(max_money))

$43893335.4021


Now let's use this best dictionary on Testing and see the results:

In [307]:
max_money = 0
max_hist = None
for _ in range(10):
    money, history = Testing(record_dict,data_dictionary,end+1,max_count+1)
    if money > max_money:
        max_money = money
        max_hist = history
print ("${}".format(max_money))

$2588032.21255


There is a wide range of results so let's use the best result. We use the history which is a dictionary of 'hold' or 'sell' for each month and use it to create a .csv file that can show graphically where stocks were held and where they were sold.

In [308]:
head_row = ['Real Price','Date Fraction','Sell or Hold']        #Header for file
data_file_2 = open('record_list.csv','wb')                      #Create .csv file where data can be recorded
write_file = csv.writer(data_file_2)
write_file.writerow(head_row)
for n in range(end+1,max_count+1):
    row = []
    row.append(data_dictionary[n][header[3]])                   #Real Price
    row.append(data_dictionary[n][header[6]])                   #Date Dictionary
    if history[n-end-1] == 'hold':
        row.append(data_dictionary[n][header[3]])               #Record Price where held
    else:
        row.append(0)                                           #Record 0 where sold
    write_file.writerow(row)
data_file_2.close()