# PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we'll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we'll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we'll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

* **ReturnJan** = the return for the company's stock during January (in the year of the observation). 
* **ReturnFeb** = the return for the company's stock during February (in the year of the observation). 
* **ReturnMar** = the return for the company's stock during March (in the year of the observation). 
* **ReturnApr** = the return for the company's stock during April (in the year of the observation). 
* **ReturnMay** = the return for the company's stock during May (in the year of the observation). 
* **ReturnJune** = the return for the company's stock during June (in the year of the observation). 
* **ReturnJuly** = the return for the company's stock during July (in the year of the observation). 
* **ReturnAug** = the return for the company's stock during August (in the year of the observation). 
* **ReturnSep** = the return for the company's stock during September (in the year of the observation). 
* **ReturnOct** = the return for the company's stock during October (in the year of the observation). 
* **ReturnNov** = the return for the company's stock during November (in the year of the observation). 
* **PositiveDec** = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

# Exploring the Dataset

In [1]:
options(jupyter.plot_mimetypes = 'image/png')

In [2]:
stocks = read.csv('data//StocksCluster.csv')

In [3]:
str(stocks)

'data.frame':	11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...


In [5]:
sum(stocks$PositiveDec)/length(stocks$PositiveDec)

In [6]:
cor(stocks)

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.066774583,-0.090496798,-0.037678006,-0.044411417,0.092238307,-0.081429765,-0.022792019,-0.026437153,0.142977229,0.067632333,0.004728518
ReturnFeb,0.06677458,1.0,-0.15598326,-0.19135192,-0.09552092,0.16999448,-0.06177851,0.13155979,0.04350177,-0.08732427,-0.15465828,-0.03817318
ReturnMar,-0.090496798,-0.155983263,1.0,0.009726288,-0.003892789,-0.085905486,0.00337416,-0.0220054,0.076518327,-0.011923758,0.037323535,0.022408661
ReturnApr,-0.037678006,-0.191351924,0.009726288,1.0,0.063822504,-0.011027752,0.080631932,-0.051756051,-0.028920972,0.048540025,0.031761837,0.094353528
ReturnMay,-0.044411417,-0.09552092,-0.003892789,0.063822504,1.0,-0.021074539,0.090850264,-0.033125658,0.021962862,0.017166728,0.04804659,0.058201934
ReturnJune,0.09223831,0.16999448,-0.08590549,-0.01102775,-0.02107454,1.0,-0.0291526,0.01071053,0.04474727,-0.02263599,-0.06527054,0.02340975
ReturnJuly,-0.081429765,-0.0617785094,0.0033741597,0.0806319317,0.0908502642,-0.0291525996,1.0,0.0007137558,0.0689478037,-0.0547089088,-0.0483738369,0.0743642097
ReturnAug,-0.0227920187,0.1315597863,-0.0220053995,-0.051756051,-0.033125658,0.010710526,0.0007137558,1.0,0.0007407139,-0.0755945614,-0.1164890345,0.0041669657
ReturnSep,-0.0264371526,0.0435017706,0.0765183267,-0.0289209718,0.0219628623,0.0447472692,0.0689478037,0.0007407139,1.0,-0.0580792362,-0.0197197998,0.0416302863
ReturnOct,0.14297723,-0.08732427,-0.01192376,0.04854003,0.01716673,-0.02263599,-0.05470891,-0.07559456,-0.05807924,1.0,0.19167279,-0.05257496


In [10]:
sort(apply(stocks, 2, mean))

# Initial Logistic Regression Model

## Train, test, split

In [12]:
library(caTools)
set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

In [13]:
model_LR = glm(PositiveDec ~ .,
              family = binomial,
              data = stocksTrain)

In [18]:
# Acc train:
predictTrain = predict(model_LR, type = 'response', newdata = stocksTrain)
t = table(stocksTrain$PositiveDec, predictTrain > 0.5)
t
sum(diag(t)) / sum(t)

   
    FALSE TRUE
  0   990 2689
  1   787 3640

In [19]:
# Acc test:
predictTest = predict(model_LR, type = 'response', newdata = stocksTest)
t = table(stocksTest$PositiveDec, predictTest > 0.5)
t
sum(diag(t)) / sum(t)

   
    FALSE TRUE
  0   417 1160
  1   344 1553

In [20]:
## Test baseline:


In [21]:
sum(stocksTest$PositiveDec)/length(stocksTest$PositiveDec)

# Clustering Stocks

In [22]:
limitedTrain = stocksTrain

limitedTrain$PositiveDec = NULL

limitedTest = stocksTest

limitedTest$PositiveDec = NULL

## preProcessing with library caret - normalize

In [23]:
library(caret)

preproc = preProcess(limitedTrain)

normTrain = predict(preproc, limitedTrain)

normTest = predict(preproc, limitedTest)

: package ‘caret’ was built under R version 3.2.4Loading required package: lattice
Loading required package: ggplot2
: package ‘ggplot2’ was built under R version 3.2.4

* * *

In [24]:
mean(normTrain$ReturnJan)

In [25]:
mean(normTest$ReturnJan)

# K-means:

In [26]:
set.seed(144)
k = 3

KMC = kmeans(normTrain, centers = k, iter.max = 1000)
str(KMC)

List of 9
 $ cluster     : Named int [1:8106] 1 1 1 3 1 3 2 2 1 3 ...
  ..- attr(*, "names")= chr [1:8106] "1" "2" "4" "6" ...
 $ centers     : num [1:3, 1:11] -0.4523 0.2641 0.7425 -0.1497 -0.0415 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:11] "ReturnJan" "ReturnFeb" "ReturnMar" "ReturnApr" ...
 $ totss       : num 89155
 $ withinss    : num [1:3] 31204 38032 9937
 $ tot.withinss: num 79173
 $ betweenss   : num 9982
 $ size        : int [1:3] 3157 4696 253
 $ iter        : int 5
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"


Recall from the recitation that we can use the **flexclust** package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

In [27]:
library(flexclust)

KMC.kcca = as.kcca(KMC, normTrain)

clusterTrain = predict(KMC.kcca)

clusterTest = predict(KMC.kcca, newdata=normTest)

Loading required package: grid
Loading required package: modeltools
Loading required package: stats4


In [30]:
table(clusterTest)

clusterTest
   1    2    3 
1298 2080   96 

# Cluster-Specific Predictions

In [33]:
stocksTrain1 = stocksTrain[clusterTrain == 1, ]
stocksTrain2 = stocksTrain[clusterTrain == 2, ]
stocksTrain3 = stocksTrain[clusterTrain == 3, ]

stocksTest1 = stocksTest[clusterTest == 1, ]
stocksTest2 = stocksTest[clusterTest == 2, ]
stocksTest3 = stocksTest[clusterTest == 3, ]

In [34]:
mean(stocksTrain1)

In mean.default(stocksTrain1): argument is not numeric or logical: returning NA

[1] NA

In [45]:
mean(stocksTrain1$PositiveDec)
mean(stocksTrain2$PositiveDec)
mean(stocksTrain3$PositiveDec)

In [46]:
model_LR_1 = glm(PositiveDec ~ .,
              family = binomial,
              data = stocksTrain1)
model_LR_2 = glm(PositiveDec ~ .,
              family = binomial,
              data = stocksTrain2)
model_LR_3 = glm(PositiveDec ~ .,
              family = binomial,
              data = stocksTrain3)

In [52]:
(model_LR_1$coefficients > 0) + (model_LR_2$coefficients > 0) + (model_LR_3$coefficients > 0)

In [53]:
predictTest1 = predict(model_LR_1, type = 'response', newdata = stocksTest1)
t = table(stocksTest1$PositiveDec, predictTest1 > 0.5)
t
sum(diag(t)) / sum(t)

   
    FALSE TRUE
  0    30  471
  1    23  774

In [54]:
predictTest2 = predict(model_LR_2, type = 'response', newdata = stocksTest2)
t = table(stocksTest2$PositiveDec, predictTest2 > 0.5)
t
sum(diag(t)) / sum(t)

   
    FALSE TRUE
  0   388  626
  1   309  757

In [55]:
predictTest3 = predict(model_LR_3, type = 'response', newdata = stocksTest3)
t = table(stocksTest3$PositiveDec, predictTest3 > 0.5)
t
sum(diag(t)) / sum(t)

   
    FALSE TRUE
  0    49   13
  1    21   13

In [56]:
AllPredictions = c(predictTest1, predictTest2, predictTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

In [64]:
t = table(AllOutcomes, AllPredictions > 0.5)
t
sum(diag(t)) / sum(t)

           
AllOutcomes FALSE TRUE
          0   467 1110
          1   353 1544

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.