https://courses.edx.org/courses/course-v1:MITx+15.071x_3+1T2016/courseware/d32b0c36ff484c228b8117257349d0e6/27bfa0a7d1304080a09965a5773c16f3/

Predicting Stock Returns with Cluster-Then-Predict

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we'll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we'll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we'll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

    ReturnJan = the return for the company's stock during January (in the year of the observation). 
    ReturnFeb = the return for the company's stock during February (in the year of the observation). 
    ReturnMar = the return for the company's stock during March (in the year of the observation). 
    ReturnApr = the return for the company's stock during April (in the year of the observation). 
    ReturnMay = the return for the company's stock during May (in the year of the observation). 
    ReturnJune = the return for the company's stock during June (in the year of the observation). 
    ReturnJuly = the return for the company's stock during July (in the year of the observation). 
    ReturnAug = the return for the company's stock during August (in the year of the observation). 
    ReturnSep = the return for the company's stock during September (in the year of the observation). 
    ReturnOct = the return for the company's stock during October (in the year of the observation). 
    ReturnNov = the return for the company's stock during November (in the year of the observation). 
    PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month

In [1]:
stocks = read.csv("StocksCluster.csv")

In [2]:
str(stocks)

'data.frame':	11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...


In [3]:
summary(stocks$PositiveDec)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.5461  1.0000  1.0000 

In [4]:
sum(stocks$PositiveDec)/nrow(stocks)

In [5]:
which.max(cor(stocks))

In [6]:
colnames(stocks)

In [7]:
sort(cor(stocks), decreasing = TRUE)

In [8]:
cor(stocks)

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.066774583,-0.090496798,-0.037678006,-0.044411417,0.092238307,-0.081429765,-0.022792019,-0.026437153,0.142977229,0.067632333,0.004728518
ReturnFeb,0.06677458,1.0,-0.15598326,-0.19135192,-0.09552092,0.16999448,-0.06177851,0.13155979,0.04350177,-0.08732427,-0.15465828,-0.03817318
ReturnMar,-0.090496798,-0.155983263,1.0,0.009726288,-0.003892789,-0.085905486,0.00337416,-0.0220054,0.076518327,-0.011923758,0.037323535,0.022408661
ReturnApr,-0.037678006,-0.191351924,0.009726288,1.0,0.063822504,-0.011027752,0.080631932,-0.051756051,-0.028920972,0.048540025,0.031761837,0.094353528
ReturnMay,-0.044411417,-0.09552092,-0.003892789,0.063822504,1.0,-0.021074539,0.090850264,-0.033125658,0.021962862,0.017166728,0.04804659,0.058201934
ReturnJune,0.09223831,0.16999448,-0.08590549,-0.01102775,-0.02107454,1.0,-0.0291526,0.01071053,0.04474727,-0.02263599,-0.06527054,0.02340975
ReturnJuly,-0.081429765,-0.0617785094,0.0033741597,0.0806319317,0.0908502642,-0.0291525996,1.0,0.0007137558,0.0689478037,-0.0547089088,-0.0483738369,0.0743642097
ReturnAug,-0.0227920187,0.1315597863,-0.0220053995,-0.051756051,-0.033125658,0.010710526,0.0007137558,1.0,0.0007407139,-0.0755945614,-0.1164890345,0.0041669657
ReturnSep,-0.0264371526,0.0435017706,0.0765183267,-0.0289209718,0.0219628623,0.0447472692,0.0689478037,0.0007407139,1.0,-0.0580792362,-0.0197197998,0.0416302863
ReturnOct,0.14297723,-0.08732427,-0.01192376,0.04854003,0.01716673,-0.02263599,-0.05470891,-0.07559456,-0.05807924,1.0,0.19167279,-0.05257496


In [9]:
sort(colMeans(stocks))

In [10]:
library(caTools)

In [11]:
set.seed(144)
split = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

In [12]:
stocksTrain = subset(stocks, split == TRUE)
stocksTest = subset(stocks, split == FALSE)

In [13]:
StocksModel = glm(PositiveDec~.,data = stocksTrain, family = binomial)

In [14]:
stocksPredTrain = predict(StocksModel, type = "response")

In [15]:
stocksPredTrain[1:10]

In [16]:
t =table(stocksTrain$PositiveDec, stocksPredTrain>0.5)
t

   
    FALSE TRUE
  0   990 2689
  1   787 3640

In [17]:
sum(diag(t))/nrow(stocksTrain)

In [18]:
stocksPredTest = predict(StocksModel, newdata = stocksTest, type = "response")

In [19]:
t =table(stocksTest$PositiveDec, stocksPredTest>0.5)
t

   
    FALSE TRUE
  0   417 1160
  1   344 1553

In [20]:
sum(diag(t))/nrow(stocksTest)

In [21]:
table(stocksTest$PositiveDec)


   0    1 
1577 1897 

In [22]:
1897/(1897+1577)

In [23]:
limitedTrain = stocksTrain

In [24]:
limitedTrain$PositiveDec = NULL

In [25]:
limitedTest = stocksTest

In [26]:
limitedTest$PositiveDec = NULL

In [27]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2
: package ‘ggplot2’ was built under R version 3.3.0

In [28]:
preproc = preProcess(limitedTrain)

In [29]:
normTrain = predict(preproc, limitedTrain)

In [30]:
normTest = predict(preproc, limitedTest)

In [31]:
mean(normTrain$ReturnJan)

In [32]:
mean(normTest$ReturnJan)

In [33]:
set.seed(144)
km = kmeans(normTrain, centers = 3)

In [56]:
summary(km)

             Length Class  Mode   
cluster      8106   -none- numeric
centers        33   -none- numeric
totss           1   -none- numeric
withinss        3   -none- numeric
tot.withinss    1   -none- numeric
betweenss       1   -none- numeric
size            3   -none- numeric
iter            1   -none- numeric
ifault          1   -none- numeric

In [34]:
table(km$cluster)


   1    2    3 
3157 4696  253 

In [35]:
library(flexclust)

: package ‘flexclust’ was built under R version 3.3.0Loading required package: grid
Loading required package: modeltools
: package ‘modeltools’ was built under R version 3.3.0Loading required package: stats4


In [36]:
km.kcca = as.kcca(km,normTrain)

In [55]:
summary(km.kcca)

kcca object of family ‘kmeans’ 

call:
as.kcca(object = km, data = normTrain)

cluster info:
  size  av_dist max_dist separation
1 3157 2.686794 42.55611  0.9913070
2 4696 2.348008 39.19962  0.9657141
3  253 5.436050 32.64870  3.1846970

convergence after 1 iterations
sum of within cluster distances: 20883.77 


In [37]:
clusterTrain = predict(km.kcca)

In [38]:
clusterTest = predict(km.kcca, newdata = normTest)

In [39]:
table(clusterTest)

clusterTest
   1    2    3 
1298 2080   96 

In [40]:
table(clusterTrain)

clusterTrain
   1    2    3 
3157 4696  253 

In [41]:
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)

In [42]:
mean(stocksTrain1$PositiveDec)
mean(stocksTrain2$PositiveDec)
mean(stocksTrain3$PositiveDec)

In [43]:
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)

In [44]:
StocksModel1 = glm(PositiveDec~., data = stocksTrain1, family = binomial)
StocksModel2 = glm(PositiveDec~., data = stocksTrain2, family = binomial)
StocksModel3 = glm(PositiveDec~., data = stocksTrain3, family = binomial)

In [45]:
summary(StocksModel1)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7307  -1.2910   0.8878   1.0280   1.5023  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.17224    0.06302   2.733  0.00628 ** 
ReturnJan    0.02498    0.29306   0.085  0.93206    
ReturnFeb   -0.37207    0.29123  -1.278  0.20139    
ReturnMar    0.59555    0.23325   2.553  0.01067 *  
ReturnApr    1.19048    0.22439   5.305 1.12e-07 ***
ReturnMay    0.30421    0.22845   1.332  0.18298    
ReturnJune  -0.01165    0.29993  -0.039  0.96901    
ReturnJuly   0.19769    0.27790   0.711  0.47685    
ReturnAug    0.51273    0.30858   1.662  0.09660 .  
ReturnSep    0.58833    0.28133   2.091  0.03651 *  
ReturnOct   -1.02254    0.26007  -3.932 8.43e-05 ***
ReturnNov   -0.74847    0.28280  -2.647  0.00813 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial f

In [46]:
summary(StocksModel2)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2012  -1.1941   0.8583   1.1334   1.9424  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.10293    0.03785   2.719 0.006540 ** 
ReturnJan    0.88451    0.20276   4.362 1.29e-05 ***
ReturnFeb    0.31762    0.26624   1.193 0.232878    
ReturnMar   -0.37978    0.24045  -1.579 0.114231    
ReturnApr    0.49291    0.22460   2.195 0.028189 *  
ReturnMay    0.89655    0.25492   3.517 0.000436 ***
ReturnJune   1.50088    0.26014   5.770 7.95e-09 ***
ReturnJuly   0.78315    0.26864   2.915 0.003554 ** 
ReturnAug   -0.24486    0.27080  -0.904 0.365876    
ReturnSep    0.73685    0.24820   2.969 0.002989 ** 
ReturnOct   -0.27756    0.18400  -1.509 0.131419    
ReturnNov   -0.78747    0.22458  -3.506 0.000454 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial f

In [47]:
summary(StocksModel3)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9146  -1.0393  -0.7689   1.1921   1.6939  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.181896   0.325182  -0.559   0.5759  
ReturnJan   -0.009789   0.448943  -0.022   0.9826  
ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
ReturnMar    0.674179   0.564790   1.194   0.2326  
ReturnApr    1.281466   0.602672   2.126   0.0335 *
ReturnMay    0.762512   0.647783   1.177   0.2392  
ReturnJune   0.329434   0.408038   0.807   0.4195  
ReturnJuly   0.774164   0.729360   1.061   0.2885  
ReturnAug    0.982605   0.533158   1.843   0.0653 .
ReturnSep    0.363807   0.627774   0.580   0.5622  
ReturnOct    0.782242   0.733123   1.067   0.2860  
ReturnNov   -0.873752   0.738480  -1.183   0.2367  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken t

In [48]:
predictTest1 = predict(StocksModel1, newdata = stocksTest1, type = "response")
predictTest2 = predict(StocksModel2, newdata = stocksTest2, type = "response")
predictTest3 = predict(StocksModel3, newdata = stocksTest3, type = "response")

In [49]:
t = table(stocksTest1$PositiveDec, predictTest1>0.5)
t
sum(diag(t))/sum(t)

   
    FALSE TRUE
  0    30  471
  1    23  774

In [50]:
t = table(stocksTest2$PositiveDec, predictTest2>0.5)
t
sum(diag(t))/sum(t)

   
    FALSE TRUE
  0   388  626
  1   309  757

In [51]:
t = table(stocksTest3$PositiveDec, predictTest3>0.5)
t
sum(diag(t))/sum(t)

   
    FALSE TRUE
  0    49   13
  1    21   13

In [52]:
AllPredictions = c(predictTest1>0.5,predictTest2>0.5,predictTest3>0.5)

In [53]:
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec,stocksTest3$PositiveDec)

In [54]:
t = table(AllOutcomes, AllPredictions)
t
sum(diag(t))/sum(t)

           AllPredictions
AllOutcomes FALSE TRUE
          0   467 1110
          1   353 1544