## Predicting Stock Returns with Cluster-Then-Predict

n the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we'll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we'll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we'll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from [infochimps](http://www.infochimps.com/datasets/nasdaq-exchange-daily-1970-2010-open-close-high-low-and-volume), a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

- ReturnJan = the return for the company's stock during January (in the year of the observation). 
- ReturnFeb = the return for the company's stock during February (in the year of the observation). 
- ReturnMar = the return for the company's stock during March (in the year of the observation). 
- ReturnApr = the return for the company's stock during April (in the year of the observation). 
- ReturnMay = the return for the company's stock during May (in the year of the observation). 
- ReturnJune = the return for the company's stock during June (in the year of the observation). 
- ReturnJuly = the return for the company's stock during July (in the year of the observation). 
- ReturnAug = the return for the company's stock during August (in the year of the observation). 
- ReturnSep = the return for the company's stock during September (in the year of the observation). 
- ReturnOct = the return for the company's stock during October (in the year of the observation). 
- ReturnNov = the return for the company's stock during November (in the year of the observation). 
- PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

### Exploring the Dataset

In [1]:
stocks = read.csv('./dataset/StocksCluster.csv')

In [2]:
str(stocks)

'data.frame':	11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...


In [3]:
summary(stocks)

   ReturnJan            ReturnFeb           ReturnMar        
 Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
 1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
 Median : 0.0009965   Median :-0.010626   Median : 0.009878  
 Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
 3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
 Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
   ReturnApr           ReturnMay          ReturnJune       
 Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
 1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
 Median : 0.009059   Median : 0.01293   Median :-0.000880  
 Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
 3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
 Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
   ReturnJuly           ReturnAug           ReturnSep        
 Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
 1st Qu.:-0.0731917   

In [4]:
cor(stocks)

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.06677458,-0.090496798,-0.037678006,-0.044411417,0.09223831,-0.081429765,-0.0227920187,-0.0264371526,0.14297723,0.06763233,0.004728518
ReturnFeb,0.066774583,1.0,-0.155983263,-0.191351924,-0.09552092,0.16999448,-0.0617785094,0.1315597863,0.0435017706,-0.08732427,-0.15465828,-0.038173184
ReturnMar,-0.090496798,-0.15598326,1.0,0.009726288,-0.003892789,-0.08590549,0.0033741597,-0.0220053995,0.0765183267,-0.01192376,0.03732353,0.022408661
ReturnApr,-0.037678006,-0.19135192,0.009726288,1.0,0.063822504,-0.01102775,0.0806319317,-0.051756051,-0.0289209718,0.04854003,0.03176184,0.094353528
ReturnMay,-0.044411417,-0.09552092,-0.003892789,0.063822504,1.0,-0.02107454,0.0908502642,-0.033125658,0.0219628623,0.01716673,0.04804659,0.058201934
ReturnJune,0.092238307,0.16999448,-0.085905486,-0.011027752,-0.021074539,1.0,-0.0291525996,0.010710526,0.0447472692,-0.02263599,-0.06527054,0.023409745
ReturnJuly,-0.081429765,-0.06177851,0.00337416,0.080631932,0.090850264,-0.0291526,1.0,0.0007137558,0.0689478037,-0.05470891,-0.04837384,0.07436421
ReturnAug,-0.022792019,0.13155979,-0.0220054,-0.051756051,-0.033125658,0.01071053,0.0007137558,1.0,0.0007407139,-0.07559456,-0.11648903,0.004166966
ReturnSep,-0.026437153,0.04350177,0.076518327,-0.028920972,0.021962862,0.04474727,0.0689478037,0.0007407139,1.0,-0.05807924,-0.0197198,0.041630286
ReturnOct,0.142977229,-0.08732427,-0.011923758,0.048540025,0.017166728,-0.02263599,-0.0547089088,-0.0755945614,-0.0580792362,1.0,0.19167279,-0.052574956


### Initial Logistic Regression Model

In [5]:
library(caTools)

In [6]:
set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio=0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest =subset(stocks, spl == FALSE)

In [7]:
StocksModel = glm(PositiveDec ~ ., data=stocksTrain, family='binomial')

In [8]:
summary(StocksModel)


Call:
glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4333  -1.2265   0.9102   1.1006   2.2611  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.14878    0.02384   6.240 4.37e-10 ***
ReturnJan    0.31742    0.13906   2.283  0.02246 *  
ReturnFeb   -0.29349    0.13113  -2.238  0.02521 *  
ReturnMar    0.28716    0.14890   1.928  0.05380 .  
ReturnApr    1.05849    0.14527   7.286 3.18e-13 ***
ReturnMay    0.75472    0.16438   4.591 4.40e-06 ***
ReturnJune   0.49435    0.15937   3.102  0.00192 ** 
ReturnJuly   0.75114    0.16110   4.662 3.12e-06 ***
ReturnAug    0.09395    0.17503   0.537  0.59142    
ReturnSep    0.72669    0.17083   4.254 2.10e-05 ***
ReturnOct   -0.60645    0.14452  -4.196 2.71e-05 ***
ReturnNov   -0.84449    0.15698  -5.380 7.46e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial 

In [9]:
stocksPred = predict(StocksModel, data=stocksTrain, type='response')

In [10]:
table(stocksTrain$PositiveDec, stocksPred > 0.5)

   
    FALSE TRUE
  0   990 2689
  1   787 3640

In [11]:
accuracy = (990 + 3640) / nrow(stocksTrain)
accuracy

In [12]:
testPred = predict(StocksModel, newdata=stocksTest, type='response')
table(stocksTest$PositiveDec, testPred > 0.5)

   
    FALSE TRUE
  0   417 1160
  1   344 1553

In [13]:
accuracy = (417 + 1553) / nrow(stocksTest)
accuracy

In [14]:
table(stocksTest$PositiveDec)


   0    1 
1577 1897 

In [15]:
baseline_acc = 1897 / nrow(stocksTest)
baseline_acc

### Clustering Stocks

Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
```R
limitedTrain = stocksTrain

limitedTrain$PositiveDec = NULL

limitedTest = stocksTest

limitedTest$PositiveDec = NULL
```

In [16]:
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL

In the market segmentation assignment in this week's homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
```R
library(caret)

preproc = preProcess(limitedTrain)

normTrain = predict(preproc, limitedTrain)

normTest = predict(preproc, limitedTest)
```

In [18]:
library("caret")
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)

In [19]:
summary(normTrain)

   ReturnJan          ReturnFeb          ReturnMar          ReturnApr      
 Min.   :-4.57682   Min.   :-3.43004   Min.   :-4.54609   Min.   :-5.0227  
 1st Qu.:-0.48271   1st Qu.:-0.35589   1st Qu.:-0.40758   1st Qu.:-0.4757  
 Median :-0.07055   Median :-0.01875   Median :-0.05778   Median :-0.1104  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.35898   3rd Qu.: 0.25337   3rd Qu.: 0.36106   3rd Qu.: 0.3400  
 Max.   :18.06234   Max.   :34.92751   Max.   :24.77296   Max.   :14.6959  
   ReturnMay          ReturnJune         ReturnJuly         ReturnAug       
 Min.   :-4.96759   Min.   :-4.82957   Min.   :-5.19139   Min.   :-5.60378  
 1st Qu.:-0.43045   1st Qu.:-0.45602   1st Qu.:-0.51832   1st Qu.:-0.47163  
 Median :-0.06983   Median :-0.04354   Median :-0.02372   Median :-0.07393  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.: 0.35906   3rd Qu.: 0.37273   3rd Qu.: 0.47735   3rd Qu.: 0.39967  
 Max. 

In [20]:
summary(normTest)

   ReturnJan           ReturnFeb           ReturnMar          ReturnApr       
 Min.   :-3.743836   Min.   :-3.251044   Min.   :-4.07731   Min.   :-4.47865  
 1st Qu.:-0.485690   1st Qu.:-0.348951   1st Qu.:-0.40662   1st Qu.:-0.51121  
 Median :-0.066856   Median :-0.006860   Median :-0.05674   Median :-0.11414  
 Mean   :-0.000419   Mean   :-0.003862   Mean   : 0.00583   Mean   :-0.03638  
 3rd Qu.: 0.357729   3rd Qu.: 0.264647   3rd Qu.: 0.35653   3rd Qu.: 0.32742  
 Max.   : 8.412973   Max.   : 9.552365   Max.   : 9.00982   Max.   : 6.84589  
   ReturnMay          ReturnJune         ReturnJuly          ReturnAug       
 Min.   :-5.84445   Min.   :-4.73628   Min.   :-5.201454   Min.   :-4.62097  
 1st Qu.:-0.43819   1st Qu.:-0.44968   1st Qu.:-0.512039   1st Qu.:-0.51546  
 Median :-0.05346   Median :-0.02678   Median :-0.026576   Median :-0.10277  
 Mean   : 0.02651   Mean   : 0.04315   Mean   : 0.006016   Mean   :-0.04973  
 3rd Qu.: 0.42290   3rd Qu.: 0.43010   3rd Qu.: 0.457193 

In [23]:
RNGkind(sample.kind = "Rounding")
set.seed(144)
km = kmeans(normTrain, centers = 3)

“non-uniform 'Rounding' sampler used”


In [24]:
table(km$cluster)


   1    2    3 
3157 4696  253 

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
```R
library(flexclust)

km.kcca = as.kcca(km, normTrain)

clusterTrain = predict(km.kcca)

clusterTest = predict(km.kcca, newdata=normTest)
```

In [25]:
library("flexclust")
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)

Loading required package: grid

Loading required package: modeltools

Loading required package: stats4



In [26]:
table(clusterTest)

clusterTest
   1    2    3 
1298 2080   96 

### Cluster-Specific Predictions

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

In [36]:
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)

In [37]:
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)

In [38]:
mean(stocksTrain1$PositiveDec)

In [39]:
mean(stocksTrain2$PositiveDec)

In [40]:
mean(stocksTrain3$PositiveDec)

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

In [41]:
StocksModel1 = glm(PositiveDec ~ ., data=stocksTrain1, family='binomial')
StocksModel2 = glm(PositiveDec ~ ., data=stocksTrain2, family='binomial')
StocksModel3 = glm(PositiveDec ~ ., data=stocksTrain3, family='binomial')

In [42]:
summary(StocksModel1)


Call:
glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7307  -1.2910   0.8878   1.0280   1.5023  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.17224    0.06302   2.733  0.00628 ** 
ReturnJan    0.02498    0.29306   0.085  0.93206    
ReturnFeb   -0.37207    0.29123  -1.278  0.20139    
ReturnMar    0.59555    0.23325   2.553  0.01067 *  
ReturnApr    1.19048    0.22439   5.305 1.12e-07 ***
ReturnMay    0.30421    0.22845   1.332  0.18298    
ReturnJune  -0.01165    0.29993  -0.039  0.96901    
ReturnJuly   0.19769    0.27790   0.711  0.47685    
ReturnAug    0.51273    0.30858   1.662  0.09660 .  
ReturnSep    0.58833    0.28133   2.091  0.03651 *  
ReturnOct   -1.02254    0.26007  -3.932 8.43e-05 ***
ReturnNov   -0.74847    0.28280  -2.647  0.00813 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial

In [43]:
summary(StocksModel2)


Call:
glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2012  -1.1941   0.8583   1.1334   1.9424  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.10293    0.03785   2.719 0.006540 ** 
ReturnJan    0.88451    0.20276   4.362 1.29e-05 ***
ReturnFeb    0.31762    0.26624   1.193 0.232878    
ReturnMar   -0.37978    0.24045  -1.579 0.114231    
ReturnApr    0.49291    0.22460   2.195 0.028189 *  
ReturnMay    0.89655    0.25492   3.517 0.000436 ***
ReturnJune   1.50088    0.26014   5.770 7.95e-09 ***
ReturnJuly   0.78315    0.26864   2.915 0.003554 ** 
ReturnAug   -0.24486    0.27080  -0.904 0.365876    
ReturnSep    0.73685    0.24820   2.969 0.002989 ** 
ReturnOct   -0.27756    0.18400  -1.509 0.131419    
ReturnNov   -0.78747    0.22458  -3.506 0.000454 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial

In [44]:
summary(StocksModel3)


Call:
glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9146  -1.0393  -0.7689   1.1921   1.6939  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.181896   0.325182  -0.559   0.5759  
ReturnJan   -0.009789   0.448943  -0.022   0.9826  
ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
ReturnMar    0.674179   0.564790   1.194   0.2326  
ReturnApr    1.281466   0.602672   2.126   0.0335 *
ReturnMay    0.762512   0.647783   1.177   0.2392  
ReturnJune   0.329434   0.408038   0.807   0.4195  
ReturnJuly   0.774164   0.729360   1.061   0.2885  
ReturnAug    0.982605   0.533158   1.843   0.0653 .
ReturnSep    0.363807   0.627774   0.580   0.5622  
ReturnOct    0.782242   0.733123   1.067   0.2860  
ReturnNov   -0.873752   0.738480  -1.183   0.2367  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken

In [46]:
predTest1 = predict(StocksModel1, newdata=stocksTest1, type='response')
predTest2 = predict(StocksModel2, newdata=stocksTest2, type='response')
predTest3 = predict(StocksModel3, newdata=stocksTest3, type='response')

In [47]:
table(stocksTest1$PositiveDec, predTest1 > 0.5)

   
    FALSE TRUE
  0    30  471
  1    23  774

In [48]:
table(stocksTest2$PositiveDec, predTest2 > 0.5)

   
    FALSE TRUE
  0   388  626
  1   309  757

In [49]:
table(stocksTest3$PositiveDec, predTest3 > 0.5)

   
    FALSE TRUE
  0    49   13
  1    21   13

In [50]:
acc_test1 = (30 + 774) / nrow(stocksTest1)
acc_test1

In [51]:
acc_test2 = (388 + 757) / nrow(stocksTest2)
acc_test2

In [52]:
acc_test3 = (49 + 13) / nrow(stocksTest3)
acc_test3

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
```R
AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)

AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
```

In [63]:
AllPredictions = c(predTest1, predTest2, predTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

In [64]:
table(AllOutcomes, AllPredictions > 0.5)

           
AllOutcomes FALSE TRUE
          0   467 1110
          1   353 1544

In [70]:
overall_acc = (467 + 1544) / 3474
overall_acc

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.