## Predicting Stock Returns with Cluster-Then-Predict

n the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we'll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we'll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we'll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from [infochimps](http://www.infochimps.com/datasets/nasdaq-exchange-daily-1970-2010-open-close-high-low-and-volume), a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

- ReturnJan = the return for the company's stock during January (in the year of the observation). 
- ReturnFeb = the return for the company's stock during February (in the year of the observation). 
- ReturnMar = the return for the company's stock during March (in the year of the observation). 
- ReturnApr = the return for the company's stock during April (in the year of the observation). 
- ReturnMay = the return for the company's stock during May (in the year of the observation). 
- ReturnJune = the return for the company's stock during June (in the year of the observation). 
- ReturnJuly = the return for the company's stock during July (in the year of the observation). 
- ReturnAug = the return for the company's stock during August (in the year of the observation). 
- ReturnSep = the return for the company's stock during September (in the year of the observation). 
- ReturnOct = the return for the company's stock during October (in the year of the observation). 
- ReturnNov = the return for the company's stock during November (in the year of the observation). 
- PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

### Exploring the Dataset

In [1]:
stocks = read.csv('./dataset/StocksCluster.csv')

In [2]:
str(stocks)

'data.frame':	11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...


In [3]:
summary(stocks)

   ReturnJan            ReturnFeb           ReturnMar        
 Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
 1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
 Median : 0.0009965   Median :-0.010626   Median : 0.009878  
 Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
 3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
 Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
   ReturnApr           ReturnMay          ReturnJune       
 Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
 1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
 Median : 0.009059   Median : 0.01293   Median :-0.000880  
 Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
 3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
 Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
   ReturnJuly           ReturnAug           ReturnSep        
 Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
 1st Qu.:-0.0731917   

In [7]:
cor(stocks)

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.06677458,-0.090496798,-0.037678006,-0.044411417,0.09223831,-0.081429765,-0.0227920187,-0.0264371526,0.14297723,0.06763233,0.004728518
ReturnFeb,0.066774583,1.0,-0.155983263,-0.191351924,-0.09552092,0.16999448,-0.0617785094,0.1315597863,0.0435017706,-0.08732427,-0.15465828,-0.038173184
ReturnMar,-0.090496798,-0.15598326,1.0,0.009726288,-0.003892789,-0.08590549,0.0033741597,-0.0220053995,0.0765183267,-0.01192376,0.03732353,0.022408661
ReturnApr,-0.037678006,-0.19135192,0.009726288,1.0,0.063822504,-0.01102775,0.0806319317,-0.051756051,-0.0289209718,0.04854003,0.03176184,0.094353528
ReturnMay,-0.044411417,-0.09552092,-0.003892789,0.063822504,1.0,-0.02107454,0.0908502642,-0.033125658,0.0219628623,0.01716673,0.04804659,0.058201934
ReturnJune,0.092238307,0.16999448,-0.085905486,-0.011027752,-0.021074539,1.0,-0.0291525996,0.010710526,0.0447472692,-0.02263599,-0.06527054,0.023409745
ReturnJuly,-0.081429765,-0.06177851,0.00337416,0.080631932,0.090850264,-0.0291526,1.0,0.0007137558,0.0689478037,-0.05470891,-0.04837384,0.07436421
ReturnAug,-0.022792019,0.13155979,-0.0220054,-0.051756051,-0.033125658,0.01071053,0.0007137558,1.0,0.0007407139,-0.07559456,-0.11648903,0.004166966
ReturnSep,-0.026437153,0.04350177,0.076518327,-0.028920972,0.021962862,0.04474727,0.0689478037,0.0007407139,1.0,-0.05807924,-0.0197198,0.041630286
ReturnOct,0.142977229,-0.08732427,-0.011923758,0.048540025,0.017166728,-0.02263599,-0.0547089088,-0.0755945614,-0.0580792362,1.0,0.19167279,-0.052574956


### Initial Logistic Regression Model

In [9]:
library(caTools)

In [10]:
set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio=0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest =subset(stocks, spl == FALSE)

In [12]:
StocksModel = glm(PositiveDec ~ ., data=stocksTrain, family='binomial')

In [13]:
summary(StocksModel)


Call:
glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4333  -1.2265   0.9102   1.1006   2.2611  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.14878    0.02384   6.240 4.37e-10 ***
ReturnJan    0.31742    0.13906   2.283  0.02246 *  
ReturnFeb   -0.29349    0.13113  -2.238  0.02521 *  
ReturnMar    0.28716    0.14890   1.928  0.05380 .  
ReturnApr    1.05849    0.14527   7.286 3.18e-13 ***
ReturnMay    0.75472    0.16438   4.591 4.40e-06 ***
ReturnJune   0.49435    0.15937   3.102  0.00192 ** 
ReturnJuly   0.75114    0.16110   4.662 3.12e-06 ***
ReturnAug    0.09395    0.17503   0.537  0.59142    
ReturnSep    0.72669    0.17083   4.254 2.10e-05 ***
ReturnOct   -0.60645    0.14452  -4.196 2.71e-05 ***
ReturnNov   -0.84449    0.15698  -5.380 7.46e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial 

In [19]:
stocksPred = predict(StocksModel, data=stocksTrain, type='response')

In [20]:
table(stocksTrain$PositiveDec, stocksPred > 0.5)

   
    FALSE TRUE
  0   990 2689
  1   787 3640

In [22]:
accuracy = (990 + 3640) / nrow(stocksTrain)
accuracy

In [23]:
testPred = predict(StocksModel, newdata=stocksTest, type='response')
table(stocksTest$PositiveDec, testPred > 0.5)

   
    FALSE TRUE
  0   417 1160
  1   344 1553

In [24]:
accuracy = (417 + 1553) / nrow(stocksTest)
accuracy

In [25]:
table(stocksTest$PositiveDec)


   0    1 
1577 1897 

In [26]:
baseline_acc = 1897 / nrow(stocksTest)
baseline_acc

### Clustering Stocks

Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
```R
limitedTrain = stocksTrain

limitedTrain$PositiveDec = NULL

limitedTest = stocksTest

limitedTest$PositiveDec = NULL
```

In [27]:
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL