## The Stock Market Data
We will use the "The Stock Market dataset” from the book “An Introduction to Statistical Learning, with applications in R”, G. James, D. Witten,  T. Hastie and R. Tibshirani, Springer, 2013. There is a package in R called ISLR with this dataset included.

Daily percentage returns for the S&P 500 stock index between 2001 and 2005 (source: raw values of the S&P 500 were obtained from Yahoo Finance and then converted to percentages and lagged).

The stock market data includes 1250 examples of stock market information, each with 9 features: Year, Lag1, Lag2, Lag3, Lag4, Lag5, Volume, Today and Direction. Direction is the class feature with two possible outcomes: up or down.

### Load data

In [4]:
install.packages("ISLR")
require(ISLR)
names(Smarket)
summary(Smarket)


The downloaded binary packages are in
	/var/folders/hx/tzq4mbtj1pj4gvxnfzdmx14h0000gn/T//Rtmp3zIqZW/downloaded_packages


Loading required package: ISLR


      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max. 

In [None]:
?Smarket

### Analyze data

In [None]:
pairs(Smarket,col=Smarket$Direction)

In [None]:
cor(Smarket) # This won't work, why)

In [None]:
cor(Smarket[,-9]) # Note that Volume has some correlation with Year...

In [None]:
boxplot(Smarket$Volume~Smarket$Year)

In [None]:
# Direction is derive from Today
cor(as.numeric(Smarket$Direction),Smarket$Today)

### Logistic regression - quick view

In [None]:
glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial)
summary(glm.fit)

In [None]:
glm.probs <- predict(glm.fit,type="response") 
glm.probs

In [None]:
glm.pred <- ifelse(glm.probs>0.5,"Up","Down")
glm.pred

In [None]:
table(glm.pred,Smarket$Direction)
mean(glm.pred==Smarket$Direction)

### Logistic regression - correct version

In [None]:
# Make training and test set
train <- (Smarket$Year < 2005)
glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
glm.fit

In [None]:
glm.probs <- predict(glm.fit,newdata=Smarket[!train,], type="response") 
glm.pred <- ifelse(glm.probs >0.5,"Up","Down")
Direction.2005 <- Smarket$Direction[!train]
table(glm.pred,Direction.2005) # Overfitting!
mean(glm.pred==Direction.2005)

### Logistic regression - smaller model

In [None]:
glm.fit <- glm(Direction~Lag1+Lag2, data=Smarket,family=binomial, subset=train)
glm.fit

In [None]:
glm.probs <- predict(glm.fit,newdata=Smarket[!train,],type="response") 
glm.pred <- ifelse(glm.probs > 0.5,"Up","Down")
table(glm.pred,Direction.2005)
mean(glm.pred==Direction.2005)

### Logistic regression - Using caret...

In [None]:
require(caret)
glmFit <- train(Smarket[train,-9], y = Smarket[train,9], method = "glm", preProcess = c("center", "scale"),
                tuneLength = 10, control=glm.control(maxit=500), trControl = trainControl(method = "cv"))
glmFit

## Exercise 2
Using the Smarket dataset perform 10 fold-cv with logistic regression.