# 4.7.2 Logistic Regression
Next, we will fit a logistic regression model in order to predict `Direction` using `Lag1` through `Lag5` and `Volume`. The `glm()` function can be used to fit many types of generalized linear models, including logistic regression. The syntax of the `glm()` function is similar to that of `lm()`, except that we must pass in the argument `family = binomial` in order to tell `R` to run a logistic regression rather than some other type of generalized linear model.

In [1]:
library(ISLR2)
attach(Smarket)

In [2]:
glm.fits <- glm( Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)
summary(glm.fits)


Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3


The smallest _p_-value here is associated with `Lag1`. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of $0.15$, the _p_-value is still relatively large, and so there is no clear evidence of a real association between `Lag1` and `Direction`.  

We use the `coef()` function in order to access just the coefficients for this fitted model. We can also use the `summary()` function to access particular aspects of the fitted model, such as the _p_values for the coefficients.

In [3]:
coef(glm.fits)

In [4]:
summary(glm.fits)$coef

Unnamed: 0,Estimate,Std. Error,z value,Pr(>|z|)
(Intercept),-0.126000257,0.24073574,-0.5233966,0.6006983
Lag1,-0.073073746,0.05016739,-1.4565986,0.1452272
Lag2,-0.042301344,0.05008605,-0.8445733,0.3983491
Lag3,0.011085108,0.04993854,0.221975,0.8243333
Lag4,0.009358938,0.04997413,0.1872757,0.8514445
Lag5,0.010313068,0.04951146,0.2082966,0.8349974
Volume,0.135440659,0.1583597,0.8552723,0.3924004


In [5]:
summary(glm.fits)$coef[,4]

The `predict()` function can be used to predict the probability that the market will go up, given values of the predictors. The `type = "response"` option tells `R` to output probabilities of the form $P(Y = 1|X)$, as opposed to other information such as the logit. If no data set is supplied to the `predict()` function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rrather than down, because the `contrasts()` function indicates that `R` has created a dummy variable with a 1 for `Up`.

In [6]:
glm.probs <- predict(glm.fits, type = "response")
glm.probs[1:10]

In [7]:
contrasts(Direction)

Unnamed: 0,Up
Down,0
Up,1


In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, `Up` or `Down`. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than $0.5$.

In [8]:
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"

The first command creates a vector of $1,250$ `Down` elements. The second line transforms to `Up` all of teh elements for which the predicted probability of a market increase exceeds $0.5$. Given these predictions, the `table()` function can be used to produce a confustion matrix in order to determine how many observations were correctly or incorrectly classified.

In [9]:
table(glm.pred, Direction)

        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507

In [10]:
(507 + 145) / 1250

In [11]:
mean(glm.pred == Direction)

The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on $507$ days and that it would go down on $145$ days, for a total of $507 + 145 = 652$ correct predictions. The `mean()` function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market $52.2%$ of the time.  

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of $1,250$ observations. In other words, $100% - 52.2% = 47.8%$, is the _training_ error rate. As we have seen previously, the training error rate is often overly optimistic&mdash;it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can the _held out_ data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model's performance not on the data that we used to fit the model, but rather on days in the future for which the market's movements are unknown.  

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

In [12]:
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train,]
dim(Smarket.2005)

In [13]:
Direction.2005 <- Direction[!train]

The object `train` is a vector of $1,250$ elements, corresponding to the observations in our data set. The elements of the vector that correspond to observations in our data set. The elements of the vector that correspond to observations that occurred before 2005 are set to `TRUE`, whereas those that correspond to observations in 2005 are set to `FALSE`. The object `train` is a _Boolean_ vector, since its elements are `TRUE` and `FALSE`. Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For instance, the command `Smarket[train,]` would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of `train` are `TRUE`. The `!` symbol can be used to reverse all of the elements of a Boolean vector. That is, `!train` is a vector similar to `train`, except that the elements that are `TRUE` in `train` get swapped to `FALSE` in `!train`. Therefore, `Smarket[!train,]` yields a submatrix of the stock market data containing only the observations for which `train` is `FALSE`&mdash;that is, the observations with dates in 2005. The output above indicates that there are $252$ such observations.  

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the `subset` argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set&mdash;that is, for the days in 2005.

In [14]:
glm.fits <- glm( Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005, type = "response")

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. FInally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

In [15]:
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)

        Direction.2005
glm.pred Down Up
    Down   77 97
    Up     34 44

In [16]:
mean(glm.pred == Direction.2005)

In [17]:
mean(glm.pred != Direction.2005)

The `!=` notion means _not equal to_, and so the last command computes the test set error rate. The results are rather disappointing: the test error rate is $52%$, which is worse than random guessing! Of course this result is not all suprising, given that opne would not generally expect to be able to use previous days' returns to predict future market performance. (After al, if it were possible to do so, then the authors of this book would be out striking it rich rather than writing a statistics textbook.)  

We recall that the logistic regression model had very underwhelming _p_-value, though not very small, corresponded to `Lag1`. Perhaps by removing the variables that appear not to be helpful in predicting `Direction`, we can obtain a more effective model. After all, using predictors that have no relationship with the repsonse tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just `Lag1` and `Lag2`, which seemed to have the highest predictive power in the orgiinal logistic regression model.

In [18]:
glm.fits <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005, type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)

        Direction.2005
glm.pred Down  Up
    Down   35  35
    Up     76 106

In [19]:
mean(glm.pred == Direction.2005)

In [20]:
106 / (106 + 76)

Now the results appear to be a little better: $56%$ of the daily movements have been correctly predicted. It is worth noting that in this case a much simpler strategy of predicting that the market will increase every day will also be correct $56%$ of the time! Hence, in terms of overall error rate, the logistic regression method is no better tna the naive approach. However, the confusion matrix shows that on days when logistic regression predicts an increase in the market, it has a $58%$ accuracy rate. This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.  

Suppose that we want to predict the returns associated with particular values of `Lag1` and `Lag2`. In particular, we want to predict `Direction` on a day when `Lag1` and `Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they equal $1.5$ and $-0.8$. We do this using the `predict()` function.

In [21]:
predict(glm.fits, newdata = data.frame(Lag1 =  c(1.2, 1.5), Lag2 = c(1.1, -0.8)), type = "response")