# DS106 Modeling : Lesson Four Companion Notebook

### Table of Contents <a class="anchor" id="DS106L4_toc"></a>

* [Table of Contents](#DS106L4_toc)
    * [Page 1 - Introduction](#DS106L4_page_1)
    * [Page 2 - Multiple Regression Models](#DS106L4_page_2)
    * [Page 3 - Stepwise Regression](#DS106L4_page_3)
    * [Page 4 - Backward Elimination](#DS106L4_page_4)
    * [Page 5 - Forward Selection](#DS106L4_page_5)
    * [Page 6 - Hybrid Stepwise - Forward and Backward Selection](#DS106L4_page_6)
    * [Page 7 - Key Terms](#DS106L4_page_7)
    * [Page 8 - Lesson 4 Practice Hands-On](#DS106L4_page_8)
    * [Page 9 - Lesson 4 Practice Hands-On Solution](#DS106L4_page_9)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L4_page_1"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Modeling with step-wise regression
VimeoVideo('246121381', width=720, height=480)


# Introduction

As a data scientist, regularly you will likely be faced with a large dataset and a request by the process owner to "make sense of the data." This is a daunting task. One of the approaches to solving a problem such as this is to employ a technique called *stepwise regression*.

You have learned in previous lessons about linear regression, logistic regression, and non-linear regression. In each case, you had a single response variable, and a single predictor variable with which you created a model. Well, you're hitting the big-time now, because now you're adding additional predictors! As many as your data allows! Just adding them in is the concept of *multiple regression*.  But if you want to sort out which ones are most important, than you'll need to employ *stepwise regression*. 

By the end of this lesson, you should be able to:

* Understand multiple regression
* Differentiate between the three different types of stepwise regression
* Complete backwards elimination, forward selection, and hybrid stepwise regression in R

This lesson will culminate in a hands-on in which you determine the best linear regression model through stepwise regression in R.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/438421794"> recorded live workshop </a> that goes over stepwise regression in R.</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Multiple Regression Models<a class="anchor" id="DS106L4_page_2"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Multiple Regression Models

When you have only one independent and one dependent variable, that is referred to as *simple regression*. But if you want to add in more independent variables, then that is referred to as *multiple regression*. 

In the real world, there are many things that go into predicting any one outcome, which means that you need to find a way to estimate with multiple predictors (independent variables). For example, if you want to predict the stopping distance for a car on dry roads, you might want to use:

* The initial speed of the car
* The weight of the car
* The amount of tread on the tires

The high level theory is that you will add additional predictor variables when they improve the model. In other words, if the ability to predict increases significantly, then you should include those additional predictor variables.

For example:

* If height is a pretty good predictor of weight, then height **and** BMI might be a better predictor of weight.
* If foot width is a pretty good predictor of foot length, then foot width **and** height might be a better predictor of weight.
* If high school GPA is a pretty good predictor of college GPA, then high school GPA **and** SAT score might be a better predictor of college GPA.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Stepwise Regression<a class="anchor" id="DS106L4_page_3"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Stepwise Regression

*Stepwise regression* is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a predictor variable is considered for addition to or subtraction from the set of explanatory variables based on some pre-specified criterion. Usually, this takes the form of a sequence of *F*-tests or *t*-tests. Stepwise regression is a tool that can help determine a complex relationship. Modeling is an iterative process, and stepwise regression certainly has a seat at the table.

The term "stepwise" comes from the fact that the process is iterative. Each time a term is eliminated, the remaining model is re-evaluated. This is in contrast to determining multiple terms that can be eliminated on the initial observation. The phrase "...in the presence of the other terms" is important here, because each term individually will behave differently depending on what other terms are in the model.

Suppose you have a dataset that includes a single response variable, and 6 different possible predictor variables. How do you figure out what the best model is? One approach would be to look at all possible models, and simply choosing the best one based on an r<sup>2</sup> value. Seems reasonable, right? The higher the r<sup>2</sup> value is, the more predicting power the model has - at least, that is the theory. However, this approach has a couple of weaknesses that makes it less than ideal: 

* **This becomes a lot of models very quickly!** The number of possible models is two raised to the power of your predictors: 2<sup>x</sup>. In the case of 6 predictor variables, that would be 2<sup>6</sup> models, or 64 models. You could take the time to look through 64 models to determine which one is best, but it will be very tedious. And that's not even including any interactions between your variables or any terms for non-linear modeling. Can you imagine what would happen if there were 30 predictor variables? That would give 2<sup>30</sup> possible models, but 2<sup>30</sup> ~ 1.07 billion models. You and all your friends could spend 1000 lifetimes looking over the models and doing nothing else, and you would still never find the best one!

* **R<sup>2</sup> will never decrease:** As you add predictor terms to a model, R<sup>2</sup> may stay the same, or it might increase, but it will never decrease. In other words, the model that contains **all** of the predictor variables will have the highest r<sup>2</sup> value. If that is your only criteria for determining which model is best, then you would automatically have to use the model that contains all predictor variables.

---

## Overfitting

You might wonder what is wrong with including all the predictor variables. Well, using all of the predictor variables increases the chances of *overfitting* the model. Basically, the model fits your current data so well that it won't fit similar data later. It can't be generalized.

There is a thing called the *Principle of Parsimony* in science. It is applied to many branches of science, and in a nutshell, it means that the most acceptable explanation of any phenomenon is usually the simplest explanation. That is not to say that there are no complex solutions to science problems; there certainly are. The more accurate interpretation is that if there are two "equally" good models for a particular relationship, whichever is simpler is usually the preferred model.

---

## Approaches for Stepwise Regression

So now that you know what doesn't work, what does? Stepwise regression, of course! There are three different approaches:

1. Backward Elimination
2. Forward Selection
3. Hybrid

You will learn about all three in this lesson, with the help of the organization "Dragonfly Statistics" and their tutorial videos.  

---

### 1. Backward Elimination

Start with all the predictors, and then start eliminating one variable at a time, starting with the IV that has the least effect on improving the model. Keep removing terms until the point at which removing a term has a significant detrimental effect on the model, and then stop.

---

#### Example

Suppose you wanted to create a model to predict someone's weight. You all have a pretty good understanding of what contributes to someone's weight. For example, you can all pretty much agree a person's height, gender, body type, body fat percentage, etc. all could help predict a person's weight. Some of those might be better predictors than others, but they all probably have some sort of prediction ability. On the other hand, you might know stuff about a person that probably has little or nothing to do with a person's weight, such as their IQ, the number of cousins they have, or whether or not they like rodeo.

Someone who is an expert on heights of humans might be able to toss out the meaningless predictors, state the predictors that are completely meaningful, and even weight the predictors that are marginal. But for the rest of you, how would you know? Using the technique of backward elimination would create a model with all of the predictors in the model, and then it would determine (one at a time) which terms don't belong, based on criteria you will discuss later.

The general approach is that if a single predictor - in the presence of all other predictors - does not add to the quality of the model, then it is dropped. Each time a predictor is dropped, the new model is evaluated to determine if there is another predictor can be dropped. 

---

### 2. Forward Selection

Start with the single predictor that has the greatest prediction power in the model, and then add predictors one at a time, beginning with the remaining predictor that has the greatest prediction power in the presence of all the terms already included in the model. Once you get to a point where adding an additional term has little effect in improving the predictability of the model, stop.

---

### 3. Hybrid: Combining Forward Selection and Backward Elimination

This method combines the first two methods. At each step, the method determines if a term needs to be added, and if any of the terms previously added need to be eliminated.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Backward Elimination<a class="anchor" id="DS106L4_page_4"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Backward Elimination

Please watch **[this video](https://www.youtube.com/watch?v=0aTtMJO-pE4&t)** first.  What follows here won't make a lot of sense outside the context of the video.

---

## Load in Data

First, load the built-in R data set called ```mtcars```. You will know if you have it if you type the command:

```{r}
head(mtcars)
```

And receive this in reply: 

```text
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---

## Question Setup

Create a model that will use ```mpg``` as the response variable, and the other 10 columns of data as potential predictor variables. It is assumed that all 10 predictors don't really belong in the model. 

---

## Get a Baseline

You will start by creating a function called ```FitAll``` that creates a linear model of all 10 predictor variables. The command looks like this:

```{r}
FitAll = lm(mpg ~ ., data = mtcars)
```

```mpg``` is your y, or response variable, and ```.``` means "all."  Instead of saying ```mpg ~ .```, you could have listed out all 10 predictor variables, as in "mpg ~ cyl + disp + hp + drat + ...". It is much easier to use the ```.``` rather than list them all out. 

Then get get the model summary, using the ```summary()``` function:

```{r}
summary(FitAll)
```

This provides you the following:

```text

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869,	Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
```

Although the overall *p* value at the bottom is significant, none of the individual predictors are, meaning that the model is probably not a good fit; the number of variables has just inflated the *p* value to the point of being significant, and what the independent variables are does not really matter. 

Once you have created a new model after backward elimination, you can compare the above model summary with all 10 predictors against what R has helped determine is the best fit. 

---

## Try Backward Elimination

The real guts of the method is in the following line of code, using the ```step()``` function and specifying the argument ```direction='backward'``` to ensure you're doing backward elimination.

```{r}
step(FitAll, direction = 'backward')
```

What this bit of code does is creates a stepwise regression, going backward only on the linear model you just created using the ```mtcars``` dataset. 

Once you enter the command above, you get several pages of output. If you scroll up to the top of the output, right below the command, you will see this:

```text
Start:  AIC=70.9
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
```

*AIC* stands for *Akaike Information Criteria*. It was invented by a Japanese statistician named Hirotugu Akaike in the 1970's. The AIC is a measure of relative quality of a model. However, it doesn't tell you if a model is any good or not. It simply helps you to compare two or more models against each other. When comparing two models, the model with the smaller AIC is generally accepted to be the better of the two models. In the case of the ```mtcars``` data, the start AIC of 70.9 is for the original model with all ten predictors.

Now, look at the table below the start model:

```text
       Df Sum of Sq    RSS    AIC
- cyl   1    0.0799 147.57 68.915
- vs    1    0.1601 147.66 68.932
- carb  1    0.4067 147.90 68.986
- gear  1    1.3531 148.85 69.190
- drat  1    1.6270 149.12 69.249
- disp  1    3.9167 151.41 69.736
- hp    1    6.8399 154.33 70.348
- qsec  1    8.8641 156.36 70.765
<none>              147.49 70.898
- am    1   10.5467 158.04 71.108
- wt    1   27.0144 174.51 74.280
```

In this portion of the output, there are 11 rows of information. Each row is information pertaining to the single model shown above. The first column is unlabeled, but it contains the variable that has been removed for that particular row of output. For example, the first row says ```- cyl```. That means the first row is the model with only 9 predictor variables, where the ```cyl``` variable has been left out.

Notice that for 10 potential predictor variables, there are 11 rows of output. This is a comparison of 11 models - the original model with all 10 predictors is included, as well as 10 different models where each model has had exactly one predictor variable removed. You will note that there is a model on the 9th row that is labeled as ```<none>```. This is the model with no predictors removed, or the original model. Note that the AIC in the far right column is 70.898 for that model, which is essentially the same as the ```Start``` model AIC listed above the table of 70.9 (rounding has taken place).

The table has been sorted from smallest AIC at the top to largest AIC at the bottom. The takeaway from this table is as follows: Of all the models with just 9 predictors, it would appear that the model that excludes ```cyl``` is the best, because it has the smallest AIC.

Before moving on, there are a couple important points to make:

1. You don't really know if the model with ```cyl``` removed is any good or not, you just know that it is better than all the other models with a single predictor removed.

2. You might assume at this point that all of the terms above the row with "<none>" could easily be removed. Since individually removing each of those rows makes for a better model, why not just remove all of them in one fell swoop, just to save time? This is actually an erroneous thought process. The power of stepwise regression is that each iteration evaluates a model in the presence or absence of the other terms in the model. This will be illustrated shortly based on the fact that at this first iteration, it might appear that the only terms that belong in the model are ```wt``` and ```am```.

---

### Iteration Two

Move onto the next iteration:

```text
Step:  AIC=68.92
mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb

       Df Sum of Sq    RSS    AIC
- vs    1    0.2685 147.84 66.973
- carb  1    0.5201 148.09 67.028
- gear  1    1.8211 149.40 67.308
- drat  1    1.9826 149.56 67.342
- disp  1    3.9009 151.47 67.750
- hp    1    7.3632 154.94 68.473
<none>              147.57 68.915
- qsec  1   10.0933 157.67 69.032
- am    1   11.8359 159.41 69.384
- wt    1   27.0280 174.60 72.297
```

This is the next portion of the output. Notice that there is a new start AIC of 68.92. This is because the predictor ```cyl``` has been removed from the model, based on the results above. In fact, right below the AIC is the current model, which has ```mpg``` as a function of ```disp```, ```hp```, ```drat```, ```wt```, ```qsec```, ```vs```, ```am```, ```gear```, and ```carb```. There is no ```cyl``` in the model.

Below the new model, there is a table of 10 different models (9 predictor variables, removed one at a time, and the 10th model with no predictor variables removed). The output suggests that ```vs``` can be removed from the model to make it better. 

---

### Iteration Three

The next iteration will be calculated with both the ```cyl``` term and the ```vs``` terms removed:

```text
Step:  AIC=66.97
mpg ~ disp + hp + drat + wt + qsec + am + gear + carb

       Df Sum of Sq    RSS    AIC
- carb  1    0.6855 148.53 65.121
- gear  1    2.1437 149.99 65.434
- drat  1    2.2139 150.06 65.449
- disp  1    3.6467 151.49 65.753
- hp    1    7.1060 154.95 66.475
<none>              147.84 66.973
- am    1   11.5694 159.41 67.384
- qsec  1   15.6830 163.53 68.200
- wt    1   27.3799 175.22 70.410
```

The model with only 8 predictor terms has an AIC of 66.97. According to the table below the current model, if ```carb``` is removed, the AIC will go down a bit more. 

---

### Iteration Four

The model with only 7 predictors looks like this:

```text
Step:  AIC=65.12
mpg ~ disp + hp + drat + wt + qsec + am + gear

       Df Sum of Sq    RSS    AIC
- gear  1     1.565 150.09 63.457
- drat  1     1.932 150.46 63.535
<none>              148.53 65.121
- disp  1    10.110 158.64 65.229
- am    1    12.323 160.85 65.672
- hp    1    14.826 163.35 66.166
- qsec  1    26.408 174.94 68.358
- wt    1    69.127 217.66 75.350
```

Now the AIC is at 65.12.

Have you been paying attention to the models below the ```<none>``` row in each table? Remember, the interpretation of those rows is that if those variables get removed, the model will actually be a bit worse than the current model. Have you noticed that at each iteration, the variables below the ```<none>``` line are not always the same variables?

---

### Iteration Five

Move on to the next iteration:

```text
Step:  AIC=63.46
mpg ~ disp + hp + drat + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- drat  1     3.345 153.44 62.162
- disp  1     8.545 158.64 63.229
<none>              150.09 63.457
- hp    1    13.285 163.38 64.171
- am    1    20.036 170.13 65.466
- qsec  1    25.574 175.67 66.491
- wt    1    67.572 217.66 73.351
```

You are down to six terms from ten. And it looks like you still have one or two terms you can eliminate...

---

### Iteration Six

Removing ```drat``` is next:

```text
Step:  AIC=62.16
mpg ~ disp + hp + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- disp  1     6.629 160.07 61.515
<none>              153.44 62.162
- hp    1    12.572 166.01 62.682
- qsec  1    26.470 179.91 65.255
- am    1    32.198 185.63 66.258
- wt    1    69.043 222.48 72.051
```

---

### Iteration Seven

Now removing ```disp``` is next. This might be the last term that is eliminated, looking at the output:

```text
Step:  AIC=61.52
mpg ~ hp + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- hp    1     9.219 169.29 61.307
<none>              160.07 61.515
- qsec  1    20.225 180.29 63.323
- am    1    25.993 186.06 64.331
- wt    1    78.494 238.56 72.284
```

But it turns out that once ```disp``` was removed from the model, that all of a sudden it looks like ```hp``` doesn't belong in the model, either, whereas earlier it looks like it might. 

---

### Iteration Eight

So now remove ```hp``` from the model:

```text
Step:  AIC=61.31
mpg ~ wt + qsec + am

       Df Sum of Sq    RSS    AIC
<none>              169.29 61.307
- am    1    26.178 195.46 63.908
- qsec  1   109.034 278.32 75.217
- wt    1   183.347 352.63 82.790
```

You only have 3 terms left in the model now. They are ```wt```, ```qsec```, and ```am```. If you look for a description of the mtcars dataset on the web, you will discover that ```wt``` is the weight of the car in thousands of pounds, ```qsec``` is the time it takes for the car to cover 1/4 mile from rest, and ```am``` is a coded variable for the type of transmission (0 for automatic, and 1 for manual).

---

## The Final Model

According to the output table, the model that includes all three of those variables is better than any of the three models that omits one of those three variables. You are done with creating a model using backward elimination. If you haven't reached this conclusion on your own, don't worry, R has reached it for you! 

The next little bit of output tells not only which terms belong in the model, but what the coefficients are, too. You will see the following:

```text
Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Coefficients:
(Intercept)           wt         qsec           am  
      9.618       -3.917        1.226        2.936  
```

This output tells you that the best model for the ```mtcars``` data uses ```wt```, ```qsec```, and ```am``` to predict the ```mpg``` for a car, using the following equation:

```mpg = 9.618 + (-3.917) _ wt + 1.226 _ qsec + 2.936 \* am```

Pay attention to the signs of the coefficients for a moment:

* The coefficient of ```wt``` is (-3.917). What that means is as long as everything else is equal, an increase of a car's weight by 1000 lbs will lead to an estimated decrease in the car's ```mpg``` by about 3.9 miles per gallon of gas.
* The coefficient of ```qsec``` is 1.226. Everything else being equal, as a car's ```qsec``` increases, so does the ```mpg```. Does that make sense? Well, yes - if the ```qsec``` is larger, that means the car is slower off the start. If you know anything about cars, more power is usually associated with more speed, which is usually associated with worse gas mileage. In this case, you are being told that less power leads to higher gas mileage - for every additional second it takes to run a quarter mile, the gas mileage is improved by about 1.2 miles per gallon. It is pretty much an accepted fact that gutless cars get better gas mileage than fast cars.
* The coefficient for the transmission type indicates that all other things being equal, two identical cars where one is an automatic transmission and one is a manual transmission will have different miles per gallon ratings. The manual transmission should get around 3 mpg better mileage than the automatic transmission.

So now you have a model. Is it any good? You know it is the best model you could find using backward elimination, but is it any good?

You can determine this by creating a new model with just those variables and then looking at the summary. You will create a new function called ```fitsome```, which will just have the terms you are interested in:

```{r}
fitsome = lm(mpg ~ am + qsec + wt, data = mtcars)
```

And here is the output:

```text
Call:
lm(formula = mpg ~ am + qsec + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
am            2.9358     1.4109   2.081 0.046716 *  
qsec          1.2259     0.2887   4.247 0.000216 ***
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,	Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
```

There are a few things to note here from the above output:

* There are coefficients for just the three predictor variables, as well as an intercept.
* There is a multiple R<sup>2</sup> = 0.8497. The practical interpretation of this is that the model explains 84.97% of the variation in the ```mpg``` variable, and there is another 15.03% of the variation that can be chalked up to noise or random error.
* There is an adjusted R<sup>2</sup> = 0.8336. The adjustment is a modification that is supposed to take into account the number of terms in the model. For your purpose, you are more interested in the adjusted R<sup>2</sup> than the multiple R<sup>2</sup>.
* There is an *F*-statistic (52.75 with 3 and 28 degrees of freedom) and a *p* value of 0.0000000000121 (this is represented in scientific notation in R) which indicates that the model is better than no model at all, because the *p* value is less than 0.05.

So, have you really gained anything by doing the backward elimination? To the untrained eye, one might say "no gain." The R<sup>2</sup> values are very similar, and both models are statistically significant. However, to the trained eye, you have adhered to the "Principle of Parsimony" and come up with an equally good (maybe marginally better) model that is much simpler. Further, now all of your individual predictors are significant! Mission accomplished.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Forward Selection<a class="anchor" id="DS106L4_page_5"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Forward Selection

Please watch **[this video](https://www.youtube.com/watch?v=OYEII--K_k4&t)** on forward selection.

The difference between backward elimination and forward selection is hopefully pretty obvious by now. Backward elimination starts with all the predictors, and then removes them one by one until you are left with a model that is better than any model with one more predictor removed. On the other hand, forward selection starts with no predictors, and adds predictors one at a time until adding one more predictor does not appreciably improve the model.

---

## Create the Original Model

The commands are similar to those of backward elimination, but instead of starting with all the predictors, you start with none of the predictors.  This done by subbing out the ```.``` in the model for a ```1```: 

```{r}
fitstart = lm(mpg ~ 1, data = mtcars)
```

In this function, the part where it says ```mpg ~ 1```, tells the model to start without any predictors; the ```1``` part forces the model to simply use the average ```mpg``` for the dataset as the predicted value of ```mpg```. It is often called the *naive model*, meaning that it is a model where you assume all future ```mpg``` fuel ratings are simply equal to the mean of the historical fuel ratings.

Then run a summary for this model: 

```{r}
summary(fitstart)
```

You will get the following:

```text
Call:
lm(formula = mpg ~ 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6906 -4.6656 -0.8906  2.7094 13.8094 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.091      1.065   18.86   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.027 on 31 degrees of freedom
```

Notice that the estimate for the intercept is 20.091, which is simply the average mpg for the 28 vehicles in the original ```mtcars``` dataset. This is a model, but it is a poor one, even though it is significant.  

---

## Begin Forward Selection

At this point, you will use the ```step()``` function to complete forward selection. R needs some instruction about which variables are potential predictors. This is called the scope of the stepwise approach. You could either spell out the entire scope of the potential model as follows:

```{r}
step(fitstart, direction = 'forward', scope = (~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb))
```

Or you can utilize the fact that the ```FitAll``` model was defined before when you did the backward elimination approach, and simplify the command as follows:

```{r}
step(fitstart, direction = 'forward', scope = (formula(FitAll)))
```

In either case, ```direction = 'forward'``` specifies that you are doing forward selection, and the argument ```scope=``` specifies the model that ```step()``` is moving towards.

Here is what the results look like, one step at a time.

---

### First Iteration

The output below shows that the AIC for the model with no terms at all is 115.94. If you look at the table below the AIC, you can see that adding _any_ predictor variable will create a better model than the model with no predictors, because any model with one variable has an AIC smaller than 115.94. Since the predictors are sorted, from best to worst, it appears that ```wt``` is the best individual predictor, so it should be added to the model first.

```text
Start:  AIC=115.94
mpg ~ 1

       Df Sum of Sq     RSS     AIC
+ wt    1    847.73  278.32  73.217
+ cyl   1    817.71  308.33  76.494
+ disp  1    808.89  317.16  77.397
+ hp    1    678.37  447.67  88.427
+ drat  1    522.48  603.57  97.988
+ vs    1    496.53  629.52  99.335
+ am    1    405.15  720.90 103.672
+ carb  1    341.78  784.27 106.369
+ gear  1    259.75  866.30 109.552
+ qsec  1    197.39  928.66 111.776
<none>              1126.05 115.943
```

---

### Second Iteration

Now, ```wt``` is in the model: 

```text
Step:  AIC=73.22
mpg ~ wt

       Df Sum of Sq    RSS    AIC
+ cyl   1    87.150 191.17 63.198
+ hp    1    83.274 195.05 63.840
+ qsec  1    82.858 195.46 63.908
+ vs    1    54.228 224.09 68.283
+ carb  1    44.602 233.72 69.628
+ disp  1    31.639 246.68 71.356
<none>              278.32 73.217
+ drat  1     9.081 269.24 74.156
+ gear  1     1.137 277.19 75.086
+ am    1     0.002 278.32 75.217
```

You can next see that the next term to add is ```cyl```. As a reminder, the new AIC on the row labeled ```cyl``` is the quality of the model with ```cyl``` added, and in the presence of ```wt``` already in the model, and not the quality of having ```cyl``` in the model by itself. If you compare this table with the table above, it is clear that having both ```wt``` and ```cyl``` in the model is better than just having ```wt``` or just having ```cyl```.

---

### Third Iteration

With ```wt``` and ```cyl``` already in the model, it still makes sense to add ```hp``` to get some improvement.

```text
Step:  AIC=63.2
mpg ~ wt + cyl

       Df Sum of Sq    RSS    AIC
+ hp    1   14.5514 176.62 62.665
+ carb  1   13.7724 177.40 62.805
<none>              191.17 63.198
+ qsec  1   10.5674 180.60 63.378
+ gear  1    3.0281 188.14 64.687
+ disp  1    2.6796 188.49 64.746
+ vs    1    0.7059 190.47 65.080
+ am    1    0.1249 191.05 65.177
+ drat  1    0.0010 191.17 65.198
```

---

### Fourth Iteration

Now, with a model including ```wt```, ```cyl```, and ```hp```, you are done. There is no point in adding any more terms, because according to the table, the AIC will actually increase if you do.

```text
Step:  AIC=62.66
mpg ~ wt + cyl + hp

       Df Sum of Sq    RSS    AIC
<none>              176.62 62.665
+ am    1    6.6228 170.00 63.442
+ disp  1    6.1762 170.44 63.526
+ carb  1    2.5187 174.10 64.205
+ drat  1    2.2453 174.38 64.255
+ qsec  1    1.4010 175.22 64.410
+ gear  1    0.8558 175.76 64.509
+ vs    1    0.0599 176.56 64.654

Call:
lm(formula = mpg ~ wt + cyl + hp, data = mtcars)

Coefficients:
(Intercept)           wt          cyl           hp  
   38.75179     -3.16697     -0.94162     -0.01804  
```

---

## Examine the Final Model

Do the same thing you did for the backward elimination model: build a model that only contains ```wt```, ```cyl```, and ```hp```. You called it ```fitsome``` in the backward elimination model; call this one ```fitsome2```:

```{r}
fitsome2 = lm(mpg ~ wt + cyl + hp, data = mtcars)
```

And then look at the summary:

```{r}
summary(fitsome2)
```

Here are the results:

```text
Call:
lm(formula = mpg ~ wt + cyl + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9290 -1.5598 -0.5311  1.1850  5.8986 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
wt          -3.16697    0.74058  -4.276 0.000199 ***
cyl         -0.94162    0.55092  -1.709 0.098480 .  
hp          -0.01804    0.01188  -1.519 0.140015    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared:  0.8431,	Adjusted R-squared:  0.8263 
F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11
```

When you did the backward elimination model, you compared the initial model (all ten predictors) with the optimized model, and concluded that the models are similar as far as R<sup>2</sup> is concerned, so the model with just 3 predictors seems preferable based upon simplicity alone. For the forward selection model, there really is no starting point sort of model. You just used the average mpg to get a starting point without any predictor variables.

Nonetheless, take a look at the specifics of the forward selection model shown above:

* The multiple R<sup>2</sup> = 0.8431. The practical interpretation of this is that the model explains 84.31% of the variation in the ```mpg``` variable, and there is another 15.69% of the variation that can be chalked up to noise or random error.
* There is an adjusted R<sup>2</sup> = 0.8263. The "adjustment" is a modification that is supposed to take into account the number of terms in the model. For your purposes, you are more interested in the adjusted R<sup>2</sup> than the multiple R<sup>2</sup>.
* There is an *F*-statistic (50.17 with 3 and 28 degrees of freedom) and a *p* value of 0.00000000002184 (this is represented in scientific notation in R) which indicates that the model is better than no model at all, because the *p* value is less than 0.05.

The R<sup>2</sup> for the backward elimination model and the forward selection model are similar. So, which one is better? One could argue that the AIC for the backward elimination model is slightly lower, so it must be better. But for all practical purposes, they are essentially the same. However, did you notice the terms in the predictor model are not the same? That is actually a pretty common result. The truth is that in many cases, there might not just be a single 'best' model that is far superior to the 'second best' model, but there might be a cluster of models that are essentially equally good.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Hybrid Stepwise - Forward and Backward Selection<a class="anchor" id="DS106L4_page_6"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Hybrid Stepwise - Forward and Backward Selection

**[This video](https://www.youtube.com/watch?v=ejR8LnQziPY)** shows how to do the regression method that does both forward steps and backward steps. In general terms, you start with no predictors, and use the mean value for ```mpg``` only. This is the same way you started the forward selection model. You will use the ```step()``` function again, but this time, utilize ```direction="both"``` as an argument, to specify the hybrid stepwise approach: 

```{r}
step(fitstart, direction="both", scope=formula(FitAll))
```

---

## First Iteration 


Here is what the first iteration looks like.  You will notice that it is identical to the first iteration of the 'forward selection approach.' Adding ```wt``` to the model is the first step.

```text
Start:  AIC=115.94
mpg ~ 1

       Df Sum of Sq     RSS     AIC
+ wt    1    847.73  278.32  73.217
+ cyl   1    817.71  308.33  76.494
+ disp  1    808.89  317.16  77.397
+ hp    1    678.37  447.67  88.427
+ drat  1    522.48  603.57  97.988
+ vs    1    496.53  629.52  99.335
+ am    1    405.15  720.90 103.672
+ carb  1    341.78  784.27 106.369
+ gear  1    259.75  866.30 109.552
+ qsec  1    197.39  928.66 111.776
<none>              1126.05 115.943
```

---

## Second Iteration

Notice at the bottom of this table shown above. It not only specifies what the new AIC would be if a predictor were aded (as indicated by the "+" sign at the start of each row), but also specifies what the new AIC would be if ```wt``` were removed from the model (as indicated by the "-" at the start of the row with ```wt```). It would be silly to add ```wt``` in one step, and then turn around and immediately remove it in the next step. However, when there are more than two terms in the model, this isn't such a far fetched idea.

```text
Step:  AIC=73.22
mpg ~ wt

       Df Sum of Sq     RSS     AIC
+ cyl   1     87.15  191.17  63.198
+ hp    1     83.27  195.05  63.840
+ qsec  1     82.86  195.46  63.908
+ vs    1     54.23  224.09  68.283
+ carb  1     44.60  233.72  69.628
+ disp  1     31.64  246.68  71.356
<none>               278.32  73.217
+ drat  1      9.08  269.24  74.156
+ gear  1      1.14  277.19  75.086
+ am    1      0.00  278.32  75.217
- wt    1    847.73 1126.05 115.943
```

The second iteration suggests that the best way to improve the model is to add the ```cyl``` predictor, which is what you will do in the next iteration below!

---

## Third Iteration

Now, with ```cyl``` in the model also, each term that is not already in the model becomes a candidate for inclusion, and each term already in the model becomes a candidate for elimination. As usual, you will let the AIC dictate what happens. In this case, the model is improved most by adding the ```hp``` term. 

```text
Step:  AIC=63.2
mpg ~ wt + cyl

       Df Sum of Sq    RSS    AIC
+ hp    1    14.551 176.62 62.665
+ carb  1    13.772 177.40 62.805
<none>              191.17 63.198
+ qsec  1    10.567 180.60 63.378
+ gear  1     3.028 188.14 64.687
+ disp  1     2.680 188.49 64.746
+ vs    1     0.706 190.47 65.080
+ am    1     0.125 191.05 65.177
+ drat  1     0.001 191.17 65.198
- cyl   1    87.150 278.32 73.217
- wt    1   117.162 308.33 76.494
```

---

## Fourth Iteration

With the fourth iteration, the model suggests that having ```wt```, ```cyl```, and ```hp``` as predictors is better than any other model with another single predictor added, and better than any other model with one of those predictors removed. The iterative process has stabilized, and you have the same model as you did when you did the forward selection approach.

```text
Step:  AIC=62.66
mpg ~ wt + cyl + hp

       Df Sum of Sq    RSS    AIC
<none>              176.62 62.665
- hp    1    14.551 191.17 63.198
+ am    1     6.623 170.00 63.442
+ disp  1     6.176 170.44 63.526
- cyl   1    18.427 195.05 63.840
+ carb  1     2.519 174.10 64.205
+ drat  1     2.245 174.38 64.255
+ qsec  1     1.401 175.22 64.410
+ gear  1     0.856 175.76 64.509
+ vs    1     0.060 176.56 64.654
- wt    1   115.354 291.98 76.750
```

You are again at mission accomplished. You won't bother to run a new model with those three predictors, since you already did that above.

---

## Hierarchical Regression 

In *hierarchical regression*, you get to pick the variables that are being added or removed next, which allows you to make statements about how much variance is added "over and above" other variables. This approach can be quite useful especially when answering stakeholder questions about what variables are most important. 

---

## Summary

* Stepwise regression has three basic approaches - backward elimination, forward selection, and a combination of the two.
* Backward elimination creates a model with all predictor variables included, and then removes them one at a time until the model is optimized.
* Forward selection starts with no predictor variables included, and adds them one at a time until the model is optimized.
* The combination approach starts with no predictor terms. Every time a term is either added or removed from the model, it is compared with the quality of all other models that either add a single predictor or remove a single predictor until the model is optimized.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS106L4_page_7"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Stepwise Regression</td>
        <td>A process to determine exactly what variables make up the best-fitting regression model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multiple Regression</td>
        <td>A regression with multiple independent variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Overfitting</td>
        <td>When a model fits the data so well that it won't fit future data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Principle of Parsimony</td>
        <td>When all other things are equal, go with the simplest explanation.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Backward Elimination</td>
        <td>Starting with all the predictors and removing them one at a time to find the best-fitting model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Forward Selection</td>
        <td>Starting with no predictors and adding them one at a time to find the best-fitting model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Hybrid Stepwise Regression</td>
        <td>Starting with no predictors and adding or removing them one at a time to find the best-fitting model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Akaike Information Criteria (AIC)</td>
        <td>An indicator of model fit quality. The smaller, the better.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>step()</td>
        <td>A function to perform stepwise regression.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>direction="backward"</td>
        <td>An argument to step() that performs backward elimination.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>direction="forward"</td>
        <td>An argument to step() that performs forward selection.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scope=</td>
        <td>An argument to step() that specifies what model you are moving towards.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>direction="both"</td>
        <td>An argument to step() that specifies hybrid stepwise regression.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 4 Practice Hands-On<a class="anchor" id="DS106L4_page_8"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Stepwise Regression Hands On

This Hands-On will **not** be graded, but you are encouraged to complete it. The best way to become a great data scientist is to practice!

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Part I: Backwards Elimination

Use **[this data file](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/IQ.zip)**, which contains test scores and IQ for 15 individuals. Each individual took 5 tests. The IQ is the response variable, and the five different tests are the potential predictor variables. Perform a backwards elimination on this data, then answer the following questions:

* Which model is the best? Why?
* From the best model, what is the adjusted R<sup>2</sup> value and what does it mean? 
* From the best model, how does each variable influence IQ?

---

## Part II: Compare Stepwise Regression Types

The **[following dataset](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/stepwiseRegression.zip)** will be used for this analysis. This data has a single response (Y) variable, and twelve predictor (X1 through X12) variables. Use these data to run all three kinds of stepwise regressions (backward elimination, forward selection, and the hybrid method). After completing these analyses, answer the following questions:

* Which model was the best for each type of method?
* How do the final models from each method compare to each other?
* From your chosen "best model," explain what variable(s) contribute to predicting Y and for how much variance they account.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 4 Practice Hands-On Solution<a class="anchor" id="DS106L4_page_9"></a>

[Back to Top](#DS106L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 4 Practice Hands-On Solution

Below you will find the solution to the Lesson 4 hands-on!

---

## Part I

```r
#Backward Selection
FitAll1 <- lm(IQ ~ ., data = IQ)

summary(FitAll1)

step(FitAll1, direction = 'backward')

#Create another model for only selected variables and compare their results
fitsome <- lm(IQ ~ Test1 + Test2 + Test4, data = IQ)
summary(fitsome)
```

---

## Part II

```r
#Backward Selection
FitAll = lm(Y ~ ., data = stepwiseRegression)
summary(FitAll)

step(FitAll, direction = 'backward', scope = formula(FitAll))

#Forward Selection
fitstart = lm(Y ~ 1, data = stepwiseRegression)
summary(fitstart)

step(fitstart, direction = 'forward', scope = (formula(FitAll)))

#StepWise 
step(fitstart, direction = 'both', scope = formula(FitAll))

fitsome <- lm(formula = Y ~ X2 + X4 + X6 + X10 + X11 + X12, data = stepwiseRegression)
summary(fitsome)

#Compare the output from the different types of stepwise selections above and note the order of the elimination of features
```