# DS106 Modeling : Lesson Two Companion Notebook

### Table of Contents <a class="anchor" id="DS106L2_toc"></a>

* [Table of Contents](#DS106L2_toc)
    * [Page 1 - Introduction](#DS106L2_page_1)
    * [Page 2 - What is Logistic Regression?](#DS106L2_page_2)
    * [Page 3 - Assumptions of Logistic Regression](#DS106L2_page_3)
    * [Page 4 - Logistic Regression Setup in R](#DS106L2_page_4)
    * [Page 5 - Running Logistic Regression and Interpreting the Output](#DS106L2_page_5)
    * [Page 6 - Logistic Regression in Python](#DS106L2_page_6)
    * [Page 7 - Key Terms](#DS106L2_page_7)
    * [Page 8 - Lesson 2 Hands-On](#DS106L2_page_8)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L2_page_1"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Modeling with logistic regression
VimeoVideo('246121316', width=720, height=480)

# Introduction

In this lesson, you will learn how to compute logistic regressions, which have a categorical dependent variable instead of a continuous variable. By the end of the lesson, you should be able to:

* Understand the theory behind logistic regression
* Recognize the assumptions for logistic regression
* Test assumptions and compute logistic regression in R
* Compute logistic regression in Python

This lesson will culminate in a hands-on in which you will complete logistic regression in R.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/465050172"> recorded live workshop </a> that goes over the concepts in this lesson.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is Logistic Regression?<a class="anchor" id="DS106L2_page_2"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is Logistic Regression?

Now that you have covered simple linear regression, where you have a single independent, continuous variable predicting a single dependent, continuous variable, jump into logistic regression! Logistic regression goes by several names:

* Logistic regression (of course)
* Logit regression
* Logit model

Logistic regression is a type of regression that works when you have a single categorical dependent variable. For this lesson, you will only consider that the dependent categorical variable can take on two values. If you want to get technical, this would be called *Binary Logistic Regression*.  If you have more than two levels of your dependent variable, then you have *Multiple Logistic Regression*. 

To give you an example of what a binary dependent variable might look like, you could have: 

* 0 and 1
* Win and Loss
* Pass and Fail
* Success and Failure
* Healthy and Sick
* Alive and Dead

---

## An Example

Take a look at an example. The following spreadsheet shows results for 20 students that took an exam. The data collected are the amount of time spent studying for the exam, and whether or not they passed the exam:

![A spreadsheet with two columns, hours and outcome. Rows beneath the hours heading the amount of time a student studied for an exam. The rows beneath the outcome heading show fail or pass.](Media/L02-02.png)

Do some eyeball analysis on these data for a few minutes. Even though the data set is pretty limited, it seems that studying for 1.5 hours or less is a recipe for failure. On the other hand, everyone who studied 4 hours or more passed the test. It's difficult to say with anyone who studied between 1.5 and 4 hours, though. Without doing much math yet, create a simple model that would help explain the likelihood of passing the exam. Set up a graph to try and model the probability of passing the exam, based on the amount of study. Here is what you know for sure:

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. A horizontal line at the top of the graph is labeled pass. On the fail line, a thick red line runs from zero to one point five. On the pass line, a thick red line runs from four to seven.](Media/L02-03.png)

If the thick red lines represents whether a student passes or fails based on study time, then anything between 1.5 hours and 4 hours is variable. If you stick with the thought that more study improves your chances of passing, you might just simply connect the two horizontal lines, like this:

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. A horizontal line at the top of the graph is labeled pass. On the fail line, a thick red line runs from zero to one point five. The line then moves upward on the chart to the pass line. On the pass line, a thick red line runs from four to seven.](Media/L02-04.png)

Now add a scale to the vertical axis, going from 0 (fail) to 1 (pass):

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. The y axis runs from zero to one in increments of zero point two five. A horizontal line at the top of the graph is labeled pass. On the fail line, a thick red line runs from zero to one point five. The line then moves upward on the chart to the pass line. On the pass line, a thick red line runs from four to seven.](Media/L02-05.png)

Now you can make some statements about probability. For instance, you might ask "What is the probability of passing, assuming that I study for 3.25 hours?" To answer this based on your 'non-mathematical' model, you simply draw a vertical line straight up from 3.25 until it intersects with the thick red line. At that point, the line should be drawn horizontally and to the left, until it hits the vertical axis, like the green line added below:

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. The y axis runs from zero to one in increments of zero point two five. A horizontal line at the top of the graph is labeled pass. On the fail line, a thick red line runs from zero to one point five. The line then moves upward on the chart to the pass line. On the pass line, a thick red line runs from four to seven. A dotted green line runs from left to right, starting on the y axis just below zero point seven five. When it hits the upward diagonal red line, the dotted green line runs directly downward to the x axis, at about three point two five.](Media/L02-06.png)

Using eyeball analysis, it looks like the horizontal dashed green line intersects the vertical axis at about 0.7. This becomes your prediction. You can now make the statement that if you study for 3.25 hours, you have a 70% probability of passing the exam.

Now add in the actual data points:

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. The y axis runs from zero to one in increments of zero point two five. A horizontal line at the top of the graph is labeled pass. On the fail line, a thick red line runs from zero to one point five. The line then moves upward on the chart to the pass line. On the pass line, a thick red line runs from four to seven. A dotted green line runs from left to right, starting on the y axis just below zero point seven five. When it hits the upward diagonal red line, the dotted green line runs directly downward to the x axis, at about three point two five. Data points are plotted on the pass line and the fail line.](Media/L02-07.png)

Even though this is different from the 'best fit' line you created in the linear regression lesson, can you see how it is sort of a 'best fit' in a sense, given the parameters within which you are working?

Okay, one final thing...typically predictions are smooth curves. Again, without doing any math, a smoothed curve for the prediction might look something like this:

![A graph showing the probability of passing an exam based on hours of study. The x axis is labeled hours of study and runs from zero to seven in increments of one. The x axis is also labeled fail. The y axis runs from zero to one in increments of zero point two five. A horizontal line at the top of the graph is labeled pass. On the fail line, a line runs from zero to one point five before beginning to smoothly curve upward. The line then moves upward on the chart toward the pass line. Near the pass line, the line begins to curve before it is horizontal and runs from four to seven. A dotted green line runs from left to right, starting on the y axis just below zero point seven five. When it hits the upward slope of the curve, the dotted green line runs directly downward to the x axis, at about three point two five. Data points are plotted on the pass line and the fail line.](Media/L02-08.png)

As you can see, the prediction for 3.25 hours of study is pretty much the same.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Assumptions of Logistic Regression<a class="anchor" id="DS106L2_page_3"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Assumptions of Logistic Regression

There are five assumptions for logistic regression, that, if met, will mean your regression model is as free of error as it can be.  The assumptions for logistic regression are as follows:

1. Meets minimum sample size of having at least one case per cell, with no more than 20% of cells having less than five cases. 
    > What is meant by cells? If you made a 2x2 chart of the actual and predicted outcomes, you should have at least 1 case in each. 
2. Linearity in the logit
3. Absence of multicollinearity
4. Absence of outliers 
5. Independence of errors

You will learn about them all in-depth below.

---

## 1. Meets Minimum Sample Size

In order to check this, you just need to make sure that you have at least one case each that: 

* Met the condition and was predicted to meet the condition
* Met the condition but was predicted to fail the condition
* Failed the condition and was predicted to fail the condition
* Failed the condition and was predicted to meet the condition

You'll be able to check this in something called a *confusion matrix*.

---

## 2. Linearity in the Logit

In logistic regression, the *logit*, also known as the *log-odds*, is the logarithm (log) of the odds of the probability of your outcome.  You need to make sure that the logit is linearly related to the independent variable. This can be done via graph once you have created the logit term.

---

## 3. Absence of Multicollinearity

You will test for the absence of multicollinearity for a logistic regression the exact same way you would for a linear regression - just correlate the independent variables, and if they are too highly correlated (.6/.7ish or higher) than you probably have related independent variables and thus the presence of multicollinearity.

---

## 4. Absence of Outliers

Again, you'll test the absence of outliers in the same way that you tested them for linear regression.  

---

## 5. Independence of Errors 

The assumption of independent errors just means that your residuals cannot be related to any part of our data.  You can test this out in several ways, but the easiest is to plot the residuals against your index number (number of rows). You can also run plots for autocorrelation, or plot the residuals versus Moran's I. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Logistic Regression Setup in R<a class="anchor" id="DS106L2_page_4"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Logistic Regression Setup in R 

Now that you understand the basics of logistic regression, and all the assumptions you will need to meet, you will begin the prep work for logistic regression in R.

---

## Load in Libraries

You will need the following libraries to conduct logistic regression in R: 

```{r}
library("caret")
library("magrittr")
library("dplyr")
library("tidyr")
library("lmtest")
library("popbio")
library("e1071")
```

```caret``` and ```lmtest``` will be used to test assumptions, ```dplyr```, ```tidyr```, and ```magrittr``` are used for data wrangling, and ```popbio``` is used to graph your logistic regression model.

---

## Load in Data

For this example, you will use this **[baseball dataset](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/baseball.zip)**. Each baseball season, each team plays 162 regular season games. Since baseball doesn't ever end in a tie, each game has a winner and a loser. There are 30 teams, but each game includes two teams, so there are a total of (162 \* 30) / 2 total games played, or 2430 total games. 

Since each of the 2430 games has a winner and a loser, you have a table with 4860 rows of data. It looks like this:

![A spreadsheet of baseball data. The column headings are game, date, team, opp, W forward slash L, R, R A, D forward slash N, att, date, team, and H R count. Twenty rows of data are shown.](Media/L02-14.png)

---

## Question Setup

Even though runs can be scored a number of different ways, you can investigate how good a predictor home runs are to determine the winner. It seems logical to assume that the more home runs a team hits in a particular game, the more likely they are to win. So you will do a regression where the predictor (IV) is the number of home runs hit by a team, and the response variable (DV) is whether the team wins or loses. This is a case of a quantitative predictor variable, and a categorical response variable.

---

## Data Wrangling

The one thing that absolutely has to be done before you can dive into logistic regression is the recoding of the outcome variable (DV) to zeros and ones.  Currently, your wins and losses variable (```W.L```) has a ```W``` indicating a win and an ```L``` indicating a loss.  In R, that just won't fly - you need this outcome variable to be numeric.  The following code will create a new wins and losses column that will re-code this variable numerically:

```{r}
baseball$WinsR <- NA
baseball$WinsR[baseball$W.L=='W'] <- 1
baseball$WinsR[baseball$W.L=='L'] <- 0
```

---

## Testing Assumptions

Great! Now you are ready to begin testing logistic regression assumptions!

---

### Appropriate Sample Size

The first thing you need to do to test for appropriate sample size is to create the logistic regression model.  

---

#### Run the Base Logistic Model

Just like with linear regression, you typically need to create a model first before you can ensure that it meets all the assumptions - you just won't use it yet.

```{r}
mylogit <- glm(WinsR ~ HR.Count, data=baseball, family="binomial")
```

This code should look somewhat familiar to you, as it stems from the same one as your linear regression.  However, instead of just ```lm()``` you now use ```glm()``` and you need to specify ```family=``` . Here you have chose ```binomial``` because you are doing Binomial Logistic Regression. Place your dependent variable first, the new ```WinsR``` column that you re-coded, and then you will add your independent variable after the tilde.  ```baseball``` is the name of your dataset.

---

#### Predict Wins and Losses

With that model created (but not interpreted!), you can make predictions about wins and losses.  To do this, you will use the ```predict()``` function on your logit model first: 

```{r}
probabilities <- predict(mylogit, type = "response")
```

Then convert your probabilities to a positive and negative prediction by having anything above .5 (half) be positive, and anything below .5 be negative. This will be done using the ```ifelse()``` function on the ```probabilities``` variable you just created, and it will be assigned to your baseball data set, as the column ```Predicted```, so that you can later compare it with the recoded wins and losses column.

```{r}
probabilities <- predict(mylogit, type = "response")
baseball$Predicted <- ifelse(probabilities > .5, "pos", "neg")
```

---

#### Recode the Predicted Variable

Just like you recoded wins and losses, you also need to recode your new Predicted variable: 

```{r}
baseball$PredictedR <- NA
baseball$PredictedR[baseball$Predicted=='pos'] <- 1
baseball$PredictedR[baseball$Predicted=='neg'] <- 0
```

---

#### Convert Variables to Factors

The next thing you need to do is to convert the ```WinsR``` and the ```PredictedR``` columns to factors.  This is necessary because the next line of code you will run requires these variables to be factors. Simply specify the dataset and call the variable before the arrow, then use the function ```as.factor()``` and call the variable again.  

```{r}
baseball$PredictedR <- as.factor(baseball$PredictedR)
baseball$WinsR <- as.factor(baseball$WinsR)
```

---

#### Create a Confusion Matrix

And now you are finally ready to create a 2x2 chart, also known as a *confusion matrix*, which will not only test your sample size out, but will also provide some information on the accuracy of our prediction.  Using the ```caret``` library, you will call the ```confusionMatrix()``` function, specifying that you want to compare your predicted values (```PredictedR```) to your actual data, which is the ```WinsR``` column.

```{r}
conf_mat <- caret::confusionMatrix(baseball$PredictedR, baseball$WinsR)
conf_mat
```

The results are shown below:

```text
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1917 1240
         1  513 1190
                                          
               Accuracy : 0.6393          
                 95% CI : (0.6256, 0.6528)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2786          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7889          
            Specificity : 0.4897          
         Pos Pred Value : 0.6072          
         Neg Pred Value : 0.6988          
             Prevalence : 0.5000          
         Detection Rate : 0.3944          
   Detection Prevalence : 0.6496          
      Balanced Accuracy : 0.6393          
                                          
       'Positive' Class : 0              
```

The first thing you will notice is the table at the top.  This is your 2x2 chart, and it shows the following: 

* **Top left corner (Reference: 0, Prediction: 0):**  These are the cases that failed the condition and were predicted to fail the condition.  In your case, a loss was predicted and a loss actually happened.  
    > This is the number you accurately predicted as "did not happen."
* **Top right corner (Reference: 1, Prediction: 0):** These are the cases that were predicted to fail the condition, but did not actually fail.  In terms of the current dataset: a loss was predicted, but the team actually won.
* **Bottom left corner (Reference: 0, Prediction: 1):** These are the cases where a success was predicted, but a failure actually happened.  This means that the team was predicted to win, but they actually lost. 
* **Bottom right corner (Reference: 1, Prediction: 1):** These are the cases were a success was predicted and a success actually happened. So, the team was predicted to win and they actually won. 
    > This is the number you accurately predicted as "did happen."

If any one of these four cells in the chart is below 5, then you do not meet the minimum sample size for binary logistic regression.  Luckily, you have a large dataset, and so you pass this assumption!

There is also some other useful information contained within the confusion matrix output, however. The accuracy rate shows how accurate your predictions are.  With a .639 accuracy rate, this means that roughly 64% of the time, your predictions are correct. If you added additional independent variables to your model, then perhaps this accuracy rate would go up. The confusion matrix also has information on sensitivity, specificity, the positive predicted value, and the negative predicted value.  Although those are not statistics you will focus on now, they will come up again later when you discuss receiver operator curve analyses, so it's worth getting a heads up on where they can be located in R.  

---

### Logit Linearity 

Now you have your model and your predictions, you can calculate the logit and then graph it against your predicted values.

You will need to do a little more data wrangling to properly create your logit. You only want to assess the linearity of the logit with numeric variables, so using the library ```dplyr```, and the ```select_if()``` function, you can select only numeric columns from the full dataset by specifying as the argument ```is.numeric```.

```{r}
baseball1 <- baseball %>% 
dplyr::select_if(is.numeric)
```

Then you will pull the rename the column names to be fed into predictors using the ```colnames()``` function: 

```{r}
predictors <- colnames(baseball1)
```

And finally you can create the logit, using ```tidyr```'s ```mutate()``` and ```gather()``` functions. The logit is calculated as the log of the probabilities divided by one minus the probabilities.

```{r}
baseball1 <- baseball1 %>%
mutate(logit=log(probabilities/(1-probabilities))) %>%
gather(key= "predictors", value="predictor.value", -logit)
```

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>If you haven't encountered %>% before, it basically means that you are stringing things together. It comes from the magrittr library and is often found in use with the dplyr library in particular. But anything that uses %>% can be written out in its long form as well. </p>
    </div>
</div>

With this logit in hand, you can graph to assess for linearity, using your dear friend, ```ggplot```!

```{r} 
ggplot(baseball1, aes(logit, predictor.value))+
geom_point(size=.5, alpha=.5)+
geom_smooth(method= "loess")+
theme_bw()+
facet_wrap(~predictors, scales="free_y")
```

![Six graphs resulting from the use of the g g plot function. The horizontal axis for each is logit. The vertical axis is predictor dot value. Top row of three graphs, from left to right, Att, Game, H R count. Bottom row of three graphs, from left to right, R, R A, Wins R.](Media/106.L7.1.png)

This will automatically give you a graph of the logit with every numeric variable.  Of course, in this case, all you care about is the number of home runs, denoted by the upper right hand corner graph labeled ```HR.Count```. Lucky for you, this shows a nice strong linear relationship, so you are good to move on to testing the next assumption!

---

### Multicollinearity

The next assumption you would normally test for is multicollinearity; you can't have your independent variables too closely related to each other.  However, since you only have one independent variable, you can skip this step.

---

### Independent Errors

You can test for independent error by graphing the residual over your index.  

---

#### Graphing the Errors

You can test this a number of ways. The first way is to graph the errors, and there's a nice, easy line of code for this:

```{r}
plot(mylogit$residuals)
```

Where ```mylogit``` is the model you created, and residuals is an automatic output of that model that you can call.

Here is the graph this code yields:

![A graph with a horizontal axis labeled index and a vertical axis labeled my logit residuals. Data is plotted throughout the graph and is mostly clustered at the top.](Media/106.L7.2.png)

You are looking for a pretty even distribution of points all the way across your x axis. You have that, so you have met the assumption of independent errors.

---

#### Use The Durbin-Watson Test

Alternatively, you can use the Durbin-Watson test to see whether you have independence of errors.  You'll use the function ```dwtest()``` out of the ```lmtest``` library:

```{r}
dwtest(mylogit, alternative="two.sided")
```

Using the ```alternative="two.sided"``` argument means that you are testing for both positive and negative autocorrelation of errors. 

This code yields the following output: 

```text
	Durbin-Watson test

data:  mylogit
DW = 2.0828, p-value = 0.003875
alternative hypothesis: true autocorrelation is not 0

```

If this test is not statistically significant (> .05), then you are automatically good to go, and you have independent errors. However, if it is significant, you can then look at the actual value of the Durbin-Watson test statistic.  If it is under 1 or greater than 3, then you have violated the assumption of independent errors.  Since your DW value is 2.08, you are in an ok range and have met the assumption of independent errors through testing as well as graphing!

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like more precision in your interpretation of the Durbin-Watson statistic, check out <a href="https://www.statisticshowto.com/durbin-watson-test-coefficient/"> this webpage</a></p>
    </div>
</div>

---

### Screening for Outliers

To screen for outliers, you will use the ```influence.measures``` function that you used previously: 

```{r}
infl <- influence.measures(mylogit)
summary(infl)
```

To yield this output: 

```text
Potentially influential observations of
	 glm(formula = WinsR ~ HR.Count, family = "binomial", data = baseball) :

     dfb.1_ dfb.HR.C dffit   cov.r   cook.d hat  
16   -0.01   0.01     0.02    1.00_*  0.00   0.00
233  -0.01   0.01     0.01    1.00_*  0.00   0.00
275  -0.01   0.01     0.02    1.00_*  0.00   0.00
285  -0.01   0.01     0.02    1.00_*  0.00   0.00
309  -0.01   0.01     0.02    1.00_*  0.00   0.00
320  -0.01   0.01     0.02    1.00_*  0.00   0.00
327  -0.01   0.01     0.02    1.00_*  0.00   0.00
334  -0.01   0.01     0.02    1.00_*  0.00   0.00
345  -0.01   0.01     0.02    1.00_*  0.00   0.00
437  -0.01   0.01     0.02    1.00_*  0.00   0.00
498  -0.01   0.01     0.02    1.00_*  0.00   0.00
501  -0.01   0.01     0.02    1.00_*  0.00   0.00
586  -0.01   0.01     0.02    1.00_*  0.00   0.00
671  -0.01   0.01     0.01    1.00_*  0.00   0.00
679  -0.01   0.01     0.02    1.00_*  0.00   0.00
684  -0.01   0.01     0.02    1.00_*  0.00   0.00
694  -0.01   0.01     0.01    1.00_*  0.00   0.00
698  -0.01   0.01     0.02    1.00_*  0.00   0.00
777   0.00   0.01     0.01    1.00_*  0.00   0.00
779  -0.01   0.01     0.01    1.00_*  0.00   0.00
788   0.04  -0.06    -0.06_*  1.00    0.01   0.00
801  -0.01   0.01     0.02    1.00_*  0.00   0.00
904  -0.01   0.01     0.01    1.00_*  0.00   0.00
1040 -0.01   0.01     0.02    1.00_*  0.00   0.00
1088 -0.01   0.01     0.02    1.00_*  0.00   0.00
1092 -0.01   0.01     0.02    1.00_*  0.00   0.00
1115 -0.01   0.01     0.02    1.00_*  0.00   0.00
1133 -0.01   0.01     0.02    1.00_*  0.00   0.00
1135 -0.01   0.01     0.02    1.00_*  0.00   0.00
1157 -0.01   0.01     0.01    1.00_*  0.00   0.00
1213 -0.01   0.01     0.02    1.00_*  0.00   0.00
1247 -0.01   0.01     0.02    1.00_*  0.00   0.00
1251 -0.01   0.01     0.02    1.00_*  0.00   0.00
1258 -0.01   0.01     0.02    1.00_*  0.00   0.00
1273 -0.01   0.01     0.02    1.00_*  0.00   0.00
1277 -0.01   0.01     0.02    1.00_*  0.00   0.00
1280 -0.01   0.01     0.02    1.00_*  0.00   0.00
1282 -0.01   0.01     0.02    1.00_*  0.00   0.00
1326 -0.01   0.01     0.02    1.00_*  0.00   0.00
1330 -0.01   0.01     0.02    1.00_*  0.00   0.00
1348 -0.01   0.01     0.02    1.00_*  0.00   0.00
1370 -0.01   0.01     0.02    1.00_*  0.00   0.00
1377 -0.01   0.01     0.02    1.00_*  0.00   0.00
1385 -0.01   0.01     0.02    1.00_*  0.00   0.00
1467 -0.01   0.01     0.01    1.00_*  0.00   0.00
1474 -0.01   0.01     0.02    1.00_*  0.00   0.00
1524  0.00   0.01     0.01    1.00_*  0.00   0.00
1539 -0.01   0.01     0.02    1.00_*  0.00   0.00
1549 -0.01   0.01     0.02    1.00_*  0.00   0.00
1561  0.00   0.01     0.01    1.00_*  0.00   0.00
1577 -0.01   0.01     0.02    1.00_*  0.00   0.00
1582 -0.01   0.01     0.02    1.00_*  0.00   0.00
1626 -0.01   0.01     0.01    1.00_*  0.00   0.00
1636 -0.01   0.01     0.02    1.00_*  0.00   0.00
1645 -0.01   0.01     0.02    1.00_*  0.00   0.00
1646 -0.01   0.01     0.02    1.00_*  0.00   0.00
1667 -0.01   0.01     0.02    1.00_*  0.00   0.00
1677 -0.01   0.01     0.02    1.00_*  0.00   0.00
1703 -0.01   0.01     0.02    1.00_*  0.00   0.00
1707 -0.01   0.01     0.02    1.00_*  0.00   0.00
1718  0.04  -0.06    -0.06_*  1.00    0.01   0.00
1754 -0.01   0.01     0.02    1.00_*  0.00   0.00
1759 -0.01   0.01     0.02    1.00_*  0.00   0.00
1783 -0.01   0.01     0.02    1.00_*  0.00   0.00
1811 -0.01   0.01     0.01    1.00_*  0.00   0.00
1827 -0.01   0.01     0.01    1.00_*  0.00   0.00
1835 -0.01   0.01     0.01    1.00_*  0.00   0.00
1849 -0.01   0.01     0.02    1.00_*  0.00   0.00
1853 -0.01   0.01     0.02    1.00_*  0.00   0.00
1860 -0.01   0.01     0.02    1.00_*  0.00   0.00
1862 -0.01   0.01     0.02    1.00_*  0.00   0.00
1867 -0.01   0.01     0.02    1.00_*  0.00   0.00
1879 -0.01   0.01     0.01    1.00_*  0.00   0.00
1901 -0.01   0.01     0.01    1.00_*  0.00   0.00
1914 -0.01   0.01     0.01    1.00_*  0.00   0.00
1994 -0.01   0.01     0.02    1.00_*  0.00   0.00
2004 -0.01   0.01     0.01    1.00_*  0.00   0.00
2011 -0.01   0.01     0.01    1.00_*  0.00   0.00
2017 -0.01   0.01     0.02    1.00_*  0.00   0.00
2023 -0.01   0.01     0.02    1.00_*  0.00   0.00
2033 -0.01   0.01     0.01    1.00_*  0.00   0.00
2038 -0.01   0.01     0.02    1.00_*  0.00   0.00
2043 -0.01   0.01     0.02    1.00_*  0.00   0.00
2067 -0.01   0.01     0.01    1.00_*  0.00   0.00
2080 -0.01   0.01     0.02    1.00_*  0.00   0.00
2106 -0.01   0.01     0.01    1.00_*  0.00   0.00
2148 -0.01   0.01     0.02    1.00_*  0.00   0.00
2154 -0.01   0.01     0.02    1.00_*  0.00   0.00
2175 -0.01   0.01     0.02    1.00_*  0.00   0.00
2206 -0.01   0.01     0.02    1.00_*  0.00   0.00
2226 -0.01   0.01     0.02    1.00_*  0.00   0.00
2254 -0.01   0.01     0.02    1.00_*  0.00   0.00
2289 -0.01   0.01     0.02    1.00_*  0.00   0.00
2332  0.04  -0.06    -0.06_*  1.00    0.01   0.00
2357  0.00   0.01     0.01    1.00_*  0.00   0.00
2390 -0.01   0.01     0.02    1.00_*  0.00   0.00
2410 -0.01   0.01     0.02    1.00_*  0.00   0.00
2427 -0.01   0.01     0.02    1.00_*  0.00   0.00
2496 -0.01   0.01     0.02    1.00_*  0.00   0.00
2500 -0.01   0.01     0.02    1.00_*  0.00   0.00
2525 -0.01   0.01     0.02    1.00_*  0.00   0.00
2538 -0.01   0.01     0.02    1.00_*  0.00   0.00
2573 -0.01   0.01     0.02    1.00_*  0.00   0.00
2633 -0.01   0.01     0.01    1.00_*  0.00   0.00
2643  0.00   0.01     0.01    1.00_*  0.00   0.00
2657 -0.01   0.01     0.01    1.00_*  0.00   0.00
2744 -0.01   0.01     0.01    1.00_*  0.00   0.00
2764 -0.01   0.01     0.02    1.00_*  0.00   0.00
2773 -0.01   0.01     0.02    1.00_*  0.00   0.00
2809 -0.01   0.01     0.01    1.00_*  0.00   0.00
2814 -0.01   0.01     0.02    1.00_*  0.00   0.00
2831 -0.01   0.01     0.02    1.00_*  0.00   0.00
2833 -0.01   0.01     0.02    1.00_*  0.00   0.00
2890 -0.01   0.01     0.01    1.00_*  0.00   0.00
2917 -0.01   0.01     0.01    1.00_*  0.00   0.00
2920 -0.01   0.01     0.02    1.00_*  0.00   0.00
2936 -0.01   0.01     0.02    1.00_*  0.00   0.00
2966 -0.01   0.01     0.02    1.00_*  0.00   0.00
2992 -0.01   0.01     0.02    1.00_*  0.00   0.00
3009 -0.01   0.01     0.02    1.00_*  0.00   0.00
3039 -0.01   0.01     0.02    1.00_*  0.00   0.00
3060 -0.01   0.01     0.02    1.00_*  0.00   0.00
3121 -0.01   0.01     0.02    1.00_*  0.00   0.00
3153 -0.01   0.01     0.01    1.00_*  0.00   0.00
3207 -0.01   0.01     0.01    1.00_*  0.00   0.00
3243 -0.01   0.01     0.01    1.00_*  0.00   0.00
3275 -0.01   0.01     0.01    1.00_*  0.00   0.00
3293 -0.01   0.01     0.01    1.00_*  0.00   0.00
3315 -0.01   0.01     0.02    1.00_*  0.00   0.00
3325 -0.01   0.01     0.02    1.00_*  0.00   0.00
3353 -0.01   0.01     0.01    1.00_*  0.00   0.00
3398 -0.01   0.01     0.02    1.00_*  0.00   0.00
3400 -0.01   0.01     0.02    1.00_*  0.00   0.00
3410 -0.01   0.01     0.02    1.00_*  0.00   0.00
3513 -0.01   0.01     0.02    1.00_*  0.00   0.00
3540 -0.01   0.01     0.01    1.00_*  0.00   0.00
3577 -0.01   0.01     0.01    1.00_*  0.00   0.00
3588 -0.01   0.01     0.02    1.00_*  0.00   0.00
3592 -0.01   0.01     0.02    1.00_*  0.00   0.00
3593  0.04  -0.07    -0.07_*  1.00    0.01   0.00
3615 -0.01   0.01     0.01    1.00_*  0.00   0.00
3620 -0.01   0.01     0.02    1.00_*  0.00   0.00
3625  0.04  -0.06    -0.06_*  1.00    0.01   0.00
3629  0.00   0.01     0.01    1.00_*  0.00   0.00
3655 -0.01   0.01     0.02    1.00_*  0.00   0.00
3670 -0.01   0.01     0.02    1.00_*  0.00   0.00
3705 -0.01   0.01     0.02    1.00_*  0.00   0.00
3714 -0.01   0.01     0.02    1.00_*  0.00   0.00
3735 -0.01   0.01     0.02    1.00_*  0.00   0.00
3741 -0.01   0.01     0.02    1.00_*  0.00   0.00
3742  0.04  -0.06    -0.06_*  1.00    0.01   0.00
3755 -0.01   0.01     0.02    1.00_*  0.00   0.00
3781  0.00   0.01     0.01    1.00_*  0.00   0.00
3795 -0.01   0.01     0.02    1.00_*  0.00   0.00
3841  0.00   0.01     0.01    1.00_*  0.00   0.00
3867 -0.01   0.01     0.02    1.00_*  0.00   0.00
3903 -0.01   0.01     0.02    1.00_*  0.00   0.00
3920 -0.01   0.01     0.02    1.00_*  0.00   0.00
3948 -0.01   0.01     0.02    1.00_*  0.00   0.00
3950 -0.01   0.01     0.02    1.00_*  0.00   0.00
3959 -0.01   0.01     0.02    1.00_*  0.00   0.00
4034 -0.01   0.01     0.02    1.00_*  0.00   0.00
4056 -0.01   0.01     0.02    1.00_*  0.00   0.00
4081 -0.01   0.01     0.02    1.00_*  0.00   0.00
4090 -0.01   0.01     0.02    1.00_*  0.00   0.00
4093  0.00   0.01     0.01    1.00_*  0.00   0.00
 [ reached getOption("max.print") -- omitted 26 rows ]
```

Notice that this is not even all the output! You may want to consider creating your own function that will print only the rows that are suspicious.  Remember that if ```dfb.1_``` or ```dffit``` values are greater than 1, or if ```hat``` is greater than .3 or so, you probably have an outlier than should be examined and possibly removed.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Running Logistic Regression and Interpreting the Output<a class="anchor" id="DS106L2_page_5"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Running Logistic Regression and Interpreting the Output

Having passed all your assumptions (HOOAH!), you can now proceed to actually calling your logistic regression model and interpreting the output.

All you need to do is ask for the summary: 

```{r}
summary(mylogit)
```

And this is the output it returns:

```text
Call:
glm(formula = WinsR ~ HR.Count, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5366  -1.1171  -0.3553   1.2389   1.5338  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.80749    0.04658  -17.34   <2e-16 ***
HR.Count     0.66398    0.03044   21.81   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6737.4  on 4859  degrees of freedom
Residual deviance: 6161.4  on 4858  degrees of freedom
AIC: 6165.4

Number of Fisher Scoring iterations: 4
```

In this output, the first thing to check is whether your independent variable, the number of home runs, was a significant predictor of the number of wins and losses a team had.  Looking in the ```Coefficients``` table under ```HR.Count```, you see that the *p* value is significant at *p* < .001, which is great news. This means that the number of home runs is a significant predictor of the number of wins and losses a team had. The *z* value given next to *p* is the *Wald Statistic*, and you can think of it similarly to the *t* tests you had for individual predictors in linear regression - it's just that Wald works for categorical variables and *t* tests don't.

In the same line, the estimate tells you how much the independent variable influences the dependent.  So, for every one unit increase in home runs, you see that the log odds of winning a game (versus losing) are increased by .66.

You also have a number of other components here that tell you about model fit, including the deviance residuals, null and residual deviance, and the AIC.  

---

## Graphing the Logistic Model

Want to plot it? You can use the ```popbio``` library to do so with this code: 

```{r}
logi.hist.plot(baseball$HR.Count,baseball$WinsR, boxp=FALSE, type="hist", col="gray")
```

It will yield this graphic:

![A graph resulting from the popbio function. The left side of the graph is labeled probability. The right side of the graph is labeled frequency. A red line slopes upward from left to right across the graph.](Media/106.L7.5.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Logistic Regression in Python<a class="anchor" id="DS106L2_page_6"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Logistic Regression in Python

Although creating a logistic regression model in Python can be done, it is very difficult to test for any of the logistic-specific assumptions such as sample size, linearity of the logit, or independence of errors.  Therefore, you'll be shown just how to create the model; but beware! Without testing assumptions, you will not be able to definitively say that the results are accurate.  Only use Python for logistic regression when the stakes aren't very high and you just need to get an idea of what will happen.

---

## Import Packages

You will only need two packages in Python in order to complete logistic regression: ```pandas``` for reading in your data, and ```statsmodels``` for running the analysis:

```python
import pandas as pd
import statsmodels.api as sm
```

---

## Read in Data

Next, you'll need to read in your data.  You'll use the **[same baseball dataset](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/baseball.zip)** as you did in R.

---

## Data Wrangling

There are two components of data wrangling you will need to tackle in Python.  First, you will need to recode your outcome variable into numeric.  Second, you will need to make each variable its own dataframe.

---

### Recoding the Outcome Variable

The wins and losses column, ```W/L```, will need to be recoded from string to numeric: 

```python
def recode (series):
    if series == "W":
        return 1
    if series == "L":
        return 0
baseball['WLr'] = baseball['W/L'].apply(recode)
```

---

### Make Each Variable a Dataframe

Then once that is done, you can make your IV and DV into their own dataframes by subsetting.  For convention's sake, you can call them x and y: 

```python
x = baseball['HR Count']
y = baseball['WLr']
```

And that's all the data wrangling you'll need to do before getting on with the analysis!

---

## Run the Analysis

Now you can run your logistic regression code, using the function ```Logit()``` out of the ```statsmodels``` package.  You'll then fit the results, using ```fit()```, and will lastly print out a summary: 

```python
logit = sm.Logit(y,x)
results = logit.fit()
print(results.summary2())
```

Here is the output you will receive:

```text
Optimization terminated successfully.
         Current function value: 0.667079
         Iterations 5
                         Results: Logit
================================================================
Model:              Logit            Pseudo R-squared: 0.038    
Dependent Variable: WLr              AIC:              6486.0070
Date:               2019-08-21 11:32 BIC:              6492.4958
No. Observations:   4860             Log-Likelihood:   -3242.0  
Df Model:           0                LL-Null:          -3368.7  
Df Residuals:       4859             LLR p-value:      nan      
Converged:          1.0000           Scale:            1.0000   
No. Iterations:     5.0000                                      
------------------------------------------------------------------
            Coef.    Std.Err.      z      P>|z|    [0.025   0.975]
------------------------------------------------------------------
HR Count    0.2798     0.0184   15.2304   0.0000   0.2438   0.3158
================================================================
```

As you can see in the ```P>|z|``` column, the number of home runs is still a significant predictor of whether a team won or lost the game, and every one home run that a team gets increases their likelihood of winning the game by 28% (as seen in the ```Coef.``` column).  However, in the grand scheme of things, just looking at the number of home runs alone does not seem to have a huge impact, since by looking at the ```Pseudo R-squared``` value, you can see that home runs only explain just under 4% of the variance in winning and losing baseball games.  *Pseudo R-Squared* is very similar to the Multiple R-Squared in regression in interpretation; however, it is calculated differently because you are using logistic  regression, which is why it retains the moniker "pseudo."

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS106L2_page_7"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Logistic Regression</td>
        <td>AKA logit regression or logit model.  A linear regression in which the dependent variable is categorical.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Logit</td>
        <td>AKA log-odds. Take the log of probability of the outcome.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>caret</td>
        <td>For testing regression assumptions.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lmtest</td>
        <td>For testing regression assumptions.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>popbio</td>
        <td>For graphing logistic regression.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>glm()</td>
        <td>Creates a logistic regression model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>family="binomial"</td>
        <td>An argument to glm() in which you specify that the outcome is a binary category.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>predict()</td>
        <td>Makes prediction about your outcome.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>confusionMatrix()</td>
        <td>Creates a confusion matrix to test for the assumption of sample size.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>dwtest()</td>
        <td>Conducts a Durbin-Watson test for independence of errors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>logi.hist.plot()</td>
        <td>Creates a graph of your logistic model.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sm.Logit()</td>
        <td>Creates a logistic regression model.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 2 Hands-On<a class="anchor" id="DS106L2_page_8"></a>

[Back to Top](#DS106L2_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


This Hands-­On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Modeling with Logistic Regression Hands-On

Geologists have long known that the presence of antimony in a ground sample can be a good indication that there is a gold deposit nearby.  In the **[attached data](https://repo.exeterlms.com/documents/V2/DataScience/Modeling-Optimization/minerals.zip)**, you will find antimony (Sb) levels for 64 different locations, and whether or not gold was found nearby.  The "gold" column is coded as 0 for no gold present, and 1 for gold present.

Use logistic regression in R to create a prediction model that will give the probability of the presence of gold as a response.  Your complete report should contain the following information:

1.  Testing and correction for assumptions if necessary. 
2.  Interpretation of the results in layman's terms.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>