# Lab 14 - Classification and Logistic Regression

Note: Some of this lab is based off the Harvard Data Science CS109 Lab 4, Fall 2015.

We now move on to *classification*, which means predicting categorical data.  In this lab, we cover *logistic regression* which adapts linear regression to the prediction of one of two categories.

In logistic regression, we fit a *sigmoid function* to the data.  Run the code below to see two examples of sigmoid functions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

xx = np.linspace(-10,10,100)
yy1 = 1/(1 + np.exp(-xx))
yy2 = 1/(1 + np.exp(5*xx +5))

fig, ax = plt.subplots(1, 2, figsize=(20, 5))


ax[0].plot(xx,yy1)
ax[1].plot(xx,yy2)

The equation of the sigmoid function is: 

$$f(x_1, x_2, ..., x_n) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}}$$

Notice that the exponent of $e$ looks like part of the linear regression equation.  This is not a coincidence!

### Section 1: Loading and cleaning the data, and exploratory data analysis

The Challenger Space Shuttle tragically exploded in 1986, killing all astronauts on board.  The explosion was shown to have been caused by an O-ring failure, likely due to cold temperatures the day of the launch (and also poor engineering that allowed this failure to cause such catastrophy).

This lab will use experimental data from tests on whether O-rings failed at different temperatures.  The data set can be downloaded from [https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/chall.txt](https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/chall.txt).

Each row in the data represents one experiment.  The first column is the temperature in Fahrenheit that the experiment was conducted at, and the second column is 1 if the O-ring failed in that experiment, and 0 if it did not. 

Import the necessary libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf


%matplotlib inline

Open the data file in Jupyter notebook or another text editor.  What do you notice about it?  What needs to be changed or added when we read in the data?

See Lab 4 for a reminder on how to deal with the spaces separating the columns instead of commas.

We can add column names directly in `read_csv()`.  To do so, add the parameters: `header = None` and `names = ["Temperature", "Failure"]`.

<details><summary>Answer:</summary>
<code>data = pd.read_csv("chall.txt",sep = "\s+", header = None, names = ["Temperature", "Failure"])</code>
</details>

Create a scatter plot with temperature on the x axis and failure on the y axis.

What do you notice about the graph?

### Section 2: Logistic regression

We will now use statsmodel to fit a logistic regression model to the data.  Notice that the code is similar to when we fit a linear regression model to the data.  What is the independent variable?  What is the dependent variable?

In [None]:
logit_model = smf.logit('Failure ~ Temperature',data).fit()
logit_model.summary()

Is there an R-squared value in the summary?  

To get the formula for the model, we plug the intercept and variable coefficients into the sigmoid equation at the start of the lab.  The intercept coefficient 15.0429 replaces $\beta_0$ and the temperature coefficient -0.2322 replaces $\beta_1$.  $x_1$ will represent the temperature variable in the equation.

$$y = \frac{1}{1 + e^{-(15.0429 -0.2322x_1)}}$$


There is another way to get the model parameters:

In [None]:
logit_model.params

We can use these parameters to graph the model equation on the data.  

First, create 200 evenly spaced x values (look at the data to see what their range should be): 

In [None]:
x = np.linspace(50, 85, 200)
x

Next, we can compute $\beta_0 + \beta_1 x$ for all of these x values:

In [None]:
p = logit_model.params
reg = p['Intercept'] + x*p['Temperature']
reg

Finally we can plug `reg` into the logistic equation to get the y values:

In [None]:
y = 1/(1 + np.exp(-reg))
y

Plot another scatter plot of the data, plus the plot of our calculated x and y values:

There are a few different ways to make predictions from the logistic regression function, but the easiest is to predict a 1 if the function (y) is > 0.5, predict a 0 if y is < 0.5 and predict either 0 or 1 (chosen randomly) if y = 0.5.

### Section 3:  Confusion matrix

One way to understand how well our model works is to make a *confusion table* or *confusion matrix*, which counts how many of each type of error there are.  We can create the table using the `pred_table()` function.

In [None]:
logit_model.pred_table()

The confusion matrix can be read as follows:
<code>   
                        predicted
             |    0           |      1      |
             --------------------------------
observed | 0 | true negative  | false positive
         | 1 | false negative | true positive
</code>


How many correct predictions did the model make?   What kind of wrong predictions did the model make?

### Section 4: Pima (Akimel Oʼodham) Indian Diabetes data

The Akimel O'odham people, who were also known as the Pima Indians since European colonization of the US, currently have a high prevalence of diabetes.   A data set of different possible diabetes indicators and whether the person has diabetes is on [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database) and can be download from [https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/diabetes.csv](https://raw.githubusercontent.com/megan-owen/MAT328-Techniques_in_Data_Science/main/data/diabetes.csv).

Read in the dataset.

Plot a scatter plot of the `Glucose` (x) vs. `Outcome` (y) columns.  The `Glucose` column is the plasma glucose concentration 2 hours after an oral glucose tolerance test, and measures how a person's body is able to handle a large amount of sugar.

Fit a logistic regression model to this data, using Glucose as the independent variable and Outcome as the dependent variable.

What is the equation of the logistic regression model?

<details><summary>Answer:</summary>
$$y = \frac{1}{1 + e^{-(-5.3501 + 0.0379x)}}$$
</details>

Plot the model equation on top of your scatter plot.

We can also plot the logistic regression model using Seaborn's `regplot()`.  Use `regplot()` as if you were doing linear regression on the variables, but add in the parameter `logistic = True`.

Compute the confusion matrix for this model.  What does it tell you about the fit of this model?