# Logistic Regression


## Intuition of Logistic Regression  -- The Big Picture
When you are interested in predicting a categorical outcome using linear model, you might run into trouble--the continous output simple linear regression gives you just does not make sense. Luckily, the logistic regression allows you to make binary prediction by taking the outcome through a transformation, such that the binary outcome can be represented as a probability of belonging to one class. This transformation, called the Sigmoid function, lies in the heart of logistic regression.

Linear regression formula:
$$Y = B_0 + B_1X_1$$

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
data = pd.read_csv('diabetes.csv')
data.columns

In [None]:
data.head()

In [None]:
sns.lmplot(x='Age', y='BMI', data=data, ci=None)


However, we can no longer use linear regression to predict a categorical outcome such as whether someone has diabetes ("Outcome")

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(data.BMI, data.Outcome)
plt.xlabel('BMI')
plt.ylabel('Outcome')

In [None]:
plt.figure(figsize = (10,6))
sns.lmplot(x='BMI',y='Outcome',data=data, ci=None)
plt.xlabel('BMI')
plt.ylabel('Outcome')

Unfortunately, we can no longer just use a linear function to interpolate a value. We want to predict probability that represents the binary outcome of 1 and 0, what should we do?

<img src="https://media.giphy.com/media/8lQyyys3SGBoUUxrUp/giphy.gif" >

![Screen%20Shot%202018-12-12%20at%2011.42.35%20PM.png](attachment:Screen%20Shot%202018-12-12%20at%2011.42.35%20PM.png)

Graph credit: Introduction to statistical learning in r

Because our outcome variable is binary, we need to confine it to a probability from 0 to 1 for any given values of the predictor(s). Therefore, we need to apply a function that transforms the old linear regression's to achieve this goal. Enters the Sigmoid Function. 

Let $P(y=1|X) = p(X)$,

$$p(X) = \frac{e^{B_0+B_1 X_1}}{1+e^{B_0+B_1 X_1}}$$

Doing a bit of manipulation, we get:

$$\frac{p(X)}{1-p(X)} = e^{B_0+B_1 X_1}$$

Taking the log of both side, we get:

$$log(\frac{p(X)}{1-p(X)}) = B_0+B_1 X_1$$

### Logistic Regression Model Construction & Coefficients

In [None]:
# interpreting the logistic regression's coefficients with continuous features 
# build a logistic regression


In [None]:
# look at how accurate our model has performed. This
X = data[['BMI', 'Age']]
y = data.Outcome
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
logreg = LogisticRegression().fit(X_train,y_train)
logreg.score(X_test, y_test, sample_weight=None)

In [None]:
# we can also use the predict_proba function to examine the predicted probability for the given classes
# logreg.predict_proba(X_test.iloc[1:2,:])
# logreg.predict_log_proba(X_test.iloc[1:2,:])

#### Logistic Regression Coefficients and Interpretation

In [None]:
# examine the coefficients
logreg.coef_

In [None]:
# examine the intercepts
logreg.intercept_

The Logistic Regression can be written as : <br>
$$Y = -4.64 + 0.077X1 + 0.045X2$$


#### Your turn: get the probability of being diabetic given BMI = 45, and age = 25?

- You will need to first calculate the associated log odds of having diabetes when BMI = 45 AND AGE = 25,
- Then get rid of the log, and acquire the odds of having diabetes when BMI = 45 AND AGE = 25
- After getting the odds, you can calculate the probability of having diabetes when BMI = 45 AND AGE = 25

In [None]:
# compute predicted log-odds for BMI=45 and age=25 using the equation
#logdds = logreg.intercept_

BMI_45_age_25_logodds = None
BMI_45_age_25_logodds

In [None]:
# compute the odds from the log odds by exponentiating it
# convert log-odds to odds

BMI_45_age_25_odds = None
BMI_45_age_25_odds

In [None]:
# compute the probability from the odds

BMI_45_age_25_prob = None
BMI_45_age_25_prob

__Bottom line__: Positive coefficients is associated with increase the log-odds of the response (and thus increase the probability), and negative coefficients is associated with decrease the log-odds of the response (and thus decrease the probability).