<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Business Analytics: Introduction to `Logistic Regression` in `scikit-learn`

In this code-along notebook, we will:

1. Introduce the basics of the `LogisticRegression` classifier in `scikit-learn`, a popular machine learning classifier
2. Examine features and capabilities of `LogisticRegression`
3. Learn how to install and import the `LogisticRegression` classifier
4. Fit a logistic regression model to data
5. Interpret the coefficients produced by the logistic regression model to determine features that are most influential
---

Resources:
>- [Logistic Regression Slides](https://docs.google.com/presentation/d/1zSurhiNUg4z7fgqpXnVEQ89GTufod92wod9HYNMt3qE/edit#slide=id.g26618ceed22_0_86)
>- https://stats.idre.ucla.edu/stata/faq/how-do-i-interpret-odds-ratios-in-logistic-regression/
>- https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

# Importing Necessary Libraries
>- In the next cell we will import scikit learn and load a dataset from the `sklearn.datasets` module. This module provides a way to load popular tutorial datasets such as the `iris` dataset
>- We will import the `LogisticRegression` classifier for this tutorial to align with our readings and examples in the course



In [None]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import matplotlib as plt
import pandas as pd
import numpy as np

# Load the `iris` Dataset
>- For this tutorial we will load the `iris` dataset from `scikit-learn`
>- We will then only select two of the classes of flowers to make this a binomial problem

In [None]:
iris = datasets.load_iris(as_frame = True)

print(dir(iris))

['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


In [None]:
iris_df = iris.frame

iris_df = iris_df[iris_df['target']<= 1]

iris_df['target'].value_counts()

0    50
1    50
Name: target, dtype: int64

Remind ourselves what the values are for 0 and 1
>- 0 = 'setosa'
>- 1 = 'versicolor'

It's important to note the class denoted as '1' because we use that when interpreting coefficients.

In [None]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Descriptive Analtyics: Quick check of the means by target.

In [None]:
iris_df.groupby('target').mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.006,3.428,1.462,0.246
1,5.936,2.77,4.26,1.326


---
# End of Video 1
---

# Fitting a Logistic Regression Model

In [None]:
log_model = LogisticRegression()

X = iris_df.iloc[:,0:4]
y = iris_df['target']

log_model.fit(X,y)

In [None]:
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
95,5.7,3.0,4.2,1.2
96,5.7,2.9,4.2,1.3
97,6.2,2.9,4.3,1.3
98,5.1,2.5,3.0,1.1


In [None]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,1
96,5.7,2.9,4.2,1.3,1
97,6.2,2.9,4.3,1.3,1
98,5.1,2.5,3.0,1.1,1


In [None]:
feature_names = list(X.columns)

model_coef = list(log_model.coef_[0])

odds_ratios = list(np.exp(model_coef))

coef_odds = list(zip(model_coef, odds_ratios))

feature_coef = dict(zip(feature_names, coef_odds))

feature_coef

{'sepal length (cm)': (0.4403648151613586, 1.553273772983347),
 'sepal width (cm)': (-0.9069681265060137, 0.4037464784833325),
 'petal length (cm)': (2.3084956594396826, 10.059280683084747),
 'petal width (cm)': (0.9623276250899723, 2.6177826040293937)}

Create a DataFrame from `feature_coef`

In [None]:
importance_df = pd.DataFrame.from_dict(feature_coef,
                                       orient = 'index',
                                       columns = ['Coeff', 'Odds_ratio']).sort_values(by = 'Odds_ratio', ascending = False)
importance_df

Unnamed: 0,Coeff,Odds_ratio
petal length (cm),2.308496,10.059281
petal width (cm),0.962328,2.617783
sepal length (cm),0.440365,1.553274
sepal width (cm),-0.906968,0.403746


---
# End of Video 2

---

# Coefficient Interpretation

Things to note about logistic regression coefficients:

>- The coefficients relate to the *odds* and can not be directly interpreted as in linear regression.
>- However, we can compare odds ratios and percentage increases in odds given a one unit increase in the feature variable


---

## Interpreting the coefficients in our example

Recall that 1 = 'Versicolor' and 0 = 'Setosa'. So, each estimated coefficient in our example is the expected change in the log odds of being a 'Versicolor' for a unit increase in the corresponding feature holding other features constant.

When we exponentiate the coefficient we get the ratio of two odds from which we can determine percent change in odds of being 'Versicolor' with a one unit increase in the feature holding the other features constant.

In [None]:
importance_df

Unnamed: 0,Coeff,Odds_ratio,Odds_inc
petal length (cm),2.308496,10.059281,905.93
petal width (cm),0.962328,2.617783,161.78
sepal length (cm),0.440365,1.553274,55.33
sepal width (cm),-0.906968,0.403746,-59.63


Let's add a column for percent increase in odds of being the "Versicolor" class given a one unit change in log-odds for each feature.

In [None]:
importance_df['Odds_inc'] = round((importance_df['Odds_ratio']- 1) * 100,2)

## Interpreting the coefficients and Odds Ratios

Here's some of the things we can learn from the coefficients and odds ratios:

1. `petal_length (cm)` has the strongest influence on predicting 'Versicolor' and `petal_width (cm)` has the second strongest influence on predicting 'Versicolor'
>- The Odds Ratio of 10.06 on `petal length (cm)` indicates that a 1 unit increase in petal length increases the odds of an Iris being a 'Versicolor' by over 900%
>- The Odds Ratio of 2.62 on `petal width (cm)` indicates that a 1 unit inrease in petal width increases the odds of an Iris being a 'Versicolor' by 162%
>- Recall that when we built a decision tree on this dataset the decision tree model indicated that `petal_length (cm)` was the most informative attribute
2. The sign of the coefficients on all but `sepal width (cm)` are positive indicating that increases in all of the features except `sepal width (cm)` will increase the liklihood of the Iris being a 'Versicolor'
>- A one unit increase in `sepal width (cm)` *decreases* the odds of the Iris being a 'Versicolor' by 60%
>- Said another way, `sepal width (cm)` increases the likelihood of the Iris being a 'Setosa'



Recall our descriptive analytics...

In [None]:
iris_df.groupby('target').mean()

Unnamed: 0_level_0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.006,3.428,1.462,0.246
1,5.936,2.77,4.26,1.326


---
# End of Video 3
---