# Python for Machine Learning

### *Session \#4*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Logistic Regression

### Warm Ups

*Type the given code into the cell below*

---

In [None]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

from matplotlib import pyplot as plt

import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from yellowbrick.classifier import ConfusionMatrix, ClassPredictionError, \
                                   ROCAUC, PrecisionRecallCurve
from yellowbrick.target import class_balance

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

In [None]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/patricks1/'
    'noble-machine-learning/master/Session4/heart_attack.csv'
)

In [None]:
def assign_tts(df, model, tgt_txt, features='all', state=None):
    if features == 'all':
        X=df.drop(columns=[tgt_txt])
    elif isinstance(features, list):
        X=df[features]
    else:
        raise ValueError('Features should either be \'all\' or a list')
    y=df[tgt_txt]
    tts = train_test_split(X,y, random_state=state)
    model.X_train = tts[0]
    model.X_test = tts[1]
    model.y_train = tts[2]
    model.y_test = tts[3]
    return None
def std_fit(model):
    model.fit(model.X_train, model.y_train)
    return None
def std_predict(model):
    return model.predict(model.X_test)
def std_score(model):
    return model.score(model.X_test, model.y_test)

**Create and fit classifier**:

In [None]:
modelI = LogisticRegression()
assign_tts(df, modelI, 'heart_attack', features=['sys_bp'], state=1)
std_fit(modelI)

**Use model to classify**:

In [None]:
print('number of predicted heart attacks: {0:0.0f}\n' \
      .format(std_predict(modelI).sum()))
display(std_predict(modelI))

**Use model to get probabilities**: `model.predict_proba(X_test)`

In [None]:
modelI.predict_proba(modelI.X_test)

### Exercises
---
**Use the following `x` and `y` variables in your analysis for this section.**

In [None]:
x = modelI.X_test['sys_bp']
y = modelI.y_test

**1. Calculate the log odds of someone having a heart attack based on systolic blood pressure assuming the equation for log odds has a slope of 0.01 and an intercept of -1** 

**2. Now convert those log odds into probabilities of having a heart attack.**

**3. For people who had a heart attack, the log likelihood of the data (i.e. having a heart attack) given our model is the log probabilty. For people who did not have a heart attack, the log likelihood of the data (i.e. not having a heart attack) given our model is log(1 - the probability).**
\begin{equation}
\log\mathcal{L} = 
\begin{cases} 
\log p(\mathrm{heart\ attack}) & \text{if } \mathrm{heart\ attack}=1 \\
\log(1-p(\mathrm{heart\ attack})) & \text{if } \mathrm{heart\ attack}=0 
\end{cases}
\end{equation}

- **Create an array of log likelihoods based on the logic above.**  
(You can use `np.log` to find logs.)
- **Sum the individual log likelihoods to find the total log likelihood of the data given our model.**

**4. Put your work from Exercises 1-3 into a function that takes a `coef` and `intercept` as arguments.** 

- Assume that anyone with a probability of a heart attack greater than 0.5 does have one. 
- Build on what you have by calculating your model's recall, which is the percentage of heart attacks that you correctly predicted. It is calculated as follows.  
$\mathrm{Recall = \dfrac{true\ positives}{true\ positives + false\ negatives}}$  

(A "positive" is a heart attack. A "negative" is no heart attack.)

- Also have your function calculate your model's accuracy, which is the percentage of predictions it got right.  
(Hint: You can turn a `pd.Series` of booleans into integers with `series.astype(int)`.)

In [None]:
def evaluate(coef, intercept):
    x = modelI.X_test['sys_bp']
    y = modelI.y_test
    
    # Your code here

**5. Now add to your function so that it plots the systolic blood pressure data and the heart attack data along with your predicted probabilities.**

In [None]:
def evaluate(coef, intercept):
    x = modelI.X_test['sys_bp']
    y = modelI.y_test
    
    # Your code here

**6. Make adjustments to your slope and intercept to try and manually maximize the log likelihood.**

**7. Now give your function the best possible coefficient and intercept that** `sklearn` **found when we ran the logistic regression fit for** `modelI` **in the warmups.**  
Remember that the model's `coef_` is a 2D numpy array. The two-dimensional nature of the `coef_` will probably cause problems for you unless you deal with that somehow.

## II. ROC Curves and Class Imbalance

### Warm Ups

*Type the given code into the cell below*

---

**Create and fit classifier**: 
```python
model_roc = ROCAUC(model)
model_roc.fit(X_train, y_train)
model_roc.score(X_test, y_test)
model_roc.show()
```

In [None]:
model_rocIIw = ROCAUC(modelI)
std_fit(model_rocIIw)
std_score(model_rocIIw)
model_rocIIw.show()

#micro-avg: (tp0 + tp1) / (tp0 + tp1 + fn0 + fn1)
#macro-avg: [tp0 / (tp0 + fp0) + tp1 / (tp1 + fp1)] / 2

**Create ClassBalance visualization:**

In [None]:
class_balance(df['heart_attack'])

**Plot the Precision Recall Curve**: 

In [None]:
model_prc = PrecisionRecallCurve(modelI)
assign_tts(df, model_prc, 'heart_attack', features=['sys_bp'], state=1)
model_prc.fit(model_prc.X_train, model_prc.y_train)
model_prc.score(model_prc.X_test, model_prc.y_test)
model_prc.show()
plt.show()

### Exercises
---

**1. Interpret the ROC curve below. What is the highest sensitivity we can reach while keeping false positives under 20% (ie. specificity > 0.8)?**

**If we care about both classes equally, what sensitivity and specificity should we choose?**

![image.png](../images/roc.png)

**2. If you were creating a machine learning model to catch credit card fraud, would you use an ROC curve or a precision-recall curve?**

**3. Train and plot an ROC curve with a** `KNeighborsClassifier` **model and a** `LogisticRegression` **model. Which model performs better with this data, according to the AUC?** 

Hint: To help deal with limited data, you may need to pass `max_iter=10000` when creating your LogisticRegression model

**4. Do the same thing but replace the** `ROCAUC` **visualizers with** `PrecisionRecallCurve` **visualizers and rerun to get a minority-class focused view on performance.**

**5. Let's examine more severe class imbalance.**

**Run the code below to drop most of the positive cases, then split the data into X and y again**

**Create a** `class_balance` **visualization to verify that the classes are now very imbalanced.**

In [None]:
df_drop = df.drop(df.query('heart_attack == 1').sample(n=400).index)

# Add your code down here

**6. Rewrite and run code similar to that which you wrote for Question 3, but this time use `df_drop` as your data source. Which model's performance has suffered more? Why?**

Hint: To help deal with limited data, you may need to pass `max_iter=10000` when creating your LogisticRegression model

**7. Do the same thing but replace the** `ROCAUC` **visualizers with** `PrecisionRecallCurve` **visualizers and rerun to get a minority-class focused view on performance.**

## III. Stratified Sampling and Oversampling

### Warm Ups

*Type the given code into the cell below*

---

**Use stratified sampling:**
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
```

**Use RandomOverSampler to balance data:**
```python
sampler = RandomOverSampler()
sampler.fit_resample(X_train, y_train)
```

**Use RandomOverSampler in pipeline:**
```python
model = make_pipeline(RandomOverSampler(), LogisticRegression())
```

### Exercises
---

**1. Inside the for-loop, call** `train_test_split()` **WITHOUT the stratify parameter**

**Then within the for-loop call** `y_test.sum()` **to count the number of positive cases.**

**Rerun with the stratify parameter set to** `y`

In [None]:
for i in range(10):
    
    # Add your code here

**2. Create a** `RandomOverSampler()` **and use .fit_resample() on X_train and y_train**

**This will return two arrays -- the rebalanced versions of** `X_train` **and** `y_train` 

**Take the mean of the new rebalanced** `y_train` **to show that it's balanced**

**3. Fit a** `LogisticRegression()` **model to the training data, and use it to plot a ConfusionMatrix**

**What is the accuracy and sensitivity of the model?**

**4. Create a pipeline with a** `RandomOverSampler` **and** `LogisticRegression()` **and fit it to the training data**

**What is the accuracy and sensitivity of the new model?**