# Python for Machine Learning

### *Session \#3*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Logistic Regression

### Warm Ups

*Type the given code into the cell below*

---

In [191]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.target import ClassBalance

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_excel('titanic.xlsx').dropna()

**Split into data sets**: 
```python
X = df[['age']]
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Create and fit classifier**: 
```python
model = LogisticRegression()
model.fit(X_train, y_train)
```

**Use model to classify**: `model.predict(X_test)`

**Use model to get probabilities**: `model.predict_proba(X_test)`

### Exercises
---

**1. Copy/paste the slope from** `model.coef_` **and the intercept from** `model.intercept_`

**2. Plot the underlying linear model using plt.plot().**

**First feed in** `X_test` **as the x-axis and** `X_test*slope + intercept` **as the y-axis**

**3. Use** `plt.scatter` **to plot** `X_test` **and** `y_test`, **and also to plot** `curve_x` and `curve_y` **which show the curve of the logistic classifier**

In [190]:
curve_x = np.linspace(-100, 200, 100).reshape(-1, 1)
curve_y = [a for a,b in model.predict_proba(curve_x)]

**4. What is your own probability of survival? Because the model expects a dataframe, you'll need to wrap your age in lists. So if your age is 50, you'll input** `[[50]]`

**5. Use one-hot encoding to add all the columns to your model. Call model.score() to see the accuracy of your model.**

Hint: Use `make_column_transformer` to separate categorical and numeric data

## II. Evaluating Classifiers

### Warm Ups

*Type the given code into the cell below*

---

**Test accuracy of your model:** `pipeline.score(X_test, y_test)`

**Create confusion matrix:** 
```python
pipeline = ConfusionMatrix(pipeline)
pipeline.score(X_test, y_test)
```

**Plot class prediction error:**
```
pipeline = ClassPredictionError(pipeline)
pipeline.score(X_test, y_test)
```                    

**Plot ROC curve:** 
```python
pipeline = ROCAUC(pipeline)
pipeline.score(X_test, y_test)
```

### Exercises
---

**1. Plot the confusion matrix. What is more common -- false positives or false negatives?**

**2. Plot the class prediction error. Which class has a higher rate of error?**

**3. Create an ROC curve for the model. Is the model closer to an ideal model or random chance?**

## III. Class Imbalance

### Warm Ups

*Type the given code into the cell below*

---

**Plot the class balance**: 
```python
viz = ClassBalance()
viz.fit(y)
```

**Plot precision-recall curve:**
```
pipeline = PrecisionRecallCurve(pipeline)
pipeline.score(X_test, y_test)
```                    

**Create a balanced model:** 
```python
pipeline = make_pipeline(RandomOverSampler(), trans, LogisticRegression())
```

### Exercises
---

**1. Make a ClassBalance visualization. Which class is overrepresented in this dataset?**

**2. Plot a precision-recall curve of this imbalanced dataset. How does this graph compare to the ROC curve?**

**3. Create a new model that uses RandomOverSampler() as part of the pipeline**