# Python for Machine Learning

### *Session \#3*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Logistic Regression

### Warm Ups

*Type the given code into the cell below*

---

In [151]:
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

df = pd.read_excel('titanic.xlsx').dropna()

**Split into data sets**: 
```python
X = df[['pclass', 'age', 'fare', 'adult_male']]
y = df['survived']
X_train, x_test, y_train, y_test = train_test_split(X, y)
```

**Create and fit classifier**: 
```python
model = LogisticRegressionCV(cv=10)
model.fit(X_train, y_train)
```

**Use model to classify**: `model.predict(X_test)`

**Use model to get probabilities**: `model.predict_proba(X_test)`

**Test accuracy of your model:** `model.score(X_test, y_test)`

### Exercises
---

**1. Train a logistic classifier on all the numeric columns. What is the accuracy of the model?**

**2. What are the most important features to determining survival?**

Hint: Use model.coef_ to see the coefficients of the model

**3. Let's say you bought a second-class ticket for 30 dollars. What is your probability of survival?**

**4. Use one-hot encoding to add the other columns to your model. What is your accuracy now?**

Hint: Use `make_column_transformer` to separate categorical and numeric data

## II. Evaluating Classifiers

### Warm Ups

*Type the given code into the cell below*

---
**Create confusion matrix:** 
```python
model = ConfusionMatrix(model)
model.score(X_test, y_test)
```

**Plot class prediction error:**
```
model = ClassPredictionError(model)
model.score(X_test, y_test)
```                    

**Plot ROC curve:** 
```python
model = ROCAUC(model)
model.score(X_test, y_test)
```

### Exercises
---

**1. Plot the confusion matrix. What is more common -- false positives or false negatives?**

**2. Plot the class prediction error. Which class has a higher rate of error?**

**3. Create an ROC curve for the model. Is the model closer to an ideal model or random chance?**

## III. Class Imbalance

### Warm Ups

*Type the given code into the cell below*

---

**Plot the class balance**: 
```python
viz = ClassBalance()
viz.fit(y)
```

**Plot precision-recall curve:**
```
model = PrecisionRecallCurve(LogisticRegressionCV(cv=10))
model.fit(X_train, y_train)
model.score(X_test, y_test)
```                    

**Create a balanced model:** 
```python
model = make_pipeline(RandomOverSampler(), LogisticRegressionCV(cv=10))
```

### Exercises
---

**1. Make a ClassBalance visualization. Which class is overrepresented in this dataset?**

**2. Plot a precision-recall curve of this imbalanced dataset. How does this graph compare to the ROC curve?**

**3. Create a new model that uses RandomOverSampler() as part of the pipeline**