# Python for Machine Learning

### *Session \#3*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Review and Class Imbalance

In [100]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from yellowbrick.classifier import ConfusionMatrix
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline

import pandas as pd

df = pd.read_csv('heart_attack.csv')

**Prepare datasets**
```python
X = df.drop('heart_attack', 1)
y = df['heart_attack']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Use RandomOverSampler to balance data:**
```python
sampler = RandomOverSampler()
sampler.fit_resample(X_train, y_train)
```

**Add RandomOverSampler to KNN using pipeline:**
```python
model = make_pipeline(RandomOverSampler(), KNeighborsClassifier(30))
```

### Exercises
---

**1. Create a** `RandomOverSampler` **and use** `.fit_resample()` **on** `X_train` **and** `y_train`

**This will return two arrays -- the rebalanced versions of** `X_train` **and** `y_train`. **Take the mean of the second one.**

**2. Fit a** `KNeighborsClassifier` **to the training data, and use it to plot a ConfusionMatrix**

**What is the sensitivity of the model (true positive rate)?**

**3. Create a pipeline with a** `RandomOverSampler` **and** `KNeighborsClassifier` **and fit it to the training data**

**What is the sensitivity of the model (true positive rate)?**

**4. Add a** `StandardScaler` **to your pipeline, to equal out the importance of your features**

## II. Regression 

### Warm Ups

*Type the given code into the cell below*

---

In [101]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('diamonds.csv')

**Train model and show Residuals Plot:** 
```
model = ResidualsPlot(LinearRegression())
model.fit(X_train, y_train)
model.score(X_test, y_test)
```

**Find average error:** 
```python
y_predicted = model.predict(X_test)
mean_absolute_error(y_predicted, y_test)
```

### Exercises
---

**1. Separate the dataframe into** `X_train`, `y_train`, `X_test`, `y_test`

**2. Use a ColumnTransformer to create a model with both** `OneHotEncoder()` **and** `StandardScaler()` **preprocessing for the appropriate columns.**

**3. Add a ResidualsPlot visualizer to the model, and fit to** `X_train`. **What range of target values does the model have the most difficulty with?**

**4. Rerun the above code, but add a LearningCurve visualizer to the model instead. Does the model need more data or not?**

## III. Bias, Variance and Polynomial Regression

### Warm Ups

*Type the given code into the cell below*

---

**Add polynomial features to model:**
```python
from sklearn.preprocessing import PolynomialFeatures

columns = ['carat']
X = df[columns]
y = df['price']

model = make_pipeline(PolynomialFeatures(3), LinearRegression())
model.fit(X, y)

```

**Predict and plot using model** 
```python
from matplotlib import pyplot as plt
%matplotlib inline

plt.scatter(X, y)
plt.scatter(X, model.predict(X))
```

**Use a validation curve to test different polynomial degrees**
```python
model = make_pipeline(PolynomialFeatures(), LinearRegression())
model = ValidationCurve(model, 
                        param_name='polynomialfeatures__degree', 
                        param_range=range(1, 5))

model.fit(X, y)
```

### Exercises
---

**1. Train a polynomial regression model using all the numeric columns** 

**2. Use a validation curve to test polynomial degrees up to 7**

**3. At what degree does the model start to overfit?**

## IV. Regularization

### Warm Ups

*Type the given code into the cell below*

---
**Train a ridge regression model:**
```python
model = Ridge(alpha=1)
pipeline = make_pipeline(transformer, model)
```

**Train a lasso regression model:**
```python
model = Lasso(alpha=1)
pipeline = make_pipeline(transformer, model)
```

**Get coefficients from classifier in Pipeline:** `model.coef_`

### Exercises
---

**1. Train a Ridge Regression model on the entire diamond dataset, as in exercise set 1. What range works best for the alpha parameter?** 

**2. Access the coefficients from the trained Ridge model. Which features were most important and least important?** 

*Hint: You can get the column names from* `df[numeric].columns` *and* `transformer.get_feature_names()`

**3. Train a Lasso Regression model on the entire diamond dataset, as in exercise set 1. What range works best for the alpha parameter?** 

**4. Access the coefficients from the trained Lasso model. How many features were eliminated?** 