# Python for Machine Learning

### *Session \#3*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. KNN Regression and Evaluation

## II. Linear Regression 

### Warm Ups

*Type the given code into the cell below*

---

**Choose model and hyperparameters:**
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```

In [1]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

**Fit model to data:**`model.fit(X_train, y_train)`                    

**Use model to predict:** `y_predicted = model.predict(X_test)`

**Find average error:** 
```python
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_predicted, y_test)
```

**Find R2 score:** `model.score(X_test, y_test)`

### Exercises
---

**1. Create a LinearRegression model**

**2. Fit the model using** `X_train` **and** `y_train`

**3. Make a** `y_predicted` **variable using** `model.predict()` **on** `X_test`

**4. Use** `y_predicted` **and** `y_test` **to find the mean absolute error of your model**

**5. Use** `X_test` **and** `y_test` **to find the R2 score of your model**

### Warm Ups

*Type the given code into the cell below*

---

In [101]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.pipeline import make_pipeline

from yellowbrick.features import rank1d, rank2d
from yellowbrick.regressor import PredictionError, ResidualsPlot
from yellowbrick.model_selection import LearningCurve, ValidationCurve

df = pd.read_csv("diamonds.csv")

**Prepare datasets**
```python
X = df.drop(['price'], 1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Split pipeline between numeric/categorical**: 
```python
numeric = ['x', 'y', 'z', 'carat', 'table', 'depth']
categorical = ['cut', 'color', 'clarity']

transformer = make_column_transformer(
        (StandardScaler(), numeric),
        (OneHotEncoder(), categorical)
    )
```

**Add yellowbrick visualizer and train model:** 
```
model = ResidualsPlot(LinearRegression())
pipeline = make_pipeline(transformer, model)
```

**Fit and score model:** 
```
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
```

### Exercises
---

**1. Separate the dataframe into** `X_train`, `y_train`, `X_test`, `y_test`

**2. Use a ColumnTransformer to create a model with both** `OneHotEncoder()` **and** `StandardScaler()` **preprocessing for the appropriate columns.**

**3. Add a ResidualsPlot visualizer to the model, and fit to** `X_train`. **What target values does the model have the most difficulty with?**

**4. Rerun the above code, but add a LearningCurve visualizer to the model instead. Does the model need more data or not?**

## II. Bias, Variance and Polynomial Regression

### Warm Ups

*Type the given code into the cell below*

---

**Add polynomial features to model:**
```python
from sklearn.preprocessing import PolynomialFeatures

model = make_pipeline(PolynomialFeatures(3), LinearRegression())
model.fit(X[['carat']], y)

```

In [123]:
from sklearn.preprocessing import PolynomialFeatures

model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(X[['carat']], y)

Pipeline(memory=None,
         steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=2, include_bias=True,
                                    interaction_only=False, order='C')),
                ('linearregression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

**Predict and plot using model** 
```python
from matplotlib import pyplot as plt
%matplotlib inline

plt.scatter(X[['carat']], y)
plt.scatter(X[['carat']], model.predict(X[['carat']]))
```

**Use a validation curve to test different polynomial degrees**
```python
model = make_pipeline(PolynomialFeatures(), LinearRegression())
model = ValidationCurve(model, 
                        param_name='polynomialfeatures__degree', 
                        param_range=range(1, 5))

model.fit(X[['carat']], y)
```

### Exercises
---

**1. Train a polynomial regression model using all the numeric columns** 

**2. Use a validation curve to test polynomial degrees up to 7**

**3. At what degree does the model start to overfit?**

## III. Regularization

### Warm Ups

*Type the given code into the cell below*

---
**Train a ridge regression model:**
```python
model = Ridge(alpha=1)
pipeline = make_pipeline(transformer, model)
```

**Train a lasso regression model:**
```python
model = Lasso(alpha=1)
pipeline = make_pipeline(transformer, model)
```

**Get coefficients from classifier in Pipeline:** `model.coef_`

### Exercises
---

**1. Train a Ridge Regression model on the entire diamond dataset, as in exercise set 1. What range works best for the alpha parameter?** 

**2. Access the coefficients from the trained Ridge model. Which features were most important and least important?** 

*Hint: You can get the column names from* `df[numeric].columns` *and* `transformer.get_feature_names()`

**3. Train a Lasso Regression model on the entire diamond dataset, as in exercise set 1. What range works best for the alpha parameter?** 

**4. Access the coefficients from the trained Lasso model. How many features were eliminated?** 

## I. Logistic Regression

### Warm Ups

*Type the given code into the cell below*

---

In [191]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from yellowbrick.target import ClassBalance

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

from matplotlib import pyplot as plt
%matplotlib inline

df = pd.read_excel('titanic.xlsx').dropna()

**Split into data sets**: 
```python
X = df[['age']]
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Create and fit classifier**: 
```python
model = LogisticRegression()
model.fit(X_train, y_train)
```

**Use model to classify**: `model.predict(X_test)`

**Use model to get probabilities**: `model.predict_proba(X_test)`

### Exercises
---

**1. Copy/paste the slope from** `model.coef_` **and the intercept from** `model.intercept_`

**2. Plot the underlying linear model using plt.plot().**

**First feed in** `X_test` **as the x-axis and** `X_test*slope + intercept` **as the y-axis**

**3. Use** `plt.scatter` **to plot** `X_test` **and** `y_test`, **and also to plot** `curve_x` and `curve_y` **which show the curve of the logistic classifier**

In [190]:
curve_x = np.linspace(-100, 200, 100).reshape(-1, 1)
curve_y = [a for a,b in model.predict_proba(curve_x)]

**4. What is your own probability of survival? Because the model expects a dataframe, you'll need to wrap your age in lists. So if your age is 50, you'll input** `[[50]]`

**5. Use one-hot encoding to add all the columns to your model. Call model.score() to see the accuracy of your model.**

Hint: Use `make_column_transformer` to separate categorical and numeric data

## II. Evaluating Classifiers

### Warm Ups

*Type the given code into the cell below*

---

**Test accuracy of your model:** `pipeline.score(X_test, y_test)`

**Create confusion matrix:** 
```python
pipeline = ConfusionMatrix(pipeline)
pipeline.score(X_test, y_test)
```

**Plot class prediction error:**
```
pipeline = ClassPredictionError(pipeline)
pipeline.score(X_test, y_test)
```                    

**Plot ROC curve:** 
```python
pipeline = ROCAUC(pipeline)
pipeline.score(X_test, y_test)
```

### Exercises
---

**1. Plot the confusion matrix. What is more common -- false positives or false negatives?**

**2. Plot the class prediction error. Which class has a higher rate of error?**

**3. Create an ROC curve for the model. Is the model closer to an ideal model or random chance?**

## III. Class Imbalance

### Warm Ups

*Type the given code into the cell below*

---

**Plot the class balance**: 
```python
viz = ClassBalance()
viz.fit(y)
```

**Plot precision-recall curve:**
```
pipeline = PrecisionRecallCurve(pipeline)
pipeline.score(X_test, y_test)
```                    

**Create a balanced model:** 
```python
pipeline = make_pipeline(RandomOverSampler(), trans, LogisticRegression())
```

### Exercises
---

**1. Make a ClassBalance visualization. Which class is overrepresented in this dataset?**

**2. Plot a precision-recall curve of this imbalanced dataset. How does this graph compare to the ROC curve?**

**3. Create a new model that uses RandomOverSampler() as part of the pipeline**