# Python for Machine Learning

### *Session \#3*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Regression and Parametric Models

### Warm Ups

*Type the given code into the cell below*

---

In [161]:
from matplotlib import pyplot as plt
%matplotlib inline

import pandas as pd

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures, scale
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

from yellowbrick.model_selection import LearningCurve, ValidationCurve
from yellowbrick.regressor import ResidualsPlot

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv('housing.csv', usecols=range(9))
df = df.dropna()

**Create training/test sets:** 
```
X = df.drop(columns=['median_price'])
y = df['median_price']
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Train model and show Residuals Plot:** 
```
model = ResidualsPlot(LinearRegression())
model.fit(X_train, y_train)
model.score(X_test, y_test)
```

**Find mean absolute error:** 
```python
y_predicted = model.predict(X_test)
mean_absolute_error(y_predicted, y_test)
```

### Exercises
---

**1. For each machine learning task, state whether it's better suited for a parametric or non-parametric model:**

* Predicting spread of a contagious disease 
* Recognizing handwritten digits
* Predicting the position of a planet
* Facial recognition

**2. Say we are trying to predict car price based on age.**

**For cars _older_ than any of the ones in the training set, how would a linear model predict price? KNN?**

**Which model will perform better on _slightly older_ cars? Which will perform better on _much older_ cars?**

![car_prices](../images/car_prices.png)

**3. Train LinearRegression model and KNN model.** `linear` and `knn`

**What is the R2 and mean absolute error of each?**

**4. Use** `scale()` **on your feature matrix X before training** `linear` **and** `knn`

**How does this affect the R2 score and mean absolute error of both models?**

**5. Paste your solution to the last problem below**


**To study the effect of limited data, rerun** `train_test_split` **with** `test_size=0.95`

**Which model fares better when data is scarce?**

**6. Create a residual plot of** `linear` **and** `knn`

**Where model performs better on extreme values? Overall?**

## II. Overfitting and Regularization

### Warm Ups

*Type the given code into the cell below*

---
**Create a ridge regression model:**
```python
model = Ridge(alpha=1)
```

**Train model and show Learning Curve:** 
```
model_learn = LearningCurve(model)
model_learn.fit(X_train, y_train)
model_learn.score(X_test, y_test)
model_learn.finalize()
```

**Feature importance:** `sorted(zip(model.coef_, X.columns))`

### Exercises
---

**1. For each of the following situations, state whether LASSO or Ridge regression would be more appropriate:**

* Predicting height based on thousands of genes
* Predicting rent prices based on a few house/neighborhood features
* Predicting a customer's monthly expenses using features like location, age, etc
* Predicting a patient's blood pressure based on dozens of facts about lifestyle, pre-existing conditions, etc

**2. Let's see how linear regression can overfit**

**Create your feature matrix and target vector using just the first 100 rows of** `df`

**Train a LinearRegression model from these data sets** 

**3. What is the model's R2 score on the test set? How about the training set?**

**What does this tell you about the model?**

**4. Replace the LinearRegression model with a Ridge model**

**Use a ValidationCurve to choose a good value for alpha** 

## III. Bias, Variance and Polynomial Regression

### Warm Ups

*Type the given code into the cell below*

---

**Setup**

In [None]:
X = df[['median_income']]
y = df['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y)

**Add polynomial features to model:**
```python
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(X_train, y_train)
```

**Predict and plot using model** 
```python
plt.scatter(X_test, y, color='gray')
plt.scatter(X_test, model.predict(X_test))
```

**Use a validation curve to test different polynomial degrees**
```python
model = ValidationCurve(model, 
                        param_name='polynomialfeatures__degree', 
                        param_range=range(1, 5))

model.fit(X, y)
```

### Exercises
---

**1. Let's say I have two columns** `x` **and** `y`. **If I add polynomial features up to degree 3, what will my columns look like? How many will I have?**

**2. Create an instance of** `PolynomialFeatures(2)` **called** `poly`

**Call** `.fit_transform()` **on X_train to see the transformed data**

**3. Create a pipeline with** `PolynomialFeatures(3)` **and a LinearRegression model**


**4. Create a scatter plot of the model's predictions and the actual values** `median_home_value`

**5. Retrain the model using all the columns, instead of just** `median_income`

**What's the model's R2 score?**

**6. Use a validation curve to test polynomial degrees up to 5.**

**At what degree does the model start to overfit?**