# Section 7. Machine Learning Basics Practice

#### Instructor: Pierre Biscaye


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%matplotlib inline

### Challenge 1: More exploratory data analysis

Load the `auto-mpg.csv` data. Using this dataset, create the following plots, or examine the following distributions, to better understand your data:

1. A histogram of the displacement.
2. A histogram of the horsepower.
3. A histogram of the weight.
4. A histogram of the acceleration.
5. What are the unique model years, and their counts?
6. What are the unique origin values, and their counts?


In [None]:
# Your code

### Challenge 2: Mean Absolute Error

Another commonly used metric in regression is the **Mean Absolute Error (MAE)**. As the name suggests, this can be calculated by taking the mean of the absolute errors. 

Follow the steps in the Section 7a notebook to model `mpg` with the training data variables. Using this trained model, calculate the mean absolute error on the training and test data. We've imported the MAE for you below:

In [None]:
from sklearn.metrics import mean_absolute_error as mae
# Your code



### Challenge 3: Feature Engineering

You might notice that the `origin` variable in the auto-mpg dataset has only three values. So, it's really a categorical variable, where each sample has one of three origins. But so far we've treated it like a continuous variable. 

How can we properly treat this variable as categorical? This is a question of preprocessing and **feature engineering**.

What we can do is replace the `origin` feature with two binary variables. The first tells us whether origin is equal to 2. The second tells us whether origin is equal to 3. If both are false, that means origin is equal to 1.

By fitting a linear regression with these two binary features rather than treating `origin` as continuous, we can get a better sense for how the origin impacts the MPG.

Create two new binary features corresponding to origin, and then recreate the training and test data. Then, fit a linear model to the new data. What do you find about the performance and new coefficients?


In [None]:
# Your code

See if you can create some features to improve the training fit without sacrificing the ability to generalize to a good test fit!

In [None]:

# Your code

### Challenge 4: Nearest Neighbors

Modify the nearest neighbors tuning function from class to save the Test RMSE for each value of $K$. Run the function on all integers from 1 to 20. Store the results and plot them. Identify the minimum value and plot it in a different color. 

In [None]:
# Your code

### Challenge 5: Benchmarking

Re-run the ordinary least squares prediction model on the data using `LinearRegression`. Then, create a new ridge regression where the `alpha` penalty is set equal to zero. How do the performances of these models compare to each other? How do they compare to a ridge regression model with `alpha` set to 5? Be sure to compare both the training performances and test performances. Which one performs better with the training data? Which one generalizes better to the test data?

In [1]:
from sklearn.linear_model import Ridge
# Your code
# Create models

# Fit models

# Run predictions

# Evaluate models

### Challenge 6: Lasso

Write a loop to perform a lasso regression over a large range of penalty hyperparameters. For each model, store the count of features with non-0 coefficients and the test $R^2$. Then plot how these values change as the penalty increases. Identify the hyperparameter value at which no features have non-0 coefficients. Identify the hyperparameter value with the best test performance.

In [None]:
# Your code

### Challenge 7: Order of Preprocessing

In the preprocessing of penguin data we did the following steps: 

1) Null values for categorical data
2) One-hot-encoding
3) Imputation for continuous data
4) Normalization

Now, consider that we change the order of the steps in the following ways. What effect might that have on the algorithms? Try changing the code from notebook 7c and trying it out!

- One-Hot-Encoding before Null Values for categorical data
- Normalization before Imputation for continuous data

**Bonus:** Are there any other switches in order that might affect preprocessing?


Your answer

### Challenge 8: Decision Tree Classification

Follow the steps notebook 7d on classification for estimating a decision tree model to predict penguin species.

First, estimate the model adding culmen_depth_mm as a feaure, and set max_depth=2.

Then, estimate this same model setting max_depth=3. 

Then, estimate this same model adding flipper_length_mm as a feature, with max_depth=3.

How does the training and test performance of the models change? What do you conclude about the effects of feature selection and depth choice?

Visualize the two decision trees to see how adding two layers changed the split decisions. Do you understand the decision protocols?

In [None]:

# Your code

### Challenge 9: Decision Tree Pruning

We've looked at some approaches to pre-pruning. Now estimate a decision tree to estimate penguin species using the default hyperparameters, and use cost complexity post-pruning to prune the tree. You can look up specific code for how to do this online. 

After identifying a range of complexity penalty alphas, estimate the decision tree for each possible alpha value. Which alpha value has the best validation performance in terms of accuracy? How does the accuracy compare to what you were finding in Challenge 8?

In [None]:
# Your code