# Section 6. Machine Learning Basics Practice

#### Instructor: Pierre Biscaye

The purpose of this notebook is to give you opportunities and challenge to practice applying the skills developed in the other notebooks. 

Most of this notebook is derived from UC Berkeley D-Lab's Python Machine Learning [course](https://github.com/dlab-berkeley/Python-Machine-Learning).

In [None]:
# Initial packages import, modify as needed

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

### Challenge 1: More exploratory data analysis

Load the `auto-mpg.csv` data. Using this dataset, create the following plots, or examine the following distributions, to better understand your data:

1. A histogram of the displacement.
2. A histogram of the horsepower.
3. A histogram of the weight.
4. A histogram of the acceleration.
5. What are the unique model years, and their counts?
6. What are the unique origin values, and their counts?


In [None]:
# Your code

### Challenge 2: Mean Absolute Error

Another commonly used metric in regression is the **Mean Absolute Error (MAE)**. As the name suggests, this can be calculated by taking the mean of the absolute errors. 

Follow the steps in the Section 6a notebook to model `mpg` using an OLS regression with the training data variables. Using this trained model, calculate the mean absolute error on the training and test data. We've imported the MAE for you below.

In [None]:
from sklearn.metrics import mean_absolute_error as mae
# Your code



### Challenge 3: Feature Engineering

You might notice that the `origin` variable in the auto-mpg dataset has only three values. So, it's really a categorical variable, where each sample has one of three origins. But so far we've treated it like a continuous variable. 

How can we properly treat this variable as categorical? This is a question of preprocessing and **feature engineering**.

What we can do is replace the `origin` feature with two binary variables. The first tells us whether origin is equal to 2. The second tells us whether origin is equal to 3. If both are false, that means origin is equal to 1.

By fitting a linear regression with these two binary features rather than treating `origin` as continuous, we can get a better sense for how the origin impacts the MPG.

Create two new binary features corresponding to origin, and then recreate the training and test data. Then, fit a linear model to the new data. What do you find about the performance? Compare the $R^2$ in this model to the original model which treated `origin` as continuous. How did the coefficients change?


In [None]:
# Your code

See if you can create some features to improve the training fit without sacrificing the ability to generalize to a good test fit!

In [None]:

# Your code

### Challenge 4: Nearest Neighbors

Modify the nearest neighbors tuning function from class to save the Test RMSE for each value of $K$. Run the function on all integers from 1 to 20. Store the results and plot them. Identify the minimum value and plot it in a different color. 

In [None]:
# Your code

### Challenge 5: Benchmarking

Re-run the ordinary least squares prediction model on the data using `LinearRegression`. Then, create a new ridge regression where the `alpha` penalty is set equal to zero. How do the performances of these models compare to each other? How do they compare to a ridge regression model with `alpha` set to 5? Be sure to compare both the training performances and test performances. Which one performs better with the training data? Which one generalizes better to the test data?

In [1]:
from sklearn.linear_model import Ridge
# Your code
# Create models

# Fit models

# Run predictions

# Evaluate models

### Challenge 6: Lasso

Write a loop to perform a lasso regression over a large range of penalty hyperparameters. For each model, store the count of features with non-0 coefficients and the test $R^2$. Then plot how these values change as the penalty increases. Identify the hyperparameter value at which no features have non-0 coefficients. Identify the hyperparameter value with the best test performance.

In [None]:
# Your code

### Challenge 7: Order of Preprocessing

In the preprocessing of penguin data we did the following steps: 

1) Null values for categorical data
2) Dummy encoding
3) Imputation for continuous data
4) Normalization

Now, consider that we change the order of the steps in the following ways. What effect might that have on the algorithms? Try changing the code from notebook 6c and trying it out!

- Dummy Encoding before Null Values for categorical data
- Normalization before Imputation for continuous data

**Bonus:** Are there any other switches in order that might affect preprocessing?


In [None]:
# Your code

Your answer

### Challenge 8: Decision Tree Classification

Follow the steps notebook 6d on classification for estimating a decision tree model to predict penguin species.

First, estimate the model adding culmen_depth_mm as a feaure, and set max_depth=2.

Then, estimate this same model setting max_depth=3. 

Then, estimate this same model adding flipper_length_mm as a feature, with max_depth=3.

How does the training and test performance of the models change? What do you conclude about the effects of feature selection and depth choice?

Visualize the original and the final new decision tree to see how adding features and layers changed the split decisions. Do you understand the decision protocols?

In [None]:

# Your code

### Challenge 9: Decision Tree Pruning

We've looked at some approaches to pre-pruning. Now estimate a decision tree to estimate penguin species using the default hyperparameters, and use cost complexity post-pruning to prune the tree. You can look up specific code for how to do this online. 

After identifying a range of complexity penalty alphas, estimate the decision tree for each possible alpha value. Which alpha value has the best validation performance in terms of accuracy? How does the accuracy compare to what you were finding in Challenge 8?

In [None]:
# Your code

### Challenge 10: Classification Model Comparisons

Work through the challenges in the Section 6e notebook. As a further challenge, run Support Vector Machines and XGBoost using appropriate parameter tuning. Create a table summarizing and comparing the results across models. 

In [None]:
# Your code

### Challenge 11: Cluster stability

We discussed at the send of the Section 6f notebook that if clusters change a lot when randomly sampling subsets of the data, they are likely not detecting real patterns. Data scientists are not just interested in whether clusters can emerge, but whether those clusters are reliable.

K-Means will always find clusters, even in a cloud of completely random dots. A professional data scientist must prove that their clusters are stableâ€”meaning they don't change drastically if a few data points are removed.

You will test the stability of your $k=3$ K-Means model with the Spotify data and the same features we used in the 6f notebook, by comparing the results of the full dataset against a 90% random sample. 

Steps:
1. Run K-Means on your full scaled dataset (X_scaled) with $k=3$ and random_state=1. Save these labels as labels_full.
2. Create a Subsample: Create a new variable X_sub that contains a random 90% sample of your original data. Hint: Use X_scaled.sample(frac=0.9, random_state=123).
3. Subsample Run: Run K-Means (same $k$ and same random_state) on X_sub. Save these labels as labels_sub.
4. The Alignment Check: Select only the rows in labels_full that exist in your 90% sample. Use a Confusion Matrix to compare the "Full" labels vs the "Subsample" labels for those overlapping rows. Discuss the results. Hint: consider the possibility of "label switching". Calculate a "stability score" using the [Adjusted Rand Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html).
5. Run this ten times with different random seeds and different subsamples, and calculate a mean stability score. Discuss.

In [None]:
# Your code

### Challenge 12: Principal components

We were not sure how to interpret the "generosity" variable. Re-run the PCA analysis we did on the World Happiness data after removing the generosity feature. Discuss how the results differ.

In [None]:
# Your code

### Challenge 13: Predicting Diamond Cut

Suppose we were interested in predicting the cut of a diamond, rather than its price. This is a categorical variable. Using logistic regression but still keeping just numeric variables in the features dataset, repeat the process in the Section 6g notebook to test the potential time benefits of using PCA for this kind of analysis.

The below code could be helpful to get started

In [None]:
# #Initialize k-fold with 5 splits
# kf = KFold(n_splits=5)
# #Intialize model
# lr = LogisticRegression(random_state=1, max_iter = 300)
# #Run cross validation
# baseline_score = cross_val_score(lr, Xs, y, cv=kf, scoring = "accuracy").mean().round(3)


### Challenge 14: t-SNE perplexity

Using the code in the Section 6g notebook, test how changing perplexity can change what the clusters look like. Make a figure with 6 subplots showing how the clusters change with increasing perplexity.

In [None]:
# Your code