# Problem Session 4

The problems in this notebook will cover the content covered in our Regression lectures including:
- Regularization
- Principle Component Analysis
- Categorical Variables and Interactions
- Pipelines

In [41]:
## We first load in packages we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

#### 1. Practice creating mock data and fitting models to it

Creating your own fake data and fitting models to that data is a good way to practice.  It is nice because you have access to the "ground truth" when you make your own data.

Another more practical usage of simulation is parametric bootstrapping, which we will cover in a few lectures.

It is also *very common* to need to mock up some data during an interview.
##### a.

Use `np.linspace` to create an array of `100` equally spaced values between $0$ and $5$.  Store this in a variable called `x`.

Simulate $$y = x + x^2 + \epsilon \textrm{ where }\epsilon \sim \mathcal{N}(\mu = 0,\sigma = 10)$$ using `np.random.randn`.  Store this in a variable called `y`.

##### b.

Make a scatterplot of $y$ against $x$ using `plt`.  Also plot the "ground truth" relationship. 

##### c.

Fit three different models to this data:

1. The OLS fit of a degree 10 polynomial.
2. The ridge regression fit of a degree 10 polynomial using the default `alpha = 1`.  Be sure to use StandardScaler as the first step in the pipeline!
3. A pipeline using PCA:  scale -> polynomial transform of degree 10 -> PCA with 5 components -> OLS linear regression.

In each case we are fitting a degree 10 polynomial which should **over-fit**.  Our hope is that regularization using either Ridge or PCA will tame that a bit.

In [None]:
# First import whatever you need

# Then instantiate the pipelines
pipe_ols = 
pipe_ridge = 
pipe_pca = 

# Finally fit the pipelines


##### d.

Graph all 3 fit models together with the ground truth.  Which is the best approximation of the ground truth?

You may want to rerun all cells above this one a few times to see the variation in fitted models.

In [47]:
# Click "Execute above cells" a few times to see the variation. 

Even if you only managed to get through this section it is a profitable use of a problem session!  Depending on the speed of your group you may be able to also tackle the next section.  If not, treat it as homework!

#### 2. The diamonds dataset

We introduce a new "classic" dataset.  Our task is to predict the price of diamonds.

* price: Price in US dollars.
* carat: Weight of the diamond.
* cut: Cut quality (ordered worst to best).
* color: Color of the diamond (ordered best to worst).
* clarity: Clarity of the diamond (ordered worst to best).
* x: Length in mm.
* y: Width in mm.
* z: Depth in mm.
* depth: Total depth percentage: 100 * z / mean(x, y)
* table: Width of the top of the diamond relative to the widest point.

Homepage: https://ggplot2.tidyverse.org/reference/diamonds.html

In [48]:
df = pd.read_csv('../../data/diamonds.csv')

In [None]:
df

For sake of time we will restrict ourselves to just one categorical feature (`cut`) and one continuous feature (`carat`) in our modeling.  This is only being done for pedagogical purposes!  In a real situation you would want to carefully explore all of the data you have available.

In [50]:
df = df[['cut', 'carat', 'price']]

#### a.

Make a train/test split with 20% of data held aside as the test set.


##### b. 

What are the percentage of samples belonging to each level of the `cut` feature?

##### c. 

Look at the distribution of price at each level of the `cut` feature.  Do you notice anything strange or unexpected?

##### d. 

One thing which might be a bit confusing is that the cut quality does not seem to be a very good indicator of price.  Why might that be?

Sometimes this happens when two predictors which each have a positive **causal** impact on the outcome are negatively correlated with each other.  In other words, it might be that **all else being equal** a higher quality cut will increase the price, and a larger carat will increase the price, but higher quality cuts are negatively correlated with the size in carats.

Use the `groupby` and `describe` methods to look at some summary statistics of carat size sorted by cut quality.

You should see that the "Fair" quality also has the largest mean carat size, while "Ideal" quality has the smallest. I am not a domain expert, but this could be due to jewelers needing to cut away more of the original stone to produce better cuts?  This would be something to consult with a jeweler on.

##### e.

Graph price against carat with color coded by cut quality.

##### f.

The relationship you obtained above does not look linear.  Graph the log of the price against the log of the carat size.  This should look substantially more linear!

In [None]:
df_train['log_carat'] = 
df_train['log_price'] = 

##### g.

We do not have the ability to **experimentally** adjust `cut` and `carat` independently to see the impact on price, but we can still use **statistical control**.

We will run a linear regression of `log_price` against `cut` and `log_carat`.  Do better cuts contribute to higher prices when controlling for carat?

In [58]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.linear_model import LinearRegression

In [None]:
# Discuss what you think preprocessor does with your team.  Can you test that it does what you think it should?
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(), ['cut']),
        ('identity', FunctionTransformer(func = None), ['log_carat'])
        ])

# Write a pipeline which first uses preprocessor and then uses LinearRegression(fit_intercept = False). 
# Why do I not want to fit the intercept term?
model = 

# Fit it on the training set using the 'cut' and 'log_carat'features (in that order).


# It is a bit difficult to access the feature names of one part of a pipeline, so I have done it for you.
feature_names = model['preprocess'].transformers_[0][1].get_feature_names_out()

cut_adjustments = {feature_name: float(model['linear'].coef_[i]) for i,feature_name in enumerate(feature_names)}

cut_adjustments_sorted = dict(sorted(cut_adjustments.items(), key=lambda item: item[1]))

cut_adjustments_sorted

#### h. Evaluating residuals

Make a plot of residuals against predicted values.  Discuss the implications for your model.

#### i. Quantifying model performance

Let's use [mean absolute percentage error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html) and [mean absolute error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) as our performance metrics.  How does our model perform on the training set?

Note:  We have not done any cross validation to compare model performance in this problem session because we have only considered one model.