***

# Practical activity: Regression

***

- In this activity you will explore and analyze a dataset of car gasoline consumption in miles per gallon (MPG) for different brands and types of cars
- We will use data from the [UCI](https://archive.ics.uci.edu/ml/) repository 
- We will use the following libraries: 
    - Python [pandas](https://pandas.pydata.org) to handle data tables
    - matplotlib for visualization
    - [PyMC3](https://docs.pymc.io/) and scikit-learn for the modeling

### Instructions
1. Follow the steps in this notebook, complete the activities and answer the questions marked with a **Q**
1. Pre-process the data and inspect it
1. Train a linear model to predict MPG as a function of other relevant features

In [None]:
import numpy as np
import pandas as pd
import pymc3 as pm
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
%matplotlib notebook
import matplotlib.pylab as plt

***
### Getting the data
- Use the following block to obtain the data
- If you don't have `wget` follow the link, download the data manually and put it on the data folder

In [None]:
!wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data -P data

*** 
### Pre-process the data

- Import the table as a pandas dataframe and explore it
- **Q:** How many features and samples are in the table?
- **Q:** Which variables are continuous and which are categorical?

In this case
- MPG (Miles per gallon) is the variable we want to model (dependent variable) 
- car_name is the index column
- In the original table (auto-mpg.data) there are missing values expressed as "?" that are converted to NaN by pandas
    - **Q:** What features have missing values?
    - **Q:** Give two cars with missing values

In [None]:
# Use help(pd.read_table) to understand de parameters of read_table

df = pd.read_table("data/auto-mpg.data", delim_whitespace=True, index_col="car_name",  na_values="?",
                   names= ["MPG", "cylinders", "displacement", "horsepower", 
                           "weight", "acceleration", "model year", "origin", "car_name"])
df

# You can grab a particular column as: df["MPG"]
# You can obtain a numpy array from it as: df["MPG"].values

***

### Data inspection

- Inspect the histogram of MPG
- **Q:** Is MPG normal/Gaussian distributed?
- **Q:** Compute and report the mean, standard deviation and skewness (statistical moments) of MPG 
    - You can use `np.mean()` and `np.std()`
    - You can use `scipy.stats.skew()` for the skewness 

In [None]:
plt.close('all'); fig, ax  = plt.subplots(figsize=(7, 4))
ax.hist(df['MPG']); ax.set_xlabel("MPG");

- Plot MPG as a function of weight, acceleration and horsepower
- **Q:** Describe qualitatively the type of relation between MPG and the other variables (proportional, inversely proportional, linear, polynomial, exponential, etc)
- **Q:** Compute the correlation coefficient $r^2$ for each case 
    - You can use `np.corrcoef(x, y)[0, 1]`
- **Q:** Which feature has the highest correlation with MPG? Which one has the lowest correlation?

In [None]:
print("R^2: %0.6f" %(np.corrcoef(df["weight"].values, df["MPG"].values)[0, 1]))
# The following might be useful
# mask = ~np.isnan(df["horsepower"].values)

plt.close('all'); df.plot.scatter("weight", "MPG", figsize=(7, 4)); 
# Note that this is equivalent to
# plt.close('all'); fig, ax  = plt.subplots(figsize=(7, 4))
# ax.scatter(df["weight"].values, df["MPG"].values); 

### Single independant variable regression

- **Q:** Find the parameters of a simple linear model to predict MPG given weight using MLE assuming a Gaussian likelihood
- **Q:** Propose a polynomial basis of a given degree and find a better regressor for MPG given weight
- In both cases: Plot and study the residuals (errors) between data and your model

In [None]:
x, y = df["weight"].values, df["MPG"].values
# OLS solution
regressor = make_pipeline(PolynomialFeatures(1), Ridge(normalize=True, alpha=0.))
regressor.fit(x.reshape(-1, 1), y)
# The previous line is equivalent to 
# X = np.stack((np.ones_like(x), x)).T # Dimension Nx2
# param, MSE, rank, singval = np.linalg.lstsq(X, y, rcond=None)
# param = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, y).T)
print("theta0: "+repr(regressor.steps[1][1].intercept_))
print("thetaj: "+repr(regressor.steps[1][1].coef_[1:]))
print("MSE: "+repr(np.sum((y - regressor.predict(x.reshape(-1, 1)))**2)))
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(df["weight"].values, df["MPG"].values, '.', label='Data')
ax.set_xlabel('weight'); ax.set_ylabel('MPG'); x_eval = np.linspace(1500, 5100, 100)
ax.plot(x_eval, regressor.predict(x_eval.reshape(-1, 1)), linewidth=4, alpha=0.5, label="Model")
plt.legend(loc=0);

- **Q:** Find the parameters of a linear model to predict MPG from weight using the Bayesian approach assuming a Gaussian likelihood and Gaussian prior
- **Q:** Extend using the best polynomial found in the previous step

$$
p(y|X, \theta) =  \mathcal{N}(X\theta, I\sigma^2)
$$
$$
p(\theta) =  \mathcal{N}(0, I\sigma_0^2)
$$

References:
- https://docs.pymc.io/notebooks/GLM-linear.html
- https://www.chrisstucchio.com/blog/2017/bayesian_linear_regression.html

In [None]:
x, y = df["weight"].values, df["MPG"].values
with pm.Model() as my_linear_model:
    # Priors
    theta0 = pm.Normal('theta0', 0, sd=20)
    theta1 = pm.Normal('theta1', 0, sd=20)
    sigma = pm.HalfCauchy('sigma_noise', beta=10, testval=1.)
    # Likelihood
    likelihood = pm.Normal('y', mu=theta0 + theta1 * x, sd=sigma, observed=y)
    # Trace
    trace = pm.sample(draws=5000, tune=1000, init='advi', n_init=30000, 
                      cores=4, chains=2, live_plot=False)

In [None]:
pm.summary(trace)

In [None]:
plt.close('all'); pm.traceplot(trace, figsize=(8, 6));

In [None]:
plt.close('all'); fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(df["weight"].values, df["MPG"].values, '.', label='Data')
ax.set_xlabel('weight'); ax.set_ylabel('MPG');
pm.plot_posterior_predictive_glm(trace, samples=100, label='Posterior predictive lines', c='r', alpha=0.2,
                                 lm= lambda x, sample: sample['theta0'] + sample['theta1'] *x,
                                 eval=np.linspace(1500, 5100, 100))
plt.legend(loc=0);

### Multiple independent variables

- **Q:** Find the parameters of a linear model to predict MPG from weight and acceleration assuming a Gaussian likelihood. Use the Frequentist and Bayesian approaches

Follow-up: http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/