In [None]:
%matplotlib notebook
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import matplotlib.pylab as plt

# Practical activity

In this activity you will explore and analyze a dataset of car gasoline consumption in miles per gallon (MPG) for different brands and types of cars

We will use data from the [UCI](https://archive.ics.uci.edu/ml/) repository 

We will use the following libraries: 

- Python [pandas](https://pandas.pydata.org) to handle data tables
- matplotlib for visualization
- Scikit-learn for the modeling

## Instructions

1. Follow the steps in this notebook, complete the activities and answer the questions marked with a **Q**
1. Pre-process the data and inspect it
1. Train a linear model to predict MPG as a function of other relevant features
1. Work in groups of two. 
1. Upload you this notebook with your answers to github using a private repository. 

## Pre-process the data

- Import the table as a pandas dataframe and explore it
- **Q:** How many features and samples are in the table?
- **Q:** Which variables are continuous and which are categorical?

In this case

- MPG (Miles per gallon) is the variable we want to model (dependent variable) 
- car_name is the index column
- In the original table (auto-mpg.data) there are missing values expressed as "?" that are converted to NaN by pandas
    - **Q:** What features have missing values?
    - **Q:** Give two cars with missing values

In [None]:
# Use help(pd.read_table) to understand de parameters of read_table

df = pd.read_csv("data/auto-mpg.data", delim_whitespace=True, index_col="car_name",  na_values="?",
                   names= ["MPG", "cylinders", "displacement", "horsepower", 
                           "weight", "acceleration", "model year", "origin", "car_name"])
df.head()

# You can grab a particular column as: df["MPG"]
# You can obtain a numpy array from it as: df["MPG"].values

## Data inspection

- Inspect the histogram of MPG
- **Q:** Is MPG normal/Gaussian distributed? Why?
- **Q:** Compute and report the mean, standard deviation and skewness (statistical moments) of MPG 
    - You can use `np.mean()` and `np.std()`
    - You can use `scipy.stats.skew()` for the skewness 

- Plot MPG as a function of weight, acceleration and horsepower
- **Q:** Describe qualitatively the type of relation between MPG and the other variables (proportional, inversely proportional, linear, polynomial, exponential, etc)
- **Q:** Compute the correlation coefficient $r^2$ for each case (you can use `np.corrcoef(x, y)`)
- **Q:** Which feature has the highest correlation with MPG? Which one has the lowest correlation?

## Single independant variable regression

- **Q:** Find the parameters of a simple linear model to predict MPG given weight using MLE assuming a Gaussian likelihood
- **Q:** Propose a polynomial basis of a given degree and find the best regressor for MPG given weight
- **Q:** Find the MAP parameters of a linear model to predict MPG from weight assuming a Gaussian likelihood and Gaussian prior (ridge regression)
- **Q:** Extend using the best polynomial found in the previous step
- In all cases: Plot and study the residuals (errors) between data and your model