# Linear Regression Exercise

## Use linear regression on the *bodyfat* dataset

We are going to follow the same pipeline than in the lab session but now with a simple dataset.  

- [ ] Visualize your dataset. Does it have anything strange?
- [ ] Split your dataset in train and test. 
- [ ] Design a pre-processing for your dataset and apply it to your partitions. 
- [ ] Train a Linear Regression model.
- [ ] Train a Ridge Regression model with cross-validation.  
- [ ] Train a Lasso Regression model with cross-validation.  
- [ ] Compare your three models using cross-validation metrics and looking into their weights. Do they show strong differences?
- [ ] Compute the generalization performance of the best model.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.precision", 3)
np.set_printoptions(precision=3)

In [None]:
bodyfat_data = pd.read_csv(
    "bodyfatdata.txt", sep="\s+", names=["triceps", "thigh", "midarm", "bodyfat"]
)

N = bodyfat_data.shape[0]
bodyfat_data.describe()

## Advanced Exercise: Try to improve the results of the lab session

There are some points that could be improved from the lab session. 
- [ ] Linear regression is strongly affected by outliers. Design a strategy for removing outliers. Does it improve the validation metrics with respect to the best ones of the lab. 
- [ ] We have ignored (almost) completely our missing values. Design a strategy for handling missing values. Does it improve the validation metrics with respect to the best ones of the lab.   
- [ ] Some variables are not looking gaussian. Transforming them could improve your model performance. Does this improve validation metrics and/or generalization of the best model?

In [None]:
life_expentancy_data = pd.read_csv("Life_Expectancy_Data.csv")
# We remove spaces and symbols to avoid problems with statsmodel GLM
life_expentancy_data.columns = [
    c.lower().strip().replace(" ", "_").replace("/", "_").replace("-", "_")
    for c in life_expentancy_data.columns
]

# We change the type of categorical variables into category
categorical_columns = list(
    life_expentancy_data.dtypes[life_expentancy_data.dtypes == "O"].index.values
)
for column in categorical_columns:
    life_expentancy_data[column] = life_expentancy_data[column].astype("category")

life_expentancy_data.head()