**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (mean_squared_error,
                             mean_absolute_error)
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from class_utils import corr_heatmap, error_histogram
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
from class_utils.download import download_file_maybe_extract
download_file_maybe_extract("https://www.dropbox.com/s/8s0ivlo9yshhxkn/winequality.zip?dl=1", directory="data")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

### Linear Regression and Wine Quality

In this example we will try to apply linear regression to a dataset concerning the quality of white wine.

We will load the dataset from a CSV file:



In [None]:
df = pd.read_csv("data/winequality-white.csv")
df.head()

#### Does the Dataset Contain Linear Relationships?

In order to find out whether the dataset contains linear relationships, which we can model using linear regression, we will display the correlation matrix. Strongly correlated variables have a clear linear relationship. Strong negative correlation also means that the variables have a linear relationship, except that they are inversely proportional to one another. Some elements of the correlation matrix are white: this means that the correlation was not statistically significant and the numbers are not too informative.



In [None]:
plt.figure(figsize=(10, 8))
corr_heatmap(df)
plt.savefig("output/wine_corr_matrix.pdf", bbox_inches="tight", pad_inches=0)

As the plot shows, there is a relatively strong correlation between variables `density` and `residual sugar`. These variables also have a weaker correlation with a bunch of other variables. We can therefore attempt to predict variable `density` from all the other variables using linear regression.

If we wanted to predict wine quality, linear regression would probably not be our best bet: the only correlation that has any strength is that with the amount of alcohol.

### Preprocessing

Let us now split the dataset into train and test, stratifying by `density` and apply the standard preprocessing.



In [None]:
#@title -- Dataset Splittling: df_train, df_test -- { display-mode: "form" }
kbins = KBinsDiscretizer(6, encode='ordinal')
y_stratify = kbins.fit_transform(df[['density']])
df_train, df_test = train_test_split(df, stratify=y_stratify,
                                 test_size=0.3, random_state=4)

In [None]:
df.head()

---
### Task 1: Column Types

**List categorical and numeric columns that should be used below.** 

---


In [None]:
categorical_inputs = [           ]  # ----

numeric_inputs = [               ]  # ----

output = ['density']

In [None]:
#@title -- Our Standard Preprocessing: X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

X_train = input_preproc.fit_transform(df_train)
Y_train = df_train[output].values

X_test = input_preproc.transform(df_test)
Y_test = df_test[output].values

#### Parameter Fitting

We will use the training data to fit the linear model:



In [None]:
model = LinearRegression()
model = model.fit(X_train, Y_train)

#### Testing

We test the model on testing data:



In [None]:
#@title -- Testing -- { display-mode: "form" }
y_test = model.predict(X_test)

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test, y_test)
print("MAE = {}".format(mae))

plt.figure(figsize=(8, 6))
error_histogram(Y_test, y_test, Y_fit_scaling=Y_train)

#### Using Only `residual sugar` as Input



In [None]:
categorical_inputs = []
numeric_inputs = ["residual sugar"]
output = ['density']

In [None]:
#@title -- Our Standard Preprocessing: X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

X_train = input_preproc.fit_transform(df_train)
Y_train = df_train[output].values

X_test = input_preproc.transform(df_test)
Y_test = df_test[output].values

In [None]:
model = LinearRegression()
model = model.fit(X_train, Y_train)

In [None]:
#@title -- Testing -- { display-mode: "form" }
y_test = model.predict(X_test)

# we compute and display the MSE and the MAE
mse = mean_squared_error(Y_test, y_test)
print("MSE = {}".format(mse))

mae = mean_absolute_error(Y_test, y_test)
print("MAE = {}".format(mae))

plt.figure(figsize=(8, 6))
error_histogram(Y_test, y_test, Y_fit_scaling=Y_train)

As we can see, the results are quite a bit worse in this case. It seems that the other columns include information about density, which cannot be extracted from column `residual sugar` alone.

