# Red wine - Regression problem

In [None]:
import pandas as pd
import numpy as np 

In [None]:
df = pd.read_csv('/content/winequality_red.csv', sep=';')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


This is can be treated as a regression model, where the outcomne variable is `quality`.




In [None]:
from sklearn.linear_model import LinearRegression

`sklearn` expects two dataframes: 

*   `X`, the datafrae with the predictors.
*   `y`, the dataframe with the outcome variable.



In [None]:
X = df.drop('quality', axis=1) # axis = 1 to instruct Python to consider quality a column (a not a row)
y = df['quality']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

As can be seen, the predictors have quite different scales. This may negatively impact the predict performance of the model.

To solve this problem, we can scale all predictors to have them with a similar scale.

For that, we need `StandardScaler` function.

In [None]:
from sklearn.preprocessing import StandardScaler

sklearn gives us the possibility of using a pipeline to group sequential steps.

The function is `Pipeline`.

In [None]:
from sklearn.pipeline import Pipeline

`'scale'` below is just the name of the step.

In [None]:
scaler = Pipeline([
    ('scale', StandardScaler())
])

We now need to apply the scaler to the columns of the predictors.

For thta, we need the function `ColumnTransformer`.

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('scale2', scaler, X.columns.to_list()) # 'scale2' is the name of the step scaler inside the 'pipeline' ColumnTransformer.
    # this name can be whatever name we like
],
remainder='passthrough' # this to pass te transformed columns to the model
)

Now, everything is ready to apply the linear regression model. 

We will use a pipeline with two steps:

*   preprocessing step
*   linear regression



In the steps of pipelines, the names between quotes are names and they be whatever we want

In [None]:
pipe = Pipeline([
    ('pre', preprocessor),
    ('lm', LinearRegression())
    ])

pipe.fit(X_train, y_train)

In [None]:
y_preds = pipe.predict(X_train)

In [None]:
y_preds

array([6.91899754, 6.16561289, 5.17680958, ..., 5.21043574, 5.11569022,
       6.35907895])