# Porównanie 3 metod regresji na podstawie 3 różnych zbiorów danych: Jakość wina

W niniejszej pracy wykorzystuję metody:
- K najbliższych sąsiądów (KNN)
- Regresji liniowej
- Lasu losowego

Używam następujących zbiorów danych:
- [Wine Quality](https://www.kaggle.com/datasets/rajyellow46/wine-quality)
    - zmienna objaśniana: jakość wina

## Załadowanie potrzebnych bibliotek

In [64]:
import random
import os
from joblib import dump, load

import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import sklearn
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import TargetEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error, mean_squared_error

import mlflow

## Stałe

In [3]:
random_state = 0

np.random.seed(random_state)
os.environ["PYTHONHASHSEED"] = str(random_state)
random.seed(random_state)

In [4]:
sklearn.set_config(transform_output="pandas")

## Wczytanie danych

In [5]:
wine_quality = pd.read_csv("../data/winequalityN.csv")

### Krótka analiza eksploracyjna danych

In [6]:
# ProfileReport(dataset, title=f"Profiling Report for Wine quality dataset").to_file(f"../data/wine_quality_EDA.html")

## Preprocessing

In [17]:
wine_quality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  6497 non-null   object 
 1   fixed acidity         6487 non-null   float64
 2   volatile acidity      6489 non-null   float64
 3   citric acid           6494 non-null   float64
 4   residual sugar        6495 non-null   float64
 5   chlorides             6495 non-null   float64
 6   free sulfur dioxide   6497 non-null   float64
 7   total sulfur dioxide  6497 non-null   float64
 8   density               6497 non-null   float64
 9   pH                    6488 non-null   float64
 10  sulphates             6493 non-null   float64
 11  alcohol               6497 non-null   float64
 12  quality               6497 non-null   int64  
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


In [7]:
wine_quality.sample(5)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
5316,red,11.9,0.38,0.51,2.0,0.121,7.0,20.0,0.9996,3.24,0.76,10.4,6
5210,red,9.0,0.46,0.31,2.8,0.093,19.0,98.0,0.99815,3.32,0.63,9.5,6
3518,white,7.5,0.2,0.41,1.2,0.05,26.0,131.0,0.99133,3.19,0.52,11.1,5
1622,white,6.5,0.44,0.49,7.7,0.045,16.0,169.0,0.9957,3.11,0.37,8.7,6
2443,white,6.6,0.32,0.33,2.5,0.052,40.0,219.5,0.99316,3.15,0.6,10.0,5


In [9]:
wine_quality.isna().sum().sort_values(ascending=False)

fixed acidity           10
pH                       9
volatile acidity         8
sulphates                4
citric acid              3
residual sugar           2
chlorides                2
type                     0
free sulfur dioxide      0
total sulfur dioxide     0
density                  0
alcohol                  0
quality                  0
dtype: int64

In [23]:
X = wine_quality.drop(columns=["quality"])
y = wine_quality["quality"]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=random_state)

In [28]:
X_train.columns

Index(['type', 'fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='object')

In [36]:
imputer = make_column_transformer((SimpleImputer(strategy="median"), make_column_selector(dtype_include=np.number)),
                                  remainder=SimpleImputer(strategy="most_frequent"), 
                                  n_jobs=-1, 
                                  verbose=True, 
                                  verbose_feature_names_out=False,
                                  )

In [41]:
encoder = make_column_transformer((OneHotEncoder(sparse_output=False), ["type"]), 
                                  remainder="passthrough",
                                  n_jobs=-1, 
                                  verbose=True, 
                                  verbose_feature_names_out=False,
                                  )


In [42]:
preprocessing_pipe = make_pipeline(imputer, encoder, verbose=True)
preprocessing_pipe

In [43]:
X_train_preprocessed = preprocessing_pipe.fit_transform(X_train)

[Pipeline]  (step 1 of 2) Processing columntransformer-1, total=   2.0s
[Pipeline]  (step 2 of 2) Processing columntransformer-2, total=   0.9s


# Modelowanie

In [48]:
knn_params = {"n_neighbors": [5, 25, 50],
                "weights": ["uniform", "distance"],
                "leaf_size": [20, 30, 50],
                "p": [1, 2],
                }

random_forest_params = {"n_estimators": [50, 100, 200],
                          # "criterion": ["squared_error", "absolute_error"],
                          "max_depth": [None, 3, 4, 5],
                          "max_features": [None, "sqrt", "log2"],
                          }

Przy wyczerpującym przeszukiwania siatki parametrów w celu znalezienia najlepszej kombinacji parametrów użyjemy walidacji krzyżowej.

[<img src="../img/grid_search_cross_validation.png" alt="drawing" width="400"/>]("../img/grid_search_cross_validation.png")
źródło: https://scikit-learn.org/stable/modules/cross_validation.html

In [49]:
folds = KFold(n_splits=5, shuffle=True, random_state=random_state)

## Regresja liniowa

In [50]:
linreg = LinearRegression(n_jobs=-1)

In [51]:
linreg.fit(X_train_preprocessed, y_train)

In [52]:
linreg.score(X_train_preprocessed, y_train)

0.2904916695297072

### K najbliższych sąsiadów

In [53]:
search_knn = GridSearchCV(estimator=KNeighborsRegressor(n_jobs=-1),
                   param_grid=knn_params,
                   scoring="r2",
                   n_jobs=-1,
                   refit=True,
                   cv=folds,
                   return_train_score=True,
                   verbose=3,
                   )

In [54]:
%%time
search_knn.fit(X_train_preprocessed, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
CPU times: total: 391 ms
Wall time: 6.92 s


In [55]:
dump(search_knn, "../models/search_knn_wine_quality.joblib")

['../models/search_knn_wine_quality.joblib']

In [56]:
# search_knn = load("../models/search_knn_wine_quality.joblib")

In [57]:
pd.DataFrame(search_knn.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_leaf_size,param_n_neighbors,param_p,param_weights,params,split0_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.069826,0.02453807,0.064205,0.019834,20,5,1,uniform,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 1, 'w...",0.157494,...,0.173435,0.039268,25,0.467343,0.459091,0.468477,0.47793,0.455403,0.465649,0.00787
1,0.018075,0.005110243,0.040412,0.015336,20,5,1,distance,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 1, 'w...",0.316415,...,0.311955,0.046971,13,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,0.018751,0.006249953,0.037528,0.007631,20,5,2,uniform,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 2, 'w...",0.150895,...,0.148702,0.039298,31,0.453536,0.448299,0.463586,0.460139,0.437239,0.45256,0.009304
3,0.027569,0.01717708,0.062868,0.022207,20,5,2,distance,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 2, 'w...",0.300573,...,0.283966,0.049275,16,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0.024419,0.006515661,0.060786,0.033393,20,25,1,uniform,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 1, '...",0.192349,...,0.199672,0.02421,20,0.260991,0.266195,0.264547,0.276809,0.252267,0.264162,0.007947
5,0.0256,0.01433368,0.095,0.026929,20,25,1,distance,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 1, '...",0.399636,...,0.386723,0.032604,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,0.0228,0.002785637,0.067399,0.030355,20,25,2,uniform,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 2, '...",0.14611,...,0.153568,0.027148,28,0.222947,0.215949,0.223595,0.232647,0.212123,0.221452,0.007063
7,0.021599,0.001356185,0.044001,0.011899,20,25,2,distance,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 2, '...",0.362631,...,0.351221,0.032887,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,0.025523,0.0105513,0.05981,0.012055,20,50,1,uniform,"{'leaf_size': 20, 'n_neighbors': 50, 'p': 1, '...",0.164075,...,0.177409,0.022266,22,0.213283,0.211099,0.208508,0.218238,0.205006,0.211227,0.004461
9,0.0203,0.005751312,0.064811,0.020937,20,50,1,distance,"{'leaf_size': 20, 'n_neighbors': 50, 'p': 1, '...",0.390163,...,0.38016,0.029079,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0


### Las losowy

In [58]:
search_random_forest = GridSearchCV(estimator=RandomForestRegressor(random_state=random_state),
                                           param_grid=random_forest_params,
                                           scoring="r2",
                                           n_jobs=-1,
                                           refit=True,
                                           cv=folds,
                                           return_train_score=True,
                                         verbose=3,
                                        )

In [59]:
%%time
search_random_forest.fit(X_train_preprocessed, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
CPU times: total: 5.53 s
Wall time: 1min 48s


In [60]:
dump(search_random_forest, "../models/search_random_forest_wine_quality.joblib")

['../models/search_random_forest_wine_quality.joblib']

In [15]:
# search_random_forest = load("../models/search_random_forest_wine_quality.joblib")

In [61]:
pd.DataFrame(search_random_forest.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,7.84617,0.315238,0.162332,0.149425,,,50,"{'max_depth': None, 'max_features': None, 'n_e...",0.524629,0.519787,...,0.494867,0.029056,9,0.924482,0.926193,0.926265,0.929021,0.925797,0.926352,0.001481
1,18.919651,0.397345,0.076003,0.011773,,,100,"{'max_depth': None, 'max_features': None, 'n_e...",0.524074,0.527712,...,0.499923,0.027076,8,0.92839,0.928173,0.930022,0.931254,0.928646,0.929297,0.001172
2,36.901868,0.700976,0.142211,0.010378,,,200,"{'max_depth': None, 'max_features': None, 'n_e...",0.526658,0.530983,...,0.504185,0.026009,7,0.930577,0.930457,0.932339,0.932426,0.931537,0.931467,0.000836
3,2.673916,0.063558,0.040626,0.007655,,sqrt,50,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.521738,0.5415,...,0.513726,0.019023,5,0.926927,0.926355,0.928265,0.930248,0.926515,0.927662,0.001457
4,5.410361,0.099975,0.071876,0.007656,,sqrt,100,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.528939,0.54661,...,0.516975,0.021228,3,0.930092,0.929365,0.932074,0.932768,0.930741,0.931008,0.001253
5,11.155184,0.332765,0.161254,0.042424,,sqrt,200,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.529189,0.548528,...,0.519451,0.019682,1,0.932615,0.931809,0.934238,0.933806,0.933702,0.933234,0.000891
6,3.023378,0.301075,0.050613,0.010907,,log2,50,"{'max_depth': None, 'max_features': 'log2', 'n...",0.521738,0.5415,...,0.513726,0.019023,5,0.926927,0.926355,0.928265,0.930248,0.926515,0.927662,0.001457
7,5.35752,0.151877,0.078126,0.009882,,log2,100,"{'max_depth': None, 'max_features': 'log2', 'n...",0.528939,0.54661,...,0.516975,0.021228,3,0.930092,0.929365,0.932074,0.932768,0.930741,0.931008,0.001253
8,10.417827,0.064719,0.140625,2e-06,,log2,200,"{'max_depth': None, 'max_features': 'log2', 'n...",0.529189,0.548528,...,0.519451,0.019682,1,0.932615,0.931809,0.934238,0.933806,0.933702,0.933234,0.000891
9,1.115772,0.02126,0.01875,0.00625,3.0,,50,"{'max_depth': 3, 'max_features': None, 'n_esti...",0.277751,0.2823,...,0.277797,0.010569,30,0.29873,0.295226,0.295886,0.300949,0.290942,0.296346,0.003392


## Analiza

Na tym tym posiadam już dostrojone, finalne modele. Teraz zostaną one porównane wg następujących metryk:

- MAE
- MSE
- MAPE
- RMSE
- R2

In [62]:
final_models = [("linear_regression", linreg), ("knn", search_knn.best_estimator_), ("random_forest", search_random_forest.best_estimator_)]

In [65]:
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")
mlflow.set_experiment("Wine_quality")

for model_name, model in final_models:
    params = search_random_forest.best_estimator_.get_params()
    
    y_pred = model.predict(preprocessing_pipe.transform(X_test))
    
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    metrics = {"mae": mae, "mse": mse, "mape": mape, "rmse": rmse, "r2": r2}
    
    with mlflow.start_run(run_name=model_name) as run:
        mlflow.log_params(params)
    
        mlflow.log_metrics(metrics)
        
        model_info = mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="wine_quality",
            input_example=X_test,
        )


