# Porównanie 3 metod regresji na podstawie 3 różnych zbiorów danych: Wiek słuchotka (ang. abalone)

W niniejszej pracy wykorzystuję metody:
- K najbliższych sąsiądów (KNN)
- Regresji liniowej
- Lasu losowego

Używam następujących zbiorów danych:
-  [Abalon](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset)
    - zmienna objaśniana: rings

## Załadowanie potrzebnych bibliotek

In [2]:
import random
import os
from joblib import dump, load

import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import sklearn
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import TargetEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import r2_score


## Stałe

In [3]:
random_state = 0

np.random.seed(random_state)
os.environ["PYTHONHASHSEED"] = str(random_state)
random.seed(random_state)

In [4]:
sklearn.set_config(transform_output="pandas")

## Wczytanie danych

In [5]:
abalone = pd.read_csv("../data/abalone.csv")

### Krótka analiza eksploracyjna danych

In [6]:
# ProfileReport(dataset, title=f"Profiling Report for Wine quality dataset").to_file(f"../data/wine_quality_EDA.html")

## Preprocessing

In [7]:
abalone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [10]:
abalone.sample(5)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
3972,M,0.415,0.315,0.12,0.4015,0.199,0.087,0.097,8
1091,I,0.45,0.33,0.11,0.3685,0.16,0.0885,0.102,6
52,M,0.485,0.36,0.13,0.5415,0.2595,0.096,0.16,10
2063,M,0.525,0.385,0.1,0.5115,0.246,0.1005,0.1455,8
3223,M,0.52,0.415,0.175,0.753,0.258,0.171,0.255,8


In [11]:
abalone.isna().sum().sort_values(ascending=False)

Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64

In [12]:
X = abalone.drop(columns=["Rings"])
y = abalone["Rings"]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=random_state)

In [14]:
encoder = make_column_transformer((OneHotEncoder(sparse_output=False), ["Sex"]), 
                                  remainder="passthrough",
                                  n_jobs=-1, 
                                  verbose=True, 
                                  verbose_feature_names_out=False,
                                  )


In [15]:
X_train_preprocessed = encoder.fit_transform(X_train)

# Modelowanie

In [16]:
knn_params = {"n_neighbors": [5, 25, 50],
                "weights": ["uniform", "distance"],
                "leaf_size": [20, 30, 50],
                "p": [1, 2],
                }

random_forest_params = {"n_estimators": [50, 100, 200],
                          # "criterion": ["squared_error", "absolute_error"],
                          "max_depth": [None, 3, 4, 5],
                          "max_features": [None, "sqrt", "log2"],
                          }

Przy wyczerpującym przeszukiwania siatki parametrów w celu znalezienia najlepszej kombinacji parametrów użyjemy walidacji krzyżowej.

[<img src="../img/grid_search_cross_validation.png" alt="drawing" width="400"/>]("../img/grid_search_cross_validation.png")
źródło: https://scikit-learn.org/stable/modules/cross_validation.html

In [17]:
folds = KFold(n_splits=5, shuffle=True, random_state=random_state)

## Regresja liniowa

In [18]:
linreg = LinearRegression(n_jobs=-1)

In [19]:
linreg.fit(X_train_preprocessed, y_train)

In [20]:
linreg.score(X_train_preprocessed, y_train)

0.5365659904594149

### K najbliższych sąsiadów

In [21]:
search_knn = GridSearchCV(estimator=KNeighborsRegressor(n_jobs=-1),
                   param_grid=knn_params,
                   scoring="r2",
                   n_jobs=-1,
                   refit=True,
                   cv=folds,
                   return_train_score=True,
                   verbose=3,
                   )

In [22]:
%%time
search_knn.fit(X_train_preprocessed, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
CPU times: total: 781 ms
Wall time: 17.6 s


In [23]:
dump(search_knn, "../models/search_knn_abalone.joblib")

['../models/search_knn_abalone.joblib']

In [24]:
# search_knn = load("../models/search_knn_wine_quality.joblib")

In [25]:
pd.DataFrame(search_knn.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_leaf_size,param_n_neighbors,param_p,param_weights,params,split0_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.078873,0.034499,0.256651,0.107112,20,5,1,uniform,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 1, 'w...",0.533313,...,0.493883,0.026155,31,0.674734,0.678449,0.674293,0.683059,0.680594,0.678226,0.003367
1,0.024495,0.019824,0.101088,0.095737,20,5,1,distance,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 1, 'w...",0.533517,...,0.495286,0.026309,28,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,0.013744,0.002357,0.050353,0.029915,20,5,2,uniform,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 2, 'w...",0.530188,...,0.5038,0.019981,22,0.682383,0.680086,0.678892,0.679133,0.685701,0.681239,0.002549
3,0.030574,0.019205,0.085398,0.032234,20,5,2,distance,"{'leaf_size': 20, 'n_neighbors': 5, 'p': 2, 'w...",0.531408,...,0.504775,0.020875,19,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0.013632,0.00185,0.062798,0.024487,20,25,1,uniform,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 1, '...",0.5199,...,0.519285,0.011283,11,0.561586,0.560842,0.564465,0.557876,0.557247,0.560403,0.002624
5,0.01897,0.00183,0.085281,0.025641,20,25,1,distance,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 1, '...",0.529055,...,0.526014,0.009321,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,0.031955,0.007542,0.130461,0.023859,20,25,2,uniform,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 2, '...",0.525379,...,0.524532,0.009155,7,0.564599,0.564253,0.567191,0.562209,0.563477,0.564346,0.001642
7,0.06592,0.020858,0.116541,0.039473,20,25,2,distance,"{'leaf_size': 20, 'n_neighbors': 25, 'p': 2, '...",0.533728,...,0.531093,0.007826,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,0.04795,0.017123,0.107542,0.015723,20,50,1,uniform,"{'leaf_size': 20, 'n_neighbors': 50, 'p': 1, '...",0.489506,...,0.492402,0.013441,35,0.517634,0.517909,0.520361,0.511128,0.509297,0.515266,0.004273
9,0.052864,0.015648,0.17779,0.07618,20,50,1,distance,"{'leaf_size': 20, 'n_neighbors': 50, 'p': 1, '...",0.504903,...,0.506391,0.012101,17,1.0,1.0,1.0,1.0,1.0,1.0,0.0


### Las losowy

In [26]:
search_random_forest = GridSearchCV(estimator=RandomForestRegressor(random_state=random_state),
                                           param_grid=random_forest_params,
                                           scoring="r2",
                                           n_jobs=-1,
                                           refit=True,
                                           cv=folds,
                                           return_train_score=True,
                                         verbose=3,
                                        )

In [27]:
%%time
search_random_forest.fit(X_train_preprocessed, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
CPU times: total: 5.16 s
Wall time: 1min 36s


In [28]:
dump(search_random_forest, "../models/search_random_forest_knn_abalone.joblib")

['../models/search_random_forest_knn_abalone.joblib']

In [29]:
# search_random_forest = load("../models/search_random_forest_wine_quality.joblib")

In [30]:
pd.DataFrame(search_random_forest.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,6.187398,0.059237,0.037487,0.006149,,,50,"{'max_depth': None, 'max_features': None, 'n_e...",0.53239,0.538431,...,0.530921,0.012133,10,0.933468,0.933779,0.933313,0.933284,0.932604,0.93329,0.000385
1,12.05261,0.144723,0.067072,0.008704,,,100,"{'max_depth': None, 'max_features': None, 'n_e...",0.540768,0.541636,...,0.535888,0.014445,8,0.935962,0.93593,0.935201,0.937381,0.93608,0.936111,0.000707
2,23.529486,0.303783,0.136363,0.022068,,,200,"{'max_depth': None, 'max_features': None, 'n_e...",0.542426,0.541523,...,0.538898,0.013341,7,0.937455,0.937038,0.937062,0.937365,0.938165,0.937417,0.000408
3,2.350078,0.127389,0.034067,0.005007,,sqrt,50,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.561096,0.534493,...,0.542535,0.010232,5,0.935276,0.934643,0.933307,0.934844,0.934768,0.934568,0.000665
4,4.641658,0.058329,0.061066,0.007507,,sqrt,100,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.56084,0.541886,...,0.546165,0.009344,3,0.937444,0.937028,0.936364,0.937677,0.937765,0.937255,0.000514
5,10.217048,0.417408,0.13848,0.005472,,sqrt,200,"{'max_depth': None, 'max_features': 'sqrt', 'n...",0.564209,0.542555,...,0.549812,0.009169,1,0.939282,0.938476,0.937978,0.938946,0.939012,0.938739,0.00046
6,2.681572,0.08312,0.04507,0.007106,,log2,50,"{'max_depth': None, 'max_features': 'log2', 'n...",0.561096,0.534493,...,0.542535,0.010232,5,0.935276,0.934643,0.933307,0.934844,0.934768,0.934568,0.000665
7,5.435568,0.087698,0.076424,0.006886,,log2,100,"{'max_depth': None, 'max_features': 'log2', 'n...",0.56084,0.541886,...,0.546165,0.009344,3,0.937444,0.937028,0.936364,0.937677,0.937765,0.937255,0.000514
8,9.527132,0.11758,0.147541,0.025762,,log2,200,"{'max_depth': None, 'max_features': 'log2', 'n...",0.564209,0.542555,...,0.549812,0.009169,1,0.939282,0.938476,0.937978,0.938946,0.939012,0.938739,0.00046
9,1.065831,0.031548,0.021041,0.003327,3.0,,50,"{'max_depth': 3, 'max_features': None, 'n_esti...",0.440963,0.427173,...,0.448539,0.020716,30,0.467454,0.478548,0.476881,0.47342,0.46778,0.472816,0.004558


## Analiza

Na tym tym posiadam już dostrojone, finalne modele. Teraz zostaną one porównane wg następujących metryk:

- MAE
- MAPE
- R2

In [4]:
# TODO: wybrać finalne metryki