# Porównanie 3 metod regresji na podstawie 3 różnych zbiorów danych

W niniejszej pracy wykorzystuję metody:
- K najbliższych sąsiądów (KNN)
- Regresji liniowej
- Lasu losowego

Używam następujących zbiorów danych:
- [Melbourne Housing](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot) 
    - zmienna objaśniana: cena domu
- [Wine Quality](https://www.kaggle.com/datasets/rajyellow46/wine-quality)
    - zmienna objaśniana: jakość wina
-  [Abalon](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset)
    - zmienna objaśniana: wiek skamieliny

## Załadowanie potrzebnych bibliotek

In [21]:
import random
import os

import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import TargetEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, GridSearchCV


## Stałe

In [4]:
random_state = 0

np.random.seed(random_state)
os.environ["PYTHONHASHSEED"] = str(random_state)
random.seed(random_state)

In [9]:
sklearn.set_config(transform_output="pandas")

## Wczytanie danych

In [5]:
housing = pd.read_csv("../data/melb_data.csv")
wine_quality = pd.read_csv("../data/winequalityN.csv")
abalone = pd.read_csv("../data/abalone.csv")

In [6]:
datasets = {"housing": housing, 
            "wine_quality": wine_quality, 
            "abalone": abalone,
            }

### Krótka analiza eksploracyjna danych

In [None]:
for name, dataset in datasets.items():
    ProfileReport(dataset, title=f"Profiling Report for {name.capitalize()} dataset").to_file(f"../data/{name}_EDA.html")

## Preprocessing

### Housing

In [7]:
housing.sample(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
8505,Williamstown,44 Electra St,4,h,2165000.0,SP,Greg,6/05/2017,8.0,3016.0,...,2.0,2.0,450.0,190.0,1910.0,Hobsons Bay,-37.861,144.8985,Western Metropolitan,6380.0
5523,Seddon,80 Gamon St,2,h,815000.0,S,Chisholm,30/07/2016,6.6,3011.0,...,1.0,0.0,172.0,81.0,1900.0,Maribyrnong,-37.81,144.8896,Western Metropolitan,2417.0
12852,Sunshine North,6 Melton Av,3,h,610000.0,SP,Sweeney,16/09/2017,10.5,3020.0,...,1.0,1.0,581.0,,,,-37.7674,144.82421,Western Metropolitan,4217.0
4818,Prahran,16 Park Rd,3,t,1245000.0,PI,Marshall,6/08/2016,4.5,3181.0,...,2.0,1.0,128.0,134.0,2000.0,Stonnington,-37.8526,145.0071,Southern Metropolitan,7717.0
12812,Pascoe Vale,13 Yorkshire St,3,h,1160000.0,S,Nelson,16/09/2017,8.5,3044.0,...,2.0,2.0,480.0,,,,-37.72523,144.94567,Northern Metropolitan,7485.0


In [12]:
X = housing.drop(columns="Price")
y = housing["Price"]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=random_state)

In [14]:
imputer_housing = ColumnTransformer([("mode_imputer", SimpleImputer(strategy="most_frequent"), ["YearBuilt", "CouncilArea"]),
                                ("mean_imputer", SimpleImputer(strategy="mean"), ["BuildingArea"]),
                                ],
                               remainder="passthrough",
                               n_jobs=-1,
                               verbose=True,
                               verbose_feature_names_out=False,
                               )

In [15]:
encoder_housing = ColumnTransformer([("target_encoder", TargetEncoder(categories="auto", target_type="continuous", random_state=random_state), ["SellerG", "Postcode", "Regionname", "CouncilArea", "Postcode", "Suburb"]),
                                ("one_hot_encoder", OneHotEncoder(sparse_output=False), ["Type", "Method"]),
                                     ("drop", "drop", ["Address", "Lattitude", "Longtitude", "Date"])
                                ],
                               remainder="passthrough",
                               n_jobs=-1,
                               verbose=True,
                               verbose_feature_names_out=False,
                               )

In [34]:
preprocessing_housing = make_pipeline(imputer_housing, encoder_housing, verbose=True)
preprocessing_housing

In [36]:
preprocessing_housing.fit_transform(X_train, y_train)

[Pipeline]  (step 1 of 2) Processing columntransformer-1, total=   2.1s
[Pipeline]  (step 2 of 2) Processing columntransformer-2, total=   1.0s


Unnamed: 0,SellerG,Postcode,Regionname,CouncilArea,Postcode.1,Type_h,Type_t,Type_u,Method_PI,Method_S,...,BuildingArea,Suburb,Rooms,Date,Distance,Bedroom2,Bathroom,Car,Landsize,Propertycount
3857,1.041916e+06,1.593424e+06,1.370599e+06,1.307007e+06,1.593424e+06,1.0,0.0,0.0,0.0,1.0,...,152.495671,Malvern East,5,15/10/2016,11.2,5.0,2.0,2.0,639.0,8801.0
5862,7.845643e+05,7.863326e+05,1.365564e+06,1.121814e+06,7.863326e+05,0.0,0.0,1.0,0.0,0.0,...,105.000000,St Kilda,2,26/07/2016,6.1,2.0,2.0,1.0,0.0,13240.0
10164,1.019539e+06,8.805224e+05,8.998646e+05,1.018565e+06,8.805224e+05,1.0,0.0,0.0,0.0,0.0,...,152.495671,Brunswick West,3,27/05/2017,5.2,3.0,1.0,3.0,484.0,7082.0
8843,6.014845e+05,5.295923e+05,8.998646e+05,5.541723e+05,5.295923e+05,1.0,0.0,0.0,1.0,0.0,...,181.000000,Jacana,4,1/07/2017,14.0,4.0,1.0,1.0,692.0,851.0
12389,8.796497e+05,6.852531e+05,8.998646e+05,1.018565e+06,6.852531e+05,1.0,0.0,0.0,0.0,1.0,...,152.495671,Reservoir,3,3/09/2017,12.0,3.0,1.0,2.0,741.0,21650.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13123,1.159248e+06,1.034137e+06,8.910190e+05,1.013013e+06,1.034137e+06,1.0,0.0,0.0,0.0,0.0,...,152.495671,Brunswick,3,23/09/2017,5.2,3.0,1.0,2.0,212.0,11918.0
3264,1.114186e+06,7.014500e+05,1.108870e+06,9.543597e+05,7.014500e+05,1.0,0.0,0.0,0.0,1.0,...,101.000000,Heidelberg Heights,3,25/02/2017,10.5,3.0,1.0,1.0,748.0,2947.0
9845,8.847678e+05,9.192777e+05,9.026117e+05,1.025348e+06,9.192777e+05,1.0,0.0,0.0,1.0,0.0,...,255.000000,Coburg,4,24/06/2017,6.7,4.0,2.0,2.0,441.0,11204.0
10799,7.228054e+05,6.852531e+05,8.998646e+05,9.208275e+05,6.852531e+05,1.0,0.0,0.0,0.0,1.0,...,152.495671,Reservoir,3,8/07/2017,12.0,3.0,1.0,1.0,606.0,21650.0


### Wine quality

In [None]:
imputer = SimpleImputer()
ct = ColumnTransformer()

### Abalone

In [None]:
imputer = SimpleImputer()
ct = ColumnTransformer()

In [19]:
# i teraz dla każdego datasetu

# Modelowanie

In [28]:
linear_regression = LinearRegression(n_jobs=-1)
knn = KNeighborsRegressor(n_jobs=-1)
random_forest = RandomForestRegressor(n_jobs=-1, random_state=random_state)

In [1]:
knn_params = {
    "n_neighbors": None,
    "weights": ["uniform", "distance"],
    "leaf_size": None,
    "p": [1, 2],
    # there could be other distance metrics but let's skip them
}

random_forest_params = {
    "n_estimator": None,
    "criterion": ["squared_error", "absolute_error"],
    "max_depth": [None, 3, 4, 5],
    "min_samples_split": None,
    "min_samples_leaf": None,
    "max_features": None,
    "bootstrap": None,
}

Przy wyczerpującym przeszukiwania siatki parametrów w celu znalezienia najlepszej kombinacji parametrów użyjemy walidacji krzyżowej .

[<img src="../img/grid_search_cross_validation.png" alt="drawing" width="400"/>]("../img/grid_search_cross_validation.png")
źródło: https://scikit-learn.org/stable/modules/cross_validation.html

In [10]:
# Możemy zrobić jeszcze poszukiwanie najlepszych hiperparametrów do modelu
grid_search_results = GridSearchCV()

## Analiza

Na tym tym posiadam już dostrojone, finalne modele. Teraz zostaną one porównane wg następujących metryk:

- MAE
- MAPE
- R2

In [4]:
# TODO: wybrać finalne metryki