Оценим изменение метрик при внесении полиномиальных признаков.
Применим регуляризационные модели

# Импорт библиотек и данных

Launch a server via:
```bash
mlflow server --host 127.0.0.1 --port 8080
```

In [11]:
import sys
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.preprocessing import PolynomialFeatures

# from sklearn.feature_selection import RFE, SelectKBest, f_regression

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, LassoCV, ElasticNetCV

root_folder = '../'
sys.path.append(root_folder)

from src.models import train_model, predict_model
from src.utils import get_dict

train_path = Path(root_folder, 'data', 'processed', '2.0_train.csv')
test_path = Path(root_folder, 'data', 'processed', '2.0_test.csv')

experiment_name = 'Housing cost'

Загрузим датасеты:

In [2]:
train = pd.read_csv(train_path, index_col=0)
train.info()
X_train, y_train = train_model.get_X_y(train, target_name='log_target')
print()
print()

test = pd.read_csv(test_path, index_col=0)
test.info()
X_test, y_test = train_model.get_X_y(test, target_name='log_target')

<class 'pandas.core.frame.DataFrame'>
Index: 264639 entries, 0 to 264638
Data columns (total 41 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   baths                            264639 non-null  float64
 1   fireplace                        264639 non-null  bool   
 2   beds                             264639 non-null  float64
 3   stories                          264639 non-null  float64
 4   private_pool                     264639 non-null  bool   
 5   parking_count                    264639 non-null  float64
 6   central_heating                  264639 non-null  bool   
 7   central_cooling                  264639 non-null  bool   
 8   log_target                       264639 non-null  float64
 9   log_sqft                         264639 non-null  float64
 10  log_lotsize                      264639 non-null  float64
 11  updated_years                    264639 non-null  float64
 12  school_

## Ridge

Сконструируем `ColumnTranformer`:
- Признаки с 'float64' проходят через `float_transformer`, состоящий из `StandardScaler` и `PolynomialFeatures` (степени 3)
- Остальные признаки (как правило bool) проходят через `minmax_transformer`

Далее идёт `SimpleImputer` на случай, если при использовании этой модели попадутся NaN-ы.
Сейчас пропусков нет.

В конце - модель Ridge c L2-регуляризацией

In [3]:
imputer_params = get_dict(
    missing_values=np.nan,
    strategy='median'
)

float_columns = list(X_train.columns[X_train.dtypes == 'float64'])

min_max_columns = list(
    set(X_train.columns) - set(float_columns)
)

float_transformer = Pipeline(
    [
        ('scaler', StandardScaler()),
        ('polynom', PolynomialFeatures(3))
    ]
)

transformers = [
    ('float_transformer', float_transformer, float_columns),
    ('minmax_transformer', MinMaxScaler(), min_max_columns)
]

pipe_elements = [
    ('column_transformer', ColumnTransformer, transformers),
    ('imputer', SimpleImputer, imputer_params),
    ('regressor', Ridge)
]

pipe, pipe_params = train_model.make_pipeline(pipe_elements)
display(pipe)

# Conduct fitting and cross-validation metrics estimation
cv_metrics = predict_model.cross_validate_pipe(
    pipe=pipe,
    X=X_train,
    y=y_train,
    njobs=3
)

pipe.fit(X_train, y_train)
metrics = predict_model.get_train_test_metrics(
    pipe,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)
metrics = metrics | cv_metrics

Unnamed: 0,cv_train,cv_validation
mape_log,0.033,0.033
r2_log,0.587,0.585


Unnamed: 0,train,test
mape,0.473,0.475
r2,0.464,0.446


In [4]:
model_info = predict_model.log_pipe_mlflow(
    pipe_name='ridge-pf',
    training_info='Ridge regression with PF and StandardTransform',
    X=X_train,
    pipe=pipe,
    pipe_params=pipe_params,
    metrics=metrics,
    experiment_name=experiment_name,
)

Successfully registered model 'ridge-pf'.
2024/04/24 21:22:22 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: ridge-pf, version 1
Created version '1' of model 'ridge-pf'.


Удалось получить меньший процент ошибки `MAPE`, чем в обоих бейзлайнах: $47.5$% против 49.2% на Decision tree и 53.1% у LinearRegression.

$R^2 = 0.446$ лучше чем у линейной регресси (0.39), но хуже, чем у дерева решений (0.492)

## Lasso

Проверим Lasso (L1-регуляризацию)

In [5]:
imputer_params = get_dict(
    missing_values=np.nan,
    strategy='median'
)

float_columns = list(X_train.columns[X_train.dtypes == 'float64'])

min_max_columns = list(
    set(X_train.columns) - set(float_columns)
)

float_transformer = Pipeline(
    [
        ('scaler', StandardScaler()),
        ('polynom', PolynomialFeatures(3))
    ]
)

transformers = [
    ('float_transformer', float_transformer, float_columns),
    ('minmax_transformer', MinMaxScaler(), min_max_columns)
]

pipe_elements = [
    ('column_transformer', ColumnTransformer, transformers),
    ('imputer', SimpleImputer, imputer_params),
    ('regressor', Lasso)
]

pipe, pipe_params = train_model.make_pipeline(pipe_elements)
display(pipe)

# Conduct fitting and cross-validation metrics estimation
cv_metrics = predict_model.cross_validate_pipe(
    pipe=pipe,
    X=X_train,
    y=y_train,
    njobs=3
)

pipe.fit(X_train, y_train)
metrics = predict_model.get_train_test_metrics(
    pipe,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)
metrics = metrics | cv_metrics

Unnamed: 0,cv_train,cv_validation
mape_log,0.053,0.053
r2_log,0.0,-0.0


Unnamed: 0,train,test
mape,0.894,0.886
r2,-0.085,-0.086


Получили плохие результаты, попробуем применить `LassoCV`, которая итеративно подбирает на кросс-валидации свои гиперпараметры

In [8]:
imputer_params = get_dict(
    missing_values=np.nan,
    strategy='median'
)

float_columns = list(X_train.columns[X_train.dtypes == 'float64'])

min_max_columns = list(
    set(X_train.columns) - set(float_columns)
)

float_transformer = Pipeline(
    [
        ('scaler', StandardScaler()),
        ('polynom', PolynomialFeatures(3))
    ]
)

transformers = [
    ('float_transformer', float_transformer, float_columns),
    ('minmax_transformer', MinMaxScaler(), min_max_columns)
]

pipe_elements = [
    ('column_transformer', ColumnTransformer, transformers),
    ('imputer', SimpleImputer, imputer_params),
    ('regressor', LassoCV)
]

pipe, pipe_params = train_model.make_pipeline(pipe_elements)
display(pipe)

pipe.fit(X_train, y_train)
metrics = predict_model.get_train_test_metrics(
    pipe,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)
metrics = metrics # | cv_metrics

Unnamed: 0,train,test
mape,0.477,0.478
r2,0.449,0.434


In [9]:
model_info = predict_model.log_pipe_mlflow(
    pipe_name='lasso-pf',
    training_info='Lasso regression with PF and StandardTransform',
    X=X_train,
    pipe=pipe,
    pipe_params=pipe_params,
    metrics=metrics,
    experiment_name=experiment_name,
)

Successfully registered model 'lasso-pf'.
2024/04/24 21:40:30 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: lasso-pf, version 1
Created version '1' of model 'lasso-pf'.


Результаты оказались несколько хуже, чем в случае с Ridge

## ElasticNet

Посмотрим на комбинацию первых двух моделей с подбором параметров на кросс-валидации.

In [12]:
imputer_params = get_dict(
    missing_values=np.nan,
    strategy='median'
)

float_columns = list(X_train.columns[X_train.dtypes == 'float64'])

min_max_columns = list(
    set(X_train.columns) - set(float_columns)
)

float_transformer = Pipeline(
    [
        ('scaler', StandardScaler()),
        ('polynom', PolynomialFeatures(3))
    ]
)

transformers = [
    ('float_transformer', float_transformer, float_columns),
    ('minmax_transformer', MinMaxScaler(), min_max_columns)
]

pipe_elements = [
    ('column_transformer', ColumnTransformer, transformers),
    ('imputer', SimpleImputer, imputer_params),
    ('regressor', ElasticNetCV)
]

pipe, pipe_params = train_model.make_pipeline(pipe_elements)
display(pipe)

pipe.fit(X_train, y_train)
metrics = predict_model.get_train_test_metrics(
    pipe,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)
metrics = metrics # | cv_metrics

Unnamed: 0,train,test
mape,0.477,0.478
r2,0.447,0.431


In [13]:
model_info = predict_model.log_pipe_mlflow(
    pipe_name='elasticnet-pf',
    training_info='ElasticNet regression with PF and StandardTransform',
    X=X_train,
    pipe=pipe,
    pipe_params=pipe_params,
    metrics=metrics,
    experiment_name=experiment_name,
)

Successfully registered model 'elasticnet-pf'.
2024/04/24 21:45:48 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: elasticnet-pf, version 1
Created version '1' of model 'elasticnet-pf'.


Тестовые метрики не улучшились по сравнению с Ridge

## Выводы по применению PolynomialFeatures

Применение PolynomialFeatures и L2-регуляризации (Ridge) удалось добиться наилучшего результата по MAPE и R^2 уступающий только DecisionTree

Удалось получить меньший процент ошибки `MAPE`, чем в обоих бейзлайнах: $47.5$% против 49.2% на Decision tree и 53.1% у LinearRegression.

$R^2 = 0.446$ лучше чем у линейной регресси (0.39), но хуже, чем у дерева решений (0.492)

Показатели лучшей модели (Ridge) на финальном тесте:
- **MAPE**: $0.475$ (то есть ошибка $47.5\%$ против 49.2% на Decision tree и 53.1% у LinearRegression.)
- **R^2**: $0.446$ (лучше чем у линейной регресси (0.39), но хуже, чем у дерева решений (0.492))