# AHP Basic Modeling
**Preprocess**
- I drop 4 features with missing values more than 80%
- I use pipeline to preprocess the data, including feature engineering (tho kinda useless for ensemble tree models), impute missing value using new class (None) for categorical data and median value for numeric data
- Yeah, I guess I did overlook for categorical data encoded in int64 data types, but for now, let them be (there's not much difference since not many variables have missing values)
- I use StandardScaler for scaling numeric features (how about RobustScaler?) (I don't have to do it for tree models tho)
- I use OrdinalEncoder for categorical data (better than one hot, less demanding in complexity)

**Modeling**
- As for modeling, I use 4 linear models, 1 neighbour model (what does it called?), and 4 tree models
- In here, we will do basic model with and without feature engineering and compare both of them.
- I use 5-fold cross validation for evaluating the model
- For evaluation metrics, I use MSLE (root mean squared log error) as a main metric (which I make from scratch, cause for some reason the default result in NaN), and RMSE (root mean squared error) as a helper metric to understand the model better
- I guess that's all for today's briefing.



In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression

# models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor

from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
# from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer, OrdinalEncoder, OneHotEncoder, RobustScaler
from sklearn.metrics import mean_squared_error, make_scorer, root_mean_squared_log_error, root_mean_squared_error




In [2]:
df = pd.read_csv('../data/train.csv')
df.drop(columns=['Id'], inplace=True)
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
y = df['SalePrice']
X = df.drop('SalePrice', axis=1)

# remove high missing cols
X.drop('PoolQC MiscFeature Alley Fence'.split(), axis=1, errors='ignore')

num_var = X.select_dtypes(include='number').columns
ord_var = X.select_dtypes(include='object').columns

In [44]:
# Feature engineering from kaggle's learn
# rooms' spaciousness, outside area, building type x ground area
def feature_eng(df):
  X = df.copy()

  outside = "WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch".split()

  X["Spaciousness"] = (X["1stFlrSF"]+X["2ndFlrSF"])/X["TotRmsAbvGrd"].replace(0, np.nan)
  X["TotalOutsideSF"] = X[outside].sum(axis=1)
  X["PorchTypes"] = X[outside].gt(0).sum(axis=1)
  
  # must use custom transformer: GroupMeanEncoder
  # X["MedNhbdArea"] = X.groupby("Neighborhood")["GrLivArea"].transform("median")

  # it's prone to error, and should use uhm. 
  X3 = pd.get_dummies(X.BldgType, prefix="Bldg", )
  X3 = X3.mul(X.GrLivArea, axis=0)
  
  return pd.concat([X, X3], axis=1)

feature_eng = FunctionTransformer(feature_eng)

In [None]:
# Preprocess for numerical, ordinal, and nominal features
# Well, not really tho. 
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
ord_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
])

nom_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(transformers=[
    ('num', num_pipe, num_var),
    ('ord', ord_pipe, ord_var)
    # ('nom', nom_pipe, nom_var)
])


# full_pipe = Pipeline([
#     ('eng', feature_eng),
#     ('pre', preprocess)
# ])


In [6]:
# custom scorer
def rmse_log(yt, yp):
  lt = np.log1p(yt)
  lp = np.log1p(yp)
  return np.sqrt(np.mean((lt-lp)**2))

rmse_log_scorer = make_scorer(rmse_log, greater_is_better=False)

In [8]:
# Define a list of regression models
basic_models = [
    LinearRegression(),
    Ridge(random_state=42),
    Lasso(random_state=42),
    ElasticNet(random_state=42),
    KNeighborsRegressor(),
    RandomForestRegressor(random_state=42),
    HistGradientBoostingRegressor(random_state=42),
    XGBRegressor(random_state=42),
    LGBMRegressor(random_state=42),
]

In [53]:
# Modeling function
kf = KFold(n_splits=5, shuffle=True, random_state=42)

def modeling(models=basic_models, eng=False):
    res = []
    for model in models:
        if eng:
            full_pipe = Pipeline([
                ('eng', feature_eng),
                ('pre', preprocess),
                ('reg', model)
            ])
            print("featuring!")
        else: 
            full_pipe = Pipeline([
                ('pre', preprocess),
                ('reg', model)
            ])
            
        print(model)      
        grid_search = GridSearchCV(estimator=full_pipe, param_grid = {}, cv=kf, 
                                    scoring = {'rmse':'neg_root_mean_squared_error',
                                                'msle2':'neg_root_mean_squared_log_error',
                                                'msle1': rmse_log_scorer}, 
                                    refit = 'msle2', verbose=0, n_jobs=-1)
            
        grid_search.fit(X, y)

        # return grid_search.predict(X)
            
        result = {
                'model': type(model).__name__,
                'msle1': -grid_search.cv_results_['mean_test_msle1'][0],
                'rmse': -grid_search.cv_results_['mean_test_rmse'][0],
                'msle2': -grid_search.cv_results_['mean_test_msle2'][0],
                'time': grid_search.cv_results_['mean_fit_time'][0],
                'params': model.get_params()
            }
            
        res.append(result)
    res = pd.DataFrame(res)
    res.set_index('model', inplace=True)
    return pd.DataFrame(res), grid_search.cv_results_

In [40]:
# First modeling (51.9s)
res_basic, cv_basic = modeling()

LinearRegression()




Ridge(random_state=42)
Lasso(random_state=42)


  model = cd_fast.enet_coordinate_descent(


ElasticNet(random_state=42)
KNeighborsRegressor()
RandomForestRegressor(random_state=42)
HistGradientBoostingRegressor(random_state=42)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=None,
             n_jobs=None, num_parallel_tree=None, ...)
LGBMRegressor(random_state=42)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing w

In [54]:
# Second modeling (39.7s)
res_eng, cv_eng = modeling(eng=True)

featuring!
LinearRegression()




featuring!
Ridge(random_state=42)
featuring!
Lasso(random_state=42)


  model = cd_fast.enet_coordinate_descent(


featuring!
ElasticNet(random_state=42)
featuring!
KNeighborsRegressor()
featuring!
RandomForestRegressor(random_state=42)
featuring!
HistGradientBoostingRegressor(random_state=42)
featuring!
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=None,
             n_jobs=None, num_parallel_tree=None, ...)
featuring!
LGBMRegressor(random_state=42)
[LightGBM] [Info]

In [55]:
display(res_basic.sort_values('msle1', ascending=True), 
        res_eng.sort_values('msle1', ascending=True))

Unnamed: 0_level_0,msle1,rmse,msle2,time,params
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMRegressor,0.138314,29202.882384,0.138314,0.929727,"{'boosting_type': 'gbdt', 'class_weight': None..."
HistGradientBoostingRegressor,0.138626,29178.515451,0.138626,0.947151,"{'categorical_features': 'from_dtype', 'early_..."
XGBRegressor,0.141867,31259.18125,0.141867,0.754835,"{'objective': 'reg:squarederror', 'base_score'..."
RandomForestRegressor,0.146925,30268.857053,0.146925,4.556262,"{'bootstrap': True, 'ccp_alpha': 0.0, 'criteri..."
ElasticNet,0.152255,34939.789808,0.152255,0.143919,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
Lasso,0.17431,39338.024485,,0.113706,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
LinearRegression,0.174396,39354.855969,,0.098189,"{'copy_X': True, 'fit_intercept': True, 'n_job..."
Ridge,0.196038,37968.769895,0.196038,0.089308,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
KNeighborsRegressor,0.197946,41285.438737,0.197946,0.060095,"{'algorithm': 'auto', 'leaf_size': 30, 'metric..."


Unnamed: 0_level_0,msle1,rmse,msle2,time,params
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMRegressor,0.138314,29202.882384,0.138314,0.587,"{'boosting_type': 'gbdt', 'class_weight': None..."
HistGradientBoostingRegressor,0.138626,29178.515451,0.138626,0.993028,"{'categorical_features': 'from_dtype', 'early_..."
XGBRegressor,0.141867,31259.18125,0.141867,0.978966,"{'objective': 'reg:squarederror', 'base_score'..."
RandomForestRegressor,0.146925,30268.857053,0.146925,4.372874,"{'bootstrap': True, 'ccp_alpha': 0.0, 'criteri..."
ElasticNet,0.152255,34939.789808,0.152255,0.075942,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
Lasso,0.17431,39338.024485,,0.129799,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
LinearRegression,0.174396,39354.855969,,0.098413,"{'copy_X': True, 'fit_intercept': True, 'n_job..."
Ridge,0.196038,37968.769895,0.196038,0.066914,"{'alpha': 1.0, 'copy_X': True, 'fit_intercept'..."
KNeighborsRegressor,0.197946,41285.438737,0.197946,0.054513,"{'algorithm': 'auto', 'leaf_size': 30, 'metric..."


**Insight**
- Feature engineering is useless, even in linear model (?)
- Maybe, because the one I implemented isn't that important, while the other features already have a great effect
- Even a likely important new feature is a useless? whoah, i'm surprised LOL. 
- Anyway, the best 4 are tree models, followed by 4 linear models, and the last one is KN regression
- Elastic net, a linear regression model with L1 & L2 penalty got a pretty good result following the tree models
-  The best model is LGBM Regressor, with 0.138 MSLE and 29203 RMSE, so the mean prediction error is around 29k (note that minimum saleprice is ~35k, median saleprice is ~163k) which isn't really good prediction in average (~18% error, based on a median price), let alone for a cheaper house. 


In [None]:
from sklearn.metrics import mean_squared_error
