# Libraries and Dataset Initilization
First we import the Python libraries we will be using:
Numpy, Pandas, Seaborn, MatPlotLib and SKLearn and Prince. 

Prince will be used for the MCA and FAMD and may need to be installed first

In [None]:
!pip install prince
!pip install -U scipy

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import prince
import scipy
import sklearn
import xgboost
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from scipy.special import boxcox1p
from scipy import stats
from math import ceil
from scipy.stats import probplot
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LogisticRegression
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import power_transform
from sklearn.feature_selection import RFECV
from sklearn.ensemble import AdaBoostRegressor
from sklearn import metrics
from numpy import *

Then this code allows the Google Drive to be mounted, such that we can work with files that we have located on Drive itself.

Note: you will have to sign-in/grant access for this in a pop-up window

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We need the train.csv file, which is found under https://drive.google.com/file/d/1_PLO-CF84thdpb-dMlE8y24nEyH-g7qh/view?usp=sharing.

In order to be able to read it, we use the file ID from the URL with /uc?id= to get a direct download like which can be saved as a pandas dataframe.

In [None]:
url='https://drive.google.com/file/d/1_PLO-CF84thdpb-dMlE8y24nEyH-g7qh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
init_df = pd.read_csv(url)

In case you are running this locally, don't run the drive related blocks, but use the code block below

In [None]:
#init_df = pd.read_csv('train.csv')

When we've imported the dataset, let's save it under the df variable (dataframe)

In [None]:
# df is a copy of the initial df so we don't have to redownload many times
df = init_df.copy()

## **General Setup**

Now that we have our dataset, let's setup the roadmap of what we want to achieve in this notebook. The goal here is to test different methods of preparing this data on different models. Let's quickly go over the preparation techniques and models.

# Data Preparation
Data preparation is concerned with adjusting the raw data to such an extent that it improves the performance of our models. This is mainly focused on the following aspects
> **Preprocessing**

> Preprocessing in this notebook is used to refer to the very basic form of preprocessing we want to apply to all the datasets. This includes basic steps to make the data usable for the models, including: Ordinal encoding, Nominal One-Hot Encoding and Processing Missing Data.

> **Feature Engineering**

>This is an additonal preprocessing step where we manually adjust features to improve our dataset. This is done by: Transforming Numerical Features with BoxCox, Log and Yeo-Johnson, Removing Multi-colinear features, engineering new features and removing outliers. 


> **Feature Selection**

>This is an additional preprocessing step where we use L1 regularization recursively to regularize features with little predictive power out. This is an automated approach to the manual Feature engineering. 

> **Dimensionality Reduction**

>Dimensionality reduction refers to lowering the amount of features to reduce the chance of overfitting. This is done with various different models, which all automatically reduce features according to similar principles. For Numeric data we have PCA, Nominal data MCA and for both there is FAMD.

# Models
Our models are our actual prediction algorithms which will use the prepared dataset to learn and then make predictions on the SalePrice for the validation/test set. The models consist of 
> LinearRegression, RidgeCV, LassoCV, AdaBoostRegressor, XGBoost

#Structure
In this Notebook, we want to test all the possible combinations of Data Preparation with the different Models. As such we will first introduce different sets of preprocessed data and functions to preprocess datasets. 
After this we will introduce the models we will be using, based on which we will find the best dataset for that model with respect to feature engineering and dimensionality reduction. 
Based on that we will test the combinations of all best preprocessing datasets with all the models.



## Data Preprocessing

# Preprocessing
Let's define 3 categories on which we can seperate the features in our dataframe. This is useful for preprocessing the data and might be useful later on as well.


> Nominal data: Data classified into different categories without ranking.

> Ordinal data: Data classified into different categories with ranking.

> Numeric data: Numeric data based on normally valued real numbers.


Not Usable: The pool features have been deemed unusable (clarified in comments)


In [None]:
# First divide data into the numerical or categorical, so we can preprocess it

# List of nominal and ordinal columns extracted from data_description.txt
nom_col = ['MSSubClass', 'MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood','Condition1',
           'Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd',
           'BsmtFinType1','BsmtFinType2','Electrical','FireplaceQu','PavedDrive','Fence','SaleCondition',
           'MasVnrType','Foundation','Heating','CentralAir','GarageType','SaleType','MiscFeature',
           # Month sold and Year should be considered nominal, as they can have 
           # an effect on sale price (price fluctuations or whatever), but not
           # on an ordinal scale
           'MoSold' , 'YrSold'
           ]

ord_col = ['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure',
           'HeatingQC','KitchenQual','Functional',
           'GarageFinish','GarageQual','GarageCond','PoolQC','OverallQual','OverallCond',
           # The following are added as ordinal features because they have a
           # very low number of discrete categories, and it intuitively makes
           # sense to consider 2 fullbaths better than 1.
           'BsmtFullBath', 'BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr',
           'KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars'
           ]

# The following are not used and are left out on purpose. They contain only
# 7 measurements, and far too many NaNs to easily be used.
# Alternatively, they could be one-hot encoded so that abundance of NaN or '0'
# will not impact the regression algorithms as negatively.
# Another alternative could be to create a new nominal feature of whether a
# house has a pool or not, or [no pool, average pool, nice pool].
not_usable = ['PoolQC' , 'PoolArea',]

# List of all columns
allcat = list(df.columns[1:])

# List of numerical columns
num_col = [x for x in allcat if x not in nom_col+ord_col]
print(num_col)

Now that we have 3 categories to which we can classify our features, we come to the data pre-processing stage. 

Data pre-processing can be done in several ways, so let's first set up some functions which can help up with pre-processing.

Impute: 
>Imputes (estimates) a number for a missing entry in the data. This is only done for LotFrontage. 

Fixnan: 
>Fixes some issues with after the Nans have been filled

Fillnan: 
>Deals with the the missing data NaN, but replacing it with 'None' for our nominal and ordinal features and '0' for the numerical features

PreProcess: 
> Drops the not_usable features from the dataframe, then applies fillnan and fixnan and splits up the data into the earlier defined features.

> Then we apply one-hot encoding to the nominal features and manually map the ordinal features to ensure that the rankings for each feature are correctly covered.

>Finally we merge the features together again to create a full preprocessed dataframe.






In [None]:
def impute(input):
    # Only used for LotFrontage to avoid too much imputation causeing inaccuracy
    imp = IterativeImputer(n_nearest_features=None, imputation_order='ascending')
    imp.fit(input)
    output = pd.DataFrame(imp.transform(input), columns = input.columns)
    return output


def fixnan(input):
    # Fix the garage year built - Inputting 0 would put it very far below all
    # other houses, so instead if there is no garage, simply input the year
    # the house was built in the garage year built column.
    input.loc[input.GarageYrBlt.isnull(),'GarageYrBlt'] = input.loc[input.GarageYrBlt.isnull(),'YearBuilt']

    # Drop the single row with NaN in 'Electrical' since this is a missing 
    # measurement. Additionally, drop the rows related to the masonry that
    # contain NaN values, as these are also missing measurements, and there are
    # only a few of them, less than 10 rows.
    input.dropna(subset = ['Electrical','MasVnrType','MasVnrArea'], inplace=True)

    # Fix MasVnrArea as it is for some reason an object type
    input.MasVnrArea = input.MasVnrArea.astype(str).astype(float)

    return input


def fillnan(input, remlf=1):
    # Fill NaNs
    for col in nom_col:
        input[col].fillna('None',inplace=True)
    for col in ord_col:
        input[col].fillna('None',inplace=True)

    # Remove LotFrontage from list of columns to fill. We don't want to fill 
    # this, as it is a continuous numerical measurement we can fill with 
    # multivariate imputation instead, to benefit our models accuracy
    col_num_nan = num_col.copy()
    if remlf is 1:
        try: col_num_nan.remove('LotFrontage')
        except: pass
    for col in col_num_nan:
        input[col].fillna('0',inplace=True)
    return input


def preprocess(df, remlf=1):
    df.drop(not_usable, axis=1)
    df = fixnan(df)
    df = fillnan(df, remlf)

    # Split nominal, ordinal, and numerical columns
    df_nom = df[nom_col]
    df_ord = df[ord_col]
    df_num = df.drop(list(nom_col+ord_col), axis=1)

    # One-hot encode nominal columns
    df_nom = pd.get_dummies(data=df_nom)
    df_MSSubClass = pd.get_dummies(data=df.MSSubClass, prefix='MSSubClass', prefix_sep='_')

    # Map ordinal columns
    df_ord.BsmtExposure = df_ord.BsmtExposure.replace({'NA' : 0, 'No' : 0, 'Mn' : 1, 'Av' : 2, 'Gd' : 3})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'LwQ' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 5})
    df_ord = df_ord.replace({'NA' : 0, 'None' : 0,  'Po' : 1, 'Fa' : 2, 'TA' : 3, 'Gd' : 4, 'Ex' : 5})
    df_ord = df_ord.replace({'IR3' : 0, 'IR2' : 1, 'IR1' : 2, 'Reg' : 3})
    df_ord = df_ord.replace({'Low' : 0, 'HLS' : 1, 'Bnk' : 2, 'Lvl' : 3})
    df_ord = df_ord.replace({'Gtl' : 0, 'Mod' : 1, 'Sev' : 2})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'Lwq' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 6})
    df_ord = df_ord.replace({'Sal' : 0, 'Sev' : 1, 'Maj2' : 2, 'Maj1' : 3, 'Mod' : 4, 'Min2' : 5, 'Min1' : 6, 'Typ' : 7})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'RFn' : 2, 'Fin' : 3})
    df_ord = df_ord.replace({'N' : 0, 'P' : 1, 'Y' : 2})
    df_ord = df_ord.replace({'ELO' : 0, 'NoSeWa' : 1, 'NoSewr' : 2, 'AllPub' : 3})

    # Merge nominal, ordinal, and numerical back together
    df = pd.concat([df_num, df_MSSubClass, df_nom, df_ord], axis=1)
    df = df.drop("MSSubClass", axis=1)



    return df


Now that we have these functions defined, let's create a preprocessed dataset ppr_df. This will be the basis for the rest of our Data Preprocessing, as well as being 1 of the 8 datasets we want to test. 

In [None]:
# Preprocess, leave LotFrontage with NaN (to impute later)
pp_df = preprocess(df)
# Imput missing LotFrontage values
ppr_df = impute(pp_df)
# Dataset which only removes NaNs, labels columns for ordirnal/nominal data and int/float data for the Auto Sklearn dataset.
ppr_auto_df = preprocess(df, remlf=0)
ppr_auto_df['LotFrontage'] = ppr_auto_df['LotFrontage'].astype('int')

# Feature Engineering

For Feature Engineering we want to perform a few steps to adjust the dataset. Let's make some formulas to perform each of these actions.

Drop_Features
>We remove one of every pair to prevent the problem of multicollinearity. Multicollinearity is when 2 or more independent variables are highly correlated with another. This negatively affects the model's ability to identify the most important features
>We also remove features which have over 96% of a single value

Remove_Outliers
>We remove outliers for LotFrontage, LotArea, BsmtFinSF1, TotalBsmtSF and GrLivArea to ensure that the model is not too much affected by extreme values

Add_Features
>We add some new features based on what we know about other features, like calcultating the TotalPorch by addinf OpenPorchSF, EnclosedPorch and ScreenPorch

In [None]:
#Feature Engineering
def drop_features(df):
    #Unusable features
    df.drop(not_usable, axis=1, inplace=True)

      
    # Features that are highly related
    '''
    - GarageYrBlt and YearBuilt
    - TotRmsAbvGrd and GrLivArea
    - 1stFlrSF and TotalBsmtSF
    - GarageArea and GarageCars
    We remove one of every pair to prevent the problem of multicollinearity. Multicollinearity is when 2 or more
    independent variables are highly correlated with another. This negatively affects the model's ability to identify
    the most important features'''
    df.drop(['GarageYrBlt','TotRmsAbvGrd','1stFlrSF','GarageCars'], axis=1, inplace=True)

    # These feature have no linear relation with SalePrice at all
    df.drop(['MoSold','YrSold'], axis=1, inplace=True)


    #Features that have mostly just one value
    # If a feature has 96% or more of only one value, we drop it
    all_features = list(df.columns[1:])
    overfit_cat = []
    for i in all_features:
        counts = df[i].value_counts()
        zeros = counts.iloc[0]
        if zeros / len(df) * 100 > 96:
            overfit_cat.append(i)

    overfit_cat = list(overfit_cat)
    df.drop(overfit_cat, axis=1, inplace=True)

    return df

def remove_outliers(df):
    df.drop(df[df['LotFrontage'] > 200].index, inplace=True)
    df.drop(df[df['LotArea'] > 100000].index, inplace=True)
    df.drop(df[df['BsmtFinSF1'] > 4000].index, inplace=True)
    df.drop(df[df['TotalBsmtSF'] > 5000].index, inplace=True)
    df.drop(df[df['GrLivArea'] > 4000].index, inplace=True)

    return df



def add_features(df):

    #Sum of features
    df['TotalLot'] = df['LotFrontage'] + df['LotArea']
    df['TotalBsmtFin'] = df['BsmtFinSF1'] + df['BsmtFinSF2']
    df['TotalBath'] = df['FullBath'] + df['HalfBath']
    df['TotalPorch'] = df['OpenPorchSF'] + df['EnclosedPorch'] + df['ScreenPorch']

    #Binary columns for features that indicate presence
    colum = ['MasVnrArea','TotalBsmtFin','TotalBsmtSF','2ndFlrSF','WoodDeckSF','TotalPorch']

    for col in colum:
        col_name = col+'_bin'
        df[col_name] = df[col].apply(lambda x: 1 if x > 0 else 0)

    #The following additional features cause the model to overfit, so maybe we should only apply it on specific features
    #Applying it on numeric features only makes the model weaker

    #Cross product of all categories
    allcat = list(df.columns[1:])
    '''
    for i in range(len(allcat)):
        cat_a = allcat[i]
        for j in range(i + 1, len(allcat)):
            cat_b = allcat[j]
            col_name = f'{cat_a} * {cat_b}'
            df[col_name] = df[df.columns[i]] * df[df.columns[j]]
    '''
    #Square of each category
    '''
    for i in range(len(allcat)):
        cat = allcat[i]
        col_name = f'{cat}^2'
        df[col_name] = df[df.columns[i]] ** 2
    '''

    #Third power of each category
    '''
    for i in range(len()):
        cat = allcat[i]
        col_name = f'{cat}^3'
        df[col_name] = df[df.columns[i]] ** 3
    '''
    return df



Apart from these steps, we can also transform the numerical features by Box Cox and YeoJohnson. These will be datasets bc_df and yj_df respectively

In [None]:
# box cox transform all numericals except saleprice, 
# as it is almost exactly a log-normal dist, so it gets log-transformed
bc_df = ppr_df.copy()
skewlist = []
for i in bc_df[num_col]:
    if abs(bc_df[i].skew()) > 0.5:
        skewlist.append(i)

for v in bc_df[skewlist]:
    if v != 'SalePrice':
        lambda_list = []
        tmp = boxcox1p(bc_df[v], 0.25)
        bc_df[v] = tmp
    if v == 'SalePrice':
        tmp = np.log1p(bc_df[v])
        bc_df[v] = tmp

# Find all columns that are skewed and yeo-johnson transform all of them:
skewlist = []
for i in df[num_col]:
    if abs(df[i].skew()) > 0.5:
        skewlist.append(i)
yj_df = pd.DataFrame(power_transform(df[skewlist], method='yeo-johnson'), columns=skewlist)

Now what we have are 2 different transformed datasets and 3 functions to apply manual feature engineering, however we do not know what works best yet. So let's make a combination of all of them with all the forms of dataset (including ppr_df). This will leave us with 2x8 models of which we later on will have to determine which one is the best for the model we want to use. Since Yeo Johnson already has reduced features, we will not apply any of the drop features, add features or remove outliers on this dataset 

>ppr_df: basis preprocessed dataset

*   ppr_dr_df: ppr with drop features
*   ppr_ro_df: ppr with remove outliers
*   ppr_af_df: ppr with add featuers
*   ppr_dr_ro_df: ppr with drop features and remove outliers
*   ppr_df_af_df: ppr with drop features and add features
*   ppr_ro_af_df: ppr with remove outliers and add features
*   ppr_dr_ro_df: ppr with drop features, remove outliers and add features


>bc_df: basis preprocessed dataset with Box Cox transformation

*   bc_dr_df: bc with drop features
*   bc_ro_df: bc with remove outliers
*   bc_af_df: bc with add featuers
*   bc_dr_ro_df: bc with drop features and remove outliers
*   bc_df_af_df: bc with drop features and add features
*   bc_ro_af_df: bc with remove outliers and add features
*   bc_dr_ro_df: bc with drop features, remove outliers and add features

>yj_df: basis preprocessed dataset with Yeo Johnson transformation

In [None]:
#Preprocessed

#Drop Features
ppr_dr_df = ppr_df.copy()
ppr_dr_df = drop_features(ppr_dr_df)
#Remove Outliers
ppr_ro_df = ppr_df.copy()
ppr_ro_df = remove_outliers(ppr_df)
#Add Features
ppr_af_df = ppr_df.copy()
ppr_af_df = add_features(ppr_af_df)
#Drop Features + Remove Outliers
ppr_dr_ro_df = ppr_dr_df.copy()
ppr_dr_ro_df = remove_outliers(ppr_ro_df)
#Drop Features + Add Features
ppr_dr_af_df = ppr_dr_df.copy()
ppr_dr_af_df = add_features(ppr_dr_af_df)
#Remove Outliers + Add Features
ppr_ro_af_df = ppr_df.copy()
ppr_ro_af_df = remove_outliers(ppr_ro_af_df)
ppr_ro_af_df = add_features(ppr_ro_af_df)
#Drop Features + Remove Outliers + Add Features
ppr_dr_ro_af_df = ppr_dr_df.copy()
ppr_dr_ro_af_df = remove_outliers(ppr_dr_ro_af_df)
ppr_dr_ro_af_df = add_features(ppr_dr_ro_af_df)


#Box Cox DF

#Drop Features
bc_dr_df = bc_df.copy()
bc_dr_df = drop_features(bc_dr_df)
#Remove Outliers
bc_ro_df = bc_df.copy()
bc_ro_df = remove_outliers(bc_ro_df)
#Add Features
bc_af_df = bc_df.copy()
bc_af_df = add_features(bc_af_df)
#Drop Features + Remove Outliers
bc_dr_ro_df = bc_dr_df.copy()
bc_dr_ro_df = remove_outliers(bc_dr_ro_df)
#Drop Features + Add Features
bc_dr_af_df = bc_dr_df.copy()
bc_dr_af_df = add_features(bc_dr_af_df)
#Remove Outliers + Add Features
bc_ro_af_df = bc_df.copy()
bc_ro_af_df = remove_outliers(bc_ro_af_df)
bc_ro_af_df = add_features(bc_ro_af_df)
#Drop Features + Remove Outliers + Add Features
bc_dr_ro_af_df = bc_dr_df.copy()
bc_dr_ro_af_df = remove_outliers(bc_dr_ro_af_df)
bc_dr_ro_af_df = add_features(bc_dr_ro_af_df)



# Feature Selection
Recursive Feature Elimination 

An ‘automated’ alternative to feature engineering: Feature selection. By using L1 regularization recursively, features with little effect on the predictive power are regularized out, i.e. assigned a weight of 0. 

For this we are using a different preprocessed dataset, which uses the preproccesing variant without the impute function.

This will be the feature selection dataframe fs_df

In [None]:
def  split(df, val_frac=0.2, seed=1):
    # Validation set is a sample of val_frac size. Deafult is 0.2, or 20%. 
    # Seed is either 0 for no seed, 1 for default seed, or a real number as seed.
    if seed == 0:
        val = df.sample(frac=val_frac)
    if seed == 1: 
        val = df.sample(frac=val_frac, random_state=200)
    else:
        val = df.sample(frac=val_frac, random_state=seed)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)
    return train, val

In [None]:
def feature_selection(df):
  train, val = split(df)
  selector = RFECV(Lasso(normalize=True), step=1, n_jobs=6)
  X = train.drop(['SalePrice'], axis=1)
  # Ignore deprecation warning. The pipeline it suggests didn't immediately work
  # and I didn't feel like spending time one that when it still works even though it's deprecated.
  selector = selector.fit(X, train.SalePrice)
  # Number of features left after selecting best features
  len(selector.get_feature_names_out())
  # list of all the features that have been selected
  selected = selector.get_feature_names_out().tolist()
  selected.extend(["SalePrice"])
  # list of all features
  features = train.columns
  # list of features that have been removed
  removed = [i for i in features if i not in selected]
  # fs_df is feature selected data frame
  fs_df = fs_init_df.drop(removed, axis=1)
  return fs_df

In [None]:
# Get new dataframe
fs_init_df = init_df.copy()
# Preprocess, fill NaN, fill also LotFrontage (no imputing)
fs_init_df = preprocess(fs_init_df, remlf=0)

In [None]:
fs_df = feature_selection(fs_init_df)

## If you need a dataframe of just the selected features for your code:

In [None]:
# Backup of the 'Removed' list:
_removed_ = ['MSSubClass_20','MSSubClass_45','MSSubClass_50','MSSubClass_90','MSZoning_RL',
'Street_Pave','Alley_None','Alley_Pave','LandContour_Lvl','LotConfig_Inside',
'Neighborhood_Gilbert','Neighborhood_NAmes','Neighborhood_OldTown',
'Condition1_Artery','Condition2_Feedr','BldgType_Duplex','RoofStyle_Gable',
'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Metal', 'Exterior1st_CBlock',
'Exterior1st_HdBoard', 'Exterior2nd_AsbShng', 'Exterior2nd_MetalSd', 'Exterior2nd_Other',
'Exterior2nd_Plywood', 'BsmtFinType1_BLQ', 'BsmtFinType2_LwQ','Electrical_FuseF',
'FireplaceQu_Gd','PavedDrive_Y','Fence_MnWw', 'Fence_None','SaleCondition_Normal',
'SaleCondition_Partial','MasVnrType_None','Foundation_PConc','Heating_Grav',
'CentralAir_Y','GarageType_BuiltIn','SaleType_COD','MiscFeature_None','GarageCond']

df = init_df.copy()
df = preprocess(df, remlf=0)
fs_df = df.drop(_removed_, axis=1)

# Dimensionality Reduction

With Dimensionality Reduction we aim to reduce the amount of features we have by creating new features based on the correlation between the existing features. This is mainly done by PCA (Principal Component Analysis) for numeric features, MCA (Multiple Correspondence Analysis) for categoric features and FAMD (Factor Analysis of Mixed Data). 

In order to make use of these three forms of dimensionality reduction, let's create functions which allow us to supply it with a dataset and a model which we want to run it.

**Principal Component Analysis (PCA)**

For applying PCA, we will use the Sklearn PCA functionality and allow for 3 different variable

1. Models, as while this method is still a way of preprocessing the data, Dimensionality reduction is the last step we will apply to our dataset, thus we can build in a model to test it with and get a result. Futhermore, using a model now also allows us to experiment a bit to find the correct number of features, as explained in 3.

2. The dataset, here we can enter which dataset we want to apply PCA to.  

3. The number of features kept; basically the less features we have, the less accurate the model becomes, but it also reduces the variance, which makes it less prone to overfitting. The goal here is to find the best amount of features. This will be done by using the amount of features avaiable in the dataset and using n_components in the SKlearn PCA function to reduce them by 1 per step. The accuracy results of the validation will be plotted to find the best accuracy.

To begin, let's go over the function that will make the magic happen: pca_evaluation.

The goal of this function is to perform pca with a different hyperparameter (n_components) and then store the result in a list as (model score, n_components) such that we can evaluate it. 

To go over this step by step, we want to create a dataset of the features and the target. So we separate out SalePrice into y and features into X, which we both split up into an 80% training set and 20% validation set by random_state key 200.

First off, we want to do a baseline test, how well does our model perform without PCA. This will be score equal to the number of features the dataset has  and will be stored in pca_scores. Features is an input variable here, as each of our datasets has a different amount of features due to pre-processing

Then we introduce PCA. Sklearn uses n_components as a parameter for its PCA function, which determines how many features the PCA should produce. Since we want to reduce dimensionality, we're taking steps of -1 from the number of features in the dataset to 0. 

Each model score is stored in pca_scores, which are represented by a scatterplot for easy viewing and printed for direct readout. This direct readout is limited in range between 0 and 1, as that's the only relevant range. Furthermore, the best score and number of components are printed.

In [None]:
def pca_evaluation(model, df, features):
  x_col = df.drop(['Id', 'SalePrice'], axis=1)
  X = x_col.values.reshape(-1, len(x_col.columns))
  y = df['SalePrice'].values

  pca_scores = [] 

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=200)

  model.fit(X_train, y_train)
  pca_scores.append((model.score(X_test, y_test), features))
  
  for i in range((features-1), 0, -1):
    pca = PCA(n_components = i)
    X_pca = pca.fit_transform(X)
    X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=200)
    
    model.fit(X_train_pca, y_train)
    pca_scores.append((model.score(X_test_pca, y_test),i))

  x_coord= []
  y_coord= [] 
  for i in range(0,len(pca_scores),1):
    x_coord.append(pca_scores[i][1])
    y_coord.append(pca_scores[i][0])
  plt.scatter(x_coord,y_coord)
  plt.ylim(0, 1)
  print(pca_scores)
  return(max(pca_scores))

# PCA  function

While the pca_evaluation function can be used to find the optimal number of features, the pca_test function can be used to run a model with PCA

In [None]:
from pyparsing.exceptions import ParseSyntaxException
def pca(model, df, features):
    x_col = df.drop(['Id', 'SalePrice'], axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = df['SalePrice'].values


    pca = PCA(n_components = features)
    X_pca = pca.fit_transform(X)
    X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=200)

    model.fit(X_train_pca, y_train)
    print('Training Results:')
    try:
        print(f'Mean squared log error loss on training: {sklearn.metrics.mean_squared_log_error(model.predict(X_train_pca), y_train)}')
    except:
        ParseSyntaxException
    print(model.score(X_train_pca, y_train))
    print("Validation Results")
    print(model.score(X_test_pca, y_test))
    try:
        print(f'Mean squared log error loss on validation: {sklearn.metrics.mean_squared_log_error(model.predict(X_test_pca), y_test)}')
    except:
        pass

    plt.scatter(model.predict(X_train_pca), y_train, marker = "s",  c = "blue", label = "Training")
    plt.scatter(model.predict(X_test_pca), y_test, marker = "s",  c = "green", label = "Validation")
    plt.plot([10, 13.5], [10, 13.5], color = 'red')
    plt.title("Linear regression")
    plt.xlabel("Predicted values")
    plt.ylabel("Real values")
    plt.legend(loc = "upper left")
    print()
    print("Standard error:{:>10} {}".format(" ", std_err(model.predict(X_test_pca), y_test)))
    print()


#  MCA for nominal columns

MCA is used for the nominal features. In order to be able to separate these from the numerical features, we introduce a separate_num_nom function which separates the nominal binary features from the numerical features.

Then we have a MCA function which returns an MCA processed dataset 

In [None]:
def is_binary(series, allow_na=False):
    if allow_na:
        series.dropna(inplace=True)
    return sorted(series.unique()) == [0, 1]

def separate_num_nom(df):
  num = []
  nom = []
  for col in df.columns:
      if is_binary(df[col]) == False:
          num.append(col)
      else:
          nom.append(col)
  return num, nom

In [None]:
def  mca(df, n=2):
    mca = prince.MCA(
        n_components=n,
        n_iter=10,
        copy=True,
        check_input=True,
        engine='auto',
        random_state=42
        )
    num, nom = separate_num_nom(df)
    mca_df = mca.fit_transform(df[nom])
    mca_df['SalePrice'] = df['SalePrice']
    return mca_df

In [None]:
mca_df = mca(ppr_df)

## Applying FAMD dimensionality reduction to all columns

In [None]:
def Famd(model, df): 
    famd = prince.FAMD(
        n_components=10,
        n_iter=3,
        copy=True,
        check_input=True,
        engine='auto',
        random_state=42
    )
    famd_df = famd.fit_transform(df.drop(['SalePrice'], axis=1))

    famd_df['SalePrice'] = df['SalePrice']

    df = famd_df

    drop, target = ['SalePrice'], ['SalePrice']


    # Validation set is a sample of 20%. Seed is used to get consistent results while testing
    val = df.sample(frac=0.2,random_state=200)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)

    x_col = train.drop(drop, axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = train[target].values

    reg = model.fit(X, y)

    print("Training results")
    print(reg.score(X, y))
    try:
        print(f'Mean squared log error loss on training: {sklearn.metrics.mean_squared_log_error(reg.predict(X), y)}')
    except:
        pass
    valx_col = val.drop(drop, axis=1)
    valx = valx_col.values.reshape(-1, len(valx_col.columns))
    valy = val[target].values

    print()
    print("Validation results")
    print(reg.score(valx, valy))
    try:
        print(f'Mean squared log error loss on validation: {sklearn.metrics.mean_squared_log_error(reg.predict(valx), valy)}')
    except:
        pass

    plt.scatter(reg.predict(X), y, marker = "s",  c = "blue", label = "Training")
    plt.scatter(reg.predict(valx), valy, marker = "s",  c = "green", label = "Validation")
    plt.plot([10, 13.5], [10, 13.5], color = 'red')
    plt.title("Linear regression")
    plt.xlabel("Predicted values")
    plt.ylabel("Real values")
    plt.legend(loc = "upper left")
    print()
    print("Standard error:{:>10} {}".format(" ", std_err(y_true, y_pred)))
    print()


In [None]:
famd_df = fillnan(df, remlf=0)

# Models
Now that we have all our preprocessing functions setup, let's take a look at the models we are going to work with.
> LinearRegression

>Standard linear model with no regularization. Used to test feature engineering, selection and dimensionality reduction, as it ensures features selected/created aren’t “regularized out”

> RidgeCV

>Cross-validated L2-regularized regression model. Deals partly with collinearity due to L2 weights minimizing impacts of multicollinear or weakly correlated variables

>LassoCV

>Cross-validated L1-regularized regression model. Deals (in theory) entirely with collinearity and poor correlations, by effectively selecting best features to use in the correlation– Bad features are regularized out by giving a weight of 0.

>Adaboost

>Ensemble model which trains a set of models in sequence and improves on them by learning the mistakes it has made.

>XGBoost

>Similar to but more flexible then Adaboost.

To run these models, let's create a regressor function that does the train-test split of the data for us and give us a score of the performance of the model with the provided dataset


In [None]:
def regressor(model, df, drop=['SalePrice'], target=['SalePrice']):
    # Validation set is a sample of 20%. Seed is used to get consistent results while testing
    val = df.sample(frac=0.2,random_state=200)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)

    x_col = train.drop(drop, axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = ravel(train[target].values)

    reg = model.fit(X, y)

    valx_col = val.drop(drop, axis=1)
    valx = valx_col.values.reshape(-1, len(valx_col.columns))
    valy = ravel(val[target].values)

    print("Training results")
    print(reg.score(X, y))
    try:
      print(f'Mean squared log error loss on training: {sklearn.metrics.mean_squared_log_error(reg.predict(X), y)}')
    except:
        pass
    print("Validation results")
    print(reg.score(valx, valy))
    try:
      print(f'Mean squared log error loss on validation: {sklearn.metrics.mean_squared_log_error(reg.predict(valx), valy)}')
    except:
        pass
    plt.scatter(reg.predict(X), y, marker = "s",  c = "blue", label = "Training")
    plt.scatter(reg.predict(valx), valy, marker = "s",  c = "green", label = "Validation")
    plt.plot([10, 13.5], [10, 13.5], color = 'red')
    plt.title("Linear regression")
    plt.xlabel("Predicted values")
    plt.ylabel("Real values")
    plt.legend(loc = "upper left")
    print()
    print("Standard error:{:>10} {}".format(" ", std_err(valy, reg.predict(valx))))
    print()

In [None]:
def  std_err(y_true, y_pred):   
    error = []
    for true, pred in zip(y_true, y_pred):
        # Shift data to positive values if negative. 
        # Log error is not affected because both values are moved.
        if true < 0:
            true = abs(true)
            diff = true * 2
            pred = pred + diff
        if pred < 0:
            pred = abs(pred)
            diff = pred * 2
            true = true + diff

        error.append(sklearn.metrics.mean_squared_log_error([true], [pred]))

    std_err = scipy.stats.bootstrap((error,), np.mean, method='basic').standard_error   
    return std_err

In [None]:
def  getarr(df, drop=['SalePrice'], target=['SalePrice']):
    # gets array of X and y from dataframe to feed model
    x_col = train.drop(drop, axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = ravel(train[target].values)
    return X, y

# Optimal dimensionality reduction values

So now we have a regressor function and we have defined our models, we need to finalize our models. This is important because we want to have 4 dataframes that we can test with; 1 for preprocessing, 1 for feature engineering, 1 for feature selection, 1 for Dimensionality reduction. 

However as it stands now, these are the datasets we have
>Preprocessing: ppr_df

>Feature Engineering: 16 different datasets

>Feature Selection: fs_df

>Dimensionality Reduction: datasets first need to be run through models to determine the best number of features for PCA and then bet compared to FAMD and MCA to see which of the 3 is best. 

So let's finalize this right now.


**Feature Engineering**

For feature engineering, we have 16 different datasets: the adjusted ppr_df to bc_df and yj_df and then the drop_features, add_features, remove_outliers variants. 

During testing it has turned out that the yj_df only returns errors for all the models and the bc_dfs return errors for AdaBoost, so we're not using those. 
This leaves us with running all the datasets through the models and then picking the best dataset for each model. We're recording both testing results and validation results, but the final model will be the best val model.

In [None]:
def eval_reg(model, df, drop=['SalePrice'], target=['SalePrice']):
    # Validation set is a sample of 20%. Seed is used to get consistent results while testing
    val = df.sample(frac=0.2,random_state=200)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)

    x_col = train.drop(drop, axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = ravel(train[target].values)

    reg = model.fit(X, y)

    valx_col = val.drop(drop, axis=1)
    valx = valx_col.values.reshape(-1, len(valx_col.columns))
    valy = ravel(val[target].values)

    
    train_res = reg.score(X, y) 
    val_res = reg.score(valx, valy)
    
    return(train_res, val_res)
    

In [None]:
Linear_Regression_train = []
Linear_Regression_val = []
RidgeCV_train = []
RidgeCV_val = []
LassoCV_train = []
LassoCV_val = []
Adaboost_train = []
Adaboost_val = []
XGboost_train = []
XGboost_val = []

datasets = [ ppr_dr_df, ppr_ro_df, ppr_af_df, ppr_dr_ro_df, ppr_dr_af_df, ppr_ro_af_df, ppr_dr_ro_af_df, bc_df, bc_dr_df, bc_ro_df,bc_af_df, bc_dr_ro_df, bc_dr_af_df, bc_ro_af_df, bc_dr_ro_af_df]
dataset_name = ['ppr_dr_df', 'ppr_ro_df', 'ppr_af_df', 'ppr_dr_ro_df', 'ppr_dr_af_df', 'ppr_ro_af_df', 'ppr_dr_ro_af_df', 'bc_df', 'bc_dr_df', 'bc_ro_df','bc_af_df', 'bc_dr_ro_df', 'bc_dr_af_df', 'bc_ro_af_df', 'bc_dr_ro_af_df']
models = [LinearRegression(), RidgeCV(), LassoCV(), AdaBoostRegressor(n_estimators= 100, learning_rate = 1), XGBRegressor(booster='gbtree', objective='reg:squarederror')]
model_count = 0
dataset_count = 0 
for model in models:
  model_count += 1 
  if model_count == 4:
      del datasets[7:15]
  for dataset in datasets: 
    name = dataset_name[dataset_count]
    dataset_count += 1
    train_res, val_res = eval_reg(model, dataset)
    if model_count == 1:
      Linear_Regression_train.append((train_res, name))
      Linear_Regression_val.append((val_res, name))
    if model_count == 2:
      RidgeCV_train.append((train_res, name))
      RidgeCV_val.append((val_res, name))
    if model_count == 3:
      LassoCV_train.append((train_res, name))
      LassoCV_val.append((val_res,name))
    if model_count == 4:
      Adaboost_train.append((train_res, name))
      Adaboost_val.append((val_res, name))
    if model_count == 5:
      XGboost_train.append((train_res, name))
      XGboost_val.append((val_res, name))
  dataset_count = 0

print('Linear Regression:')
print(max(Linear_Regression_train))
print(max(Linear_Regression_val))
print('RidgeCV:')
print(max(RidgeCV_train))
print(max(RidgeCV_val))
print('LassoCV:')
print(max(LassoCV_train))
print(max(LassoCV_val))
print('AdaBoost:')
print(max(Adaboost_train))
print(max(Adaboost_val))
print('XGBoost:')
print(max(XGboost_train))
print(max(XGboost_val))



As we can see this leaves us with the following
> Preprocessing: ppr_df

> Feature Engineering
>*   Linear Regression: ppr_dr_ro_af_df
>*   RidgeCV: ppr_ro_af_df
>*   LassoCV: ppr_dr_ro_af_df
>*   AdaBoost: ppr_ro_df
>*   XGBoost : ppr_ro_af_df

> Feature Selection: fs_df

So now let's take a look at Dimensionality Reduction



**Dimensionality Reduction**

For Dimensionality Reduction we have 3 methods of Dimensionality reduction: PCA, MCA and FAMD.

To choose which one we want to use for each model, let's run them and take the best performer.

We start with PCA to find the optimal amount of features per each model. 


In [None]:
models = [LinearRegression(), RidgeCV(), LassoCV(), AdaBoostRegressor(n_estimators= 100, learning_rate = 1), XGBRegressor(booster='gbtree', objective='reg:squarederror')]
model_count = 0
for model in models:
  max_score = pca_evaluation(model, ppr_df, 262)
  if model_count == 0:
    print('Linear Regression: ', max_score)
  if model_count == 1:
    print('RidgeCV: ', max_score)  
  if model_count == 2:
    print('LassoCV: ', max_score) 
  if model_count == 3:
    print('AdaBoost: ', max_score) 
  if model_count == 4:
    print('XGBoost: ', max_score)
  model_count += 1

If you don't want to run this (as Adaboost can make this take over 20 minutes), here are the results:

*   Linear Regression: 134 
*   RidgeCV: 134
*   LassoCV: 262
*   AdaBoost: 262
*   XGBoost: 262









**Linear Regression**

In [None]:
pca(LinearRegression(), ppr_df, 134)

In [None]:
regressor(LinearRegression(), mca_df)

In [None]:
Famd(LinearRegression(), famd_df)

As we can see PCA has the best results for Linear Regression with 134 features

**RidgeCV**

In [None]:
pca(RidgeCV(), ppr_df, 134)

In [None]:
regressor(RidgeCV(), mca_df)

In [None]:
Famd(RidgeCV(), famd_df)

Same Results as with Linear Regression: PCA with  134 features is best

**LassoCV**

In [None]:
pca(LassoCV(), ppr_df, 262)

In [None]:
regressor(LassoCV(), mca_df)

In [None]:
Famd(LassoCV(), famd_df)

Here we see that LassoCV has best validation results with PCA and 262 features

**AdaBoost**

In [None]:
pca(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_df, 262)

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), mca_df)

In [None]:
Famd(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), famd_df)

For AdaBoost we find that MCA has the best validation results

**XGBoost**

In [None]:
pca(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_df, 262)

In [None]:
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), mca_df)

In [None]:
Famd(XGBRegressor(booster='gbtree', objective='reg:squarederror'), famd_df)

As we can see PCA works the best for XGBoost with 262 features

Alright, so now we have the following datasets
> Preprocessing: ppr_df

> Feature Engineering
>*   Linear Regression: ppr_dr_ro_af_df
>*   RidgeCV: ppr_ro_af_df
>*   LassoCV: ppr_dr_ro_af_df
>*   AdaBoost: ppr_ro_df
>*   XGBoost : ppr_ro_af_df

> Feature Selection: fs_df

> Dimensionality Reduction
>*   Linear Regression: PCA, 134
>*   RidgeCV: PCA, 134
>*   LassoCV: PCA, 262
>*   AdaBoost: MCA
>*   XGBoost: PCA, 262

Now what we need to do is to create our 8 datasets per model which look as follows

* Dataset 1: Preproccessed only (ppr_df)

* Dataset 2: Preprocessed + Feature engineering (depending on model)

* Dataset 3: Preprocessed + Feature Selection (fs_df) 

* Dataset 4: Preprocessed + Dimensionality Reduction (depending on model)

* Dataset 5: Preprocessed + Feature Engineering + Selection (using feature selection function on FE model) 

* Dataset 6: Preprocessed + Feature selection + Dimensionality Reduction (Putting fs_df through best Dim Red function)

* Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction (putting FE model through  best Dim Red function)

* Dataset 8: Preprocessed + Feature Engineering + Feature Selection (putting dataset 5 through best Dim Red function)

# **Model 1: Linear Regression**


## Dataset 1: Preprocessed Only

In [None]:
regressor(LinearRegression(), ppr_df)

## Dataset 2: Preprocessed + Feature Engineering

In [None]:
regressor(LinearRegression(), ppr_dr_ro_af_df)

## Dataset 3: Preprocessed + Feature Selection

In [None]:
regressor(LinearRegression(), fs_df)

##Dataset 4: Preprocessed + Dimensionality Reduction

In [None]:
pca(LinearRegression(), ppr_df, 134)

## Dataset 5: Preprocessed + Feature Engineering + Feature Selection

In [None]:
fe_fs_df = feature_selection(ppr_dr_ro_af_df)

In [None]:
regressor(LinearRegression(), fe_fs_df)

## Dataset 6: Preprocessed + Feature Selection + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fs_df. As it turns out, it's PCA with 165 features

In [None]:
pca_evaluation(LinearRegression(), fs_df, 219) #don't need to run this if you trust that I've run this already and discovers that 165 returns the best results

In [None]:
pca(LinearRegression(), fs_df, 165)

In [None]:
mca_fs_df = mca(fs_df)
regressor(LinearRegression(), mca_fs_df)

In [None]:
Famd(LinearRegression(), fs_df)

## Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's PCA with 101 features


In [None]:
pca_evaluation(LinearRegression(), ppr_dr_ro_af_df, 130) #don't need to run this if you trust that I've run this already and discovers that 101 returns the best results

In [None]:
pca(LinearRegression(), ppr_dr_ro_af_df, 101)

In [None]:
mca_ppr_dr_ro_af_df = mca(ppr_dr_ro_af_df)
regressor(LinearRegression(), mca_ppr_dr_ro_af_df)

In [None]:
Famd(LinearRegression(), ppr_dr_ro_af_df)

## Dataset 8: Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fe_fs_df. As it turns out, it's PCA with 128 features


In [None]:
pca_evaluation(LinearRegression(), fe_fs_df, 250) #don't need to run this if you trust that I've run this already and discovers that 128 returns the best results

In [None]:
pca(LinearRegression(), fe_fs_df, 128)

In [None]:
mca_fe_fs_df = mca(fe_fs_df)
regressor(LinearRegression(), mca_fe_fs_df)

In [None]:
Famd(LinearRegression(), fe_fs_df)

# **Model 2: RidgeCV**


## Dataset 1: Preprocessed Only

In [None]:
regressor(RidgeCV(), ppr_df)

## Dataset 2: Preprocessed + Feature Engineering

In [None]:
regressor(RidgeCV(), ppr_ro_af_df)

## Dataset 3: Preprocessed + Feature Selection

In [None]:
regressor(RidgeCV(), fs_df)

##Dataset 4: Preprocessed + Dimensionality Reduction

In [None]:
pca(RidgeCV(), ppr_df, 134)

## Dataset 5: Preprocessed + Feature Engineering + Feature Selection
note: while RidgeCV had a different better performing Feature Engineering dataset compared to Linear Regression and LassoCV (ppr_ro_af_df instead of ppr_dr_ro_af_df), we are using the full feature engineering to combine with feature selection, as that seems more complete and it doesn't crash.

In [None]:
regressor(RidgeCV(), fe_fs_df)

## Dataset 6: Preprocessed + Feature Selection + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fs_df. As it turns out, it's PCA with 202 features

In [None]:
pca_evaluation(RidgeCV(), fs_df, 219) #don't need to run this if you trust that I've run this already and discovers that 202 returns the best results

In [None]:
pca(RidgeCV(), fs_df, 202)

In [None]:
mca_fs_df = mca(fs_df)
regressor(RidgeCV(), mca_fs_df)

In [None]:
Famd(RidgeCV(), fs_df)

## Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_ro_af_df. As it turns out, it's PCA with 129 features


In [None]:
pca_evaluation(RidgeCV(),  ppr_ro_af_df, 130) #don't need to run this if you trust that I've run this already and discovers that 129 returns the best results

In [None]:
pca(RidgeCV(),  ppr_ro_af_df, 129)

In [None]:
mca_ppr_ro_af_df= mca(ppr_ro_af_df)
regressor(RidgeCV(), mca_ppr_ro_af_df)

In [None]:
Famd(RidgeCV(), ppr_dr_ro_af_df)

## Dataset 8: Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fe_fs_df. As it turns out, it's PCA with 210 features


In [None]:
pca_evaluation(RidgeCV(), fe_fs_df, 250) #don't need to run this if you trust that I've run this already and discovers that 210 returns the best results

In [None]:
pca(RidgeCV(), fe_fs_df, 210)

In [None]:
mca_fe_fs_df = mca(fe_fs_df)
regressor(RidgeCV(), mca_fe_fs_df)

In [None]:
Famd(RidgeCV(), fe_fs_df)

# **Model 3: LassoCV**


## Dataset 1: Preprocessed Only

In [None]:
regressor(LassoCV(), ppr_df)

## Dataset 2: Preprocessed + Feature Engineering

In [None]:
regressor(LassoCV(), ppr_dr_ro_af_df)

## Dataset 3: Preprocessed + Feature Selection

In [None]:
regressor(LassoCV(), fs_df)

##Dataset 4: Preprocessed + Dimensionality Reduction

In [None]:
pca(LassoCV(), ppr_df, 262)

## Dataset 5: Preprocessed + Feature Engineering + Feature Selection

In [None]:
fe_fs_df = feature_selection(ppr_dr_ro_af_df)

In [None]:
regressor(LassoCV(), fe_fs_df)

## Dataset 6: Preprocessed + Feature Selection + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fs_df. As it turns out, it's FAMD

In [None]:
pca_evaluation(LassoCV(), fs_df, 219) #don't need to run this if you trust that I've run this already and discovers that 219 returns the best results

In [None]:
pca(LassoCV(), fs_df, 219)

In [None]:
mca_fs_df = mca(fs_df)
regressor(LassoCV(), mca_fs_df)

In [None]:
Famd(LassoCV(), fs_df)

## Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's PCA with 130 features


In [None]:
pca_evaluation(LassoCV(), ppr_dr_ro_af_df, 130) #don't need to run this if you trust that I've run this already and discovers that 130 returns the best results

In [None]:
pca(LassoCV(), ppr_dr_ro_af_df, 128)

In [None]:
mca_ppr_dr_ro_af_df = mca(ppr_dr_ro_af_df)
regressor(LassoCV(), mca_ppr_dr_ro_af_df)

In [None]:
Famd(LassoCV(), ppr_dr_ro_af_df)

## Dataset 8: Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's FAMD


In [None]:
pca_evaluation(LassoCV(), fe_fs_df, 250) #don't need to run this if you trust that I've run this already and discovers that 250 returns the best results

In [None]:
pca(LassoCV(), fe_fs_df, 249)

In [None]:
mca_fe_fs_df = mca(fe_fs_df)
regressor(LassoCV(), mca_fe_fs_df)

In [None]:
Famd(LassoCV(), fe_fs_df)

# **Model 4: AdaBoost**


## Dataset 1: Preprocessed Only

Haven't run this yet, so all the text results are incomplete

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_df)

## Dataset 2: Preprocessed + Feature Engineering

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_dr_ro_af_df)

## Dataset 3: Preprocessed + Feature Selection

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fs_df)

##Dataset 4: Preprocessed + Dimensionality Reduction

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), mca_df)

## Dataset 5: Preprocessed + Feature Engineering + Feature Selection

In [None]:
fe_fs_df = feature_selection(ppr_dr_ro_af_df)

In [None]:
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fe_fs_df)

## Dataset 6: Preprocessed + Feature Selection + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fs_df. As it turns out, it's FAMD

In [None]:
pca_evaluation(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fs_df, 219) #don't need to run this if you trust that I've run this already and discovers that 219 returns the best results

In [None]:
pca(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fs_df, 219)

In [None]:
mca_fs_df = mca(fs_df)
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), mca_fs_df)

In [None]:
Famd(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fs_df)

## Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's PCA with 128 features


In [None]:
pca_evaluation(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_dr_ro_af_df, 128) #don't need to run this if you trust that I've run this already and discovers that 130 returns the best results

In [None]:
pca(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_dr_ro_af_df, 128)

In [None]:
mca_ppr_dr_ro_af_df = mca(ppr_dr_ro_af_df)
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), mca_ppr_dr_ro_af_df)

In [None]:
Famd(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), ppr_dr_ro_af_df)

## Dataset 8: Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's PCA with 249 features


In [None]:
pca_evaluation(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fe_fs_df, 250) #don't need to run this if you trust that I've run this already and discovers that 250 returns the best results

In [None]:
pca(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fe_fs_df, 249)

In [None]:
mca_fe_fs_df = mca(fe_fs_df)
regressor(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), mca_fe_fs_df)

In [None]:
Famd(AdaBoostRegressor(n_estimators= 100, learning_rate = 1), fe_fs_df)

# **Model 5: XGBoost**



## Dataset 1: Preprocessed Only

In [None]:
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_df)

## Dataset 2: Preprocessed + Feature Engineering

In [None]:
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_ro_af_df)

## Dataset 3: Preprocessed + Feature Selection

In [None]:
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fs_df)

##Dataset 4: Preprocessed + Dimensionality Reduction

In [None]:
pca(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_df, 262)

## Dataset 5: Preprocessed + Feature Engineering + Feature Selection

In [None]:
fe_fs_df = feature_selection(ppr_dr_ro_af_df)

In [None]:
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fe_fs_df)

## Dataset 6: Preprocessed + Feature Selection + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the fs_df. As it turns out, it's PCA with 171 features

In [None]:
pca_evaluation(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fs_df, 219) #don't need to run this if you trust that I've run this already and discovers that 171 returns the best results

In [None]:
pca(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fs_df, 171)

In [None]:
mca_fs_df = mca(fs_df)
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), mca_fs_df)

In [None]:
Famd(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fs_df)

## Dataset 7: Preprocessed + Feature Engineering + Dimensionality Reduction

For this we need to test the 3 different Dim Red methods again to see which one applies best with the ppr_dr_ro_af_df. As it turns out, it's PCA with 128 features


In [None]:
pca_evaluation(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_dr_ro_af_df, 128) #don't need to run this if you trust that I've run this already and discovers that 128 returns the best results

In [None]:
pca(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_dr_ro_af_df, 128)

In [None]:
mca_ppr_dr_ro_af_df = mca(ppr_dr_ro_af_df)
regressor(XGBRegressor(booster='gbtree', objective='reg:squarederror'), mca_ppr_dr_ro_af_df)

In [None]:
Famd(XGBRegressor(booster='gbtree', objective='reg:squarederror'), ppr_dr_ro_af_df)

In [None]:
pca(XGBRegressor(booster='gbtree', objective='reg:squarederror'), fe_fs_df, 128)

# **Results**

Here's a table with the results we find for the validation scores.  

In [None]:
results= [[0.864049256666222, 0.882993788411652, 0.57307520746678, 0.88675026673289, 0.562444946440133, 0.570734722689366, 0.885061238332195, 0.572579970161637 ],
          [0.929792769342271,0.889912655333249, 0.564117682743432, 0.887531961995074, 0.560001523399702, 0.56555931546625,  0.887031269852025, 0.562285091841353 ],
          [0.79497251266111, 0.795964920104688, 0.385353999499943, 0.792961241847281, 0.385353503248167, 0.459591797446816, 0.791585588001044, 0.413759820106853  ],
          [0.013888888888888888, 0.013888888888888888,0.013793103448275862,0.006944444444444444,0.013793103448275862,  0.013793103448275862, 0.010380622837370242, 0.020618556701030927],
          [0.8874144692240529,0.8939285305481581, 0.6756744674642111,0.8188901089366438, 0.6403642344872913, 0.6728817403925524, 0.8448092641012978,0.7164264936286864   ]]
index= ['PP', 'PP+FE', 'PP+FS', 'PP+DR', 'PP+FE+FS', 'PP+FS+DR', 'PP+FE+DR', 'PP+FS+FE+DR']
dfs = ['Linear Regression','RidgeCV','LassoCV', 'AdaBoost', 'XGBoost']
pd.DataFrame(results, columns = index, index = dfs)

# **Auto SkLearn Comparison**

As we're using Auto-sklearn as a comparison at the end and the installation of auto-sklearn gives an error with the fs_df creation, we're going to install auto-sklearn now here at the end. 

Note that for the installation to be completed, you need to restart the runtime, so there will be some duplicate code cells underneath such that you can run those and not have worry about going back through the notebook again

In [None]:
!pip3 install auto-sklearn

# Duplicate functions
These are code cells we've seen before, but need now after restarting the runtime. It's recommended to run these and then collapse the header


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import prince
import scipy
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from scipy.special import boxcox1p
from scipy import stats
from math import ceil
from scipy.stats import probplot
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, LogisticRegression
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import power_transform
from sklearn.feature_selection import RFECV
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
from numpy import *

In [None]:
url='https://drive.google.com/file/d/1_PLO-CF84thdpb-dMlE8y24nEyH-g7qh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
init_df = pd.read_csv(url)

In [None]:
# df is a copy of the initial df so we don't have to redownload many times
df = init_df.copy()

In [None]:
# First divide data into the numerical or categorical, so we can preprocess it

# List of nominal and ordinal columns extracted from data_description.txt
nom_col = ['MSSubClass', 'MSZoning','Street','Alley','LandContour','LotConfig','Neighborhood','Condition1',
           'Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd',
           'BsmtFinType1','BsmtFinType2','Electrical','FireplaceQu','PavedDrive','Fence','SaleCondition',
           'MasVnrType','Foundation','Heating','CentralAir','GarageType','SaleType','MiscFeature',
           # Month sold and Year should be considered nominal, as they can have 
           # an effect on sale price (price fluctuations or whatever), but not
           # on an ordinal scale
           'MoSold' , 'YrSold'
           ]

ord_col = ['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure',
           'HeatingQC','KitchenQual','Functional',
           'GarageFinish','GarageQual','GarageCond','PoolQC','OverallQual','OverallCond',
           # The following are added as ordinal features because they have a
           # very low number of discrete categories, and it intuitively makes
           # sense to consider 2 fullbaths better than 1.
           'BsmtFullBath', 'BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr',
           'KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars'
           ]

# The following are not used and are left out on purpose. They contain only
# 7 measurements, and far too many NaNs to easily be used.
# Alternatively, they could be one-hot encoded so that abundance of NaN or '0'
# will not impact the regression algorithms as negatively.
# Another alternative could be to create a new nominal feature of whether a
# house has a pool or not, or [no pool, average pool, nice pool].
not_usable = ['PoolQC' , 'PoolArea',]

# List of all columns
allcat = list(df.columns[1:])

# List of numerical columns
num_col = [x for x in allcat if x not in nom_col+ord_col]

In [None]:
def impute(input):
    # Only used for LotFrontage to avoid too much imputation causeing inaccuracy
    imp = IterativeImputer(n_nearest_features=None, imputation_order='ascending')
    imp.fit(input)
    output = pd.DataFrame(imp.transform(input), columns = input.columns)
    return output


def fixnan(input):
    # Fix the garage year built - Inputting 0 would put it very far below all
    # other houses, so instead if there is no garage, simply input the year
    # the house was built in the garage year built column.
    input.loc[input.GarageYrBlt.isnull(),'GarageYrBlt'] = input.loc[input.GarageYrBlt.isnull(),'YearBuilt']

    # Drop the single row with NaN in 'Electrical' since this is a missing 
    # measurement. Additionally, drop the rows related to the masonry that
    # contain NaN values, as these are also missing measurements, and there are
    # only a few of them, less than 10 rows.
    input.dropna(subset = ['Electrical','MasVnrType','MasVnrArea'], inplace=True)

    # Fix MasVnrArea as it is for some reason an object type
    input.MasVnrArea = input.MasVnrArea.astype(str).astype(float)

    return input


def fillnan(input, remlf=1):
    # Fill NaNs
    for col in nom_col:
        input[col].fillna('None',inplace=True)
    for col in ord_col:
        input[col].fillna('None',inplace=True)

    # Remove LotFrontage from list of columns to fill. We don't want to fill 
    # this, as it is a continuous numerical measurement we can fill with 
    # multivariate imputation instead, to benefit our models accuracy
    col_num_nan = num_col.copy()
    if remlf is 1:
        try: col_num_nan.remove('LotFrontage')
        except: pass
    for col in col_num_nan:
        input[col].fillna('0',inplace=True)
    return input


def preprocess(df, remlf=1):
    df.drop(not_usable, axis=1)
    df = fixnan(df)
    df = fillnan(df, remlf)

    # Split nominal, ordinal, and numerical columns
    df_nom = df[nom_col]
    df_ord = df[ord_col]
    df_num = df.drop(list(nom_col+ord_col), axis=1)

    # One-hot encode nominal columns
    df_nom = pd.get_dummies(data=df_nom)
    df_MSSubClass = pd.get_dummies(data=df.MSSubClass, prefix='MSSubClass', prefix_sep='_')

    # Map ordinal columns
    df_ord.BsmtExposure = df_ord.BsmtExposure.replace({'NA' : 0, 'No' : 0, 'Mn' : 1, 'Av' : 2, 'Gd' : 3})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'LwQ' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 5})
    df_ord = df_ord.replace({'NA' : 0, 'None' : 0,  'Po' : 1, 'Fa' : 2, 'TA' : 3, 'Gd' : 4, 'Ex' : 5})
    df_ord = df_ord.replace({'IR3' : 0, 'IR2' : 1, 'IR1' : 2, 'Reg' : 3})
    df_ord = df_ord.replace({'Low' : 0, 'HLS' : 1, 'Bnk' : 2, 'Lvl' : 3})
    df_ord = df_ord.replace({'Gtl' : 0, 'Mod' : 1, 'Sev' : 2})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'Lwq' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 6})
    df_ord = df_ord.replace({'Sal' : 0, 'Sev' : 1, 'Maj2' : 2, 'Maj1' : 3, 'Mod' : 4, 'Min2' : 5, 'Min1' : 6, 'Typ' : 7})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'RFn' : 2, 'Fin' : 3})
    df_ord = df_ord.replace({'N' : 0, 'P' : 1, 'Y' : 2})
    df_ord = df_ord.replace({'ELO' : 0, 'NoSeWa' : 1, 'NoSewr' : 2, 'AllPub' : 3})

    # Merge nominal, ordinal, and numerical back together
    df = pd.concat([df_num, df_MSSubClass, df_nom, df_ord], axis=1)
    df = df.drop("MSSubClass", axis=1)



    return df


In [None]:
# Dataset which only removes NaNs, labels columns for ordirnal/nominal data and int/float data for the Auto Sklearn dataset.
ppr_auto_df = preprocess(df, remlf=0)
ppr_auto_df['LotFrontage'] = ppr_auto_df['LotFrontage'].astype('int')

In [None]:
def  split(df, val_frac=0.2, seed=1):
    # Validation set is a sample of val_frac size. Deafult is 0.2, or 20%. 
    # Seed is either 0 for no seed, 1 for default seed, or a real number as seed.
    if seed == 0:
        val = df.sample(frac=val_frac)
    if seed == 1: 
        val = df.sample(frac=val_frac, random_state=200)
    else:
        val = df.sample(frac=val_frac, random_state=seed)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)
    return train, val

# Auto Sklearn Continued

As we do the Auto Sklearn, we will need the "autoskl_reg.pkl" file which can be found in the MachineLearning/Project/Files. 

For you the path may be different, as this folder will be on your shared drives, so check the left side of Google Collab and see the file icon. This opens your file directory of Collab. 

Follow content/drive/Shareddrives and then search for the "autoskl_reg.pkl" and use the 3 dots at the end to copy path and paste it over my path. 

If you have trouble accessing the SharedDrives, you can always download the "autoskl_reg.pkl" file and upload it to your own drive and then do the same steps to access its path

In [None]:
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score
import pickle

In [None]:
# prepare df for model
df = ppr_auto_df.copy()

for col in num_col:
    df[col] = df[col].astype('int')
for col in df.columns:
    if col not in num_col:
        df[col] = df[col].astype('category')
#df = df.drop(["MiscFeature_TenC"], axis=1)

train, val = split(df, 0.2, 210)
valx = val.drop(['SalePrice'], axis=1)
valy = val.SalePrice

In [None]:
# load auto-sklearn model
with open('autoskl_reg.pkl', 'rb') as f:
    autoskl_regressor = pickle.load(f)

In [None]:
# predict
y_true = valy
y_pred = autoskl_regressor.predict(valx)
print('auto-sklearn regressor log MSE:', mean_squared_log_error(y_true, y_pred))
print()
print('auto-sklearn regressor validation score:', r2_score(y_true, y_pred))
print()
qmin = np.quantile(y_true, [0.0])
qmax = np.quantile(y_true, [1.0])

plt.scatter(autoskl_regressor.predict(valx), y_true, marker = "o",  c = "blue")
#plt.scatter(autoskl_regressor.predict(valx), y_pred, marker = "o",  c = "green", label = "Predicted Y")
plt.plot([qmin, qmax], [qmin, qmax], color = 'red')
plt.title("Auto-sklearn Regressor")
plt.xlabel("Predicted values")
plt.ylabel("Real values")
plt.legend(loc = "upper left")

print(mean_squared_log_error(y_true, y_pred))

In [None]:
conf_int(y_true, y_pred)

In [None]:
base_mean = [ppr_df.SalePrice.mean() for i in y_true]
base_median = [ppr_df.SalePrice.median() for i in y_true]

print(mean_squared_log_error(y_true, base_mean))
print(mean_squared_log_error(y_true, base_median))

In [None]:
import statsmodels.stats.api as sms

In [None]:
import scipy

In [None]:
 a = mean_squared_log_error(y_true, y_pred, multioutput='raw_values')
 sms.DescrStatsW(a).tconfint_mean()

In [None]:
print(mean_squared_log_error(y_true, y_pred))

In [None]:
def  std_err(y_true, y_pred):   
    error = []
    for true, pred in zip(y_true, y_pred):
        error.append(mean_squared_log_error([true], [pred]))

    std_err = scipy.stats.bootstrap((error,), np.mean, method='basic').standard_error   
    return std_err

def  conf_int(y_true, y_pred):   
    error = []
    for true, pred in zip(y_true, y_pred):
        error.append(mean_squared_log_error([true], [pred]))

    conf_int = scipy.stats.bootstrap((error,), np.mean, method='basic').confidence_interval   
    return conf_int

# Data set with bare minimum preprocessing


In [None]:
def  min_prep(df):
    df.MasVnrArea = df.MasVnrArea.astype(str).astype(float)
    df = fillnan(df, 0)
    # Split nominal, ordinal, and numerical columns
    df_nom = df[nom_col]
    df_ord = df[ord_col]
    df_num = df.drop(list(nom_col+ord_col), axis=1)

    # One-hot encode nominal columns
    df_nom = pd.get_dummies(data=df_nom)
    df_MSSubClass = pd.get_dummies(data=df.MSSubClass, prefix='MSSubClass', prefix_sep='_')

    # Map ordinal columns
    df_ord.BsmtExposure = df_ord.BsmtExposure.replace({'NA' : 0, 'No' : 0, 'Mn' : 1, 'Av' : 2, 'Gd' : 3})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'LwQ' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 5})
    df_ord = df_ord.replace({'NA' : 0, 'None' : 0,  'Po' : 1, 'Fa' : 2, 'TA' : 3, 'Gd' : 4, 'Ex' : 5})
    df_ord = df_ord.replace({'IR3' : 0, 'IR2' : 1, 'IR1' : 2, 'Reg' : 3})
    df_ord = df_ord.replace({'Low' : 0, 'HLS' : 1, 'Bnk' : 2, 'Lvl' : 3})
    df_ord = df_ord.replace({'Gtl' : 0, 'Mod' : 1, 'Sev' : 2})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'Lwq' : 2, 'Rec' : 3, 'BLQ' : 4, 'ALQ' : 5, 'GLQ' : 6})
    df_ord = df_ord.replace({'Sal' : 0, 'Sev' : 1, 'Maj2' : 2, 'Maj1' : 3, 'Mod' : 4, 'Min2' : 5, 'Min1' : 6, 'Typ' : 7})
    df_ord = df_ord.replace({'NA' : 0, 'Unf' : 1, 'RFn' : 2, 'Fin' : 3})
    df_ord = df_ord.replace({'N' : 0, 'P' : 1, 'Y' : 2})
    df_ord = df_ord.replace({'ELO' : 0, 'NoSeWa' : 1, 'NoSewr' : 2, 'AllPub' : 3})

    # Merge nominal, ordinal, and numerical back together
    df = pd.concat([df_num, df_MSSubClass, df_nom, df_ord], axis=1)
    df = df.drop("MSSubClass", axis=1)
    return df

In [None]:
df = init_df.copy()

min_df = min_prep(df)

In [None]:
min_df.isna().sum()

In [None]:
regressor(LinearRegression(), min_df)

In [None]:
def regressor(models, df, cols, drop=['SalePrice'], target=['SalePrice']):
    # Validation set is a sample of 20%. Seed is used to get consistent results while testing
    val = df.sample(frac=0.2,random_state=200)
    # training is set acquired by dropping the validation set
    train = df.drop(val.index)

    x_col = train.drop(drop, axis=1)
    X = x_col.values.reshape(-1, len(x_col.columns))
    y = ravel(train[target].values)

    idx = ["Train r2", "val r2", "train log", "val log", "val std err"]
    train_r2_score = []
    val_r2_result = []
    train_log_score = []
    val_log_score = []
    val_std_err = []

    for model in models:
        reg = model.fit(X, y)
        valx_col = val.drop(drop, axis=1)
        valx = valx_col.values.reshape(-1, len(valx_col.columns))
        valy = ravel(val[target].values)
        train_r2_score.append(reg.score(X, y))
        val_r2_result.append(reg.score(valx, valy))
        train_log_score.append(sklearn.metrics.mean_squared_log_error(reg.predict(X), y))
        val_log_score.append(sklearn.metrics.mean_squared_log_error(reg.predict(valx), valy))
        val_std_err.append(std_err(valy, reg.predict(valx)))
        print(str(model), conf_int(valy, reg.predict(valx)))

    output = pd.DataFrame([train_r2_score, val_r2_result, train_log_score, val_log_score, val_std_err], index=idx, columns=cols)
    return output

In [None]:
models = [LinearRegression(), 
          RidgeCV(), 
          LassoCV(), 
          AdaBoostRegressor(n_estimators= 100, learning_rate = 1), 
          XGBRegressor(booster='gbtree', objective='reg:squarederror')
          ]

columns = ["Linear Regression", 
           "RidgeCV", 
           "LassoCV",
           "AdaBoost",
           "XGBoost"
           ]

In [None]:
out = regressor(models, min_df, columns)

In [None]:
out

In [None]:
xgb = regressor([XGBRegressor(booster='gbtree', objective='reg:squarederror')], min_df, ["XGBoost"])

In [None]:
xgb