# **Modelling and Evaluation version 1**

## Objectives

*   Fit and evaluate a regression model to predict sale price based on house attributes.

## Inputs

* outputs/datasets/collection/house_prices_records.csv
* Instructions on which variables to use for data cleaning and feature engineering, found in the respective notebooks.

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML pipeline to predict sale price
* Feature importance plot

## Conclusions
* The complete pipeline for data cleaning, feature engineering, modelling and evaluation have been defined in this notebook.
* The model selected with best performance for predicting house price sales is the use of ExtraTreesRegressor with the following set of parameters: 'model__n_estimators': [200],'model__max_features': [0.3], 'model__max_depth': [None] and 'model__min_samples_split': [5].
* The most important features that are used to train the model are ['GrLivArea', '1stFlrSF', 'GarageArea', 'YearRemodAdd', 'KitchenQual', 'YearBuilt'].


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

# Load Data

In [4]:
import numpy as np
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/house_prices_records.csv")
print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


# ML Pipeline: Regressor

## Create ML pipeline

Custom code for ML pipeline taken and modified from CI's Churnometer project.

In [5]:
from sklearn.pipeline import Pipeline

# Data Cleaning
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection, DropFeatures
from feature_engine import transformation as vt

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

def PipelineOptimization(model):
    pipeline_base = Pipeline([

        # Data Cleaning - see Data Cleaning Notebook
        ("DropFeatures", DropFeatures(features_to_drop=['EnclosedPorch', 'WoodDeckSF', 'GarageYrBlt'])),

        ("CategoricalImputation", CategoricalImputer(imputation_method='missing',fill_value='Unf', 
                                                        variables=['GarageFinish','BsmtFinType1'])),
        
        ("MedianImputation", MeanMedianImputer(imputation_method='median', 
                                                variables=['LotFrontage', '2ndFlrSF', 'MasVnrArea'])),

        ("MeanImputation", MeanMedianImputer(imputation_method='mean', variables='BedroomAbvGr')),

        # Feature Engineering - see Feature Engineering Notebook
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', 
                                                     variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),

        ("LogTransformer", vt.LogTransformer(variables=['LotArea', 'LotFrontage'])),

        ("PowerTransformer", vt.PowerTransformer(variables=['BsmtUnfSF', 'OpenPorchSF'])),

        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables='TotalBsmtSF')),
                                      
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, 
                                                                selection_method="variance")),

        # Feature Scaling - Standardize features by removing the mean and scaling to unit variance.
        ("scaler", StandardScaler()),

        # Feature Selection - Meta-transformer for selecting features based on importance weights.
        ("feat_selection", SelectFromModel(model)),

        # ML Algorithms
        ("model", model),
    ])

    return pipeline_base

  from pandas import MultiIndex, Int64Index


Custom Class for hyperparameter optimisation taken and modified from CI Churnometer project

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

## Split Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['SalePrice'], axis=1),
    df['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

## Grid Search CV - Sklearn

### Use default hyperparameters to find most suitable algorithm

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Do a hyperparameter optimisation search using default hyperparameters

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

* Initial assessment of the best regression model identifies 'RandomForestRegressor' with a mean R2 score of 0.797, indicating that the model is able to explain ~80% of the variance in the target variable, on average, across multiple cross-validation folds or test sets. In addition it has relatively low variance of 0.0559 indicating that there is little spread in the data.
* ExtraTreesRegressor and GradientBoostingRegressor also have mean R2 scores > 0.75, and could be investigated further using hyperparameter optimisation.

### Do an extensive search on the most suitable model to find the best hyperparameter configuration.

Define model and parameters for extensive search
* Three machine learning models were tested with hyperparameter optimisation, and their performance recorded on a spreadsheet. Initial hyperparameter choices were based on those recommended in CI's feature engine module. The most promising results were obtained with ExtraTreesRegressor and further efforts were taken to tune hyperparameters for this model with careful choice of hyperparameters based on documentation and further reading (see 'rationale for hyperparameter choices' below).
* More than nine parameters were tested for ExtraTreesRegressor (some of which resulting in errors), with 3-6 values in each case. 
* Hyperparameter optimisation is documented here (select hyperparameter worksheet): [hyperparameter optimisation spreadsheet](https://docs.google.com/spreadsheets/d/1HL8qL_EjlPuzbkr0mxfrzLsB_4yTtz3urhVFmWGFUQc/edit?usp=sharing)

The following model and hyperparameters were selected as being the best balance between performance for generalisation on unseen data whilst minimising potential for overfitting.

In [None]:
# defining model parameters for a more extensive search

models_search = {
    # "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    # "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
}

params_search = {

  "ExtraTreesRegressor":{'model__n_estimators': [200],
                          'model__max_features': [0.3],
                          'model__max_depth': [None],
                          'model__min_samples_split': [5],
  }

    # "ExtraTreesRegressor":{'model__n_estimators': [10, 20, 50, 100, 200], # default=100
    #                       'model__max_features': [0.3, 1.0, None], # default=1.0
    #                       'model__max_depth': [None, 2, 10], # default=None
    #                       'model__min_samples_split': [2, 5, 10], # default=2
    #                       'model__bootstrap': [True], # default=False
    #                       'model__oob_score': [True], # default=False
    #                       'model__min_samples_leaf': [1, 2, 3, 5], # default=1
    #                       'model__min_weight_fraction_leaf': [x / 10 for x in range(0, 6)], # default=0.0
    #                       'model__max_leaf_nodes': [2, 5, 21], # default=None

    #                         }
  
    # "GradientBoostingRegressor":{'model__n_estimators': [100,50,140],
    #                               'model__learning_rate':[0.1, 0.01, 0.001],
    #                               'model__max_depth': [3,15, None],
    #                               'model__min_samples_split': [2,50],
    #                               'model__min_samples_leaf': [1,50],
    #                               'model__max_leaf_nodes': [None,50],}


    #  "RandomForestRegressor":{'model__n_estimators': [100, 50, 140],
    #                           'model__max_depth': [None, 4 , 15],
    #                           'model__min_samples_split': [2, 50],
    #                           'model__min_samples_leaf': [1, 50],
    #                           'model__max_leaf_nodes': [None, 50],
    #                           }

}


Rationale for hyperparameter choices:

* ExtraTreesRegressor
    * n_estimators and max_features are the main parameters to adjust when using ExtraTreesRegressor.
    * n_estimators - the number of trees in a forest, the more the better but it was necessary to strike a balance with computing time. Also, results will stop getting significantly better beyond a critical number of trees. 200 was the best choice on balance.
    * max_features - the size of the random subsets of features to consider when splitting a node. "The default value of max_features=1.0 is equivalent to bagged trees and more randomness can be achieved by setting smaller values (e.g. 0.3 is a typical default in the literature)." [Reference scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html#forest) 0.3 gave best results in this case also.
    * max_depth - the depth of the trees, the deeper the better "Good results are often achieved when setting max_depth=None in combination with min_samples_split=2 (i.e., when fully developing the trees)." [Reference scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html#forest). Using a value of 'None' gave the best results in this case.
    * min_samples_split - The minimum number of samples required to split an internal node. The suggestion above was tried, however, the best results were obtained with a value of 5.
    * bootstrap and oob_score- this parameter determines whether bootstrap samples are used. "The default strategy for extra-trees is to use the whole dataset (bootstrap=False). When using bootstrap sampling the generalisation error can be estimated on the left out or out-of-bag samples. This can be enabled by setting oob_score=True." [Reference scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html#forest). Performance was tested with and without bootstrap samples. High R2 scores were achieved in both cases, however, bootstrap seemed to increase overfitting (showed by extremely high R2train and a greater difference between R2train and R2test). It was decided to use the default value of False for the bootstrap parameter.
    * Other parameters were tested including 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_leaf_nodes' and 'criterion', however, best performance in terms of model fit and computing time were achieved when the default for these values was used.

    The following references were used for hyperparameter optimisation:
    * [GradientBoostingRegressor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
    * [RandomForestRegressor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor)
    * [ExtraTreesRegressor documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
    * [Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)
    * [How Extra trees classification and regression algorithm works](https://pro.arcgis.com/en/pro-app/latest/tool-reference/geoai/how-extra-tree-classification-and-regression-works.htm#:~:text=The%20extra%20trees%20algorithm%2C%20like,selected%20randomly%20for%20each%20tree.)
    * [Ensembles: Gradient boosting, random forests, bagging, voting, stacking](https://scikit-learn.org/stable/modules/ensemble.html#forest)

Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0, 0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best regressor, based on the extensive grid search

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

## Assess feature importance

* Through data cleaning and feature engineering, the original set of 24 variables was reduced to 18.
* The following code narrows the 18 remaining features further, and selects only the features that are deemed most important for predicting the target, based on using ExtraTreesRegressor model with the best model with the best parameters (as defined above). 
* A new dataframe is created with only the features that are most important, it returns them as a list, and also displays them visually on a bar plot.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# after data cleaning and feature engineering, the features may have changes
# how many data cleaning and feature engineering steps does your pipeline have?
data_cleaning_feat_eng_steps = 9 
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support(
)].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
    'Importance': best_regressor_pipeline['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()


## Evaluate on Train and Test Sets

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Model Evaluation \n")
    print("* Train Set")
    regression_evaluation(X_train, y_train, pipeline)
    print("* Test Set")
    regression_evaluation(X_test, y_test, pipeline)


def regression_evaluation(X, y, pipeline):
    prediction = pipeline.predict(X)
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(
        mean_squared_error(y, prediction)).round(3))
    print("\n")


def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline.predict(X_train)
    pred_test = pipeline.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()

Evaluate performance

In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

## Summary


* In this notebook, the data cleaning and feature engineering steps were performed on the data.
* A quick grid search was used to narrow down the algorithms that predict the target best. 
* Hyperparameter optimisation was perfomed on the top three algorithms, which allowed the best model and combination of parameters to be selected.
* The top three algorithms all generalised well on the test set, with R2 scores > 0.75, achieving the limit agreed with client. 
* The table below summarises the results for the top three algorithms after hyperparameter optimisation:

|Estimator|	R2_train|	R2_test|	R2_train - R2_test|	Outcome|
|:----|:----|:----|:----|:----|
|RandomForestRegressor|	0.970|	0.794	|0.176|	Model generalises reasonably well on unseen data. Possible overfitting.|
|ExtraTreesRegressor|	0.943	|0.825|	0.118|	Best generalisation on test set. Best balance between generalisation performance and potential overfitting. |
|GradientBoostingRegressor	|0.873	|0.771|	0.102|	Model generalises reasonably well on unseen data. Least overfitting of models tested.|

* Based on this data it was decided that ExtraTreesRegressor was the best model to use for sale price prediction, the R2 score for train and test sets is 0.943 and 0.825 respectively. The difference between R2 scores for the train and test set is relatively low at 0.118, this is important to limit overfitting, which is important when considering a prediction model.
* Since ExtraTreesRegressor generalises well on unseen data and achieves R2 scores greater than 0.75, we are satisfied with its performance and do not deem it necessary to explore other methods such as PCA (Principal Component Analysis) regression. Additionally, given the satisfactory results, there is no need to iterate on data cleaning and feature engineering stages. However, given more time, it would be interesting to determine whether results could be improved with further investigation, in particular limiting the potential for overfitting.

## Best ML Pipeline for Modelling

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([

        # Data Cleaning - see Data Cleaning Notebook
        ("DropFeatures", DropFeatures(features_to_drop=['EnclosedPorch', 'WoodDeckSF', 'GarageYrBlt'])),

        ("CategoricalImputation", CategoricalImputer(imputation_method='missing',fill_value='Unf', 
                                                        variables=['GarageFinish','BsmtFinType1'])),
        
        ("MedianImputation", MeanMedianImputer(imputation_method='median', 
                                                variables=['LotFrontage', '2ndFlrSF', 'MasVnrArea'])),

        ("MeanImputation", MeanMedianImputer(imputation_method='mean', variables='BedroomAbvGr')),

        # Feature Engineering - see Feature Engineering Notebook
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', 
                                                     variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),

        ("LogTransformer", vt.LogTransformer(variables=['LotArea', 'LotFrontage'])),

        ("PowerTransformer", vt.PowerTransformer(variables=['BsmtUnfSF', 'OpenPorchSF'])),

        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(variables='TotalBsmtSF')),
                                      
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, 
                                                                selection_method="variance")),

        # Feature Scaling - Standardize features by removing the mean and scaling to unit variance.
        ("scaler", StandardScaler()),

        # Feature Selection - Meta-transformer for selecting features based on importance weights.
        ("feat_selection", SelectFromModel(model)),

        # ML Algorithms
        ("model", model),
    ])

    return pipeline_base

In [None]:
models_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
}

params_search = {
    "ExtraTreesRegressor":{'model__max_depth':[None],
                            'model__max_features':[0.3],
                            'model__min_samples_split':[5],
                            'model__n_estimators':[200]
                            }
}


In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
# Defining the best pipeline

pipeline_regression = grid_search_pipelines[best_model].best_estimator_
pipeline_regression

In [None]:
regression_performance(X_train, y_train, X_test, y_test, pipeline_regression)
regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline_regression)

---

# Push Files to Repo

Generate the following files:
* Train Set
* Test Set
* Modelling Pipeline
* Feature importance Plot

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_sale_price/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)
    


## Train Set: Features and Target

In [None]:
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train.head()

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

## Test Set: Features and Target

In [None]:
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test.head()

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## Modelling Pipeline

In [None]:
pipeline_regression

In [None]:

joblib.dump(value=pipeline_regression, filename=f"{file_path}/regression_pipeline.pkl")

## Feature Importance Plot

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

In [None]:
def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline.predict(X_train)
    pred_test = pipeline.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    # plt.show()
    plt.savefig(f'{file_path}/model_performance_evaluation.png', bbox_inches='tight')

regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline_regression)

## Conclusions

* The complete pipeline for data cleaning, feature engineering, modelling and evaluation have been defined in this notebook.
* The model selected with best performance for predicting house price sales is the use of ExtraTreesRegressor with the following set of parameters: 'model__n_estimators': [200],'model__max_features': [0.3], 'model__max_depth': [None] and 'model__min_samples_split': [5].
* The most important features that are used to train the model are ['GrLivArea', '1stFlrSF', 'GarageArea', 'YearRemodAdd', 'KitchenQual', 'YearBuilt'].

## Next Steps

* Build and save a second simplified version of the best ML pipeline.
