# Advanced Regression Techniques - Ames, IA Housing Data

The purpose of this project is to determine the price of housing in the city of Ames, IA using real estate data and different models

(Decision Tree, Random Forest, XGBoost Regression)

Addtionally, features such as Lot Size, Number of Bedrooms and the Age of the House will be considered

Root Mean Squared Error (RMSE) is the method of Regression that will be selected to predict real estate prices using real estate features. According to Wikipedia, RMSE is defined as the following:

The root-mean-square deviation (**RMSD**) or root-mean-square error **(RMSE)** is a frequently used measure of the differences between values (sample or the effect values) predicted by a model or an estimator and the values observed. The **RMSD** represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the The effect sample that was used for estimation and are called errors (or The effect errors) when computed out-of-sample. The **RMSD** serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. **RMSD** is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.[1]


**RMSD** is non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a **lower RMSD** is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

**RMSD** is the square root of the average of squared errors. The effect of each error on **RMSD** is proportional to the size of the squared error; thus larger errors have a disproportionately large effect on **RMSD**. Consequently, **RMSD** is sensitive to outliers. The root-mean-square deviation (**RMSD**) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or The effect values) predicted by a model or an estimator and the values observed. The **RMSD** represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the The effect sample that was used for estimation and are called errors (or The effect errors) when computed out-of-sample. The **RMSD** serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. **RMSD** is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.[1]



**RMSD** is in effect non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower **RMSD** is better than a higher one. However, comparisons across different types of data would be invalid because the measure is dependent on the scale of the numbers used.

**RMSD** is the square root of the average of squared errors. The effect of each error on **RMSD** is proportional to the size of the squared error; thus larger errors have a disproportionately large  effect on **RMSD**. Consequently, **RMSD** is sensitive to outliers.

# Data Description

In [None]:
"""
Data Description:

       NA	No Garage
              
       GarageCond: Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
              
       PavedDrive: Paved driveway

       Y	Paved 
       P	Partial Pavement
       N	Dirt/Gravel
              
       WoodDeckSF: Wood deck area in square feet

       OpenPorchSF: Open porch area in square feet

       EnclosedPorch: Enclosed porch area in square feet

       3SsnPorch: Three season porch area in square feet

       ScreenPorch: Screen porch area in square feet

       PoolArea: Pool area in square feet

       PoolQC: Pool quality
              
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
              
       Fence: Fence quality
              
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence

       MiscFeature: Miscellaneous feature not covered in other categories
              
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None
              
       MiscVal: $Value of miscellaneous feature

       MoSold: Month Sold (MM)

       YrSold: Year Sold (YYYY)

       SaleType: Type of sale
              
       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other
              
       SaleCondition: Condition of sale

       Normal	Normal Sale
       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
       AdjLand	Adjoining Land Purchase
       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
       Family	Sale between family members
       Partial	Home was not completed when last assessed (associated with New Homes)
       
"""

# Import libraries

In [None]:
try:
    %reload_ext autotime
except:
    %pip install ipython-autotime
    %load_ext autotime

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.offline

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import os
  
# import sklearn machine learning libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import RobustScaler

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.feature_selection import RFE, RFECV

%pip install shap
import shap

%pip install xgboost
from xgboost import plot_importance

%pip install category_encoders

# XGBoost ML libraries
import math
from xgboost import XGBRegressor
from xgboost import plot_importance

# Formatting options
pd.options.display.float_format = '{:,.3f}'.format
pd.set_option('display.max_columns', None)

# Data Loading and EDA

In [None]:
# import the training dataset

url = 'https://github.com/patty-olanterns/RealEstateAmesIA/blob/main/train.csv?raw=true'
df = pd.read_csv(url, low_memory=True)
df.head()

In [None]:
# check the data structure of the dataset
df.info()

In [None]:
# Drop the Id column
df.drop('Id', axis=1, inplace=True)

In [None]:
object_cols = ['MSSubClass', 'MSZoning'] 
df[object_cols] = df[object_cols].astype('object')

# Convert all int and float64 to float32
num_cols = df.select_dtypes(exclude=['object']).columns
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce', downcast='float')

Check the number of attributes with Missing Values

In [None]:
# Check for null data and sort by the top 10 columns
df.isnull().sum(axis=0).sort_values(ascending=False).head(20)

19/80 attributes are missing values. These cells can be filled in with imputation
or other methods or deleted.

# Is there a correlation (+ or -) between specific house features and the SalePrice?

Example:
 - Does a renovation increase the SalePrice?
 - By how much?
 - How much of an impact does the GarageSize(1 car, 2 car) have on the SalePrice?
   

# Data Analysis

In [None]:
# Display top 10 positively correlated features with target (SalePrice)
df.corrwith(df['SalePrice']).sort_values(ascending=False).head(11)

In [None]:
# Plot the data using a heat map
corr_vals = df.corr()

# Check correlation between SalePrice and attributes
plt.rcParams['figure.figsize'] = 25, 25
plot_map = sns.heatmap(corr_vals,annot=True,fmt=".2f",cmap='coolwarm')

## Plot AveragePriceOfHome by Neighborhood

In [None]:
# Plot average price by neighborhood
a = pd.DataFrame(df.groupby('Neighborhood')['SalePrice'].mean().sort_values(ascending=True))
a.plot.barh(figsize = (8,5))
plt.xlabel('Price (USD)')
plt.title('Average Price of Home by Neighborhood')

## What factors should we consider for SalePrice? 

Factors to consider: 

- Year of Renovations (YearRemodAdd)
- Overall Quality
- Kitchen
- Roof
- Number of years since last Renovation ('YearsSinceReno')
- Age of House (YearBuilt)
- Total Square Footage (TotalSF)
- Lot Size
- Area of town (crime, race, earnings, etc.)
- Condition of House (HouseQuality)
- Basement quality etc..

In [None]:
# combine 1st, 2nd and finished basement sqft together (1stFlrSF, 2ndFlrSF)
df['TotalSF'] = df['1stFlrSF'] + df['2ndFlrSF'] + df['TotalBsmtSF']
df.info()

In [None]:
# Determine outliers for TotalSF
fig = px.histogram(df, x='TotalSF', 
                   marginal='box',
             histnorm = 'percent',
             title='TotalSF Histogram')

fig.show()

Filter out all outlier values with a z-score > 3

In [None]:
# Filter other columns based on a single column

from scipy import stats
df_filtered = df[(np.abs(stats.zscore(df['TotalSF']) < 3))]

print("Old Shape: ", df.shape)
print("New Shape: ", df_filtered.shape)

# Preprocessing

- Features and target selection
- Train-Test Split
- Numeric/Category Pipeline Setup
  - Define numerical and categorical columns in training data
  - Replace null numeric values with SimpleImputer()
  - Replace null categorical values (string, object, bool) with most frequent values by column
  - Encode each categorical value as unique category using OneHotEncoder()
  - Setup preprocessor ColumnTransformer Pipeline with numeric and category transformers as steps
    in process

## Features and Target selection

In [None]:
features = [x for x in df.columns if x not in ['SalePrice']]
X = df[features] # Prediction variable
y = df['SalePrice'] # Target variable

In [None]:
len(features)

## Train-Test Split

In [None]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

## Numeric and Category Pipeline setup 
 - XGBoost can ONLY interpret numeric values!
 - In order to interpret category and numeric values, all category values
   must be encoded using OneHotEncoder. Each string feature will be its own category

In [None]:

# Split the data up in to numerical data (int and float) and categorical 
# data (objects, names, words etc.)
num_cols = [cname for cname in X_train.columns 
            if X_train[cname].dtype == "float32"]

category_cols = [cname for cname in X_train.columns 
                 if X_train[cname].nunique() < 22 and 
                 X_train[cname].dtype == "object"]

# SimpleImputer is a function that replaces null cell values with the mean,
# median, most frequent or a fixed value based on the dataset used
numerical_transformer = SimpleImputer(strategy='constant')

# The same process can be applied to categorical values (strings, objects, etc.)
# and automated using the Pipeline function
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Transform all data in columns using the preprocessor and ColumnTransformer function
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_cols),
        ('cat', categorical_transformer, category_cols)
])

# Create Root Mean Squared Log Error function (RMSLE)

RMSLE (Root Mean Square Log Error) must be used for this dataset to compare the predicted
data with the valid data. It's not a default scoring metric available
as a tool in Sci-kit Learn.

In [None]:
def root_mean_squared_log_error(y_valid, y_preds):
    # Calc rmse of log(y_test) and log(y_pred)
    
    if len(y_preds) != len(y_valid): return 'error_mismatch'
    y_preds_new = [math.log(x) for x in y_preds]
    y_valid_new = [math.log(x) for x in y_valid]
    return mean_squared_error(y_valid_new,y_preds_new,squared=False)

# Model Selection

## Model 1: DecisionTree

In [None]:
# Import model from sklearn
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)

# Setup a Pipeline processing function
tree_clf = Pipeline(steps=[('preprocessor',preprocessor),
                           ('tree_model',tree_model)
                          ])


# Fit the training dataset to the model
tree_clf.fit(X_train,y_train)

# Set tree_preds to the test feature data (X_test)
tree_preds = tree_clf.predict(X_test)

# Print the RMSLE results
print('RMSLE:', root_mean_squared_log_error(y_test,tree_preds))

## Model II: Random Forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(random_state=42)

rand_clf = Pipeline(steps=[('preprocessor', preprocessor),
                           ('rf',rf_model)
                           ])

# Fit the training data to the model
rand_clf.fit(X_train, y_train)

rand_preds = rand_clf.predict(X_test)

print('RMSLE:', root_mean_squared_log_error(y_test, rand_preds))

## Model III: XGB Regressor

### Run Pipeline process and fit train-test data

In [None]:
xgb_model = XGBRegressor(n_estimators=1000,
                         max_depth=5, min_child_weight=1, 
                         gamma=0, 
                         booster='gbtree', 
                         learning_rate=0.02, 
                         objective='reg:squarederror', 
                         random_state=42)

# Run Pipeline
xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('xgb_model', xgb_model)
                          ])

# Fit the model
xgb_clf.fit(X_train, y_train, xgb_model__verbose=False)

## Determine feature importance in XGBoost model

- Calc beta coefficients for each feature in the model
- Size of the beta coefficients for each feature will affect the Sale Price 

In [None]:
model = xgb_clf.named_steps['xgb_model']

feature_important = model.get_booster().get_score(importance_type='gain')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(10, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 20 features

In [None]:
feats = {} # a dict to hold feature_name: feature_importance

feature_important = model.get_booster().get_score(importance_type='gain')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["SCORE"]).sort_values(by = "SCORE", ascending=False)
data['FEATURE_IMPORTANCE_KEY'] = data.index
data.reset_index(inplace=True,drop=True)

# The size of the beta coefficient for each Feature in the model affects the SalePrice by a certain amount
df_beta = pd.DataFrame(zip(X_train.columns, model.get_booster().get_score(importance_type='gain')), columns=['FEATURE','FEATURE_IMPORTANCE_KEY'])

df_feat = pd.merge(df_beta, data, on='FEATURE_IMPORTANCE_KEY', how='inner')
df_feat = df_feat.sort_values(by='FEATURE_IMPORTANCE_KEY',ascending=False).reset_index(drop=True)
df_feat['FEATURE_IMPORTANCE_KEY'] = df_feat['FEATURE_IMPORTANCE_KEY'].str.replace('f','')
df_feat['FEATURE_IMPORTANCE_KEY'] = df_feat['FEATURE_IMPORTANCE_KEY'].astype('int32')
df_feat['SCORE'] = df_feat['SCORE']  / 10**6
df_feat = df_feat.sort_values(by='SCORE',ascending=False).reset_index(drop=True)
df_feat = df_feat.head(10)

# Display top 10 features for XGBoost model
fig = px.bar(df_feat,
             x='SCORE',
             y='FEATURE',
             hover_data=['FEATURE',
                        'SCORE'],
            title='XGBoost Regression Model: Feature Importance Score - Importance Type = Gain')
fig.show()

### Display predicted values (RMSLE) in XGBoost model

In [None]:
xgb_preds = xgb_clf.predict(X_test)

print('RMSLE:', root_mean_squared_log_error(y_test, xgb_preds))
print('\n')


Compare model values

In [None]:
print('Decision Tree RMSLE:', root_mean_squared_log_error(y_test, tree_preds))
print('Random Forest RMSLE:', root_mean_squared_log_error(y_test, rand_preds))
print('XGBoost Regressor RMSLE:', root_mean_squared_log_error(y_test, xgb_preds))


The XGBoost model performed the best out of the three models as 
it had the lowest RMSLE score

### Run GridSearchCV (CrossValidation) to select best parameters

Use the GridSearchCV model to determine the best parameters to select for 
Feature Engineering.

In [None]:
param_grid={"xgb_model__nlearning_rate": (0.05, 0.10, 0.15),
                        "xgb_model__nmax_depth": [6],
                        "xgb_model__nmin_child_weight": [1],
                        "xgb_model__ngamma":[0.0, 0.1, 0.2],
                        "xgb_model__ncolsample_bytree":[ 0.3, 0.4]}

grid = GridSearchCV(xgb_clf, 
            cv=3, param_grid=param_grid, 
            scoring=None, verbose=True, n_jobs=-1)

"""
grid = GridSearchCV(xgb_clf,
                    param_grid=param_grid,
                    n_jobs=-1,
                    cv=3,
                    scoring='accuracy')

"""

grid.fit(X_train, y_train)
print('\n All results:')
print(grid.cv_results_)

In [None]:
print('\n Best estimator:')
print(grid.best_estimator_)

In [None]:
print('\n Best score:')
print(grid.best_score_ * 2 - 1)

In [None]:
print('\n Best parameters:')
print(grid.best_params_)

In [None]:
print('\n Feature Importances:')
feat_array = grid.best_estimator_.named_steps["xgb_model"].feature_importances_
df_feat = pd.DataFrame(feat_array.reshape(feat_array.shape), columns=['FEAT_IMPOR_SCORE']).sort_values(by='FEAT_IMPOR_SCORE',ascending=False).reset_index(drop=True)
df_feat.head(10)

## Model IV: Run high performance XGB regressor model

In [None]:
hp_model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0.5, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=4,
             min_child_weight=1, monotone_constraints='()',
             n_estimators=1000, n_jobs=0, num_parallel_tree=1, random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.8,
             tree_method='exact', validate_parameters=1, verbosity=None)

hp_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('hp_model', hp_model)
                     ])

hp_clf.fit(X_train, y_train, hp_model__verbose=False)

hp_preds = hp_clf.predict(X_test)

print('High Performance XGB Regressor RMSLE:', root_mean_squared_log_error(y_test, hp_preds))

# Final model Setup

Feature Engineering

In [None]:
X.columns.to_list()

In [None]:
print(set(X['SaleCondition']))

In [None]:
print(X['YearBuilt'].head())
print('\n')
print(X['YearRemodAdd'].head())

In [None]:
print(set(X['YrSold']))
print(set(X['MoSold']))

In [None]:
print(set(X['ExterQual']))
print(set(X['ExterCond']))

In [None]:
print(set(X['YearBuilt']))
print('\n')
print(set(X['OverallQual']))

In [None]:
print(set(X['BedroomAbvGr']))
print(set(X['FullBath']))
print(set(X['HalfBath']))

### Feature notes

Based on the features in the columns, a few things stand out:
    
  - Subtracting YearBuilt from YearRemodAdd determines how recent renovation was
    completed (adds value to the house).
  - Lot geometry can be determined by dividing LotArea by LotFrontage. If the       lot is a good shape, it'll sell better. If it's strange than it may be less
    likely to sell.
  - Location in Ames? Is their high crime in the area? What is the income          level     in the neighborhood?  Is     it close to the downtown area or accessible to              shopping/university/transit/major road networks?
   
  - Features to combine:
       - YrSold and MoSold
       - Condition1 and Condition2
       - ExterQual and ExterCont
       - YearBuilt and OverallQual
       - Is there a finished basement?
       - Finished basement sqft
       
    

In [None]:
# Make a copy of the features (X values)
X_feat_eng = X.copy()

In [None]:
# Create the combined features
X_feat_eng['YearsSinceReno'] = X_feat_eng['YearRemodAdd'] - X_feat_eng['YearBuilt']
X_feat_eng['LotShape'] = X_feat_eng['LotArea'] / X_feat_eng['LotFrontage']
X_feat_eng['LandTopo'] = X_feat_eng['LandSlope'] + '_' + X_feat_eng['LandContour']
X_feat_eng['ValueRating'] = X_feat_eng['YearBuilt'] * X_feat_eng['OverallQual']
X_feat_eng['FinishedBsmt'] = X_feat_eng['BsmtFinSF1'] > 0
X_feat_eng['GarageVal'] = X_feat_eng['YearBuilt'] * X_feat_eng['GarageCars']
X_feat_eng['MiscVal'] = X_feat_eng['Fireplaces'] + X_feat_eng['OverallQual']  
X_feat_eng = X_feat_eng.drop(columns=['GarageCars'])


In [None]:
# Split the data up in to numerical data (int and float) and categorical 
# data (objects, names, words etc.)
feat_num_cols = [cname for cname in X_feat_eng.columns 
            if X_feat_eng[cname].dtype in ['float32']]

feat_category_cols = [cname for cname in X_feat_eng.columns 
                 if X_feat_eng[cname].nunique() < 22 and 
                 X_feat_eng[cname].dtype in ['object', 'bool']]

# SimpleImputer is a function that replaces null cell values with the mean,
# median, most frequent or a fixed value based on the dataset used
feat_numerical_transformer = SimpleImputer(strategy='constant')

# The same process can be applied to categorical values (strings, objects, etc.)
# and automated using the Pipeline function
feat_categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Transform all data in columns using the preprocessor and ColumnTransformer function
feature_preprocessor = ColumnTransformer(
    transformers=[
        ('num', feat_numerical_transformer, feat_num_cols),
        ('cat', feat_categorical_transformer, feat_category_cols)
])


### Run the final Feature model (XGBRegressor)

In [None]:
feature_model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0.0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=4,
             min_child_weight=0.0, monotone_constraints='()',
             n_estimators=1250, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.8,
             tree_method='exact', validate_parameters=1, verbosity=None)


feature_clf = Pipeline(steps=[('feature_preprocessor', feature_preprocessor),
                                ('feature_model', feature_model)           
                                ])
# Perform train-test split
feature_X_train, feature_X_valid, feature_y_train, feature_y_valid = train_test_split(X_feat_eng, y, random_state=42)

# Fit the training dataset
feature_clf.fit(feature_X_train, feature_y_train)

### Check feature importance

In [None]:
model = feature_clf.named_steps['feature_model']

feats = {} # a dict to hold feature_name: feature_importance

feature_important = model.get_booster().get_score(importance_type='gain')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["SCORE"]).sort_values(by = "SCORE", ascending=False)
data['FEATURE_IMPORTANCE_KEY'] = data.index
data.reset_index(inplace=True,drop=True)

# The size of the beta coefficient for each Feature in the model affects the SalePrice by a certain amount
df_beta = pd.DataFrame(zip(X_train.columns, model.get_booster().get_score(importance_type='gain')), columns=['FEATURE','FEATURE_IMPORTANCE_KEY'])

df_feat = pd.merge(df_beta, data, on='FEATURE_IMPORTANCE_KEY', how='inner')
df_feat = df_feat.sort_values(by='FEATURE_IMPORTANCE_KEY',ascending=True).reset_index(drop=True)
df_feat['FEATURE_IMPORTANCE_KEY'] = df_feat['FEATURE_IMPORTANCE_KEY'].str.replace('f','')
df_feat['FEATURE_IMPORTANCE_KEY'] = df_feat['FEATURE_IMPORTANCE_KEY'].astype('int32')
df_feat['SCORE'] = df_feat['SCORE'] / 10**6
df_feat = df_feat.sort_values(by='SCORE',ascending=False).reset_index(drop=True)
df_feat = df_feat.head(10)

# Display top 20 features for XGBoost model
fig = px.bar(df_feat,
             x='SCORE',
             y='FEATURE',
             hover_data=['FEATURE',
                        'SCORE'],
            title='XGBoost Final Regression Model: Feature Importance Score - Importance Type = Gain')

fig.show()

In [None]:

# Feature predictions using validation feature data (test)
feature_preds = feature_clf.predict(feature_X_valid)

print('Final XGBRegressor Model RMSLE:', root_mean_squared_log_error(feature_y_valid, feature_preds))