# Challenge
In this module, we learned how to approach and solve regression problems using linear regression models. Throughout the module, you worked on a house price dataset from Kaggle. In this challenge, you will keep working on this dataset.

The scenario
The housing market is one of the most crucial parts of the economy for every country. Purchasing a home is one of the primary ways to build wealth and savings for people. In this respect, predicting prices in the housing market is a very central topic in economic and financial circles.

The house price dataset from Kaggle includes several features of the houses along with their sale prices at the time they are sold. So far, in this module, you built and implemented some models using this dataset.

In this challenge, you are required to improve your model with respect to its prediction performance.

To complete this challenge, submit a Jupyter notebook containing your solutions to the following tasks.

Steps
1. Load the houseprices data from Thinkful's database.

2. Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.

3. Now, split your data into train and test sets where 20% of the data resides in the test set.

4. Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. Use k-fold cross-validation to select the best hyperparameters if your models include one!

5. Evaluate your best model on the test set.

6. So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.


### 1. Load the houseprices data from Thinkful's database.

In [375]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine

import math
import seaborn as sns

import scipy.stats as stats

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

In [376]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

In [None]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()




### 2. Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.


In [None]:
# get rid of the columns with nan. (since we don't need that many features)
df_drop = df.copy()
for col in df_drop.columns:
    if df_drop[col].isna().any():
#         print(col)
        df_drop = df_drop.drop(columns=col)    

# in this regression task, let's use continous type features
for (col,col_type) in zip(df_drop.columns,df_drop.dtypes):
    if col_type == 'object':
        df_drop = df_drop.drop(columns=col)

In [None]:
# check correlation using heatmap
corr_matrix = np.abs(df_drop.corr())
plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, square=True, annot=True, linewidths=.5)
plt.show()

##### when the correlation between feature and target varialbe is too low (< 0.1), those features might not be helpful to explain the target variable in our linear regression model, so we can drop them.
also, we would like to avoid colinearity between features, otherwise might cause unstability, so we can drop features in the high correlation coefficient pairs.

# algorithm:

- 1. check features pair with collinearity higher than certain threshold (i.e. > 0.5)
- 2. in the feature pair, get rid of the one has lower correalation coefficient(cc) on target

In [None]:
df_drop_2 = df_drop.copy()

corr_matrix = np.abs(df_drop_2.corr())
col_index_list = []
for (index, val) in enumerate(corr_matrix.iloc[:,-1]):
#     print (index, val)
    if val > 0.3 :
        col_index_list.append(index)
col_index_list

df_drop_2 = df_drop_2.iloc[:, col_index_list].copy()

corr_matrix = np.abs(df_drop_2.corr())
plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix, square=True, annot=True, linewidths=.5)
plt.show()

In [None]:
# sort columns based on the correalation coefficient between feature and target,
# so that the algorithm will be easier to apply
df_drop_3 = df_drop_2.copy()
corr_matrix = np.abs(df_drop_3.corr())
s_sort = corr_matrix.iloc[:,-1].sort_values()
df_drop_3 = df_drop_3.loc[:,s_sort.index]

corr_matrix = np.abs(df_drop_3.corr())

corr_matrix = np.abs(df_drop_3.corr())
plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix, square=True, annot=True, linewidths=.5)
plt.show()

In [None]:
# drop the columns when collinearity between features is higher than threshold,
# since we've already sorted the columns, so the lower cc between feature and target will drop out first

df_drop_4 = df_drop_3.copy()
corr_thres = 0.5
count = 15


corr_matrix = np.abs(df_drop_4.corr())
corr_matrix.iloc[:-1,:-1][np.abs(corr_matrix)==1] = -10
while((corr_matrix.iloc[:-1,:-1]>corr_thres).any(axis=None)   and count >0):
    col_current = 0
    while col_current < len(corr_matrix):        
        if (corr_matrix.iloc[col_current,:-1]>corr_thres).any():
#             print("current col: ", col_current)
            df_drop_4.drop(columns=df_drop_4.columns[col_current], axis=1, inplace=True)
            corr_matrix = np.abs(df_drop_4.corr())
            corr_matrix.iloc[:-1,:-1][np.abs(corr_matrix)==1] = -10  
            col_current = len(corr_matrix)
        else:        
            col_current +=1
#             print("current col: ", col_current)
                      
    count -=1
#     print("count is: ", count)
#     print(df_drop_4.columns)

# check the cc again
corr_matrix = np.abs(df_drop_4.corr())
plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix, square=True, annot=True, linewidths=.5)
plt.show()

### we will pick the remained variables as the features to feed in the linear regression model


In [None]:
# pick the features
# X = df[['openporchsf', 'wooddecksf', 'fireplaces', 'overallqual']]
X = df_drop_4
X = X.drop(columns=['saleprice'])
Y = df['saleprice']
# Y = np.log1p(Y)

In [None]:
X.head()
# X.columns

### 3. Now, split your data into train and test sets where 20% of the data resides in the test set.

In [None]:
# split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

### 4. Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. Use k-fold cross-validation to select the best hyperparameters if your models include one!

In [None]:
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV

In [None]:
# start with OLS    

def lrm_myfun(X_train, X_test, y_train, y_test): 
    lrm = LinearRegression()

    lrm.fit(X_train, y_train)


    # We are making predictions here
    y_preds_train = lrm.predict(X_train)
    y_preds_test = lrm.predict(X_test)


    print("-----OLS Training set statistics-----")
    print("constant: {}, coefficient: {}".format(lrm.intercept_, lrm.coef_))
    print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
    print("-----OLS Test set statistics-----")
    print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
    print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
    print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
    print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

    # make a scatter plot to compare predicated and actual

    plt.scatter(y_test, y_preds_test)
    plt.plot(y_test, y_test, color="red")
    plt.xlabel("true values")
    plt.ylabel("predicted values")
    plt.title("OLS: true and predicted values")
    plt.show()

    
    
lrm_myfun(X_train, X_test, y_train, y_test)

In [None]:
# lasso regression

# making a list of alphas(lambdas)
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

# split
# X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

def lasso_myfun(X_train, X_test, y_train, y_test, alphas):

    lasso_cv = LassoCV(alphas=alphas, cv=5)

    lasso_cv.fit(X_train, y_train)
    
    # We are making predictions here
    y_preds_train = lasso_cv.predict(X_train)
    y_preds_test = lasso_cv.predict(X_test)

    # note: lasso_cv.alpha_ return the best alpha
    print("-----Lasso Training set statistics-----")
    print("constant: {}, coefficient: {}".format(lasso_cv.intercept_, lasso_cv.coef_))
    print("Best alpha value is: {}".format(lasso_cv.alpha_))
    print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
    print("-----Lasso Test set statistics-----")
    print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
    print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
    print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
    print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))



    # make a scatter plot to compare predicated and actual

    plt.scatter(y_test, y_preds_test)
    plt.plot(y_test, y_test, color="red")
    plt.xlabel("true values")
    plt.ylabel("predicted values")
    plt.title("Lasso: true and predicted values")
    plt.show()
    
lasso_myfun(X_train, X_test, y_train, y_test, alphas)

In [None]:
# Ridge regreesion + cross validation (CV)


def ridge_myfun(X_train, X_test, y_train, y_test, alphas):
    ridge_cv = RidgeCV(alphas=alphas, cv=5)

    ridge_cv.fit(X_train, y_train)
    

    # We are making predictions here
    y_preds_train = ridge_cv.predict(X_train)
    y_preds_test = ridge_cv.predict(X_test)

    # note: ridge_cv.alpha_ return the best alpha
    print("-----Ridge Training set statistics-----")
    print("constant: {}, coefficient: {}".format(ridge_cv.intercept_, ridge_cv.coef_))
    print("Best alpha value is: {}".format(ridge_cv.alpha_))
    print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
    print("-----Ridge {Test set statistics-----")
    print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
    print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
    print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
    print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

    # make a scatter plot to compare predicated and actual

    plt.scatter(y_test, y_preds_test)
    plt.plot(y_test, y_test, color="red")
    plt.xlabel("true values")
    plt.ylabel("predicted values")
    plt.title("Ridge: true and predicted values")
    plt.show()
    
ridge_myfun(X_train, X_test, y_train, y_test, alphas)


In [None]:
# ElasticNet


def elasticnet_myfun(X_train, X_test, y_train, y_test, alphas):
    elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

    elasticnet_cv.fit(X_train, y_train)
    

    # We are making predictions here
    y_preds_train = elasticnet_cv.predict(X_train)
    y_preds_test = elasticnet_cv.predict(X_test)

    # note: lasso_cv.alpha_ return the best alpha
    print("-----ElasticNet Training set statistics-----")
    print("constant: {}, coefficient: {}".format(elasticnet_cv.intercept_, elasticnet_cv.coef_))
    print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
    print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
    print("-----ElasticNet Test set statistics-----")
    print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
    print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
    print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
    print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

    # make a scatter plot to compare predicated and actual

    plt.scatter(y_test, y_preds_test)
    plt.plot(y_test, y_test, color="red")
    plt.xlabel("true values")
    plt.ylabel("predicted values")
    plt.title("ElasticNet: true and predicted values")
    plt.show()

elasticnet_myfun(X_train, X_test, y_train, y_test, alphas)

### 5. Evaluate your best model on the test set.

In [None]:
# sumarize the result

# pick the features
X = df[['openporchsf', 'wooddecksf', 'fireplaces', 'overallqual']]
Y = df['saleprice']

# split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

# making a list of alphas(lambdas)
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

lrm_myfun(X_train, X_test, y_train, y_test)
lasso_myfun(X_train, X_test, y_train, y_test, alphas)
ridge_myfun(X_train, X_test, y_train, y_test, alphas)
elasticnet_myfun(X_train, X_test, y_train, y_test, alphas)

### evaluation sumary

- As the results showed, when taken 'openporchsf', 'wooddecksf', 'fireplaces', 'overallqual' as the feautures, "Sale price" as the target, Ridge regression is the best performance model (to be more specific when alpha taken 10.0). 
- Among other stats, Ridge regression model has the lowest MAPE value (21.427533081169933%) on test set, and the highest R-squared score (0.6413897187301367 on test set, 0.6810531255539183 on training set).
- The estimation coefficients (including constant term) is: constant: -81984.59555811743, coefficient: [   87.78736864    80.28724956 19299.6517318  39315.75092182]

#### 5b extra comment, iterative step, feature engineering

- Before we jump into searching for new features and include them in the model, we can actually take more usage of the features we already have. 
- For instance, we realize that the target varialbe has kind of skewed distribution. We can try to fix that by performing -  log1p transformation (i.e. log1p = log(1+x)), and hope that can improve the performance.
- And it does!!!

### notes:
- The illustration is performed using ElasticNet model.
- Note that, in the evaluation part, we use transformed data to calculate R-squared value, but use the invsered-transformed data (original data) to calculate MAPE (mean absolute percentage error).
- Also note that, the new model using transformed target data has two very low coefficient values associated with 'openporchsf' and 'wooddecksf'. When excluding those two features from our model, we only lost less than one percentage of MAPE. This is a good news when a less complex model is favored than one with better predication capability. 

In [None]:
### modification (feature engineering)

Y = df['saleprice']
sns.distplot(Y)

print(stats.describe(Y))

In [None]:
# pick the features
X = df[['openporchsf', 'wooddecksf', 'fireplaces', 'overallqual']]
Y = df['saleprice']
# log1p transforming Y
Y = np.log1p(Y)

# split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

# ElasticNet
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)
print("constant: {}, coefficient: {}".format(elasticnet_cv.intercept_, elasticnet_cv.coef_))

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))

y_train_invtrans = np.expm1(y_train)
y_test_invtrans = np.expm1(y_test)
y_preds_train_invtrans = np.expm1(y_preds_train)
y_preds_test_invtrans = np.expm1(y_preds_test)


print("-----Test set statistics (compare with original-----")
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test_invtrans, y_preds_test_invtrans)))
print("Mean squared error of the prediction is: {}".format(mse(y_test_invtrans, y_preds_test_invtrans)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test_invtrans, y_preds_test_invtrans)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test_invtrans - y_preds_test_invtrans) / y_test_invtrans)) * 100))

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(y_test_invtrans, y_preds_test_invtrans)
plt.plot(y_test_invtrans, y_test_invtrans, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (original)")


plt.subplot(1,2,2)
plt.scatter(y_test, y_preds_test)
plt.plot(y_test, y_test, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (transformed)")


plt.show()

In [None]:
# trail 2, get only take two features.

# pick the features
X = df[['fireplaces', 'overallqual']]
Y = df['saleprice']
Y = np.log1p(Y)

# split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

# ElasticNet
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)
print("constant: {}, coefficient: {}".format(elasticnet_cv.intercept_, elasticnet_cv.coef_))

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))

y_train_invtrans = np.expm1(y_train)
y_test_invtrans = np.expm1(y_test)
y_preds_train_invtrans = np.expm1(y_preds_train)
y_preds_test_invtrans = np.expm1(y_preds_test)


print("-----Test set statistics (compare with original-----")
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test_invtrans, y_preds_test_invtrans)))
print("Mean squared error of the prediction is: {}".format(mse(y_test_invtrans, y_preds_test_invtrans)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test_invtrans, y_preds_test_invtrans)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test_invtrans - y_preds_test_invtrans) / y_test_invtrans)) * 100))

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(y_test_invtrans, y_preds_test_invtrans)
plt.plot(y_test_invtrans, y_test_invtrans, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (original)")


plt.subplot(1,2,2)
plt.scatter(y_test, y_preds_test)
plt.plot(y_test, y_test, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (transformed)")


plt.show()

### 6. So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.

### potential solution
- Interest rates can be an important factor on house sale price. But for individual buys, it is the mortgage interest rate one can get matters, rather than the Federal Reserve Interest Rate. Note that the personal mortgage interest rate can be consdiered as interaction variables related to house type (i.e. townhouse, condominium, single-family, etc.), credit, profile, property use, so on and so forth. 
- While it is very hard to exact personal mortgage interest rate for each sale, maybe we can focus on some key factors that contribute to personal mortgage interest rate. Here, let's use "quality" and "house size". Note that: "quality" can be indicated by "OverallQual" and/or "OverallCond" variables, "house size" can be calcualted by using "totalbsmtsf" + "firstflrsf" + "secondflrsf".

In [None]:
# # new features
# X['totalsf'] = df['totalbsmtsf'] + df['firstflrsf'] + df['secondflrsf']
# X['int_over_sf'] = X['totalsf'] * X['overallqual']

# X = pd.DataFrame()
X = df[['fireplaces', 'overallqual']]
X['int_over_sf'] = (df['totalbsmtsf'] + df['firstflrsf'] + df['secondflrsf']) * df['overallqual']

In [None]:
X.head()

In [None]:

# pick the features
# X = df[['fireplaces', 'overallqual']]
X =X
Y = df['saleprice']
Y = np.log1p(Y)

# split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

# ElasticNet
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)
print("constant: {}, coefficient: {}".format(elasticnet_cv.intercept_, elasticnet_cv.coef_))

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))

y_train_invtrans = np.expm1(y_train)
y_test_invtrans = np.expm1(y_test)
y_preds_train_invtrans = np.expm1(y_preds_train)
y_preds_test_invtrans = np.expm1(y_preds_test)


print("-----Test set statistics (compare with original-----")
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test_invtrans, y_preds_test_invtrans)))
print("Mean squared error of the prediction is: {}".format(mse(y_test_invtrans, y_preds_test_invtrans)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test_invtrans, y_preds_test_invtrans)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test_invtrans - y_preds_test_invtrans) / y_test_invtrans)) * 100))

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(y_test_invtrans, y_preds_test_invtrans)
plt.plot(y_test_invtrans, y_test_invtrans, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (original)")


plt.subplot(1,2,2)
plt.scatter(y_test, y_preds_test)
plt.plot(y_test, y_test, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("ElectNet: true and predicted values (transformed)")


plt.show()

### sumary

- After including the new feature "int_over_sf", the performance has improved. (i.e. PAME = 15.923998640777748 from 17.968835375052954).
- the final model has three features, "fireplaces", "overallqual", "int_over_sf". The former two features are directly from the dataset, while the last one is an interaction varialbe, that we try to model the effect of personal interest rate. 
- Note that "int_over_sf" feature is "calculated" from features we have in the dataset. (i.e. X['int_over_sf'] = (df['totalbsmtsf'] + df['firstflrsf'] + df['secondflrsf']) * df['overallqual']). By including such interaction variables, the model performance has improved. Which indicates that, sometimes, we shall try to explore the data we already have instead of feeding "new" data into the model right away.