# Model Training, Evaluation and Conclusions

The data used will be the cleaned and featurised data from [Notebook 2](02_Preprocessing_and_Feature_Engineering.ipynb).
Combinations of different features to try have also been set out and saved from Notebook 2 and stored as a datagrame [here](../features/features.csv).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNet, ElasticNetCV
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.metrics import mean_squared_error

## Load Data

In [2]:
df = pd.read_csv('../datasets/f_train.csv')

In [3]:
fea_df = pd.read_csv('../features/features.csv')

In [4]:
df.head(2)

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,Total Bsmt SF,1st Flr SF,2nd Flr SF,...,Attached or BuiltIn Garage,Finished Garage,Fully Paved Drive,New Sale,Has Alley Access,Total SF,1.5P Gr Liv Area,1.5P Total SF,P3 Overall Qual,SalePrice
0,0.0,13517,6,8,1976,2005,289.0,725.0,725,754,...,1,1,1,0,0,2204.0,56879.040419,103470.699543,216,130500
1,43.0,11492,7,5,1996,1997,132.0,913.0,913,1209,...,1,1,1,0,0,3035.0,97750.29334,167200.681443,343,220000


In [5]:
df.shape

(2049, 57)

In [6]:
# df.isna().sum().sum()

In [7]:
# df.dtypes.unique()

In [8]:
# df.columns

In [9]:
#target variable
tv = 'SalePrice'

In [10]:
fea_df.head()

Unnamed: 0,features_max,features_max_powerless,features_drop_weak,features_lite,features_min
0,Lot Frontage,Lot Frontage,Lot Frontage,TotRms AbvGrd,P3 Overall Qual
1,Lot Area,Lot Area,Lot Area,Mas Vnr Area,1.5P Total SF
2,Overall Cond,Overall Qual,Year Built,Bath Log,
3,Year Built,Overall Cond,Year Remod/Add,Garage Area,
4,Year Remod/Add,Year Built,Mas Vnr Area,Year Built,


In [11]:
fea = {}
for col in fea_df.columns:
    fea[col] = fea_df[col].dropna().to_list()    

In [12]:
fea['features_min']

['P3 Overall Qual', '1.5P Total SF']

In [13]:
fea.keys()

dict_keys(['features_max', 'features_max_powerless', 'features_drop_weak', 'features_lite', 'features_min'])

In [14]:
binary_vars = list(df.columns[df.isin([0,1]).all()])

Model Results DataFrame

In [15]:
results = pd.DataFrame(columns=['model_group', 'model_type', 'model_params',
                               'train_R2', 'train_RMSE', 'val_R2', 'val_RMSE', 'comments'])

## Train-Val Split
Seperate out a holdout validation (Val) set from the start as a fair comparison to evaluate each model and combination of features. The term 'Val' is used instead of 'test' to avoid confusion with the 'test.csv'.

In [16]:
X = df.drop(columns=tv)

In [17]:
y = df[tv]

In [18]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, test_size=0.2)

In [19]:
# store a full copy for reference before the variables change values with different feature sets
X_train_full, X_val_full = X_train, X_val

## Models

Each set of features will be modelled. Linear Regression, Lasso, Ridge and ElasticNet will be evaluated.

### Baselines

Key metric will be Root Mean Squared Error (RMSE).


#### baselines from simplistic set
Note that the baselines set out in [Notebook 3a](03a_Model_Baselines.ipynb), where a tiny set of simplistic, untransformed features were used, the baseline RMSE of just predicting the mean price for all entries was 77081.23. Likewise for just predicting with median price, the RMSE was 79754.72. The RMSE in that scenario for the Lasso Model (which outperformed Ridge and regular Linear Regression) was 38009.028.

Benchmark MSE based on predicting all as mean Sale Price:

In [20]:
just_y_means = np.full(y_val.shape, np.mean(y_val)) 
(mean_squared_error(y_val, just_y_means))**0.5

77233.34908246441

Benchmark MSE based on predicting all as median Sale Price:

In [21]:
just_y_meds = np.full(y_val.shape, np.median(y_val)) 
(mean_squared_error(y_val, just_y_meds))**0.5

79543.79434433526

### 1. Models with features_max

Referred to as 'Model Group 1'.

Get Model Group 1 features.

In [22]:
features1 = fea['features_max']

In [23]:
bin_features1 = [f for f in features1 if f in binary_vars]
num_features1 = [f for f in features1 if f not in binary_vars]

In [24]:
X_train, X_val = X_train_full[features1], X_val_full[features1]

#### Model Group 1 Linear Regression 

In [25]:
linreg = LinearRegression()

linreg.fit(X_train, y_train)
linreg.score(X_train, y_train)

0.9173287383408084

In [26]:
mean_squared_error(y_train, linreg.predict(X_train), squared=False)

22935.16674801194

The same RMSE can also be obtained by squaring the MSE while leaving the default squared=True. Looks a bit more intuitive with '\** 0.5' in showing that it is the root MSE.

In [27]:
(mean_squared_error(y_train, linreg.predict(X_train)))**0.5

22935.16674801194

Model Group 1 Linear Regression performance with cross_validation.

In [28]:
cross_val_score(linreg, X_train, y_train, cv=5).mean()

0.9040010255376176

In [29]:
(- cross_val_score(linreg, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()) **0.5

24482.037007948464

Model Group 1 Linear Regression performance on validation set.

In [30]:
linreg.score(X_val, y_val)

0.9057719543483982

In [31]:
(mean_squared_error(y_val, linreg.predict(X_val)))**0.5

23708.002232701274

In [32]:
results = results.append({'model_group':1, 'model_type':'Linear Regression', 
                'train_R2': round(0.9173287383408084, 2), 
                'train_RMSE': round(24482.037007948464, 2),
                'val_R2': round(0.9057719543483982, 2), 
                'val_RMSE': round(23708.002232701274, 2)},
              ignore_index=True)

In [33]:
dict(zip(features1, linreg.coef_))

{'Lot Frontage': 51.0674719437568,
 'Lot Area': 1.2552876731352764,
 'Overall Cond': 6435.986344416999,
 'Year Built': 299.2231032084188,
 'Year Remod/Add': 17.056164865735006,
 'Mas Vnr Area': 23.364921888473564,
 'Total Bsmt SF': -58.85301835420593,
 '1st Flr SF': -72.53866405577125,
 '2nd Flr SF': -73.06485811418278,
 'Low Qual Fin SF': -109.9890175341086,
 'Bedroom AbvGr': -3689.088035972781,
 'Kitchen AbvGr': -13954.06243454526,
 'TotRms AbvGrd': 1813.6838685933803,
 'Fireplaces': 5794.140459045523,
 'Garage Area': 16.91332836529297,
 'Wood Deck SF': 16.757908900635954,
 'Open Porch SF': 13.643361782413457,
 'Enclosed Porch': 16.98311347523247,
 '3Ssn Porch': 43.48751431797177,
 'Screen Porch': 78.8895494105575,
 'Pool Area': -6.864800501527277,
 'Misc Val': 0.9158201968797073,
 'Mo Sold': 85.57406791269308,
 'Yr Sold': 266.914439457163,
 'After 1999': 5096.147921644307,
 'PID 9': 5025.1067124229885,
 'Bath Log': 13155.046810444197,
 'Floating Village': 7810.474435011966,
 'Regula

#### Model Group 1 Lasso Regression 

In [34]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

l_alphas = np.arange(0.0001, 0.50, 0.0025)

lasso_cv = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', LassoCV())])

lasso_cv.fit(X_train, y_train)
lasso_cv.score(X_train, y_train)

0.9148587069884249

In [35]:
(mean_squared_error(y_train, lasso_cv.predict(X_train)))**0.5

23275.270658087706

In [36]:
lasso_cv['estimator'].alpha_

67.08901348430173

In [37]:
lasso_cv['estimator'].alphas_

array([6.70890135e+04, 6.25674385e+04, 5.83506026e+04, 5.44179672e+04,
       5.07503784e+04, 4.73299727e+04, 4.41400910e+04, 4.11651966e+04,
       3.83908001e+04, 3.58033887e+04, 3.33903601e+04, 3.11399616e+04,
       2.90412324e+04, 2.70839505e+04, 2.52585829e+04, 2.35562390e+04,
       2.19686273e+04, 2.04880153e+04, 1.91071917e+04, 1.78194309e+04,
       1.66184609e+04, 1.54984323e+04, 1.44538898e+04, 1.34797460e+04,
       1.25712562e+04, 1.17239956e+04, 1.09338376e+04, 1.01969335e+04,
       9.50969432e+03, 8.86877275e+03, 8.27104715e+03, 7.71360626e+03,
       7.19373501e+03, 6.70890135e+03, 6.25674385e+03, 5.83506026e+03,
       5.44179672e+03, 5.07503784e+03, 4.73299727e+03, 4.41400910e+03,
       4.11651966e+03, 3.83908001e+03, 3.58033887e+03, 3.33903601e+03,
       3.11399616e+03, 2.90412324e+03, 2.70839505e+03, 2.52585829e+03,
       2.35562390e+03, 2.19686273e+03, 2.04880153e+03, 1.91071917e+03,
       1.78194309e+03, 1.66184609e+03, 1.54984323e+03, 1.44538898e+03,
      

Model Group 1 Lasso Regression performance on validation set.

In [38]:
lasso_cv.score(X_val, y_val)

0.9018749587365382

In [39]:
(mean_squared_error(y_val, lasso_cv.predict(X_val)))**0.5

24193.28234326319

In [40]:
results = results.append({'model_group':1, 'model_type':'Lasso',
                          'model_params': f'alpha={round(67.08901348430173, 2)}',
                          'train_R2': round(0.9148587069884249, 2), 
                          'train_RMSE': round(23275.270658087706, 2),
                          'val_R2': round(0.9018749587365382, 2), 
                          'val_RMSE': round(24193.28234326319, 2)},
                          ignore_index=True)

Analysing the Lasso Coefficients

In [41]:
# bin_features1

In [42]:
# dict(zip(features1, lasso_cv['estimator'].coef_))

The Lasso coefficients indicate the relatively weight of the feature coefficients in predicting the target after the Lasso penalty has been applied. Those with a coefficient of 0 can be interpreted as features that are not important in the optimal Lasso Regression mode. Here, it seems that information on very specific porch areas, and the area of the second floor are deemed to be not useful. This makes sense as there is already information about porch area, and second floor area information is already provided within the total area. 
The penalisation of the '1.5P Gr Liv Area' feature weight is likely due to the colinearity with the '1.5P Total SF' feature. 

#### Model Group 1 Ridge Regression 

In [43]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

ridge_cv = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', RidgeCV())])

ridge_cv.fit(X_train, y_train)
ridge_cv.score(X_train, y_train)

0.9162954469831734

In [44]:
(mean_squared_error(y_train, ridge_cv.predict(X_train)))**0.5

23078.05266121885

In [45]:
ridge_cv['estimator'].alpha_

1.0

Model Group 1 Ridge Regression performance on validation set.

In [46]:
ridge_cv.score(X_val, y_val)

0.9044848727687812

In [47]:
(mean_squared_error(y_val, ridge_cv.predict(X_val)))**0.5

23869.369469851747

In [48]:
results = results.append({'model_group':1, 'model_type':'Ridge',
                          'model_params': f'alpha=1.0',
                          'train_R2': round(0.9162954469831734, 2), 
                          'train_RMSE': round(23078.05266121885, 2),
                          'val_R2': round(0.9044848727687812, 2), 
                          'val_RMSE': round(23869.369469851747, 2)},
                          ignore_index=True)

#### Model Group 1 ElasticNet Regression 

In [49]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

en_cv = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', ElasticNetCV())])

en_cv.fit(X_train, y_train)
en_cv.score(X_train, y_train)

0.17881367543866655

In [50]:
(mean_squared_error(y_train, en_cv.predict(X_train)))**0.5

72284.5486147868

In [51]:
en_cv['estimator'].alpha_

134.1780269686034

In [52]:
en_cv['estimator'].l1_ratio_

0.5

Model Group 1 ElastiNet performance on validation set.

In [53]:
en_cv.score(X_val, y_val)

0.1783088767355988

In [54]:
(mean_squared_error(y_val, en_cv.predict(X_val)))**0.5

70009.85292315512

#### Model Group 1 Analysis

In [55]:
results['features'] = len(fea['features_max'])

In [56]:
results['feature_set'] = 'features_max'

In [57]:
results[results['model_group']==1]

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max


#### Streamlining the Metric Generation for Subsequent Feature Sets

Having gone through the model training and evaluation for the first feature set step by step, a function will now be created to streamline the process for subsequent feature sets.

In [58]:
"""
model_specs = {
    'model_group': 1,
    'model_type': 'Linear Regression', # or Lasso or Ridge
    'feature_set': 'features_max'
}
"""

"\nmodel_specs = {\n    'model_group': 1,\n    'model_type': 'Linear Regression', # or Lasso or Ridge\n    'feature_set': 'features_max'\n}\n"

In [59]:
def model_results(model_specs, max_iter=1000):
    """
    Takes in model specification (dict) containing model group, 
    model type and feature set to use.
    Max_iter can be set to quickly resolve cases where Lasso does not converge.
    Returns dict of model R2 (R-Squared) and RMSE on both train and test sets.
    """
    result = model_specs
    
    model_group = model_specs['model_group']
    model_type = model_specs['model_type']
    feature_set = model_specs['feature_set']
    features = fea[feature_set]
    result['features'] = len(features)
    
    bin_features = [f for f in features if f in binary_vars]
    num_features = [f for f in features if f not in binary_vars]
    X_train, X_val = X_train_full[features], X_val_full[features]
    
    if model_type == 'Linear Regression':  
        linreg = LinearRegression()
        linreg.fit(X_train, y_train)
        
        result['comments'] = f'Non-CV R2 on train set is {round(linreg.score(X_train, y_train), 2)}.'
        cv_train_R2 = cross_val_score(linreg, X_train, y_train, cv=5).mean()
        result['train_R2'] = round(cv_train_R2, 2)
        cv_train_RMSE = (- cross_val_score(linreg, X_train, y_train, 
                                           cv=5, scoring='neg_mean_squared_error').mean()) **0.5
        result['train_RMSE'] = round(cv_train_RMSE, 2)
        result['val_R2'] = round(linreg.score(X_val, y_val), 2)
        result['val_RMSE'] = round((mean_squared_error(y_val, linreg.predict(X_val)))**0.5, 2)
        
    elif model_type == 'Lasso' or model_type == 'Ridge':
        
        if model_type == 'Lasso':
            estimator = LassoCV(max_iter=max_iter)
            result['comments'] = f'max_iter = {max_iter}'
        else:
            estimator = RidgeCV()
        
        numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
        preprocessor = ColumnTransformer(transformers=[
            ('bin', 'passthrough', bin_features),
            ('num', numeric_transformer, num_features)])
        model_cv = Pipeline(steps=[('preprocessor', preprocessor),
                               ('estimator', estimator)])
        model_cv.fit(X_train, y_train)
        
        best_alpha = model_cv['estimator'].alpha_
        result['model_params'] = f'alpha={round(best_alpha,2)}'        
        result['train_R2'] = round(model_cv.score(X_train, y_train), 2)
        result['train_RMSE'] = round((mean_squared_error(y_train, model_cv.predict(X_train)))**0.5, 
                                     2)
        result['val_R2'] = round(model_cv.score(X_val, y_val), 2)
        result['val_RMSE'] = round((mean_squared_error(y_val, model_cv.predict(X_val)))**0.5, 
                                     2)
        
    else:
        return 'Please choose LinearRegression, Lasso or Ridge for model types.'
    
    return result    

### 2. Models with features_max_powerless

Model Group 2

#### Model Group 2 Linear Regression

In [60]:
model_specs_2 = {
    'model_group': 2,
    'model_type': 'Linear Regression', 
    'feature_set': 'features_max_powerless'
}

In [61]:
results = results.append(model_results(model_specs_2), ignore_index=True)

In [62]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless


#### Model Group 2 Lasso Regression

In [63]:
model_specs_2L = {
    'model_group': 2,
    'model_type': 'Lasso', 
    'feature_set': 'features_max_powerless'
}

In [64]:
results = results.append(model_results(model_specs_2L, max_iter=10000), ignore_index=True)

In [65]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless


#### Model Group 2 Ridge Regression

In [66]:
model_specs_2R = {
    'model_group': 2,
    'model_type': 'Ridge', 
    'feature_set': 'features_max_powerless'
}

In [67]:
model_results(model_specs_2R)

{'model_group': 2,
 'model_type': 'Ridge',
 'feature_set': 'features_max_powerless',
 'features': 53,
 'model_params': 'alpha=10.0',
 'train_R2': 0.9,
 'train_RMSE': 25405.1,
 'val_R2': 0.88,
 'val_RMSE': 26575.5}

In [68]:
results = results.append(model_results(model_specs_2R), ignore_index=True)

#### Model Group 2 Analysis

In [69]:
results[results['model_group']==1]

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max


In [70]:
results[results['model_group']==2]

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless
5,2,Ridge,alpha=10.0,0.9,25405.1,0.88,26575.5,,53,features_max_powerless


### 3. Models with features_drop_weak
Model Group 3

#### Model Group 3 Linear Regression

In [71]:
model_specs_3 = {
    'model_group': 3,
    'model_type': 'Linear Regression', 
    'feature_set': 'features_drop_weak'
}

In [72]:
results = results.append(model_results(model_specs_3), ignore_index=True)

#### Model Group 3 Lasso Regression

In [73]:
model_specs_3L = {
    'model_group': 3,
    'model_type': 'Lasso', 
    'feature_set': 'features_drop_weak'
}

In [74]:
results = results.append(model_results(model_specs_3L), ignore_index=True)

#### Model Group 3 Ridge Regression

In [75]:
model_specs_3R = {
    'model_group': 3,
    'model_type': 'Ridge', 
    'feature_set': 'features_drop_weak'
}

In [76]:
results = results.append(model_results(model_specs_3R), ignore_index=True)

#### Model Group 3 Analysis

In [77]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless
5,2,Ridge,alpha=10.0,0.9,25405.1,0.88,26575.5,,53,features_max_powerless
6,3,Linear Regression,,0.9,25433.04,0.9,24795.49,Non-CV R2 on train set is 0.91.,45,features_drop_weak
7,3,Lasso,alpha=117.24,0.91,24342.21,0.89,25283.04,max_iter = 1000,45,features_drop_weak
8,3,Ridge,alpha=1.0,0.91,24068.17,0.9,24910.0,,45,features_drop_weak


### 4. Models with features_lite
Model Group 4

#### Model Group 4 Linear Regression

In [78]:
fea

{'features_max': ['Lot Frontage',
  'Lot Area',
  'Overall Cond',
  'Year Built',
  'Year Remod/Add',
  'Mas Vnr Area',
  'Total Bsmt SF',
  '1st Flr SF',
  '2nd Flr SF',
  'Low Qual Fin SF',
  'Bedroom AbvGr',
  'Kitchen AbvGr',
  'TotRms AbvGrd',
  'Fireplaces',
  'Garage Area',
  'Wood Deck SF',
  'Open Porch SF',
  'Enclosed Porch',
  '3Ssn Porch',
  'Screen Porch',
  'Pool Area',
  'Misc Val',
  'Mo Sold',
  'Yr Sold',
  'After 1999',
  'PID 9',
  'Bath Log',
  'Floating Village',
  'Regular Lot Shape',
  'Hillside',
  'CulDSac',
  'NH1',
  'NH2',
  'NH3',
  'NH4',
  '1 Story',
  'Hip Roof',
  'Stone Vnr',
  'Has Vnr',
  'Exter Qual Num',
  'PConc Foundation',
  'Has Central Air',
  'Excellent Heating',
  'Kitchen Qual Num',
  'Fireplace Qu Num',
  'Attached or BuiltIn Garage',
  'Finished Garage',
  'Fully Paved Drive',
  'New Sale',
  'Has Alley Access',
  '1.5P Gr Liv Area',
  '1.5P Total SF',
  'P3 Overall Qual'],
 'features_max_powerless': ['Lot Frontage',
  'Lot Area',
  'Ov

In [79]:
fea['features_lite']

['TotRms AbvGrd',
 'Mas Vnr Area',
 'Bath Log',
 'Garage Area',
 'Year Built',
 'PConc Foundation',
 'NH4',
 'Fireplaces',
 'Finished Garage',
 '1.5P Gr Liv Area',
 'Has Vnr',
 '1.5P Total SF',
 'Excellent Heating',
 'Kitchen Qual Num',
 'Fireplace Qu Num',
 'NH2',
 'Exter Qual Num',
 'Year Remod/Add',
 'Attached or BuiltIn Garage',
 'After 1999',
 'Overall Qual',
 '1st Flr SF',
 'Total SF',
 'P3 Overall Qual']

In [80]:
model_specs_4 = {
    'model_group': 4,
    'model_type': 'Linear Regression', 
    'feature_set': 'features_lite'
}

In [81]:
results = results.append(model_results(model_specs_4), ignore_index=True)

#### Model Group 4 Lasso Regression

In [82]:
model_specs_4L = {
    'model_group': 4,
    'model_type': 'Lasso', 
    'feature_set': 'features_lite'
}

In [83]:
results = results.append(model_results(model_specs_4L), ignore_index=True)

#### Model Group 4 Ridge Regression

In [84]:
model_specs_4R = {
    'model_group': 4,
    'model_type': 'Ridge', 
    'feature_set': 'features_lite'
}

In [85]:
results = results.append(model_results(model_specs_4R), ignore_index=True)

#### Model Group 4 Analysis

In [86]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless
5,2,Ridge,alpha=10.0,0.9,25405.1,0.88,26575.5,,53,features_max_powerless
6,3,Linear Regression,,0.9,25433.04,0.9,24795.49,Non-CV R2 on train set is 0.91.,45,features_drop_weak
7,3,Lasso,alpha=117.24,0.91,24342.21,0.89,25283.04,max_iter = 1000,45,features_drop_weak
8,3,Ridge,alpha=1.0,0.91,24068.17,0.9,24910.0,,45,features_drop_weak
9,4,Linear Regression,,0.88,27353.21,0.89,25224.94,Non-CV R2 on train set is 0.89.,24,features_lite


### 5. Models with features_min
Model Group 5

#### Model Group 5 Linear Regression

In [87]:
model_specs_5 = {
    'model_group': 5,
    'model_type': 'Linear Regression', 
    'feature_set': 'features_min'
}

In [88]:
results = results.append(model_results(model_specs_5), ignore_index=True)

#### Model Group 5 Lasso Regression

In [89]:
model_specs_5L = {
    'model_group': 5,
    'model_type': 'Lasso', 
    'feature_set': 'features_min'
}

In [90]:
results = results.append(model_results(model_specs_5L), ignore_index=True)

#### Model Group 5 Ridge Regression

In [91]:
model_specs_5R = {
    'model_group': 5,
    'model_type': 'Ridge', 
    'feature_set': 'features_min'
}

In [92]:
results = results.append(model_results(model_specs_5R), ignore_index=True)

#### Model Group 5 Analysis

In [93]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless
5,2,Ridge,alpha=10.0,0.9,25405.1,0.88,26575.5,,53,features_max_powerless
6,3,Linear Regression,,0.9,25433.04,0.9,24795.49,Non-CV R2 on train set is 0.91.,45,features_drop_weak
7,3,Lasso,alpha=117.24,0.91,24342.21,0.89,25283.04,max_iter = 1000,45,features_drop_weak
8,3,Ridge,alpha=1.0,0.91,24068.17,0.9,24910.0,,45,features_drop_weak
9,4,Linear Regression,,0.88,27353.21,0.89,25224.94,Non-CV R2 on train set is 0.89.,24,features_lite


It is clear that when the feature set is drastically reduced to just 2 features, despite being the features with extremely strong correlation to the target, the performance is significantly lower compared to other models with more features.

The baseline performance is a RMSE of 77233.34, which is calculated based on making all prediction as the mean sale price. All the models evaluated beat the baseline.

### Fine Tuning of Best Models

Model Group 1 appears to have the best performance, particularly Linear Regression and Ridge models.
A wider range of alphas will be tried on the Ridge model to see if performance improves.

In [94]:
X_train, X_val = X_train_full[features1], X_val_full[features1]

In [95]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

ridge_cv_more_alphas = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', RidgeCV(alphas=np.linspace(0.1,10,500)))])

ridge_cv_more_alphas.fit(X_train, y_train)
ridge_cv_more_alphas.score(X_train, y_train)

0.9162509248125673

In [96]:
(mean_squared_error(y_train, ridge_cv_more_alphas.predict(X_train)))**0.5

23084.18941464612

In [97]:
ridge_cv_more_alphas['estimator'].alpha_

1.0919839679358718

Model Group 1 Ridge Regression performance on validation set.

In [98]:
ridge_cv_more_alphas.score(X_val, y_val)

0.9043533999320057

In [99]:
(mean_squared_error(y_val, ridge_cv_more_alphas.predict(X_val)))**0.5

23885.791447481966

Trying out a larger range of alpha hyperparameters does not seem to improve the best performing Ridge Model.

## Log Transformation of Target Variable

During EDA, it was observed that on a logarithmic scale, the fit of certain predictors to the target was tighter, and it made sense given that the target has a distribution that has an extreme spread of values. Extremely large values of price tend to be further off the line of best fit against other predictors. Logarithmic transformation on the target is one method that may possibly soften the effect of these extreme values.

Model that takes log transformation of target variable, using first feature set plain Logistic Regression.

In [100]:
linreg_log_y = TransformedTargetRegressor(regressor=LinearRegression(),
                                        func=np.log,
                                        inverse_func=np.exp)

In [101]:
linreg_log_y.fit(X_train, y_train)

TransformedTargetRegressor(func=<ufunc 'log'>, inverse_func=<ufunc 'exp'>,
                           regressor=LinearRegression())

In [102]:
linreg_log_y.score(X_train, y_train)

0.9298951848380516

In [103]:
(mean_squared_error(y_train, linreg_log_y.predict(X_train)))**0.5

21120.225143907315

In [104]:
(- cross_val_score(linreg_log_y, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()) **0.5

22813.101555980727

In [105]:
linreg_log_y.score(X_val, y_val)

0.8998437567930891

In [106]:
(mean_squared_error(y_val, linreg_log_y.predict(X_val)))**0.5

24442.40189197985

In [107]:
results = results.append({
    'comments': 'Logarithmic transformation of target variable.',
    'model_group': 1,
    'model_type': 'Linear Regression',
    'train_R2': round(0.9298951848380516, 2),
    'train_RMSE': round(22813.101555980727, 2),
    'val_R2': round(0.8998437567930891,2),
    'val_RMSE': round(24442.40189197985,2),
    'features': len(fea['features_max']),
    'feature_set': 'features_max'
    
}, ignore_index=True)

In [108]:
results

Unnamed: 0,model_group,model_type,model_params,train_R2,train_RMSE,val_R2,val_RMSE,comments,features,feature_set
0,1,Linear Regression,,0.92,24482.04,0.91,23708.0,,53,features_max
1,1,Lasso,alpha=67.09,0.91,23275.27,0.9,24193.28,,53,features_max
2,1,Ridge,alpha=1.0,0.92,23078.05,0.9,23869.37,,53,features_max
3,2,Linear Regression,,0.89,26621.22,0.88,26424.24,Non-CV R2 on train set is 0.9.,53,features_max_powerless
4,2,Lasso,alpha=65.87,0.9,25427.73,0.88,26713.18,max_iter = 10000,53,features_max_powerless
5,2,Ridge,alpha=10.0,0.9,25405.1,0.88,26575.5,,53,features_max_powerless
6,3,Linear Regression,,0.9,25433.04,0.9,24795.49,Non-CV R2 on train set is 0.91.,45,features_drop_weak
7,3,Lasso,alpha=117.24,0.91,24342.21,0.89,25283.04,max_iter = 1000,45,features_drop_weak
8,3,Ridge,alpha=1.0,0.91,24068.17,0.9,24910.0,,45,features_drop_weak
9,4,Linear Regression,,0.88,27353.21,0.89,25224.94,Non-CV R2 on train set is 0.89.,24,features_lite


Logarithmic transformation of target variable for ridge regression on first feature set.

In [109]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

ridge = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', RidgeCV())])

ridge_log_y = TransformedTargetRegressor(regressor=ridge,
                                        func=np.log,
                                        inverse_func=np.exp)

ridge_log_y.fit(X_train, y_train)
ridge_log_y.score(X_train, y_train)

0.9293065837229739

In [110]:
(mean_squared_error(y_train, ridge_log_y.predict(X_train)))**0.5

21208.702686279965

In [111]:
ridge_log_y.score(X_val, y_val)

0.8970654444734278

In [112]:
(mean_squared_error(y_val, ridge_log_y.predict(X_val)))**0.5

24779.096352319517

In [113]:
results = results.append({
    'comments': 'Logarithmic transformation of target variable.',
    'model_group': 1,
    'model_type': 'Ridge',
    'train_R2': round(0.9293065837229739, 2),
    'train_RMSE': round(21208.702686279965, 2),
    'val_R2': round(0.8970654444734278,2),
    'val_RMSE': round(24779.096352319517,2),
    'features': len(fea['features_max']),
    'feature_set': 'features_max'
    
}, ignore_index=True)

In [114]:
results = results.fillna('')

In [115]:
# print(results.to_markdown())

|    |   model_group | model_type        | model_params   |   train_R2 |   train_RMSE |   val_R2 |   val_RMSE | comments                                       |   features | feature_set            |
|---:|--------------:|:------------------|:---------------|-----------:|-------------:|---------:|-----------:|:-----------------------------------------------|-----------:|:-----------------------|
|  0 |             1 | Linear Regression |                |       0.92 |      24482   |     0.91 |    23708   |                                                |         53 | features_max           |
|  1 |             1 | Lasso             | alpha=67.09    |       0.91 |      23275.3 |     0.9  |    24193.3 |                                                |         53 | features_max           |
|  2 |             1 | Ridge             | alpha=1.0      |       0.92 |      23078   |     0.9  |    23869.4 |                                                |         53 | features_max           |
|  3 |             2 | Linear Regression |                |       0.89 |      26621.2 |     0.88 |    26424.2 | Non-CV R2 on train set is 0.9.                 |         53 | features_max_powerless |
|  4 |             2 | Lasso             | alpha=65.87    |       0.9  |      25427.7 |     0.88 |    26713.2 | max_iter = 10000                               |         53 | features_max_powerless |
|  5 |             2 | Ridge             | alpha=10.0     |       0.9  |      25405.1 |     0.88 |    26575.5 |                                                |         53 | features_max_powerless |
|  6 |             3 | Linear Regression |                |       0.9  |      25433   |     0.9  |    24795.5 | Non-CV R2 on train set is 0.91.                |         45 | features_drop_weak     |
|  7 |             3 | Lasso             | alpha=117.24   |       0.91 |      24342.2 |     0.89 |    25283   | max_iter = 1000                                |         45 | features_drop_weak     |
|  8 |             3 | Ridge             | alpha=1.0      |       0.91 |      24068.2 |     0.9  |    24910   |                                                |         45 | features_drop_weak     |
|  9 |             4 | Linear Regression |                |       0.88 |      27353.2 |     0.89 |    25224.9 | Non-CV R2 on train set is 0.89.                |         24 | features_lite          |
| 10 |             4 | Lasso             | alpha=67.09    |       0.89 |      26751.6 |     0.89 |    25425   | max_iter = 1000                                |         24 | features_lite          |
| 11 |             4 | Ridge             | alpha=1.0      |       0.89 |      26714.6 |     0.89 |    25297.3 |                                                |         24 | features_lite          |
| 12 |             5 | Linear Regression |                |       0.83 |      32166.2 |     0.84 |    30618.2 | Non-CV R2 on train set is 0.84.                |          2 | features_min           |
| 13 |             5 | Lasso             | alpha=71.94    |       0.84 |      31964.5 |     0.84 |    30622.3 | max_iter = 1000                                |          2 | features_min           |
| 14 |             5 | Ridge             | alpha=1.0      |       0.84 |      31964.4 |     0.84 |    30619.6 |                                                |          2 | features_min           |
| 15 |             1 | Linear Regression |                |       0.93 |      22813.1 |     0.9  |    24442.4 | Logarithmic transformation of target variable. |         53 | features_max           |
| 16 |             1 | Ridge             |                |       0.93 |      21208.7 |     0.9  |    24779.1 | Logarithmic transformation of target variable. |         53 | features_max           |

Overall, the logarithmic transformation seems to improve performance on the train set as opposed to the equivalent models without this transformation. However, it performs worse on the validation sets, indicating that the log transformation is likely causing some overfitting.

## Train Model with Combined Dataset

Now, the two most promising models are trained with the entire trianing set (train + val) to prepare them for prediction on the test set.

Naturally, we would expect to see the error go down overall compared to above models, due to more training data helping to both decrease bias and variance. But what is crucial to look out for is whether these models with more data ahve a tendency to overfoot this larger training set. Comparing cross-validated scores is an indicator of this.

### Linear Regression Model for Combined Dataset
Selection: Model Group 1

In [116]:
X1 = X[features1]

In [117]:
linreg_final = LinearRegression()

In [118]:
linreg_final.fit(X1, y)

LinearRegression()

In [119]:
(mean_squared_error(y, linreg_final.predict(X1)))**0.5

22883.38685996629

In [120]:
(- cross_val_score(linreg_final, X1, y, cv=5, scoring='neg_mean_squared_error').mean()) **0.5

24126.50674943843

There is significant difference between the cross-validated and non-cross-validated score for the final linear regression model, this indicates that this model might not generalise so well to unseen data. 

### Ridge Model for Combined Dataset
Selection: Model Group 1

In [121]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

preprocessor = ColumnTransformer(transformers=[
    ('bin', 'passthrough', bin_features1),
    ('num', numeric_transformer, num_features1)])

ridge_final = Pipeline(steps=[('preprocessor', preprocessor),
                       ('estimator', Ridge(alpha=1.0))])

In [122]:
ridge_final.fit(X1, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('bin', 'passthrough',
                                                  ['After 1999', 'PID 9',
                                                   'Floating Village',
                                                   'Regular Lot Shape',
                                                   'Hillside', 'CulDSac', 'NH1',
                                                   'NH2', 'NH3', 'NH4',
                                                   '1 Story', 'Hip Roof',
                                                   'Stone Vnr', 'Has Vnr',
                                                   'PConc Foundation',
                                                   'Has Central Air',
                                                   'Excellent Heating',
                                                   'Attached or BuiltIn Garage',
                                                   'Finished Garage',
            

In [123]:
(mean_squared_error(y, ridge_final.predict(X1)))**0.5

22995.79669003899

In [124]:
(- cross_val_score(ridge_final, X1, y, cv=5, scoring='neg_mean_squared_error').mean()) **0.5

24264.449630813175

The final ridge model performs slightly worse than the plain linear gression model above, but the cross-validated score is closer to its non-cross-validated score. This indicates that the model likely generalises better to unseen data than the plain linear regression model.

### Interpretation of Ridge Final Model Coefficients

In [125]:
dict(zip(features1, ridge_final['estimator'].coef_))

{'Lot Frontage': 3452.7781429888764,
 'Lot Area': 4594.659798879528,
 'Overall Cond': 9010.989107847758,
 'Year Built': -2239.919807259114,
 'Year Remod/Add': 16531.61295882386,
 'Mas Vnr Area': 7522.53886432506,
 'Total Bsmt SF': -45174.13254774184,
 '1st Flr SF': -38907.594803949134,
 '2nd Flr SF': -37510.61588826987,
 'Low Qual Fin SF': -16285.952417901193,
 'Bedroom AbvGr': -3160.200648775165,
 'Kitchen AbvGr': 5626.5902160422365,
 'TotRms AbvGrd': 8135.846674883454,
 'Fireplaces': -5800.119015966088,
 'Garage Area': 4326.712870587705,
 'Wood Deck SF': 829.8882126802082,
 'Open Porch SF': 2926.1376383082093,
 'Enclosed Porch': 2604.970979630671,
 '3Ssn Porch': 174.3553132959988,
 'Screen Porch': 6934.863865230692,
 'Pool Area': 14749.628453031411,
 'Misc Val': -5525.811315753386,
 'Mo Sold': 2286.6045041027946,
 'Yr Sold': 5847.387041772138,
 'After 1999': 7175.4546558107595,
 'PID 9': 8531.282495863774,
 'Bath Log': 534.7842575731994,
 'Floating Village': 4290.696903325964,
 'Regu

The magnitude of the regression coefficients allow us to interpret the importance of that feature in predicting sale price.
Below, the coefficients are sorted in descending order of magnitude.

In [126]:
# dict(sorted(zip(features1, np.abs(ridge_final['estimator'].coef_)), key = lambda x:x[1], reverse=True))

In [127]:
sorted_coefs = dict(sorted(zip(features1, [round(c,2) for c in ridge_final['estimator'].coef_]), key = lambda x:np.abs(x[1]), reverse=True))

In [128]:
sorted_coefs

{'1.5P Total SF': 68050.26,
 'Total Bsmt SF': -45174.13,
 '1st Flr SF': -38907.59,
 '2nd Flr SF': -37510.62,
 'CulDSac': -30230.2,
 'Hillside': -25995.79,
 'Regular Lot Shape': -24515.89,
 'P3 Overall Qual': 19130.79,
 'Year Remod/Add': 16531.61,
 'Low Qual Fin SF': -16285.95,
 'Pool Area': 14749.63,
 '1.5P Gr Liv Area': 9849.04,
 'Overall Cond': 9010.99,
 'PID 9': 8531.28,
 'TotRms AbvGrd': 8135.85,
 'Mas Vnr Area': 7522.54,
 'After 1999': 7175.45,
 'Screen Porch': 6934.86,
 'Yr Sold': 5847.39,
 'Fireplaces': -5800.12,
 'Kitchen AbvGr': 5626.59,
 'Misc Val': -5525.81,
 'NH1': -5507.44,
 'New Sale': 5163.87,
 'Hip Roof': 4793.83,
 'Lot Area': 4594.66,
 'Finished Garage': 4328.88,
 'Garage Area': 4326.71,
 'Has Central Air': 4318.93,
 'Floating Village': 4290.7,
 '1 Story': 3720.73,
 'Lot Frontage': 3452.78,
 'Bedroom AbvGr': -3160.2,
 'NH3': -3092.52,
 'NH2': -2971.61,
 'NH4': 2944.13,
 'Open Porch SF': 2926.14,
 'Enclosed Porch': 2604.97,
 'Mo Sold': 2286.6,
 'Year Built': -2239.92,
 

The obvious candidates that are confirmed to be highly predictive of sale price are the total area (Total SF) and quality of house finishing.

Quality of the house finishing ranks in the top 2 positive predictive features, unsurprisingly, as houses with higher quality finishings are desirable and costlier for the practical reason that finishing cost money too. Another unsurprising relation confirmed is that houses with large pools cost more.

The year last renovated or constructed ('Year Remod/Add') expectedly is an important factor as well. Newer or newly renovated houses are more desirable. Likewise, having a large amount of low quality finished areas has a negative effect on sale price, likely because this depicts a house in poor shape. 

Weirdly, there is a negative relation between basement and individual floors areas with the price. The basement relation could possible be due to the fact that more basement space might imply less above ground space.

Surprisingly, being on a hill and/or at a cul de sac has a strong negative effect on price. It would make sense for houses to have a view from a hill and privacy of a cul de sac to be worth more, but it seems perhaps there are inconveniences with such location attributes such ease of access. It is also lilely that there are simply many expensive properties that are not located at a cul de sac or on a hill which causes this negative association.

Another interesting insight is that the regularity of the lot shape is negatively associated with price. There might be some underlying pattern about regular lot shapes, perhaps more generic lower priced houses tend to be on regular lots.

Interestingly, neighborhoods were not a very strong predictor of pricing, but this could be a limitation of how the project tried to segment the neighborhoods.

Another interesting observation is that kitchen quality are not that predictive of pricing. This is likely due to buyer's choosing to renovate the kitchen after purchasing, or perhaps that kitchens just do not vary that much in quality.

In [129]:
coef_df = pd.DataFrame.from_dict(sorted_coefs, orient='index', columns=['ridge coef'])

In [130]:
coef_df

Unnamed: 0,ridge coef
1.5P Total SF,68050.26
Total Bsmt SF,-45174.13
1st Flr SF,-38907.59
2nd Flr SF,-37510.62
CulDSac,-30230.2
Hillside,-25995.79
Regular Lot Shape,-24515.89
P3 Overall Qual,19130.79
Year Remod/Add,16531.61
Low Qual Fin SF,-16285.95


In [131]:
# coef_df.to_csv('../features/ridge_coef.csv')

## Kaggle Predictions

### Load Transformed Test Data

This data is prepared in [Notebook 2a](02a_Preprocessing_Test_Set.ipynb).

In [132]:
df_test = pd.read_csv('../datasets/f_test.csv')

In [133]:
df_test.set_index('Id', inplace=True)

### Make Predictions on Kaggle Test Set

In [134]:
df_test.head()

Unnamed: 0_level_0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,Total Bsmt SF,1st Flr SF,2nd Flr SF,...,Attached or BuiltIn Garage,Finished Garage,Fully Paved Drive,New Sale,Has Alley Access,Total SF,1.5P Gr Liv Area,1.5P Total SF,After 1999,P3 Overall Qual
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2658,69.0,9142,6,8,1910,1950,0.0,1020.0,908,1020,...,0,0,1,0,1,2948.0,84656.545831,160063.098158,0,216
2718,0.0,9662,5,4,1977,1977,0.0,1967.0,1967,0,...,1,1,1,0,0,3934.0,87238.168613,246746.802419,0,125
2414,58.0,17104,7,5,2006,2006,0.0,654.0,664,832,...,1,1,1,1,0,2150.0,57862.526181,99691.398827,1,343
1989,60.0,8520,5,6,1923,2006,0.0,968.0,968,0,...,0,0,0,0,0,1936.0,30117.092024,85184.0,0,125
625,0.0,9500,6,5,1963,1963,247.0,1394.0,1394,0,...,1,1,1,0,0,2788.0,52046.815311,147210.624182,0,216


In [135]:
df_test['SalePrice'] = ridge_final.predict(df_test[features1])

In [136]:
submission = df_test[['SalePrice']]

In [137]:
submission.head(10)

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
2658,147010.619551
2718,176220.617244
2414,212364.126487
1989,97824.060097
625,173701.651906
333,95257.546822
1327,104698.205443
858,154388.60013
95,185339.704741
1568,168916.847434


In [138]:
# submission.to_csv('../kaggledata/final_sub.csv')

### Kaggle MSE Score
Public Score: 24727.75

This score is close to the cross-validated score of the Ridge model on the training data, which indicates that this model generalises decently to unseen data in predicting prices.

## Conclusion and Recommendations

1. The key features that, based on the Ames housing data, add most value to a home are its total area, high finishing quality, and year last renovated or built. This makes sense as larger, better finished, and newer homes would garner higher prices. 


2. Having a lot of poorly finished areas can hurt the pricing of the house a lot. This can either be an opportunity for buyers to get bargains (they can simply spend some money to renovate after getting a bargain). 


3. Likewise, it would be helpful to sellers that they make sure their hosues are touched up before going on the market if they wish to fetch a good price. Renovating to get a higher quality house finishing might be worth the capital for the potential increase in sale price.


4. In Ames, Neighborhood Class 3 (NH3) of Bloomington, Somerset, Timberland, and Veenker could prove to be good investments. For their average relatively high finishing quality, they are negatively related to sale price compared to the most expensive tier of neighborhoods. The data is about a decade outdated, so this might not be the case today.


5. Although pool size featured as strongly positively predictive for the Ames properties, this might not always hold true. For example, in hot areas such as in Calfornia and Nevada where a pool in the house is a necessity instead of a luxury, the correlation might not be so strong. The reverse could be true of cities where pools are an expensive luxury due to scarcity of land such as in New York City or Singapore, and pool size could be highly indicative of a high property price. 


6. Other attritubes of the Ames data that might not generalise to other cities, apart from the obvious neighborhood and parcel IDs, are location attributes such as being on a hill. In some cities, being situated on a gradient could be associated with being in an expensive neighbourhood (e.g. The Peak in Hong Kong or Mulholland Drive in Los Angeles). If this prediction task were to be on other cities, some additional work would be needed to segment the neighborhoods. We could use a similar approach of this project by trying to segment based on quality of house finishings and price per square foot. However, location attributes such as land contours could still be relevant, but would be interpreted differently.