# 3 Testing of Different Regression Permutations
In this codebook, we will focus more on working through various logical and iterative steps to try and ascertain a good model for predicting the SalePrice of the Ames housing dataset.

Our general 'modus operandi' will be as follows:
1. Test the general model with all (relevant) features, and compare the usage of Ridge, Lasso, and vanilla linear regression.
2. From the first test, we will determine which of the 3 has the best performance (most likely Lasso).
3. We will test subsequent models with the best estimator strategy only to try and find the model with the best performance.
4. Once we have settled on something, we will test the other estimators again.

In this project, we will focus mostly on the impact specific variables have the accuracy of our model, and whether segments of the Ames dataset suggest that customers behave differently when certain conditions are met. This conjecture is based particularly on the economic concept of **market segmentation**, and how some groups of customers/buyers have different considerations and incentives to making a purchase, and thus have to be marketed to individually and differently.

#### Packages

In [1]:
# Basic Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

# Sklearn modules
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error

#### Functions

In [2]:
# Dummify Nominal Data
def dummify_nom(NA,features_df,exclusion_list):
    try:
        # 1. nominal list with NAs - these will not need to drop their first column
        if NA == 'withNA':
            drop = False
            tmp = feat_summ[(feat_summ['ftypes'] == 'Nominal') & # Nominal data
                            ((feat_summ['train_null%'] == 0)&(feat_summ['test_null%'] == 0)) # No missing data
                            ].index 
        # 2. nominal list without NAs - these will need to drop their first column
        elif NA == 'withoutNA':
            drop = True
            tmp = feat_summ[(feat_summ['ftypes'] == 'Nominal') & # Nominal data
                            ((feat_summ['train_null%'] > 0)|(feat_summ['test_null%'] > 0)) # No missing data
                            ].index 

        tmp = [i for i in tmp if i not in exclusion_list]
        
        # Get dummies on nominal data, and apply it to features_df
        return pd.get_dummies(features_df, columns = tmp, drop_first = drop)
    except:
        print(f'did not perform get_dummies on nominal data')

In [3]:
# Create dummy columns for the columns missing in each, based on values not present in either
def dummify_row_match(train_set,test_set):
    train_set.loc[:,[i for i in test_set.columns if i not in train_set.columns]] = 0
    test_set.loc[:,[i for i in train_set.columns if i not in test_set.columns]] = 0

In [4]:
def drop_row_match(train_set,test_set):
    train_drop = []
    test_drop = []
    
    # List of columns not present in each in the other
    test_drop = [i for i in test_set.columns if i not in train_set.columns]
    train_drop = [i for i in train_set.columns if i not in test_set.columns]
    
    # drop all columns not present in each
    train_set_output = train_set.drop(train_drop, axis = 1)
    test_set_output = test_set.drop(test_drop, axis = 1)
    
    return train_set_output, test_set_output

In [5]:
# Function for imputation and scaling
# X_test_tar treatment is optional
def basic_modelprep(imputer, X_train_tar, X_test_tar, X_hidden_tar, kaggleset = False):
    # Impute Data
    if imputer == 'simple':  
        imp = SimpleImputer(strategy = 'mean')
        X_train_tmp = pd.DataFrame(imp.fit_transform(X_train_tar))
        X_test_tmp = pd.DataFrame(imp.transform(X_test_tar))
        X_hidden_tmp = pd.DataFrame(imp.transform(X_hidden_tar))

    # Scaling the data
    ss = StandardScaler()
    X_train_tmp = ss.fit_transform(X_train_tmp)
    X_test_tmp = ss.transform(X_test_tmp)
    X_hidden_tmp = ss.transform(X_hidden_tmp)
    
    return X_train_tmp, X_test_tmp, X_hidden_tmp

## 3.1 Regularisation with All Features, using Lasso and Ridge
We will use a standard regularisation with all features included as our benchmark for base model performance, and tweak our approach as we move forward from there. This methodology can be truly associated with the "let the model do the work" mindset, as we trust our regularisation and imputation algorithms to sift through the correlations and find the best predictors.

In [6]:
train_df = pd.read_csv('datasets/train_ord.csv', index_col = 'Unnamed: 0')
test_df = pd.read_csv('datasets/test_ord.csv', index_col = 'Unnamed: 0')
feat_summ = pd.read_csv('datasets/feature_summary.csv', index_col = 'Unnamed: 0')

In [7]:
train_df.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,2,Lvl,...,0,0,0,0,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,2,Lvl,...,0,0,0,0,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,3,Lvl,...,0,0,0,0,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,3,Lvl,...,0,0,0,0,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,2,Lvl,...,0,0,0,0,,0,3,2010,WD,138500


In [8]:
# Assigning X features, remove y and IDs from the list
excl_list = ['SalePrice','Id','PID']
features = [i for i in train_df.columns if i not in excl_list]

In [9]:
# Assigning X and y variables
X = train_df[features]
y = train_df['SalePrice']

In [10]:
X = train_df[features] # The code returns a run error if run twice
X = dummify_nom('withNA',X,excl_list)
X = dummify_nom('withoutNA',X,excl_list)

In [11]:
X_hidden = test_df[features][:]
X_hidden = dummify_nom('withNA',X_hidden,excl_list)
X_hidden = dummify_nom('withoutNA',X_hidden,excl_list)

In [12]:
# Train/Test Split
# Keeping a random state so to maintain consistency between model runs
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

In [13]:
# Duplicating out train/test sets for this regression step
X_train_1 = X_train[:]
X_test_1 = X_test[:]

In [14]:
# Hidden features (X) for kaggle submission
X_hidden_1 = X_hidden[:]

# Remove any columns that do not exist in either the test set or the train set
X_train_1, X_hidden_1 = drop_row_match(X_train_1, X_hidden_1)
X_train_1, X_test_1 = drop_row_match(X_train_1,X_test_1)

In [15]:
# Streamlined imputation
X_train_1, X_test_1, X_hidden_1 = basic_modelprep(imputer = 'simple', 
                                                  X_train_tar = X_train_1, 
                                                  X_test_tar = X_test_1,
                                                  X_hidden_tar = X_hidden_1)

In [16]:
# Instantiate Regression models
lr = LinearRegression()
lasso = LassoCV(n_alphas = 200)
ridge = RidgeCV(alphas = np.linspace(0.1,10,200))

### 3.1.1 Linear Regression

In [17]:
# Cross Val Score for standard linear regression
lr_score = cross_val_score(lr,X_train_1,y_train,cv=5)
lr_score.mean()

-1.9164062693935097e+20

Very poignantly, the simple linear regression is abyssmally poor at predicting anything, thus we will be skipping evaluating a vanilla linear regression from now on.

### 3.1.2 Linear Regression with Lasso Regularisation (Run 1)

In [18]:
# LassoCV Score
lasso_score = cross_val_score(lasso,X_train_1,y_train,cv=5)
lasso_score.mean()

0.8232714217330204

In [19]:
# LassoCV train vs test scores
lasso.fit(X_train_1,y_train)
print(lasso.score(X_train_1,y_train))
print(lasso.score(X_test_1,y_test))

0.8856330124618806
0.8896044014377833


In [20]:
# LassoCV Rooted Mean Squared Error
y_pred = lasso.predict(X_test_1)
mean_squared_error(y_test,y_pred, squared = False)

26035.201272306058

In [21]:
# Pushing kaggle .csv submissions
# y_pred results
y_pred_kaggle = lasso.predict(X_hidden_1)
y_312_pred_kaggle = y_pred_kaggle # Use for later

# Setting up kaggle submission .csv
kaggle_exp = pd.DataFrame({'Id':test_df['Id'],
                           'SalePrice': y_pred_kaggle})
kaggle_312_exp = kaggle_exp[:] # Use for later
kaggle_exp.to_csv('Subm/lr_3_1_1_lasso_all_features.csv', index = False)

### 3.1.3 Linear Regression with Ridge Regularisation (Run 2)

In [22]:
# RidgeCV Score
ridge_score = cross_val_score(ridge,X_train_1,y_train,cv=5)
ridge_score.mean()

0.7798111325614392

In [23]:
# RidgeCV train vs test scores
ridge.fit(X_train_1,y_train)
print(ridge.score(X_train_1,y_train))
print(ridge.score(X_test_1,y_test))

0.9089554587068819
0.8704728499744766


In [24]:
# RidgeCV Rooted Mean Squared Error
y_pred = ridge.predict(X_test_1)
mean_squared_error(y_test,y_pred, squared = False)

28201.062114810604

In [25]:
# Pushing kaggle .csv submissions
# y_pred results
y_pred_kaggle = ridge.predict(X_hidden_1)

# Setting up kaggle submission .csv
kaggle_exp = pd.DataFrame({'Id':test_df['Id'],
                           'SalePrice': y_pred_kaggle})
kaggle_exp.to_csv('Subm/lr_3_1_2_ridge_all_features.csv', index = False)

### 3.1.4 Conclusions
The summary of results is as follows:
1. Linear Regression (R2): $-1.9 * 10^{20}$
2. LassoCV Regression (R2): $0.88$
3. RidgeCV Regression (R2): $0.87$

Comparing RMSE values:
1. LassoCV (RSME): $26,035.20$
2. RidgeCV (RSME): $28,201.06$

From the above results, Lasso seems to be slightly better in this respect, though the two methods aren't that dissimiliar in output. As this is just the first iteration, the RSME sits at a hefty ~35k.

Also note that each re-run of the codebook will yield different results.

### 3.1.5 Impact of Removing Fields with High Nulls (Run 3)
In our data cleaning and EDA section, we identified some fields, namely 'Alley', 'Pool QC', 'Fence', and 'Misc Feature', which had very high null counts. The below section will test the effect of such fields being included in the analysis.

In [26]:
excl_list = ['SalePrice','Id','PID','Alley','Pool QC','Fence','Misc Feature']
excl_list_mod = [j for j in X_train.columns for i in excl_list if i in j]  
features = [i for i in X_train.columns if i not in excl_list_mod]
X_train_2 = X_train[features]
X_test_2 = X_test[features]

In [27]:
features_hidden = [i for i in features if i in X_hidden.columns]
X_hidden_2 = X_hidden[features_hidden]

# Remove any columns that do not exist in either the test set or the train set
X_train_2, X_hidden_2 = drop_row_match(X_train_2, X_hidden_2)
X_train_2, X_test_2 = drop_row_match(X_train_2,X_test_2)

In [28]:
X_train_2,X_test_2,X_hidden_2 = basic_modelprep('simple',X_train_2,X_test_2, X_hidden_2)

In [29]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X_train_2,y_train,cv=5)
lasso_score.mean()

0.8286941584958025

In [30]:
lasso.fit(X_train_2,y_train)
print(lasso.score(X_train_2,y_train))
print(lasso.score(X_test_2,y_test))    

0.8869334530211209
0.8953275180946924


In [31]:
# LassoCV Rooted Mean Squared Error
y_pred = lasso.predict(X_test_2)
mean_squared_error(y_test,y_pred, squared = False)

25351.363424510546

The above results show that removing the high null fields yield negligible change.

In [32]:
# Pushing kaggle .csv submissions
# y_pred results
y_pred_kaggle = lasso.predict(X_hidden_2)

# Setting up kaggle submission .csv
kaggle_exp = pd.DataFrame({'Id':test_df['Id'],
                           'SalePrice': y_pred_kaggle})
kaggle_exp.to_csv('Subm/lr_3_1_3_lasso_remove_high_null.csv', index = False)

## 3.2 Regression with only features with $>50%$ correlation with SalePrice (Run 4)
As we start to narrow down regression tactics, one obvious choice is to choose only those fields strongly correlated to the SalePrice. 

In [33]:
# Dummify train_df and get all the fields with high correlation to SalePrice
train_df_dummy = train_df[:]
train_df_dummy = dummify_nom('withNA',train_df_dummy,['Id','PID'])
train_df_dummy = dummify_nom('withoutNA',train_df_dummy,['Id','PID'])
high_corr_list = train_df_dummy.corr()[
    (train_df_dummy.corr()['SalePrice'] > 0.5)| # Include only correlation higher than 50%
    (train_df_dummy.corr()['SalePrice'] < -0.5)].index # and lower than -50%
high_corr_list = [i for i in high_corr_list if i != 'SalePrice']

In [34]:
# Modifying our X_train and X_test to fit the confines of the problem
X_train_3 = X_train[high_corr_list]
X_test_3 = X_test[high_corr_list]
X_hidden_3 = X_hidden[high_corr_list]

In [35]:
X_train_3, X_test_3, X_hidden_3 = basic_modelprep('simple',X_train_3,X_test_3,X_hidden_3)

In [36]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X_train_3,y_train,cv=5)
lasso_score.mean()

0.7871675077111492

In [37]:
lasso.fit(X_train_3,y_train)
print(lasso.score(X_train_3,y_train))
print(lasso.score(X_test_3,y_test))    

0.8056102360231014
0.8609787542615951


In [38]:
# LassoCV Rooted Mean Squared Error
y_pred = lasso.predict(X_test_3)
mean_squared_error(y_test,y_pred, squared = False)

29216.329113569987

In [39]:
# Pushing kaggle .csv submissions
# y_pred results
y_pred_kaggle = lasso.predict(X_hidden_3)

# Setting up kaggle submission .csv
kaggle_exp = pd.DataFrame({'Id':test_df['Id'],
                           'SalePrice': y_pred_kaggle})
kaggle_exp.to_csv('Subm/lr_3_1_4_lasso_high_corr.csv', index = False)

### 3.2.1 Trying removing high null fields again from 3.2 (Run 5)

In [40]:
excl_list = ['SalePrice','Id','PID','Alley','Pool QC','Fence','Misc Feature']
excl_list_mod = [j for j in X_train[high_corr_list].columns for i in excl_list if i in j]  
features = [i for i in X_train[high_corr_list] if i not in excl_list_mod]

X_train_4 = X_train[high_corr_list][features]
X_test_4 = X_test[high_corr_list][features]
X_hidden_4 = X_hidden[high_corr_list][features]

In [41]:
X_train_4, X_test_4, X_hidden_4 = basic_modelprep('simple',X_train_4, X_test_4, X_hidden_4)

In [42]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X_train_4,y_train,cv=5)
lasso_score.mean()

0.7871675077111492

In [43]:
lasso.fit(X_train_4,y_train)
print(lasso.score(X_train_4,y_train))
print(lasso.score(X_test_4,y_test))    

0.8056102360231014
0.8609787542615951


In [44]:
# LassoCV Rooted Mean Squared Error
y_pred = lasso.predict(X_test_4)
mean_squared_error(y_test,y_pred, squared = False)

29216.329113569987

In [45]:
# Pushing kaggle .csv submissions
# y_pred results
y_pred_kaggle = lasso.predict(X_hidden_4)

# Setting up kaggle submission .csv
kaggle_exp = pd.DataFrame({'Id':test_df['Id'],
                           'SalePrice': y_pred_kaggle})
kaggle_exp.to_csv('Subm/lr_3_1_5_lasso_high_corr_remove_high_null.csv', index = False)

**Conclusions**  
From the above results, it is apparent that focusing entirely on the highly correlated features does not naturally translate into a more accurate model. In this case, we actually see a reduction in model performance of around 29,000, up from 26,000 from lasso all features run.

## 3.3 Looking into key features as metrics for segmentation
In economics, rarely will you find market demand and supply that is consistent across various market segments. Be it by socio-economic class, geography, or other social factors, customer behaviour may vary wildly and in ways that may not be very easy to predict.  

In application, this would mean we can assume that these market segments would have different relationships with the price (which is our metric for market equilibrium, so to speak), and thus would have different correlation/regressive characteristics.

In this section, we will look at identifying some categorical fields to use as proxies to 'split' out dataset up, and generate separate models to predict their movements.

The methodology for selecting fields can be found in the EDA portion of this project.

We will be looking mainly at the following choice fields:
1. Lot Shape - Irregular (0) and Regular (1);
2. Overall Quality - 1-5 (0), 6-7 (1), 8-10 (2).

### 3.3.1 Lot Shape (Run 6)

In [46]:
# Function for lot shape conversion
def lot_shape_conv(df):
    return [1 if i == 3 else 0 for i in df['Lot Shape']]

In [47]:
# Convert Lot Shape into 0 (Irregular) and 1 (Regular)
X_ls = X[:]
X_hidden_ls = X_hidden[:]
if 3 in list(X['Lot Shape']): # Make sure the code does not execute if already transformed
    X_ls['Lot Shape'] = lot_shape_conv(X_ls)
    X_hidden_ls['Lot Shape'] = lot_shape_conv(X_hidden_ls)

In [48]:
# Split the X and y into the Lot Shape subsets
# Lot Shape = 0 (Irregular)
X0 = X_ls[X_ls['Lot Shape'] == 0]
y0 = y[X0.index]
X0_hidden = X_hidden_ls[X_hidden_ls['Lot Shape'] == 0]

#Lot Shape = 1 (Regular)
X1 = X_ls[X_ls['Lot Shape'] == 1]
y1 = y[X1.index]
X1_hidden = X_hidden_ls[X_hidden_ls['Lot Shape'] == 1]

In [49]:
# Train-Test Split for both subsets
X0_train, X0_test, y0_train, y0_test = train_test_split(X0,y0,random_state = 42)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1,y1,random_state = 42)

#### 3.3.1.1 Lot Shape = 0; Regression (Lasso Reg)

In [50]:
# Remove any columns that do not exist in either the test set or the train set
X0_train, X0_hidden = drop_row_match(X0_train, X0_hidden)
X0_train, X0_test = drop_row_match(X0_train,X0_test)

# Simple Imputation + Standard Scaler
X0_train, X0_test, X0_hidden = basic_modelprep('simple',X0_train,X0_test,X0_hidden)

lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X0_train,y0_train,cv=5)
lasso_score.mean()

0.8685281612187381

In [51]:
lasso.fit(X0_train,y0_train)
print(lasso.score(X0_train,y0_train))
print(lasso.score(X0_test,y0_test)) 
# This is an abyssmally bad score!

0.9059705492030885
0.24571922493554366


In [52]:
# LassoCV Rooted Mean Squared Error
y0_pred = lasso.predict(X0_test)
mean_squared_error(y0_test,y0_pred, squared = False)

70627.2540937239

In [53]:
# Kaggle y-pred result
y0_pred_kaggle = lasso.predict(X0_hidden)

#### 3.3.1.2 Lot Shape = 1, Regression (Lasso Reg)

In [54]:
# Remove any columns that do not exist in either the test set or the train set
X1_train, X1_hidden = drop_row_match(X1_train, X1_hidden)
X1_train, X1_test = drop_row_match(X1_train,X1_test)

# Simple Imputation + Standard Scaler
X1_train, X1_test, X1_hidden = basic_modelprep('simple',X1_train,X1_test,X1_hidden)

lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X1_train,y1_train,cv=5)
lasso_score.mean()

0.9085348625083224

In [55]:
lasso.fit(X1_train,y1_train)
print(lasso.score(X1_train,y1_train))
print(lasso.score(X1_test,y1_test)) 
# This is a really good score!

0.9341604340943529
0.8886041685265025


In [56]:
# LassoCV Rooted Mean Squared Error
y1_pred = lasso.predict(X1_test)
mean_squared_error(y1_test,y1_pred, squared = False)

18962.411674414358

In [57]:
# Kaggle y-pred result
y1_pred_kaggle = lasso.predict(X1_hidden)

#### 3.3.1.3 Combining the predictions into a .csv to submit to Kaggle

In [58]:
# This is a roundabout way of preserving the Id mapping for the SalePrice
X0_tmp = X_hidden_ls[X_hidden_ls['Lot Shape'] == 0][['Lot Shape']]
X0_tmp['Id'] = test_df['Id']
X0_tmp = X0_tmp[['Id']]
X0_tmp['SalePrice'] = y0_pred_kaggle

X1_tmp = X_hidden_ls[X_hidden_ls['Lot Shape'] == 1][['Lot Shape']]
X1_tmp['Id'] = test_df['Id']
X1_tmp = X1_tmp[['Id']]
X1_tmp['SalePrice'] = y1_pred_kaggle

X_tmp = pd.concat([X0_tmp,X1_tmp]).sort_index()

In [59]:
# Pushing kaggle .csv submissions
# Setting up kaggle submission .csv
kaggle_exp = X_tmp
kaggle_exp.to_csv('Subm/lr_3_3_1_Groupby_LotShape.csv', index = False)

#### Conclusion to Lot Shape Groupings
Summarising the findings from this iteration:
1. The portion of the set that is irregular (grouping IR3/IR2/IR1 classifications into a single group) has really quite bad accuracy, giving us a 0.90/0.24 split in $R^2$ between the train and test sets.
2. Conversely, the model is exceptionally good at predicting the regular lot shape classification.
3. Overall (from kaggle), the composite segmented model produces **a respectable y-pred of 21,556 (private score).**

This R^2 is likely the result of extremely good predictions from the Lot Shape = 1 side of the set, diluted by some inaccurate predictions from the Lot Shape = 0 subset.

As an experiment, we will substitute the y-predictions for Lot Shape =0 with the y-pred from 3.1.2 (Lasso Regularisation benchmark), which was our best score before this, and see how that improves the score.

In [60]:
# Redoing the X_tmp setup, but replacing the y-pred data with that from 3.1.2
X0_tmp = X_hidden_ls[X_hidden_ls['Lot Shape'] == 0][['Lot Shape']]
X0_tmp['Id'] = test_df['Id']
X0_tmp = X0_tmp[['Id']]
X0_tmp['SalePrice'] = kaggle_312_exp['SalePrice'] # From 3.1.2

X1_tmp = X_hidden_ls[X_hidden_ls['Lot Shape'] == 1][['Lot Shape']]
X1_tmp['Id'] = test_df['Id']
X1_tmp = X1_tmp[['Id']]
X1_tmp['SalePrice'] = y1_pred_kaggle

X_tmp = pd.concat([X0_tmp,X1_tmp]).sort_index()

In [61]:
# Pushing kaggle .csv submissions
# Setting up kaggle submission .csv
kaggle_exp = X_tmp
kaggle_exp.to_csv('Subm/lr_3_3_1_Groupby_LotShape_subwprev.csv', index = False)

Peculiarly, the substitution does not improve the score, and even worsens it a little (Private score of 22,134). The effect of which perhaps can be revisited and studied.

### 3.3.2 Overall Quality (Run 7)


In [62]:
# Convert X and y variables (and X_hidden) to new scale factors
X_OQ = X[:]
X_hidden_OQ = X_hidden[:]

def OQ_conv(df):
    return [2 if i > 7 else 0 if i < 6 else 1 for i in df['Overall Qual']]

X_OQ['Overall Qual'] = OQ_conv(X_OQ)
X_hidden_OQ['Overall Qual'] = OQ_conv(X_hidden_OQ)

In [63]:
# Split into subsets
def OQ_split(df,rank):
    return df[df['Overall Qual'] == rank]

# Xi_OQ and Xi_hidden_OQ sets
for i in [0,1,2]:
    globals()['X'+str(i)+'_OQ'] = OQ_split(X_OQ,i) # Convert X subsets
    globals()['X'+str(i)+'_hidden_OQ'] = OQ_split(X_hidden_OQ,i) # Convert X_hidden subsets

#yi_OQ sets
for i in [0,1,2]:
    globals()['y'+str(i)+'_OQ'] = y[globals()['X'+str(i)+'_OQ'].index] # Split out y variables as well

In [64]:
# Train-Test Split for all subsets
for i in [0,1,2]:
    (globals()['X'+str(i)+'_train'], 
    globals()['X'+str(i)+'_test'], 
    globals()['y'+str(i)+'_train'], 
    globals()['y'+str(i)+'_test']) = train_test_split(globals()['X'+str(i)+'_OQ'],
                                                     globals()['y'+str(i)+'_OQ'],
                                                     random_state = 42)

# There are now 3 sets of Xi_train and Xi_test, and yi_train and yi_test sets, for each of the 3 segments.

#### 3.3.2.0 Section Functions

In [65]:
def standard_modifiers(rank,str1 = None):
    if str1 != None: str1 = '_' + str1
    else: str1 = ''
# Remove any columns that do not exist in either the test set or the train set
    (globals()['X'+str(rank)+'_train'], 
     globals()['X'+str(rank)+'_hidden'+str1]) = drop_row_match(globals()['X'+str(rank)+'_train'], 
                                                               globals()['X'+str(rank)+'_hidden'+str1])
    (globals()['X'+str(rank)+'_train'], 
     globals()['X'+str(rank)+'_test']) = drop_row_match(globals()['X'+str(rank)+'_train'], 
                                                        globals()['X'+str(rank)+'_test'])

# Simple Imputation + Standard Scaler
    (globals()['X'+str(rank)+'_train'], 
     globals()['X'+str(rank)+'_test'], 
     globals()['X'+str(rank)+'_hidden'+str1]) = basic_modelprep('simple',
                                                                globals()['X'+str(rank)+'_train'],
                                                                globals()['X'+str(rank)+'_test'],
                                                                globals()['X'+str(rank)+'_hidden'+str1])

In [66]:
def get_R2(rank):
    lasso.fit(globals()['X'+str(rank)+'_train'],globals()['y'+str(rank)+'_train'])
    print(lasso.score(globals()['X'+str(rank)+'_train'],globals()['y'+str(rank)+'_train']))
    print(lasso.score(globals()['X'+str(rank)+'_test'],globals()['y'+str(rank)+'_test'])) 

In [67]:
# LassoCV Rooted Mean Squared Error
def get_RMSE(rank):
    globals()['y'+str(rank)+'_pred'] = lasso.predict(globals()['X'+str(rank)+'_test'])
    return mean_squared_error(globals()['y'+str(rank)+'_test'],globals()['y'+str(rank)+'_pred'], squared = False)

#### 3.3.2.1 Overall Qual = 0 (Ordinal rank 1-5)

In [68]:
# X0_train, X0_test, X0_hidden_OQ
# Remove columns that do not exist in each of the 3 sets, and imputes and scales data
standard_modifiers(rank = 0, str1 = 'OQ')

In [69]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X0_train,y0_train,cv=5)
lasso_score.mean()

0.7822995551584105

In [70]:
# Getting R^2 score for train and test sets
get_R2(0)

0.8487943069293179
0.7864742583961891


In [71]:
get_RMSE(0)

14550.191139758532

In [72]:
# Kaggle y-pred result
y0_pred_kaggle = lasso.predict(X0_hidden_OQ)
len(y0_pred_kaggle)

344

#### 3.3.2.2 Overall Qual = 1 (Ordinal rank 6-7)

In [73]:
# X1_train, X1_test, X1_hidden_OQ
# Remove columns that do not exist in each of the 3 sets, and imputes and scales data
standard_modifiers(rank = 1, str1 = 'OQ')

In [74]:
lasso = LassoCV(n_alphas = 200, max_iter = 3000) # instantiate, increased iterations due to error
lasso_score = cross_val_score(lasso,X1_train,y1_train,cv=5)
lasso_score.mean()

0.8230411279437366

In [75]:
get_R2(1)

0.8809606040844872
0.8135164314313986


In [76]:
get_RMSE(1)

18713.366659566753

In [77]:
# Kaggle y-pred result
y1_pred_kaggle = lasso.predict(X1_hidden_OQ)
len(y1_pred_kaggle)

397

#### 3.3.2.3 Overall Qual = 2 (Ordinal rank 8-10)

In [78]:
# X2_train, X2_test, X2_hidden_OQ
# Remove columns that do not exist in each of the 3 sets, and imputes and scales data
standard_modifiers(rank = 2, str1 = 'OQ')

In [79]:
lasso = LassoCV(n_alphas = 200, max_iter = 3000) # instantiate, increased iterations due to error
lasso_score = cross_val_score(lasso,X2_train,y2_train,cv=5)
lasso_score.mean()

-0.13492870102478205

In [80]:
get_R2(2)

0.7926988117432899
0.6885227477034966


In [81]:
get_RMSE(2)

42278.99240218937

In [82]:
# Kaggle y-pred result
y2_pred_kaggle = lasso.predict(X2_hidden_OQ)
len(y2_pred_kaggle)

137

#### 3.3.2.4 Combine Kaggle Result

In [83]:
# Streamline formatting of data for concatenating
def format_kaggle_df(rank):
    globals()['X'+str(rank)+'_tmp'] = X_hidden_OQ[X_hidden_OQ['Overall Qual'] == rank][['Overall Qual']]
    globals()['X'+str(rank)+'_tmp']['Id'] = test_df['Id']
    globals()['X'+str(rank)+'_tmp'] = globals()['X'+str(rank)+'_tmp'][['Id']]
    globals()['X'+str(rank)+'_tmp']['SalePrice'] = globals()['y'+str(rank)+'_pred_kaggle']

In [84]:
# Format export for each step
for i in [0,1,2]:
    format_kaggle_df(i)

# Combine data and export to .csv
X_tmp = pd.concat([X0_tmp,X1_tmp,X2_tmp]).sort_index()
kaggle_exp = X_tmp
kaggle_exp.to_csv('Subm/lr_3_3_2_OverallQual(1to5_6to7_8to10).csv', index = False)

#### Some notes on the result
RMSE values for each of the subsets:
1. Test RMSE(Overall Qual = 0): 14,550
2. Test RMSE(Overall Qual = 1): 18,713
3. Test RMSE(Overall Qual = 2): 42,278

The results sort of support our conjecture - the buckets with more data performed better in the test set than bucket 2 (_8-10 score ordinal data_). On kaggle, the y-predicted values (_private score_) is around 21,493, which sort of lines up as the average performance of the 3 subsets.

**Possible reasons for such a disparity could be:**
1. The lack of enough training data for bucket 2;
2. Perhaps also the correlation landscape for each subset is different as well.

Let's look at the Overall Quality set again, but only take the top 20 correlated terms for each set.

### 3.3.3 Overall Quality, only top 20 correlated features (Run 8)

In [85]:
# Taking the dataframes from 3.3.2, Xi_OQ, yi_OQ, and Xi_hidden_OQ, which are already dummified and split
# Step 1: Get top 20 correlated 
OQ_corrlist = pd.read_csv('datasets/3_3_3_corrlist.csv', index_col = 'Unnamed: 0')
OQ_corrlist.shape

(20, 3)

In [86]:
# Convert X and y variables (and X_hidden) to new scale factors
X_OQ = X[:]
X_hidden_OQ = X_hidden[:]

# Reusing QQ_conv from 3.3.2
X_OQ['Overall Qual'] = OQ_conv(X_OQ)
X_hidden_OQ['Overall Qual'] = OQ_conv(X_hidden_OQ)

In [87]:
# Reusing QQ_split from 3.3.2
# Xi_OQ and Xi_hidden_OQ sets
for i in [0,1,2]:
    globals()['X'+str(i)+'_OQ'] = OQ_split(X_OQ,i) # Convert X subsets
    globals()['X'+str(i)+'_hidden_OQ'] = OQ_split(X_hidden_OQ,i) # Convert X_hidden subsets

#yi_OQ sets
for i in [0,1,2]:
    globals()['y'+str(i)+'_OQ'] = y[globals()['X'+str(i)+'_OQ'].index] # Split out y variables as well

In [88]:
# Selecting only top 20 correlated features
for i in [0,1,2]:
    corr_list = [j for j in list(OQ_corrlist['OQ'+str(i)]) 
                 if (j in globals()['X'+str(i)+'_OQ'].columns)&(j in globals()['X'+str(i)+'_hidden_OQ'].columns)]
    globals()['X'+str(i)+'_OQ2'] = globals()['X'+str(i)+'_OQ'][corr_list]
    globals()['X'+str(i)+'_hidden_OQ2'] = globals()['X'+str(i)+'_hidden_OQ'][corr_list]

In [89]:
# Train-Test Split for all subsets
for i in [0,1,2]:
    (globals()['X'+str(i)+'_train'], 
    globals()['X'+str(i)+'_test'], 
    globals()['y'+str(i)+'_train'], 
    globals()['y'+str(i)+'_test']) = train_test_split(globals()['X'+str(i)+'_OQ2'],
                                                     globals()['y'+str(i)+'_OQ'],
                                                     random_state = 42)

#### 3.3.3.1 Overall Qual = 0, 20 Top Corr Features

In [90]:
standard_modifiers(0,'OQ2')

In [91]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X0_train,y0_train,cv=5)
lasso_score.mean()

0.6975762042546677

In [92]:
# Getting R^2 score for train and test sets
get_R2(0)

0.7282923297563675
0.7024081489967064


In [93]:
# Get test RMSE score
get_RMSE(0)

17177.26819078559

In [94]:
# Kaggle y-pred result
y0_pred_kaggle = lasso.predict(X0_hidden_OQ2)
len(y0_pred_kaggle)

344

#### 3.3.3.2 Overall Qual = 1, 20 Top Corr Features

In [95]:
standard_modifiers(1,'OQ2')

In [96]:
lasso = LassoCV(n_alphas = 200) # instantiate
lasso_score = cross_val_score(lasso,X1_train,y1_train,cv=5)
lasso_score.mean()

0.7709888719477063

In [97]:
# Getting R^2 score for train and test sets
get_R2(1)

0.795370880062981
0.761482481360553


In [98]:
# Get test RMSE score
get_RMSE(1)

21163.70906303993

In [99]:
# Kaggle y-pred result
y1_pred_kaggle = lasso.predict(X1_hidden_OQ2)
len(y1_pred_kaggle)

397

#### 3.3.3.3 Overall Qual = 2, 20 Top Corr Features

In [100]:
standard_modifiers(2,'OQ2')

In [101]:
lasso = LassoCV(n_alphas = 200, max_iter = 3000) # instantiate
lasso_score = cross_val_score(lasso,X2_train,y2_train,cv=5)
lasso_score.mean()

-0.18738343687492737

In [102]:
# Getting R^2 score for train and test sets
get_R2(2)

0.4795727947429209
0.3869937058565498


In [103]:
# Get test RMSE score
get_RMSE(2)

59312.18564548424

In [104]:
# Kaggle y-pred result
y2_pred_kaggle = lasso.predict(X2_hidden_OQ2)
len(y2_pred_kaggle)

137

#### 3.3.3.4 Combine Kaggle Result

In [105]:
# Format export for each step
for i in [0,1,2]:
    format_kaggle_df(i)

# Combine data and export to .csv
X_tmp = pd.concat([X0_tmp,X1_tmp,X2_tmp]).sort_index()
kaggle_exp = X_tmp
kaggle_exp.to_csv('Subm/lr_3_3_2_OverallQual(1to5_6to7_8to10)_20topcorr.csv', index = False)

#### 3.3.3.5 Conclusions on Overall Quality
1. Unlike the model with all features, censoring out the best correlated features seems to have the opposite effect, increasing the variance on the model predictions and driving up the RMSE score.
2. It is not represented in the notebook (due to time constraints), but when the number of variables is increased (in order of correlation to the particular segment), the performance of the model increases.

## 4 Conclusions and Limitations
The results from our optimisations seems to suggest that market segments do have differing wants, and thus segemented data react to various factors and variables differently. 

Segmented results do show promise in being good predictors, but more tuning is ultimately required in order to further tune the model's performance.

**Final (best) run:**  
LassoCV run on Overall Quality segmentation, run 6  
Kaggle RMSE score: 21,493


#### Limitations
1. There was not enough time to do a full exploration of all the features within the dataset. With more tweaking, it is very likely that some of the features unexplored would have impacted the performance of the model to a larger degree.
2. I chose not to focus on too many things when trying to optimise and explore the effects of the variables on the overall model, such as noise and (direct) inter-correlation analysis. These things ultimately have an impact on the accuracy of the model, and clearing the noise would definitely improve the performance as well.

## 5 Future Improvements
Due to time constraints, a lot of optimisations were not implemented. The following can be improved on should there come a time to further optimise the resultant models.

1. Keep the naming of the models consistent and unique, allowing for the trained models to be called easily later on when needed.
2. Some of the algorithms and transformations can be streamlined into more general, universal functions, with a little standardisation and tweaking.
3. Use a cleaner naming convention for the runs and model variables, allowing for more powerful and generalised functions to be leveraged on.
4. Creating an algorithm to sift through variables, perhaps 1-by-1, and their significance to each subset, might yield a more optimal result in most cases. There is a good chance that the variance seen in the performance of each segment (e.g. OQ = 0,1, and 2) is in part due to alot of variables creating noise and obscuring proper accuracy.
5. Consider removing fields that may cause more noise than good for the model performance.
6. Hyperparameter tuning could also be used to further optimise the models to get the best results.