## 03: Kaggle Submissions

|Project Notebooks|
|--------|
|[00_Problem_Statement_and_Project_Overview](00_Problem_Statement_and_Project_Overview.ipynb)|
|[01_EDA_and_Cleaning](01_EDA_and_Cleaning.ipynb)|
|[02_Preprocessing_FeatureEngineering_Modeling](02_Preprocessing_FeatureEngineering_Modeling.ipynb)|
|[03_Kaggle_Submissions](04_Kaggle_Submissions.ipynb)|


This project notebook will be used primarily for producing Ames Housing Dataset `saleprice` modeling.

The training data cleaning and exploratory discussion is documented in [01_EDA_and_Cleaning](01_EDA_and_Cleaning.ipynb).

Using the cleaned dataset, I will produce one set of predictions of `saleprice` using predictors in the testing dataset and multiple linear regression

The output of this notebook should be 1 .CSV file containing `Id` and `SalePrice` fields, with `SalePrice` predicted according to our trained model.


Begin by importing the necessary tools:

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics

---

## Preparing the `test.csv` dataset

We'll have to make some of the column name, field re-casting, and missing-value adjustments we made to the `train.csv` data for the data to be congruent enough to run a prediction after the model is fit. The steps that must be taken are detailed below:

1. Make column names lowercase and replace spaces with underscores.
1. Drop columns which were considered extraneous or collinear with another variable.
1. Deal with missing values, applying treatments as similar as possible to those applied in the training set.
1. Recast column types to match what was done for the training set.

In [2]:
train_orig = pd.read_csv('../datasets/train.csv')
train_clean = pd.read_csv('../datasets/cleaned/ames_train_clean.csv')
test = pd.read_csv('../datasets/test.csv') # doesn't match what we did in the training set yet

#after being converted to categorical variables and exported to a csv, some of the training
#fields have reverted to their original types... this should amend that:
train_clean[['id','ms_subclass','overall_qual','overall_cond','year_built','year_remod/add','mo_sold','yr_sold']] = \
    train_clean[['id','ms_subclass','overall_qual','overall_cond','year_built','year_remod/add','mo_sold','yr_sold']].astype(str)


In [3]:
#get the original field name, determine what the updated name should be, and
orig_names = {x:[x.lower().replace(' ','_'), train_orig[x].dtype] for x in train_orig.columns}

#get the new names in the cleaned training set:
new_names = {column:train_clean[column].dtype for column in train_clean.columns}

#drop the following columns from test.csv:
test_to_drop = []
for column in orig_names:
    if orig_names[column][0] not in new_names:
        test_to_drop.append(column)
test.drop(columns=test_to_drop,inplace=True)

In [4]:
#update the column names
test.columns = [orig_names.get(x)[0] for x in test.columns]

In [5]:
#get the shape:
trows, tcols = test.shape
print(f'rows {trows}',f'cols {tcols}')

rows 878 cols 72


In [7]:
#check the columns with nulls
test_null_cols = {column: [null_sum] for column, null_sum in list(zip(test.columns,test.isnull().sum())) if null_sum > 0}
pd.DataFrame.from_dict(test_null_cols,'index',columns=['count of null obvs']).sort_values('count of null obvs', ascending = False)

Unnamed: 0,count of null obvs
pool_qc,874
misc_feature,837
alley,820
fence,706
fireplace_qu,422
lot_frontage,160
garage_yr_blt,45
garage_finish,45
garage_qual,45
garage_cond,45


The code cell below is a series of steps taken to deal with NAs in the test set as similarly as we did in the training dataset.

In [8]:
def fill_na_none(df,column):
    df[column] = df[column].fillna('None')

columns_to_fill_na_none = ['pool_qc',
                           'misc_feature',
                           'alley',
                           'fence',
                           'fireplace_qu',
                           'garage_yr_blt',
                           'garage_finish',
                           'garage_qual',
                           'garage_cond',
                           'bsmt_exposure',
                           'bsmtfin_type_1',
                           'bsmtfin_type_2',
                           'bsmt_cond',
                           'bsmt_qual',
                           'mas_vnr_type',
                           'electrical']

#property index 764 has a non-null 'garage_type', but missing data in the other garage columns would indicate they do not have a garage.
#make property 764's garage_type = 'None' and 
test.loc[764,'garage_type'] = 'None'
columns_to_fill_na_none.append('garage_type')

for col in columns_to_fill_na_none:
    fill_na_none(test,col)
    
#lastly, make the one null in 'mas_vnr_area' = 0.0
test.loc[865,'mas_vnr_area'] = 0.0

In [9]:
#recheck the columns with nulls
test_null_cols = {column: [null_sum] for column, null_sum in list(zip(test.columns,test.isnull().sum())) if null_sum > 0}
pd.DataFrame.from_dict(test_null_cols,'index',columns=['count of null obvs']).sort_values('count of null obvs', ascending = False)

Unnamed: 0,count of null obvs
lot_frontage,160


All that's left is to use KNNImputer to fill the missing values in `lot_frontage`:

In [10]:
#import and instantiate
from sklearn.impute import KNNImputer
knn_imp = KNNImputer()

#create a dataframe with imputed values for `lot_frontage`
imputed = pd.DataFrame(knn_imp.fit_transform(test.select_dtypes(include=np.number)),
                       columns=test.select_dtypes(include=np.number).columns)

In [11]:
#compare original vs. imputed mean and variance
test.lot_frontage.mean(),  imputed.lot_frontage.mean(), test.lot_frontage.var(), imputed.lot_frontage.var()

(69.54596100278552, 70.16309794988611, 553.8465596749076, 505.5483744282521)

In [12]:
#overwrite the original lot_frontage NAs with the imputed lot frontages
test['lot_frontage'] = imputed['lot_frontage']

In [13]:
#confirm no nulls remain:
test.isnull().sum().sum()

0

---

Lastly, in the training data, we recasted some columns' types. The code cell below should cast all the same columns to matching types:

In [14]:
#cast discrete vars as objects
test[['id','ms_subclass','overall_qual','overall_cond','year_built','year_remod/add','mo_sold','yr_sold']] = \
    test[['id','ms_subclass','overall_qual','overall_cond','year_built','year_remod/add','mo_sold','yr_sold']].astype(str)

Check now that all the column types between clean train match that of clean test:

In [15]:
test.shape, train_clean.shape

((878, 72), (2045, 73))

We expect that there is one fewer column (`saleprice`) in the test set.

In [16]:
list(train_clean.columns[:-1]) == list(test.columns)

True

Check the dtypes too.

In [17]:
for column in test.columns:
    if test[column].dtype != 'O' and test[column].dtype != train_clean[column].dtype:
        test[column] = test[column].astype(train_clean[column].dtype)
        
list(train_clean.dtypes[:-1]) == list(test.dtypes)

True

---

## Strategy for Preprocessing

I'll be using LASSO regression to assist with feature selection. As exhibited in [01_EDA_and_Cleaning](01_EDA_and_Cleaning.ipynb), it's somewhat apparent which continuous variable should be included in predicing homes' `saleprice`. The regularized regression we do here will largely help us select which categorical variables to keep.

This strategy will proceed roughly as follows:

* One-hot encode the categorical variables
* Set up a train-test split to split our training dataset into training and testing data for validation
* Scale the variables (per the prerequsite to using regularized models)
* Use GridSearchCV to attune our model to the best lambda (error scalar term)
* Determine which variables contribute meaningfully to predicting `saleprice` and drop the ones that don't.

Once we have the variables we like, we'll train all three models using the identified predictors to predict `saleprice` in the test dataset.

In [18]:
#### Save a copy of our cleaned test data in case we want 
#### to try different models or feature selections later.
test.to_csv('../datasets/cleaned/ames_test_clean.csv',index=False)

---

#### One-Hot Encode the Categorical Variables

The test.csv is now clean. But there may still be adjustments we need to make AFTER we've made our variable selection.


In [19]:
#save a list of the categorical columns:
categorical_columns = list(train_clean.select_dtypes(include='object').columns)

#don't include the id column for this:
categorical_columns.remove('id')

In [20]:
train_clean = pd.get_dummies(train_clean, columns=categorical_columns,drop_first=True)

In [21]:
train_clean.shape

(2045, 565)

---

#### Apply a train-test split

Now to split our data into training and testing sets. We'll use pipelines to make this process a little easier.

In [22]:
X = train_clean.drop(columns=['saleprice'])
y = train_clean['saleprice']

In [23]:
#apply train-test_split:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

#### Use `Pipeline` and `GridSearchCV` to optimze hyperparameters for `Lasso`:

Inspired by Lesson 4.05-hyperparameters-gridsearch-and-pipelines

In [32]:
#Instantiate Pipelines
pipe_lasso_gs = Pipeline([
    ('ss',StandardScaler()),
    ('Lasso',Lasso())
])

pipe_lasso_params = {
    'Lasso__random_state': [42],
    'Lasso__alpha': [.001, .01, .1],
    'Lasso__max_iter': [20000, 50000],
    'Lasso__tol': [.001],
}

#Instantiate the GridSearch Cross Validation:
gs_lasso = GridSearchCV(pipe_lasso_gs, pipe_lasso_params, cv=3)

#Run the gridsearch:
gs_lasso.fit(X_train,y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('Lasso', Lasso())]),
             param_grid={'Lasso__alpha': [0.001, 0.01, 0.1],
                         'Lasso__max_iter': [20000, 50000],
                         'Lasso__random_state': [42], 'Lasso__tol': [0.001]})

The Gridsearch results in a lot of ConvergenceWarnings. Should this be a concern? Is this an inappropriate application of the GridsearchCV?

I commented out the actual fit on the gridsearch because it is pretty machine intensive and takes several minutes to run. In the cell below, I will include the best alpha and best max_iter I got during this process.

In [31]:
gs_lasso.best_params_

{'Lasso__alpha': 599.4842503189421,
 'Lasso__max_iter': 20000,
 'Lasso__random_state': 42,
 'Lasso__tol': 0.001}

In [27]:
#If we ignore the convergence warnings, we can extract the best alpha value used as our optimized hyperparameter
gbest_alpha = gs_lasso.best_params_['Lasso__alpha']
gbest_miter = gs_lasso.best_params_['Lasso__max_iter']

#gbest_alpha = 599.4842503189421
#gbest_miter = 5000
print(gbest_alpha,gbest_miter)

599.4842503189421 5000


In [25]:
#Here was the best scoore the gridsearch got:
#gs_lasso.best_score_

#my output:
#0.9053456636743066

In [26]:
#And here's how the alpha-optimzed lasso's R2 in predicting training vs testing split comes out: 
#gs_lasso.score(X_test,y_test)
#my output:
#0.9163464598246794

So we have the hyperparameters we should use to tune our Lasso regression. Let's set it up so that we can finally extract the variables we actually want to keep.

In [27]:
#Instantiate new scaler and lasso:
ss = StandardScaler()
Lasso_reg = Lasso(alpha=gbest_alpha,max_iter=gbest_miter)

In [28]:
#Transform predictors:
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

In [29]:
#Fit the model:
Lasso_reg.fit(Z_train,y_train)

Lasso(alpha=599.4842503189421, max_iter=5000)

In [30]:
#Evaluate:
Lasso_reg.score(Z_train,y_train),Lasso_reg.score(Z_test,y_test)

(0.939640883186035, 0.9163464598246794)

In [31]:
#Use these variables for feature selection:
best_lasso_coefficients = list(zip(X.columns,Lasso_reg.coef_))

In [32]:
#Store the variables which have coefficients of 0 as a result of the fit model above:
vars_to_drop = [x[0] for x in best_lasso_coefficients if x[1] == 0]

In [33]:
#Determine which of our training dummified data are left after removing:
selected_features = [x for x in list(train_clean.columns) if x not in vars_to_drop]

So we have the variables we want to end up with, but we need to take a few steps before we can run regression using them on the test dataset. We will need to proceed as follows:

* Dummify the test data
* Add anything from the selected features which is not in the test data. They should all be set to 0 if the dummified column was not already in the test dummy columns.
* Remove everything (except `id`) from the train and test features which are not among the selected features

In [34]:
#Dummify the test data:
test = pd.get_dummies(test,columns=categorical_columns)

#Identify what is not in test that is in the selected features:
test_dummy_cols = list(test.columns)
in_selected_not_in_test = [x for x in selected_features if x not in test.columns] 
in_selected_not_in_test.remove('saleprice') #make sure saleprice doesn't make it either

for column in in_selected_not_in_test:
    test[column] = 0
    
#Identify what is still remaining in test which is not in the selected features:
in_test_not_in_selected = [x for x in test.columns if x not in selected_features]

#we'll be using the list as an argument for columns to drop. make sure we don't lose id:
in_test_not_in_selected.remove('id')

#remove the columns:
test.drop(columns=in_test_not_in_selected,inplace=True)

#Do the same for the training set:
in_train_not_in_selected = [x for x in train_clean.columns if x not in selected_features]

#we'll be using the list as an argument for columns to drop. make sure we don't lose id:
in_train_not_in_selected.remove('id')

#remove the columns:
train_clean.drop(columns=in_train_not_in_selected,inplace=True)

Now we have selected features. Last thing we need to ensure is that our column orders are also correct:

In [35]:
#check that our columns match now (minus just saleprice, present in train and not in test):
train_clean.shape, test.shape

((2045, 174), (878, 173))

In [36]:
list(train_clean.columns[:-1]) == list(test.columns)

False

In [37]:
final_column_order = list(train_clean.columns)

In [38]:
final_column_order.remove('saleprice')

In [39]:
test = test[[x for x in final_column_order]]

In [40]:
final_column_order == list(test.columns)

True

Now that our columns are congruent both in the training and testing datasets, we can finally set fit three a standard multiple linear regression model on our training data to try to predict price in the testing data.

After that, we'll export the .csv

In [41]:
#Define our training and testing sets:
X_train = train_clean.drop(columns=['saleprice'])
y_train = train_clean['saleprice']

X_test = test

In [42]:
#Instantiate a linear regression model:
lr = LinearRegression()

#Fit the model:
lr.fit(X_train,y_train)

#Predict:
y_pred = lr.predict(X_test)

In [43]:
metrics.mean_squared_error(y_train,lr.predict(X_train),squared=False)

19021.892858470175

In [44]:
#add the predicted price to our test dataframe:
test['saleprice'] = y_pred

In [45]:
test.columns

Index(['id', 'lot_frontage', 'lot_area', 'mas_vnr_area', 'total_bsmt_sf',
       'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath', 'full_bath',
       'half_bath',
       ...
       'misc_feature_TenC', 'mo_sold_2', 'mo_sold_3', 'mo_sold_5', 'mo_sold_7',
       'yr_sold_2009', 'sale_type_Con', 'sale_type_New', 'sale_type_Oth',
       'saleprice'],
      dtype='object', length=174)

In [46]:
kaggle_output = test[['id','saleprice']]

In [47]:
kaggle_output.columns = ['Id','SalePrice']

In [48]:
kaggle_output

Unnamed: 0,Id,SalePrice
0,2658,128457.996398
1,2718,149023.181081
2,2414,212358.362929
3,1989,108947.249635
4,625,185710.625196
...,...,...
873,1662,188595.991679
874,1234,211971.405362
875,1373,126885.776624
876,1672,117133.407394


In [49]:
#export the csv!
kaggle_output.to_csv('../datasets/kaggle_submissions/attempt1_mls.csv',index = False)

---

# Successive Attempts

Everything done before this point culminated in the first submission. Now, I'll try to tune the model and perhaps even try other regression model variants.

In [50]:
#reassert the train and test data:

#Define our training and testing sets:
X_train = train_clean.drop(columns=['saleprice'])
y_train = train_clean['saleprice']

X_test = test # will be using the predictor values in this dataframe to predict saleprice

### Set up benchmarking envionment

Benchmarking can only be done on the training data.

Set up a train-test split on the training data:

In [51]:
X_benchmark, X_benchmark_test, y_benchmark, y_benchmark_test = train_test_split(X_train,y_train,random_state=24)

In [52]:
scores = {}

In [53]:
# define a function to add benchmark results to the scores dictionary:

def add_benchmark(model, y_true, y_pred):
    scores[model] = {'r2': metrics.r2_score(y_true,y_pred),'rmse': metrics.mean_squared_error(y_true,y_pred,squared=False)}

In [69]:
test_preds = {}
# define a function to create predictions using X_test

def add_test_preds(model,model_name):
    prediction = np.array(model.predict(X_test))
    test_preds[model_name] = prediction

Set the baseline score, using the mean of the training *y*.

In [56]:
y_bench_test_bar = y_benchmark_test.mean()

add_benchmark('baseline',y_benchmark_test,[y_bench_test_bar for val in y_benchmark_test])

Though we did it in the previous section, let's just make sure we also include a benchmark for the standard multiple linear regression:

In [70]:
# Instantiate the model:
lr = LinearRegression()
lr.fit(X_benchmark, y_benchmark)

#add the benchmark to the scores:
add_benchmark('standard MLR',y_benchmark_test,lr.predict(X_benchmark_test))

#add the test prediction:
add_test_preds(lr,'standard MLR')

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 173 is different from 174)