---

# Part 5: Kaggle Submissions

---

## Notebook Summary

This notebook will review two different Kaggle submissions of the best predictive model for home sale prices. These Kaggle submissions may include models built loosely off my model iterations in the model tuning process, but since these subsmission are focused exclusively on predictive power and not on answering the problem statement, the models will not look exactly like the previous model iterations. In this notebook the reader will find:

* Kaggle Submission 1
* Kaggle Submission 2

---

## Kaggle Submission 1

I will start by importing the libraries for EDA and linear regression as well as importing the data file.

In [55]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

In [56]:
homes_train = pd.read_csv('../datasets/train_cleaned.csv')
homes_test = pd.read_csv('../datasets/test.csv')
homes_train_org = pd.read_csv('../datasets/train.csv')

In [57]:
homes_train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,0.0,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


I will start by creating a data cleaning function, following the same process for data cleaning that I underwent in Part 1. I will also create a dummify function for changing our categorical data into dummies, just like we did in Part 1 as well.

In [58]:
def clean_data(homes):
    all_nulls = homes.isnull().sum().sort_values(ascending = False)[homes.isnull().sum() != 0]
    num_cols = homes.select_dtypes('number').columns
    num_nulls = [val for val in num_cols if val in all_nulls.index]
    
    for col in num_nulls:
        homes[col].fillna(0, inplace = True)
    
    object_nulls = homes.isnull().sum().sort_values(ascending = False)[homes.isnull().sum() != 0]
    
    for col in object_nulls.index:
        homes[col].fillna('None', inplace = True)
        
    homes['MS SubClass'] = homes['MS SubClass'].astype(str)
    homes['Mo Sold'] = homes['Mo Sold'].astype(str)
        
    return homes

In [59]:
def dummify(homes):
    cols_objs = homes.select_dtypes('object').columns
    homes = pd.get_dummies(data = homes, columns = cols_objs, drop_first = True)
    
    return homes
    

In [60]:
homes_train = clean_data(homes_train)

homes_train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,0.0,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In [61]:
homes_train_dummies = dummify(homes_train)

homes_train_dummies.head()

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Mo Sold_8,Mo Sold_9,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
0,109,533352170,0.0,13517,6,8,1976,2005,289.0,533.0,...,0,0,0,0,0,0,0,0,0,1
1,544,531379050,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,0,0,0,0,0,0,0,0,1
2,153,535304180,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,0,0,0,0,0,0,0,0,1
3,318,916386060,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,255,906425045,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


For this first Kaggle submission, I will focus on those numeric variables that I identified in Part 1 as having the highest positive correlations with sale price, and build a basic model off those features.

In [62]:
corr = homes_train.corr(numeric_only = True)

highest_corr = corr['SalePrice'][corr['SalePrice'] >= 0.5].sort_values(ascending = False)

highest_corr

SalePrice         1.000000
Overall Qual      0.800207
Gr Liv Area       0.697038
Garage Area       0.649897
Garage Cars       0.647781
Total Bsmt SF     0.629303
1st Flr SF        0.618486
Year Built        0.571849
Year Remod/Add    0.550370
Full Bath         0.537969
TotRms AbvGrd     0.504014
Mas Vnr Area      0.503579
Name: SalePrice, dtype: float64

In [63]:
X = homes_train[highest_corr.index].drop(columns = 'SalePrice')
y = homes_train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

lr = LinearRegression()
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))
cross_val_score(lr, X_train, y_train).mean()



0.7823634774211838
0.8399541825260143


0.7602871395015003

In [64]:
homes_test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [65]:
homes_test = clean_data(homes_test)
homes_test_dummies = dummify(homes_test)

homes_test_dummies.head()

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Mo Sold_9,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
0,2658,902301120,69.0,9142,6,8,1910,1950,0.0,0,...,0,0,0,0,0,0,0,0,0,1
1,2718,905108090,0.0,9662,5,4,1977,1977,0.0,0,...,0,0,0,0,0,0,0,0,0,1
2,2414,528218130,58.0,17104,7,5,2006,2006,0.0,554,...,1,0,0,0,0,0,1,0,0,0
3,1989,902207150,60.0,8520,5,6,1923,2006,0.0,0,...,0,0,0,0,0,0,0,0,0,1
4,625,535105100,0.0,9500,6,5,1963,1963,247.0,609,...,0,0,0,0,0,0,0,0,0,1


In [66]:
X_submit = homes_test[X.columns]

homes_test['SalePrice'] = lr.predict(X_submit)

In [67]:
homes_test

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,,,,0,4,2006,WD,156295.446268
1,2718,905108090,90,RL,0.0,9662,Pave,,IR1,Lvl,...,0,0,,,,0,8,2006,WD,204015.453479
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,,,,0,9,2006,New,196789.013534
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,,,,0,7,2007,WD,130160.093280
4,625,535105100,20,RL,0.0,9500,Pave,,IR1,Lvl,...,185,0,,,,0,7,2009,WD,184056.272056
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873,1662,527377110,60,RL,80.0,8000,Pave,,Reg,Lvl,...,0,0,,,,0,11,2007,WD,187920.040945
874,1234,535126140,60,RL,90.0,14670,Pave,,Reg,Lvl,...,0,0,,MnPrv,,0,8,2008,WD,219547.130692
875,1373,904100040,20,RL,55.0,8250,Pave,,Reg,Lvl,...,0,0,,,,0,8,2008,WD,127505.555237
876,1672,527425140,20,RL,60.0,9000,Pave,,Reg,Lvl,...,0,0,,GdWo,,0,5,2007,WD,101879.756491


In [68]:
submit_1 = homes_test[['Id', 'SalePrice']]

submit_1

Unnamed: 0,Id,SalePrice
0,2658,156295.446268
1,2718,204015.453479
2,2414,196789.013534
3,1989,130160.093280
4,625,184056.272056
...,...,...
873,1662,187920.040945
874,1234,219547.130692
875,1373,127505.555237
876,1672,101879.756491


In [69]:
submit_1.to_csv('../datasets/kaggle_submit_1.csv', index = False)

---

## Kaggle Submission 2

For the second Kaggle Submission, I will use the modified dataset which I saved at the end of Model in my Part 3 notebook.

In [70]:
kaggle_2_train = pd.read_csv('../datasets/kaggle_train_submission.csv')

kaggle_2_train.head()

Unnamed: 0,Overall Qual,Gr Liv Area,Year Built,Neighborhood_NridgHt,BsmtFin SF 1,Neighborhood_StoneBr,Lot Area,Garage Area,TotRms AbvGrd,Neighborhood_NoRidge,...,Mas Vnr Area,1st Flr SF,Open Porch SF,Total Bsmt SF,Low Qual Fin SF,Exter Qual_TA,Misc Val,Kitchen Qual_Gd,Kitchen Qual_TA,SalePrice
0,6,1479,1976,0,533.0,0,13517,475.0,6,0,...,289.0,725,44,725.0,0,0,0,1,0,130500
1,7,2122,1996,0,637.0,0,11492,559.0,8,0,...,132.0,913,74,913.0,0,0,0,1,0,220000
2,5,1057,1953,0,731.0,0,7922,246.0,5,0,...,0.0,1057,52,1057.0,0,1,0,1,0,109000
3,5,1444,2006,0,0.0,0,9802,400.0,7,0,...,0.0,744,0,384.0,0,1,0,0,1,174000
4,6,1445,1900,0,0.0,0,14235,484.0,6,0,...,0.0,831,59,676.0,0,1,0,0,1,138500


In [71]:
X = kaggle_2_train.drop(columns = ['SalePrice'])
y = kaggle_2_train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
    
ss_pipe = Pipeline([
        ('sc', StandardScaler()),
        ('lr', LinearRegression())
    ])
    
ss_pipe.fit(X_train, y_train)
    
print(f'Cross-Validation Score: {cross_val_score(ss_pipe, X_train, y_train).mean()}')
print(f'Cross-Validation Score: {cross_val_score(ss_pipe, X_train, y_train)}')
print(f'Training Score: {ss_pipe.score(X_train, y_train)}')
print(f'Test Score: {ss_pipe.score(X_test, y_test)}')

Cross-Validation Score: 0.7887791215189661
Cross-Validation Score: [0.85596435 0.83420385 0.85651961 0.81177475 0.58543306]
Training Score: 0.8359322339767182
Test Score: 0.8782602013559124


In [72]:
homes_test = pd.read_csv('../datasets/test.csv')

In [73]:
homes_test = clean_data(homes_test)

homes_test_dummies = dummify(homes_test)

homes_test_dummies

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Mo Sold_9,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
0,2658,902301120,69.0,9142,6,8,1910,1950,0.0,0,...,0,0,0,0,0,0,0,0,0,1
1,2718,905108090,0.0,9662,5,4,1977,1977,0.0,0,...,0,0,0,0,0,0,0,0,0,1
2,2414,528218130,58.0,17104,7,5,2006,2006,0.0,554,...,1,0,0,0,0,0,1,0,0,0
3,1989,902207150,60.0,8520,5,6,1923,2006,0.0,0,...,0,0,0,0,0,0,0,0,0,1
4,625,535105100,0.0,9500,6,5,1963,1963,247.0,609,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873,1662,527377110,80.0,8000,6,6,1974,1974,0.0,931,...,0,0,0,0,0,0,0,0,0,1
874,1234,535126140,90.0,14670,6,7,1966,1999,410.0,575,...,0,0,0,0,0,0,0,0,0,1
875,1373,904100040,55.0,8250,5,5,1968,1968,0.0,250,...,0,0,0,0,0,0,0,0,0,1
876,1672,527425140,60.0,9000,4,6,1971,1971,0.0,616,...,0,0,0,0,0,0,0,0,0,1


In [74]:
X_submit = homes_test_dummies[X.columns]

homes_test_dummies['SalePrice'] = ss_pipe.predict(X_submit)

homes_test_dummies

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,SalePrice
0,2658,902301120,69.0,9142,6,8,1910,1950,0.0,0,...,0,0,0,0,0,0,0,0,1,186070.724594
1,2718,905108090,0.0,9662,5,4,1977,1977,0.0,0,...,0,0,0,0,0,0,0,0,1,175154.576806
2,2414,528218130,58.0,17104,7,5,2006,2006,0.0,554,...,0,0,0,0,0,1,0,0,0,212840.501379
3,1989,902207150,60.0,8520,5,6,1923,2006,0.0,0,...,0,0,0,0,0,0,0,0,1,107838.805211
4,625,535105100,0.0,9500,6,5,1963,1963,247.0,609,...,0,0,0,0,0,0,0,0,1,167371.474523
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873,1662,527377110,80.0,8000,6,6,1974,1974,0.0,931,...,0,0,0,0,0,0,0,0,1,191056.260044
874,1234,535126140,90.0,14670,6,7,1966,1999,410.0,575,...,0,0,0,0,0,0,0,0,1,215199.322380
875,1373,904100040,55.0,8250,5,5,1968,1968,0.0,250,...,0,0,0,0,0,0,0,0,1,130317.933801
876,1672,527425140,60.0,9000,4,6,1971,1971,0.0,616,...,0,0,0,0,0,0,0,0,1,106553.082025


In [75]:
submit_2 = homes_test_dummies[['Id', 'SalePrice']]

submit_2

Unnamed: 0,Id,SalePrice
0,2658,186070.724594
1,2718,175154.576806
2,2414,212840.501379
3,1989,107838.805211
4,625,167371.474523
...,...,...
873,1662,191056.260044
874,1234,215199.322380
875,1373,130317.933801
876,1672,106553.082025


In [76]:
submit_2.to_csv('../datasets/kaggle_submit_2.csv', index = False)