# Notebook Intro:

In this notebook, I combine my selected dummy columns from Notebook 3a, and combine this with continuous features that have an r2 above .2 wrt Sale Price and create a linear regression model.

I first do a train/test split on the training data, then I fit the linear regression model to the train split of the data, verify the model is reasonable by doing a cross-val score as well as running is on the test split of the data.

Finally, I modify the training data similar to the testing data (creating the dummy columns and selecting only the cotinuous features that I used in the training data), and I use the linear regression model that I fit on the train split of the training data on the test data and export this for **Prediction 2**.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression

In [2]:
# import cleaned training data
filepath = '../datasets/interim_files/train_clean.csv'

df = pd.read_csv(filepath)

In [3]:
# import dummy columns with r2 over.2
filepath = '../datasets/interim_files/dummiesbest.csv'

df_dummies = pd.read_csv(filepath)

In [4]:
# continuous features - per the data dictionary
continuous_features = ['Lot Frontage','Lot Area','Mas Vnr Area','BsmtFin SF 1','BsmtFin SF 2','Bsmt Unf SF','Total Bsmt SF','1st Flr SF','2nd Flr SF','Low Qual Fin SF','Gr Liv Area','Garage Area','Wood Deck SF','Open Porch SF','Enclosed Porch','3Ssn Porch','Screen Porch','Pool Area','Misc Val']

# nominal features - per the data dictionary
nominal_features = ['PID','MS SubClass','MS Zoning','Street','Alley','Land Contour','Lot Config','Neighborhood','Condition 1','Condition 2','Bldg Type','House Style','Roof Style','Roof Matl','Exterior 1st','Exterior 2nd','Mas Vnr Type','Foundation','Heating','Central Air','Garage Type','Misc Feature','Sale Type']

# discrete features 
discrete_features = ['Year Built','Year Remod/Add','Bsmt Full Bath','Bsmt Half Bath','Full Bath','Half Bath','Bedroom AbvGr','Kitchen AbvGr','TotRms AbvGrd','Fireplaces','Garage Yr Blt','Garage Cars','Mo Sold','Yr Sold']

# Ordinal Features
ordinal_features = ['Lot Shape','Utilities','Land Slope','Overall Qual','Overall Cond','Exter Qual','Exter Cond','Bsmt Qual','Bsmt Cond','Bsmt Exposure','BsmtFin Type 1','BsmtFin Type 2','Heating QC','Electrical','Kitchen Qual','Functional','Fireplace Qu','Garage Finish','Garage Qual','Garage Cond','Paved Drive','Pool QC','Fence']

# look at continuous variable

In [5]:
df[continuous_features]

Unnamed: 0,Lot Frontage,Lot Area,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val
0,0.0,13517,289.0,533.0,0.0,192.0,725.0,725,754,0,1479,475.0,0,44,0,0,0,0,0
1,43.0,11492,132.0,637.0,0.0,276.0,913.0,913,1209,0,2122,559.0,0,74,0,0,0,0,0
2,68.0,7922,0.0,731.0,0.0,326.0,1057.0,1057,0,0,1057,246.0,0,52,0,0,0,0,0
3,73.0,9802,0.0,0.0,0.0,384.0,384.0,744,700,0,1444,400.0,100,0,0,0,0,0,0
4,82.0,14235,0.0,0.0,0.0,676.0,676.0,831,614,0,1445,484.0,0,59,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021,79.0,11449,0.0,1011.0,0.0,873.0,1884.0,1728,0,0,1728,520.0,0,276,0,0,0,0,0
2022,0.0,12342,0.0,262.0,0.0,599.0,861.0,861,0,0,861,539.0,158,0,0,0,0,0,0
2023,57.0,7558,0.0,0.0,0.0,896.0,896.0,1172,741,0,1913,342.0,0,0,0,0,0,0,0
2024,80.0,10400,0.0,155.0,750.0,295.0,1200.0,1200,0,0,1200,294.0,0,189,140,0,0,0,0


In [6]:
continuous_features_with_sale = pd.concat([df[continuous_features],df['SalePrice']], axis = 1)
np.abs(continuous_features_with_sale.corr()['SalePrice'])

Lot Frontage       0.179545
Lot Area           0.295845
Mas Vnr Area       0.512699
BsmtFin SF 1       0.424380
BsmtFin SF 2       0.019387
Bsmt Unf SF        0.191275
Total Bsmt SF      0.631975
1st Flr SF         0.623523
2nd Flr SF         0.250946
Low Qual Fin SF    0.041153
Gr Liv Area        0.699026
Garage Area        0.648661
Wood Deck SF       0.328729
Open Porch SF      0.325625
Enclosed Porch     0.138048
3Ssn Porch         0.049865
Screen Porch       0.137802
Pool Area          0.023749
Misc Val           0.006787
SalePrice          1.000000
Name: SalePrice, dtype: float64

In [7]:
continuous_features_with_sale.corr()['SalePrice'][np.abs(continuous_features_with_sale.corr()['SalePrice']) >.2]

Lot Area         0.295845
Mas Vnr Area     0.512699
BsmtFin SF 1     0.424380
Total Bsmt SF    0.631975
1st Flr SF       0.623523
2nd Flr SF       0.250946
Gr Liv Area      0.699026
Garage Area      0.648661
Wood Deck SF     0.328729
Open Porch SF    0.325625
SalePrice        1.000000
Name: SalePrice, dtype: float64

In [8]:
final_features_continuous = continuous_features_with_sale.corr()['SalePrice'][np.abs(continuous_features_with_sale.corr()['SalePrice']) >.2]
final_features_continuous.index

Index(['Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'Total Bsmt SF',
       '1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'Garage Area',
       'Wood Deck SF', 'Open Porch SF', 'SalePrice'],
      dtype='object')

In [9]:
df_continuous = df[final_features_continuous.index].copy()
df_continuous.drop(columns = 'SalePrice', inplace=True)

# combine nominal and continuous features

In [10]:
X = pd.concat([df_dummies,df_continuous], axis=1)
y = df['SalePrice']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [12]:
lr = LinearRegression()

In [13]:
cross_val_score(lr, X_train, y_train).mean()

0.7468310385066939

In [14]:
lr.fit(X_train, y_train)

LinearRegression()

In [15]:
lr.score(X_train, y_train), lr.score(X_test, y_test)

(0.7913951844578868, 0.8511567811004915)

# apply model to test data

In [16]:
# import cleaned test info
filepath = '../datasets/interim_files/test_clean.csv'

testdata = pd.read_csv(filepath)
testdata.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,none,none,none,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,none,Reg,Lvl,...,0,0,0,none,none,none,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,none,IR1,Lvl,...,0,185,0,none,none,none,0,7,2009,WD


In [17]:
testdata.shape

(878, 80)

In [21]:
# columns used in dummies (all the nominal columns that had relevant dummy values) - from notebook 3a
combined_columns = ['MS SubClass', 'MS Zoning', 'Land Contour', 'Neighborhood',
       'House Style', 'Roof Style', 'Exterior 1st', 'Exterior 2nd',
       'Mas Vnr Type', 'Foundation', 'Central Air', 'Garage Type',
       'Sale Type']

In [22]:
# get dummy columns from categorical variables
df_long_testdata = pd.get_dummies(testdata, columns = combined_columns)

df_to_review_testdata = df_long_testdata[df_dummies.columns]

In [23]:
# get continuous columns
df_continuous_testdata = testdata[final_features_continuous.index[:-1]]

In [24]:
df_continuous_testdata.shape

(878, 10)

In [25]:
df_to_review_testdata.shape

(878, 26)

In [26]:
X_testdata = pd.concat([df_to_review_testdata,df_continuous_testdata], axis=1)

In [27]:
X_testdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 36 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MS SubClass_60        878 non-null    uint8  
 1   MS SubClass_30        878 non-null    uint8  
 2   MS Zoning_RM          878 non-null    uint8  
 3   MS Zoning_RL          878 non-null    uint8  
 4   Land Contour_HLS      878 non-null    uint8  
 5   Neighborhood_NridgHt  878 non-null    uint8  
 6   Neighborhood_NoRidge  878 non-null    uint8  
 7   Neighborhood_StoneBr  878 non-null    uint8  
 8   Neighborhood_OldTown  878 non-null    uint8  
 9   House Style_2Story    878 non-null    uint8  
 10  Roof Style_Hip        878 non-null    uint8  
 11  Roof Style_Gable      878 non-null    uint8  
 12  Exterior 1st_VinylSd  878 non-null    uint8  
 13  Exterior 2nd_VinylSd  878 non-null    uint8  
 14  Mas Vnr Type_None     878 non-null    uint8  
 15  Mas Vnr Type_Stone    8

In [28]:
X_testdata.fillna(0, inplace=True)

In [29]:
y_pred = lr.predict(X_testdata)

In [30]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df = y_pred_df.rename(columns={0:'SalePrice'})
y_pred_df.head()

Unnamed: 0,SalePrice
0,128715.393784
1,198432.037676
2,202574.713609
3,96085.845483
4,172913.776976


In [31]:
final_preds = pd.concat([testdata['Id'],y_pred_df], axis=1 )
final_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         878 non-null    int64  
 1   SalePrice  878 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 13.8 KB


In [33]:
#export predictions
filepath = '../datasets/submissions/prediction2.csv'

final_preds.to_csv(filepath, index=False)

# Notebook Summary:

In this notebook, I combine my selected dummy columns from Notebook 3a, and combine this with continuous features that have an r2 above .2 wrt Sale Price and create a linear regression model.

I first do a train/test split on the training data, then I fit the linear regression model to the train split of the data, verify the model is reasonable by doing a cross-val score as well as running is on the test split of the data.

Finally, I modify the training data similar to the testing data (creating the dummy columns and selecting only the cotinuous features that I used in the training data), and I use the linear regression model that I fit on the train split of the training data on the test data and export this for **Prediction 2**.