# Notebook Intro:

In this notebook, I use my cleaned training data along with the five features I identified in notebook 2a as having the best r2 values wrt Sale Price and create a linear regression model.

I first do a train/test split on the training data, then I fit the linear regression model to the train split of the data, verify the model is reasonable by doing a cross-val score as well as running is on the test split of the data.

Finally, I use linear regression model that I fit on the train split of the training data on the test data and export this for **Prediction 1**.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
# import cleaned info
filepath = '../datasets/interim_files/train_clean.csv'

df = pd.read_csv(filepath)
df.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,0.0,13517,Pave,none,IR1,Lvl,...,0,0,none,none,none,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,none,IR1,Lvl,...,0,0,none,none,none,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,none,Reg,Lvl,...,0,0,none,none,none,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,none,Reg,Lvl,...,0,0,none,none,none,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,none,IR1,Lvl,...,0,0,none,none,none,0,3,2010,WD,138500


In [3]:
# Create matricies

features = ['Overall Qual','Gr Liv Area','Garage Area','Garage Cars','Total Bsmt SF']

X = df[features]
y = df['SalePrice']

In [4]:
# Train-Test-Split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [5]:
# Instantiate Model

lr = LinearRegression()

In [6]:
# cross val score

cross_val_score(lr,X_train,y_train).mean()

0.7393194085418239

In [7]:
# fit model since ok cross val score

lr.fit(X_train,y_train)

LinearRegression()

In [8]:
# train score

lr.score(X_train,y_train)

0.7611896164421184

In [9]:
# similar to cross val score, so keep model

In [10]:
lr.score(X_test, y_test)

0.8161173326229041

In [None]:
# test and train scores - so apply to test data

## apply model to test data





In [13]:
# import cleaned test info
filepath = '../datasets/interim_files/test_clean.csv'

testdata = pd.read_csv(filepath)
testdata.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,none,none,none,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,none,IR1,Lvl,...,0,0,0,none,none,none,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,none,Reg,Lvl,...,0,0,0,none,none,none,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,none,IR1,Lvl,...,0,185,0,none,none,none,0,7,2009,WD


In [14]:
testdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               878 non-null    int64  
 1   PID              878 non-null    int64  
 2   MS SubClass      878 non-null    int64  
 3   MS Zoning        878 non-null    object 
 4   Lot Frontage     878 non-null    float64
 5   Lot Area         878 non-null    int64  
 6   Street           878 non-null    object 
 7   Alley            878 non-null    object 
 8   Lot Shape        878 non-null    object 
 9   Land Contour     878 non-null    object 
 10  Utilities        878 non-null    object 
 11  Lot Config       878 non-null    object 
 12  Land Slope       878 non-null    object 
 13  Neighborhood     878 non-null    object 
 14  Condition 1      878 non-null    object 
 15  Condition 2      878 non-null    object 
 16  Bldg Type        878 non-null    object 
 17  House Style     

In [15]:
# Create matricies for test data

features = ['Overall Qual','Gr Liv Area','Garage Area','Garage Cars','Total Bsmt SF']

X_testdata = testdata[features]

In [16]:
#verify no nulls in the dataset
X_testdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Overall Qual   878 non-null    int64
 1   Gr Liv Area    878 non-null    int64
 2   Garage Area    878 non-null    int64
 3   Garage Cars    878 non-null    int64
 4   Total Bsmt SF  878 non-null    int64
dtypes: int64(5)
memory usage: 34.4 KB


In [17]:
# Make Predictions

y_pred = lr.predict(X_testdata)

In [18]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df = y_pred_df.rename(columns={0:'SalePrice'})
y_pred_df.head()

Unnamed: 0,SalePrice
0,185045.932553
1,205632.928113
2,192630.997718
3,132549.349863
4,187666.092798


In [19]:
final_preds = pd.concat([testdata['Id'],y_pred_df], axis=1 )
final_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         878 non-null    int64  
 1   SalePrice  878 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 13.8 KB


In [20]:
#export predictions
filepath = '../datasets/submissions/prediction1.csv'

final_preds.to_csv(filepath, index=False)

# Notebook Summary of Work:

In this notebook, I use my cleaned training data along with the five features I identified in notebook 2a as having the best r2 values wrt Sale Price and create a linear regression model.

I first do a train/test split on the training data, then I fit the linear regression model to the train split of the data, verify the model is reasonable by doing a cross-val score as well as running is on the test split of the data.

Finally, I use linear regression model that I fit on the train split of the training data on the test data and export this for **Prediction 1**.