Time for some Preprocessing
In this section, I will be splitting my train set into a train test set, scale the variables with continuous values, and build a couple of models to find my best baseline model upon which I will build my final model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
f_pairs = pd.read_csv('../capstone2-housing/documents/feature_pairs_df.csv')
ft_pairs = pd.read_csv('../capstone2-housing/documents/feature_target_df.csv')

I will begin with perfoming a test/train split with a test size of 25%

In [3]:
X = df.drop('SalePrice', axis=1)
y = df.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

In [4]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(1095, 601) (365, 601)
(1095,) (365,)


Train test split is done! Time to scale my data.  
I have decided to scale only continuous variables (as there is not much point in scaling dummy variables). Therefore, before scaling my data, I will create a mask to separate out the continuous variables for the train and test X sets.

In [5]:
mask = [col for col in X_train if len(pd.unique(X_train[col])) > 2]

I will create a standard scaler to standardize the continuous values of the training set.

In [6]:
columns = X_train[mask].columns

ct = ColumnTransformer([
        ('scaler', StandardScaler(), columns)
    ], remainder='passthrough')

scaled_X_train = ct.fit_transform(X_train)

In [7]:
X_train = pd.DataFrame(scaled_X_train, columns = X_train.columns)

Perfect! build a linear regression model.

In [8]:
model1 = LinearRegression()
model1.fit(X_train, y_train)
model1.score(X_train, y_train)

0.9564626574843584

This model gives me a 95.6% accuracy score on the training set. That's quite high and is most likely due to overfitting. Let's see how it performs on the test set.

In [9]:
scaled_X_test = ct.transform(X_test)
X_test = pd.DataFrame(scaled_X_test, columns = X_test.columns)
model1.score(X_test, y_test)

-5.942855744917795e+18