Time for some Preprocessing
In this section, I will be splitting my train set into a train test set, scale the variables with continuous values, and build a couple of models to find my best baseline model upon which I will build my final model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
f_pairs = pd.read_csv('../capstone2-housing/documents/feature_pairs_df.csv')
ft_pairs = pd.read_csv('../capstone2-housing/documents/feature_target_df.csv')

I will begin with perfoming a test/train split with a test size of 25%

In [3]:
X = df.drop('SalePrice', axis=1)
y = df.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

Now that the train test split is done, it's time for scaling my data.  
I have decided to scale only continuous variables (as there is not much point in scaling dummy variables). Therefore, before scaling my data, I will separate out the dummy variables into a separate dataframe.

In [4]:
train_mask = [col for col in X_train if len(pd.unique(X_train[col])) <= 2]

X_train_dummies = X_train[train_mask]
X_train_contin = X_train.drop(X_train[train_mask], axis=1)

In [5]:
X_train_dummies.info()
X_train_contin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1446 to 1389
Columns: 573 entries, KitchenAbvGr to SaleCondition_Partial
dtypes: int64(573)
memory usage: 4.8 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1446 to 1389
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   LotFrontage    1095 non-null   int64
 1   LotArea        1095 non-null   int64
 2   MasVnrArea     1095 non-null   int64
 3   BsmtFinSF1     1095 non-null   int64
 4   BsmtFinSF2     1095 non-null   int64
 5   BsmtUnfSF      1095 non-null   int64
 6   TotalBsmtSF    1095 non-null   int64
 7   1stFlrSF       1095 non-null   int64
 8   2ndFlrSF       1095 non-null   int64
 9   LowQualFinSF   1095 non-null   int64
 10  GrLivArea      1095 non-null   int64
 11  BsmtFullBath   1095 non-null   int64
 12  BsmtHalfBath   1095 non-null   int64
 13  FullBath       1095 non-null   int64
 14  HalfBath       1095 non-null   int64

In [6]:
# DO I WANT TO DO THIS FOR TEST SET AS WELL OR NO?



#test_mask = [col for col in X_test if len(pd.unique(X_test[col])) <= 2]
#X_test_dummies = X_test[test_mask]
#X_test_contin = X_test.drop(X_test[test_mask], axis=1)

Now that I have divided my X_train set into dummies and continuous variables, I will create a standard scaler to fit standardize the continuous training set.

In [7]:
scaler = StandardScaler()
X_contin_transformed = scaler.fit_transform(X_train_contin)
print(X_contin_transformed)

[[-0.00696566  1.65439997  0.49143639 ... -0.07333341 -0.10879978
   1.60926176]
 [-0.89604872 -0.09339394 -0.56990559 ... -0.07333341 -0.10879978
   0.85845578]
 [-0.00696566 -0.03679453 -0.56990559 ... -0.07333341 -0.10879978
   0.85845578]
 ...
 [-0.00696566  1.09456708 -0.56990559 ... -0.07333341 -0.10879978
  -1.39396214]
 [ 0.03982819 -0.19364381 -0.56990559 ... -0.07333341 -0.10879978
   0.85845578]
 [-0.42811027 -0.44896771 -0.56990559 ... -0.07333341 -0.10879978
  -0.64315617]]


Now that I have my transformed X_train_contin data set, I will combine it with the X_train_dummies set.

In [8]:
df = pd.DataFrame(X_contin_transformed, columns=X_train_contin.columns)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   LotFrontage    1095 non-null   float64
 1   LotArea        1095 non-null   float64
 2   MasVnrArea     1095 non-null   float64
 3   BsmtFinSF1     1095 non-null   float64
 4   BsmtFinSF2     1095 non-null   float64
 5   BsmtUnfSF      1095 non-null   float64
 6   TotalBsmtSF    1095 non-null   float64
 7   1stFlrSF       1095 non-null   float64
 8   2ndFlrSF       1095 non-null   float64
 9   LowQualFinSF   1095 non-null   float64
 10  GrLivArea      1095 non-null   float64
 11  BsmtFullBath   1095 non-null   float64
 12  BsmtHalfBath   1095 non-null   float64
 13  FullBath       1095 non-null   float64
 14  HalfBath       1095 non-null   float64
 15  BedroomAbvGr   1095 non-null   float64
 16  TotRmsAbvGrd   1095 non-null   float64
 17  Fireplaces     1095 non-null   float64
 18  GarageCa

In [9]:
X_train_ready = pd.concat([df, X_train_dummies], axis=1)
X_train_ready.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1368 entries, 0 to 1459
Columns: 601 entries, LotFrontage to SaleCondition_Partial
dtypes: float64(601)
memory usage: 6.3 MB


Perfect!  build a linear regression model.