# Model building
## With main variables

In [1]:
# === System imports ===
sys.path.append("../../")

# === Third-party import ===

# To handle datasets
import pandas as pd

# To display only a limited number of columns of the dataframe in the whole notebook
pd.options.display.max_rows = 20

# === Local imports ===
import utils

root = utils.get_project_root()

## Load and split train and test sets

In the notebook [2.1.feature_engineering.ipynb](../2.1.feature_engineering.ipynb), <br>
we have engineered the dataframes `X_train` and `X_test` having as variables those which have the most impact on the target variables `SalePrice`<br>
We saved these two daframes. Let's import them.

In [6]:
X_train = pd.read_csv(filepath_or_buffer=f'{root}/datasets/outputs/with_main_variables/x_train.csv', index_col=0)
X_test = pd.read_csv(filepath_or_buffer=f'{root}/datasets/outputs/with_main_variables/x_test.csv', index_col=0)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}') 

X_train shape: (1226, 24)
X_test shape: (146, 24)


In [7]:
X_train.head()

Unnamed: 0,SalePrice,GrLivArea,GarageArea,TotalBsmtSF,OverallQual_5,OverallQual_6,OverallQual_7,OverallQual_8,OverallQual_9,OverallQual_Rare,...,TotRmsAbvGrd_4,TotRmsAbvGrd_5,TotRmsAbvGrd_6,TotRmsAbvGrd_7,TotRmsAbvGrd_8,TotRmsAbvGrd_9,TotRmsAbvGrd_Rare,OverallQual_3,OverallQual_4,TotRmsAbvGrd_12
930,12.21106,7.290293,610,1466,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
656,11.887931,6.959399,312,1053,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
45,12.675764,7.468513,576,1752,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
1348,12.278393,7.309212,514,1482,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
55,12.103486,7.261927,576,1425,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [8]:
X_test.head()

Unnamed: 0,SalePrice,GrLivArea,GarageArea,TotalBsmtSF,OverallQual_3,OverallQual_4,OverallQual_5,OverallQual_6,OverallQual_7,OverallQual_8,...,TotRmsAbvGrd_12,TotRmsAbvGrd_4,TotRmsAbvGrd_5,TotRmsAbvGrd_6,TotRmsAbvGrd_7,TotRmsAbvGrd_8,TotRmsAbvGrd_9,TotRmsAbvGrd_Rare,OverallQual_Rare,TotRmsAbvGrd_3
529,12.209188,7.830028,484,2035,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
491,11.798104,7.363914,240,806,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
459,11.608236,7.092574,352,709,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
279,12.165251,7.611842,505,1160,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
655,11.385092,6.995766,264,525,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0


Let's remove the target `Sale_price` from `X_train` and `X_test` and store it respectively in `y_train` and `y_test`

In [9]:
target = 'SalePrice'

y_train = X_train[target]
X_train.drop(target, axis=1, inplace=True)

y_test = X_test[target]
X_test.drop(target, axis=1, inplace=True)


print('Shapes:')
for df_name in ['X_train', 'X_test', 'y_train', 'y_test']:
    print(f'{df_name}:  {eval(df_name).shape}')

Shapes:
X_train:  (1226, 23)
X_test:  (146, 23)
y_train:  (1226,)
y_test:  (146,)


## ML models

Let's try different linear models.
We'