
# Feature Engineering for House Prices Prediction

## 1. Introduction
In this notebook, we will perform **Feature Engineering**, which includes:
1. Handling missing values.
2. Transforming and encoding categorical variables.
3. Creating new features from existing ones.
4. Applying feature scaling to numerical features.
5. Saving the cleaned dataset for model building.


In [9]:
# Import necessary libraries for feature engineering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import joblib

# Setting options to display all columns in dataframes
pd.pandas.set_option('display.max_columns', None)


## 2. Handling Missing Values
We will replace missing values in numerical and categorical columns by applying appropriate strategies such as imputing with the median for numerical variables and the mode for categorical variables.


In [10]:

# Load the train and test datasets
train = pd.read_csv('train.csv')

# Display the shape of the train dataset
print(f"Train dataset has {train.shape[0]} rows and {train.shape[1]} columns")

Train dataset has 1460 rows and 81 columns


In [11]:

# List of numerical features with missing values
numerical_with_nan = [feature for feature in train.columns if train[feature].isnull().sum() > 0 and train[feature].dtypes != 'O']

# Impute missing values for numerical features with the median
for feature in numerical_with_nan:
    median_value = train[feature].median()
    train[feature + '_nan'] = np.where(train[feature].isnull(), 1, 0)  # Create a new binary feature to indicate missing values
    train[feature].fillna(median_value, inplace=True)

# List of categorical features with missing values
categorical_with_nan = [feature for feature in train.columns if train[feature].isnull().sum() > 0 and train[feature].dtypes == 'O']

# Impute missing values with the mode for categorical variables
for feature in categorical_with_nan:
    mode_value = train[feature].mode()[0]
    train[feature + '_nan'] = np.where(train[feature].isnull(), 1, 0)  # Create a new binary feature to indicate missing values
    train[feature].fillna(mode_value, inplace=True)

# Check if all missing values have been handled
print(train[numerical_with_nan].isnull().sum(), train[categorical_with_nan].isnull().sum())


LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64 Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[feature].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[feature].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 


## 3. Encoding Categorical Variables
We will use one-hot encoding to transform categorical features into numerical format.


In [12]:

# Apply one-hot encoding to categorical variables
categorical_features = [feature for feature in train.columns if train[feature].dtype == 'O']

train_encoded = pd.get_dummies(train, columns=categorical_features, drop_first=True)

# Preview the encoded dataset
train_encoded.head()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,LotFrontage_nan,MasVnrArea_nan,GarageYrBlt_nan,Alley_nan,MasVnrType_nan,BsmtQual_nan,BsmtCond_nan,BsmtExposure_nan,BsmtFinType1_nan,BsmtFinType2_nan,Electrical_nan,FireplaceQu_nan,GarageType_nan,GarageFinish_nan,GarageQual_nan,GarageCond_nan,PoolQC_nan,Fence_nan,MiscFeature_nan,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,Alley_Pave,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_HLS,LandContour_Low,LandContour_Lvl,Utilities_NoSeWa,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,LandSlope_Mod,LandSlope_Sev,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,MasVnrType_BrkFace,MasVnrType_Stone,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Fa,ExterCond_Gd,ExterCond_Po,ExterCond_TA,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,BsmtQual_Fa,BsmtQual_Gd,BsmtQual_TA,BsmtCond_Gd,BsmtCond_Po,BsmtCond_TA,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_Rec,BsmtFinType2_Unf,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HeatingQC_Fa,HeatingQC_Gd,HeatingQC_Po,HeatingQC_TA,CentralAir_Y,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_Po,FireplaceQu_TA,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA,GarageCond_Fa,GarageCond_Gd,GarageCond_Po,GarageCond_TA,PavedDrive_P,PavedDrive_Y,PoolQC_Fa,PoolQC_Gd,Fence_GdWo,Fence_MnPrv,Fence_MnWw,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,0,150,856,856,854,0,1710,1,0,2,1,3,1,8,0,2003.0,2,548,0,61,0,0,0,0,0,2,2008,208500,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,False,False,True,False,True,False,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,True,False,False,False,False,True,False,False,False,True,False,False,True,False,True,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,True,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,True,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,0,284,1262,1262,0,0,1262,0,1,2,0,3,1,6,1,1976.0,2,460,298,0,0,0,0,0,0,5,2007,181500,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,False,False,True,False,True,False,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,True,True,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,True,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,True,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,0,434,920,920,866,0,1786,1,0,2,1,3,1,6,1,2001.0,2,608,0,42,0,0,0,0,0,9,2008,223500,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,True,False,False,False,False,True,False,False,False,True,False,True,False,False,True,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,True,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,0,540,756,961,756,0,1717,1,0,1,0,3,1,7,1,1998.0,3,642,0,35,272,0,0,0,0,2,2006,140000,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,True,False,False,True,False,False,False,True,False,True,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,True,False,True,False,False,False,True,False,False,False,True,False,True,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,0,490,1145,1145,1053,0,2198,1,0,2,1,4,1,9,1,2000.0,3,836,192,84,0,0,0,0,0,12,2008,250000,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,False,False,True,False,True,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,True,False,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,True,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,True,False,True,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False



## 4. Creating New Features
We will create new features that could help improve model performance. For example, combining certain features like the total square footage or creating interaction features between existing variables.


In [13]:

# Create a new feature 'TotalSF' that sums up several area-related features
train_encoded['TotalSF'] = train_encoded['TotalBsmtSF'] + train_encoded['1stFlrSF'] + train_encoded['2ndFlrSF']

# Create a new feature 'Age' that represents the age of the house
train_encoded['Age'] = train_encoded['YrSold'] - train_encoded['YearBuilt']

# Preview new features
train_encoded[['TotalSF', 'Age']].head()


Unnamed: 0,TotalSF,Age
0,2566,5
1,2524,31
2,2706,7
3,2473,91
4,3343,8



## 5. Feature Scaling
Scaling numerical features is important when using algorithms that rely on distance metrics like Linear Regression, KNN, etc.


In [14]:

from sklearn.preprocessing import StandardScaler

# List of features that need scaling
numerical_features = ['LotArea', 'GrLivArea', 'TotalSF', 'Age']

scaler = StandardScaler()
train_encoded[numerical_features] = scaler.fit_transform(train_encoded[numerical_features])

# Preview scaled features
train_encoded[numerical_features].head()


Unnamed: 0,LotArea,GrLivArea,TotalSF,Age
0,-0.207142,0.370333,-0.001277,-1.043259
1,-0.091886,-0.482512,-0.052407,-0.183465
2,0.07348,0.515013,0.169157,-0.977121
3,-0.096897,0.383659,-0.114493,1.800676
4,0.375148,1.299326,0.944631,-0.944052



## 6. Saving the Cleaned and Engineered Dataset
Once all the feature engineering steps are complete, we will save the cleaned dataset as a CSV file to be used in the model building phase.


In [15]:

# Save the processed dataset to a CSV file
train_encoded.to_csv('cleaned_train_data.csv', index=False)
print("Cleaned and engineered dataset saved to 'cleaned_train_data.csv'.")

# Save the column names of the processed training dataset
train_columns = train_encoded.columns.tolist()
joblib.dump(train_columns, 'train_columns.pkl')

print("Training columns saved as 'train_columns.pkl'.")


Cleaned and engineered dataset saved to 'cleaned_train_data.csv'.
Training columns saved as 'train_columns.pkl'.



## 7. Summary and Next Steps
In this section, we performed feature engineering, handling missing values, encoding categorical variables, creating new features, and scaling numerical features. The cleaned dataset was saved as 'cleaned_train_data.csv'.

In the next notebook, we will proceed to **Model Building**, where we will use this cleaned dataset to train our machine learning models.
