<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/data/FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Feature Engineering with the Ames Housing Dataset

Feature engineering is the process of transforming raw data into meaningful features that improve the predictive power of a model. This involves selecting, transforming, and creating new features to capture patterns in the data, ultimately leading to more accurate predictions. In this section, we will demonstrate key feature engineering techniques using the **Ames Housing dataset**, a popular dataset for predicting house prices.

---

##### Why Use the Ames Housing Dataset?

The Ames Housing dataset is ideal for showcasing feature engineering due to its rich mix of numerical and categorical features. It includes data on various aspects of houses, such as their size, age, and neighborhood, making it a realistic dataset for regression tasks like predicting house prices.


##### Loading the Ames Housing Dataset

The Ames Housing dataset can be loaded using the `fetch_openml` method from `sklearn.datasets`.


In [1]:
from sklearn.datasets import fetch_openml
import pandas as pd

# Load the Ames Housing dataset
data = fetch_openml(name="house_prices", as_frame=True)
housing = data.frame

# Display the first few rows
print(housing.head())

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

### 1. Normalization and Scaling

Scaling numerical features ensures that all features are on the same scale, which is critical for many machine learning models.

- **Normalization**: Scale features to a range, typically [0, 1].
- **Standardization**: Centers the data around zero with a standard deviation of one.


In [2]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Select numerical features
numerical_features = ['LotArea', 'GrLivArea', 'TotalBsmtSF']

# Apply Min-Max Scaling
scaler = MinMaxScaler()
housing[numerical_features] = scaler.fit_transform(housing[numerical_features])

print(housing[numerical_features].head())

    LotArea  GrLivArea  TotalBsmtSF
0  0.033420   0.259231     0.140098
1  0.038795   0.174830     0.206547
2  0.046507   0.273549     0.150573
3  0.038561   0.260550     0.123732
4  0.060576   0.351168     0.187398


### 2. Encoding Categorical Variables

Categorical features like `Neighborhood` and `HouseStyle` need to be converted into numerical formats for machine learning models.

- **One-Hot Encoding**: Converts each category into a binary column.
- **Target Encoding**: Encodes categories based on their mean target value.


In [4]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encode categorical features
categorical_features = ['Neighborhood', 'HouseStyle']
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = pd.DataFrame(encoder.fit_transform(housing[categorical_features]),
                                columns=encoder.get_feature_names_out(categorical_features))

# Concatenate the encoded features with the dataset
housing = pd.concat([housing, encoded_features], axis=1)

# Drop the original categorical columns
housing = housing.drop(columns=categorical_features)

print(housing.head())

   Id  MSSubClass MSZoning  LotFrontage   LotArea Street Alley LotShape  \
0   1          60       RL         65.0  0.033420   Pave   NaN      Reg   
1   2          20       RL         80.0  0.038795   Pave   NaN      Reg   
2   3          60       RL         68.0  0.046507   Pave   NaN      IR1   
3   4          70       RL         60.0  0.038561   Pave   NaN      IR1   
4   5          60       RL         84.0  0.060576   Pave   NaN      IR1   

  LandContour Utilities  ... Neighborhood_StoneBr Neighborhood_Timber  \
0         Lvl    AllPub  ...                  0.0                 0.0   
1         Lvl    AllPub  ...                  0.0                 0.0   
2         Lvl    AllPub  ...                  0.0                 0.0   
3         Lvl    AllPub  ...                  0.0                 0.0   
4         Lvl    AllPub  ...                  0.0                 0.0   

  Neighborhood_Veenker HouseStyle_1.5Unf HouseStyle_1Story  HouseStyle_2.5Fin  \
0                  0.0       

#### 3. Creating Interaction Terms

Interaction terms capture relationships between two or more features.


In [5]:
housing['OverallQual_GrLivArea'] = housing['OverallQual'] * housing['GrLivArea']
housing['YearBuilt_Age'] = housing['YrSold'] - housing['YearBuilt']

print(housing[['OverallQual_GrLivArea', 'YearBuilt_Age']].head())

   OverallQual_GrLivArea  YearBuilt_Age
0               1.814619              5
1               1.048983             31
2               1.914846              7
3               1.823851             91
4               2.809344              8


### 4. Polynomial Features

Polynomial transformation helps capture non-linear relationships by creating squared or higher-order terms for numerical features.


In [6]:
from sklearn.preprocessing import PolynomialFeatures

# Apply polynomial transformation
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(housing[['GrLivArea']])

# Convert to DataFrame and add to dataset
poly_features_df = pd.DataFrame(poly_features, columns=['GrLivArea', 'GrLivArea^2'])
housing = pd.concat([housing, poly_features_df], axis=1)

print(housing[['GrLivArea', 'GrLivArea^2']].head())

   GrLivArea  GrLivArea  GrLivArea^2
0   0.259231   0.259231     0.067201
1   0.174830   0.174830     0.030566
2   0.273549   0.273549     0.074829
3   0.260550   0.260550     0.067886
4   0.351168   0.351168     0.123319


### 5. Binning

Binning groups continuous values into discrete intervals, reducing model complexity and noise.


In [8]:
# Bin LotArea into discrete categories
housing['LotArea_Bin'] = pd.cut(housing['LotArea'], bins=4, labels=['Small', 'Medium', 'Large', 'Very Large'])

print(housing['LotArea_Bin'].value_counts())

LotArea_Bin
Small         1453
Medium           3
Large            2
Very Large       2
Name: count, dtype: int64


### 6. Log Transformation

Log transformation reduces skewness in features with heavy-tailed distributions.


In [9]:
import numpy as np

# Apply log transformation to the SalePrice column
housing['Log_SalePrice'] = np.log(housing['SalePrice'])

print(housing[['SalePrice', 'Log_SalePrice']].head())

   SalePrice  Log_SalePrice
0     208500      12.247694
1     181500      12.109011
2     223500      12.317167
3     140000      11.849398
4     250000      12.429216


### 7. Time-Based Features

Extract time-related features from columns like `YrSold` to capture time-based patterns.


In [10]:
# Extract year and season features
housing['SoldYear'] = housing['YrSold']
housing['SoldSeason'] = pd.cut(housing['MoSold'], bins=[0, 3, 6, 9, 12], labels=['Winter', 'Spring', 'Summer', 'Fall'])

print(housing[['SoldYear', 'SoldSeason']].head())

   SoldYear SoldSeason
0      2008     Winter
1      2007     Spring
2      2008     Summer
3      2006     Winter
4      2008       Fall
