# 1. Introduction to Feature Engineering


## 1.1 About this tutorial
The origin doc is [here](https://www.kaggle.com/code/ryanholbrook/what-is-feature-engineering)
In this course we will learn about one of the most important steps on the way to building a great machine learning model: **feature engineering**. You'll learn how to:

- determine which features are the most important with mutual information
- invent new features in several real-world problem domains
- encode high-cardinality categorical column with a target encoding
- create segmentation features with k-means clustering
- decompose a dataset's variation into features with principal component analysis

The hands-on exercises build up to a complete notebook that applies all of these techniques to make a submission to the House Prices Getting Started competition. After completing this course, you'll have several ideas that you can use to further improve your performance.


## 1.2 The Goal of Feature Engineering

The goal of feature engineering is simply to make your data better suited to the problem at hand.

Consider "apparent temperature" measures like the heat index and the wind chill. These quantities attempt to measure the perceived temperature to humans based on air temperature, humidity, and wind speed, things which we can measure directly. You could think of an apparent temperature as the result of a kind of feature engineering, an attempt to make the observed data more relevant to what we actually care about: how it actually feels outside!

You might perform feature engineering to:

- improve a model's predictive performance
- reduce computational or data needs
- improve interpretability of the results

## 1.3 A Guiding Principle of Feature Engineering

For a feature to be useful, it must have a relationship to the target that your model is able to learn. Linear models, for instance, are only able to learn linear relationships. So, when using a linear model, your goal is to transform the features to make their relationship to the target linear.

The key idea here is that a transformation you apply to a feature becomes in essence a part of the model itself. Say you were trying to predict the Price of square plots of land from the Length of one side. Fitting a linear model directly to Length gives poor results: the relationship is not linear(A linear model fits poorly with only length as feature).

If we square the Length feature to get 'Area', however, we create a linear relationship. Adding Area to the feature set means this linear model can now fit a parabola. Squaring a feature, in other words, gave the linear model the ability to fit squared features.


The above example show you why there can be such a **high return on time invested in feature engineering**. Whatever relationships your model can't learn, you can provide yourself through transformations. As you develop your feature set, think about what information your model could use to achieve its best performance.

## 1.4 A Concrete Example

Below example demonstrates how adding a few synthetic features to a dataset can improve the predictive performance of a random forest model

### 1.4.1 The Source Data
The dataset "../data/concrete.csv" contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear.


In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [6]:
# some config
source_path = "../../data/concrete.csv"

label_col = "CompressiveStrength"

In [3]:
df = pd.read_csv(source_path)

In [4]:
df.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In above dataset show, you can notice the various ingredients of each variety of concrete.


### 1.4.2 The model

The objective of the model that we will train is to predict a concrete's compressive strength given its formulation.

### 1.4.3 Build a baseline (Train a model without feature engineering)

We'll first establish a baseline by training the model with raw feature. This will help us determine whether `our new features` are actually better than the raw feature.

Establishing baselines like this is **good practice** at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.


In [19]:
def train_and_eval(features, label):
    """
    This function takes a list of features and a label, then it uses these data to train a random forest model, at last it evaluate
    the model with MAE score
    :param features:
    :param label:
    :return:
    """
    # train and score baseline model
    model = RandomForestRegressor(criterion="absolute_error", random_state=0)
    score = cross_val_score(model, features, label, cv=5, scoring="neg_mean_absolute_error")
    score = -1 * score.mean()
    print(f"MAE Baseline Score: {score:.4}")

In [16]:
# prepare training data
X = df.copy()
y = X.pop(label_col)

In [18]:
X.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [20]:
train_and_eval(X,y)

  warn(
  warn(
  warn(
  warn(
  warn(


MAE Baseline Score: 8.232


### 1.4.4 First attempt

You might know that the **ratio of ingredients** in a recipe is usually a better predictor of how the recipe turns out than their **absolute amounts**. We might reason then that ratios of the features above would be a good predictor of CompressiveStrength.

Let's check first the raw features

In [11]:
X_1 = X
X_1.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [10]:
y.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: CompressiveStrength, dtype: float64

The cell below adds three new ratio features to the dataset.
- Fine vs Coarse
- Aggregate vs Cement
- Water vs Cement

In [12]:
# Create synthetic features
X_1["FCRatio"] = X_1["FineAggregate"] / X_1["CoarseAggregate"]

X_1["AggCmtRatio"] = (X_1["CoarseAggregate"] + X_1["FineAggregate"]) / X_1["Cement"]

# water cement ratio
X_1["WtrCmtRatio"] = X_1["Water"] / X_1["Cement"]

In [13]:
X_1.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,FCRatio,AggCmtRatio,WtrCmtRatio
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,0.65,3.177778,0.3
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,0.640758,3.205556,0.3
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,0.637339,4.589474,0.685714
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,0.637339,4.589474,0.685714
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,0.843724,9.083082,0.966767


In [21]:
train_and_eval(X_1,y)

  warn(
  warn(
  warn(
  warn(
  warn(


MAE Baseline Score: 7.948


You can notice that the MAE score decreased from 8.232 to 7.948. So the performance of the model improved a little. This is the evidence that the these new ratio features exposed important information to the model that it wasn't detecting before.