# Introduction


**What?** FE - introduction ratio of features



# What is the goal of feature engineering?


- You might perform feature engineering to:
    - improve a model's predictive performance
    - reduce computational or data needs
    - improve interpretability of the results



# Import modules

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Import the dataset


- To illustrate these ideas we'll see how adding a few synthetic features to a dataset can improve the predictive performance of a random forest model.
- The Concrete dataset contains a variety of concrete formulations and the resulting product's *compressive strength*, which is a measure of how much load that kind of concrete can bear. 
- The task for this dataset is to predict a concrete's compressive strength given its formulation.



In [2]:
df = pd.read_excel("../DATASETS/Concrete_Data.xls")
df.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


# Establish baseline model


- We'll first establish a baseline by training the model. 
- This will help us determine whether our new features are actually useful. 



In [3]:
X = df.copy()
y = X.pop("CompressiveStrength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.397


# Feature engineering


- The *ratio* of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. 
- We might reason then that ratios of the features above would be a good predictor of `CompressiveStrength`.
- The cell below adds three new ratio features to the dataset. 



In [4]:
X = df.copy()
y = X.pop("CompressiveStrength")

# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 8.01



- And sure enough, performance improved! 
- This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.



# References


- https://www.kaggle.com/ryanholbrook/what-is-feature-engineering
- https://www.kaggle.com/sinamhd9/concrete-comprehensive-strength 

