# Kaggle Feature Engineering microcurse
- Better features make better models. 
- Discover how to get the most out of your data
- https://www.kaggle.com/learn/feature-engineering

## What Is Feature Engineering
- Learn the steps and principles of creating better features
1. determine which features are the most importan with *mutual information*
2. invent new features in several real-world problem domains
3. encode hisgh-cardinality categoricals with a *target encoding*
4. create segmentation features with *k-means clustering*
5. decompose a dataset´s variation into features with *principal component analysis*

## The Goal of Feature Engineering
- make your data better suited to the problem at hand
1. improve a model´s predictive performance
2. reduce comupational or data needs
3. improve interpretability of the resutls

## A Guiding Principle of Feature Engineering
- For a feature to be useful, it must have a relatoinship to the target that your model es able to learn.
- Linear models, for instance, are only able to learn linear relationships.
- So, when using a liner model, your goal is to transform the features to make their relationship to the target linear.
- The key ideas here is that a transformation you apply to a feature becomes in essence part of the model itself.
- ex. trying to predict Price of square plots of land from de Lenght, the relationship (b/price and Lenght) will be linear only if we transform lentght to area (are squares).

## Example - Concrete Formulations
- Add 'synthetic'features can improve the predictive performance of a model.

In [31]:
import zipfile as zfm
import pandas as pd
pd.__version__

'2.0.1'

In [32]:
#url = 'https://github.com/jmonti-gh/Datasets/blob/\
#c790af2d1885dcd63baea8b5a6f9dc8c1b8a1531/Concrete_Data.xls'
# url = 'https://github.com/jmonti-gh/Datasets/blob/main/test.csv'
# df = pd.read_csv(url)
# df

In [46]:
zipfile = 'files/ConcreteCompressiveStrength.zip'
dataset = 'Concrete_Data.xls'

with zfm.ZipFile(zipfile) as zf:
    df_xls = pd.read_excel(zf.open(dataset))

df = pd.read_csv('files/concrete.csv')

print(df.shape)
display(df_xls.iloc[[0, 9, -9, -1]])
display(df.iloc[[0, 9, -9, -1]])

### read_axcel: install xlrd (for xls); install openpyxl (for xlsx)
# https://stackoverflow.com/questions/48066517/python-pandas-pd-read-excel-giving-importerror-install-xlrd-0-9-0-for-excel

(1030, 9)


Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.28979
1021,298.2,0.0,107.0,209.7,11.1,879.6,744.2,28,31.875165
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.401235


Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29
1021,298.2,0.0,107.0,209.7,11.1,879.6,744.2,28,31.88
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.4


The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. The task for this dataset is to predict a concrete's compressive strength given its formulation.

In [35]:
# libraries necesary to build and evaluate the model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

You can see in the df the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

In [36]:
df.columns

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

In [58]:
target = df_xls.columns[-1]
tdf = df.columns[-1]
X = df.copy()
y = X.pop(tdf)



0       79.99
1       61.89
2       40.27
3       41.05
4       44.30
        ...  
1025    44.28
1026    31.18
1027    23.70
1028    32.77
1029    32.40
Name: strength, Length: 1030, dtype: float64

In [59]:
# Train and score baseline model
baseline = RandomForestRegressor(criterion="absolute_error", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.232


If you ever cook at home, you might know that the ratio of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. We might reason then that ratios of the features above would be a good predictor of CompressiveStrength.

The cell below adds three new ratio features to the dataset.

In [61]:
df.columns

Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')

In [62]:
X = df.copy()
y = X.pop("strength")

# Create synthetic features
X["FCRatio"] = X["fineagg"] / X["coarseagg"]
X["AggCmtRatio"] = (X["coarseagg"] + X["fineagg"]) / X["cement"]
X["WtrCmtRatio"] = X["water"] / X["cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="absolute_error", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 7.948


And sure enough, performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.
### Continue
We've seen that engineering new features can improve model performance. But how do you identify features in the dataset that might be useful to combine? __Discover useful features__ with mutual information.