# Kaggle Feature Engineering microcurse
- Better features make better models. 
- Discover how to get the most out of your data
- https://www.kaggle.com/learn/feature-engineering

## 1.- What Is Feature Engineering
- Learn the steps and principles of creating better features

1. determine which features are the most importan with *mutual information*
2. invent new features in several real-world problem domains
3. encode hisgh-cardinality categoricals with a *target encoding*
4. create segmentation features with *k-means clustering*
5. decompose a dataset´s variation into features with *principal component analysis*

### The Goal of Feature Engineering
- make your data better suited to the problem at hand

1. improve a model´s predictive performance
2. reduce comupational or data needs
3. improve interpretability of the resutls

### A Guiding Principle of Feature Engineering

- For a feature to be useful, it must have a relatoinship to the target that your model es able to learn.
- 
Linear models, for instance, are only able to learn linear relationships
- 
So, when using a liner model, your goal is to transform the features to make their relationship to the target linea
- .
The key ideas here is that a transformation you apply to a feature becomes in essence part of the model itse
- f.
ex. trying to predict Price of square plots of land from de Lenght, the relationship (b/price and Lenght) will be linear  nly if we transform lentght to area (are squares).

### Example - Concrete Formulations
- Add 'synthetic'features can improve the predictive performance of a model.

In [1]:
import zipfile as zfm
import pandas as pd
pd.__version__

'1.5.3'

In [2]:
#url = 'https://github.com/jmonti-gh/Datasets/blob/\
#c790af2d1885dcd63baea8b5a6f9dc8c1b8a1531/Concrete_Data.xls'
# url = 'https://github.com/jmonti-gh/Datasets/blob/main/test.csv'
# df = pd.read_csv(url)
# df

In [3]:
zipfile = 'files/ConcreteCompressiveStrength.zip'
dataset = 'Concrete_Data.xls'

with zfm.ZipFile(zipfile) as zf:
    df_xls = pd.read_excel(zf.open(dataset))

df = pd.read_csv('files/concrete.csv')

print(df.shape)
display(df_xls.iloc[[0, 9, -9, -1]])
display(df.iloc[[0, 9, -9, -1]])

### read_axcel: install xlrd (for xls); install openpyxl (for xlsx)
# https://stackoverflow.com/questions/48066517/python-pandas-pd-read-excel-giving-importerror-install-xlrd-0-9-0-for-excel

(1030, 9)


Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.28979
1021,298.2,0.0,107.0,209.7,11.1,879.6,744.2,28,31.875165
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.401235


Unnamed: 0,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29
1021,298.2,0.0,107.0,209.7,11.1,879.6,744.2,28,31.88
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.4


The Concrete dataset contains a variety of concrete formulations and the resulting product's compressive strength, which is a measure of how much load that kind of concrete can bear. The task for this dataset is to predict a concrete's compressive strength given its formulation.

In [4]:
# libraries necesary to build and evaluate the model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

You can see in the df the various ingredients going into each variety of concrete. We'll see in a moment how adding some additional synthetic features derived from these can help a model to learn important relationships among them.

We'll first establish a baseline by training the model on the un-augmented dataset. This will help us determine whether our new features are actually useful.

Establishing baselines like this is good practice at the start of the feature engineering process. A baseline score can help you decide whether your new features are worth keeping, or whether you should discard them and possibly try something else.

In [5]:
df.columns

Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')

In [6]:
target = df_xls.columns[-1]
tdf = df.columns[-1]
X = df.copy()
y = X.pop(tdf)



In [7]:
# Train and score baseline model
baseline = RandomForestRegressor(criterion="absolute_error", random_state=0)
baseline_score = cross_val_score(
    baseline, X, y, cv=5, scoring="neg_mean_absolute_error"
)
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}")

MAE Baseline Score: 8.232


If you ever cook at home, you might know that the ratio of ingredients in a recipe is usually a better predictor of how the recipe turns out than their absolute amounts. We might reason then that ratios of the features above would be a good predictor of CompressiveStrength.

The cell below adds three new ratio features to the dataset.

In [8]:
df.columns

Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')

In [9]:
X = df.copy()
y = X.pop("strength")

# Create synthetic features
X["FCRatio"] = X["fineagg"] / X["coarseagg"]
X["AggCmtRatio"] = (X["coarseagg"] + X["fineagg"]) / X["cement"]
X["WtrCmtRatio"] = X["water"] / X["cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="absolute_error", random_state=0)
score = cross_val_score(
    model, X, y, cv=5, scoring="neg_mean_absolute_error"
)
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}")

MAE Score with Ratio Features: 7.948


And sure enough, performance improved! This is evidence that these new ratio features exposed important information to the model that it wasn't detecting before.
#### Continue
We've seen that engineering new features can improve model performance. But how do you identify features in the dataset that might be useful to combine? __Discover useful features__ with mutual information.

## 2.- Mutual Information
- Locate features with the most potential.

### Intro
- First encountering a new dataset can feel overhelming

1. a great first step is construct a ranking with a __feature utility metric__ ( a function measuring associations between a feature and the target).
2. Then you can choose a smaller set of the most useful features to develop initially and have more confidence that your time will be well spent.
3. The metric we'll use is called "mutual information". 

#### Mutual information
- is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships.
- Mutual information is a great general-purpose metric and especially useful at the start of feature development when you might not know what model you'd like to use yet. It is:
  1. easy to use an interpret.
  2. computationally efficient,
  3. thererically well-founded,
  4. resistant to overfitting, and,
  5. able to detect any kind of relationship.

### Mutual Information and What It Measures

- Mutual information describes relationships in terms of uncertainty.
- The __mutual information__ (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.
- If you knew the value of a feature, how much more confident would you be about the target?

### Interpreting Mutual Information Scores
- The least possible mutual information between quantities is 0.0 MI = 0, quantities are totally independent.
- MI > 2.0 or so are uncommon (MI is a logarithmic qty, increases slowly)
1. MI can help you to understand the relative potential of a feature as a predictor of the target, considered by itself.
2. t's possible for a feature to be very informative when interacting with other features, but not so informative all alone. MI can't detect interactions between features. It is a univariate metric.
3. The actual usefulness of a feature depends on the model you use it with. A feature is only useful to the extent that its relationship with the target is one your model can learn. Just because a feature has a high MI score doesn't mean your model will be able to do anything with that information. You may need to transform the feature first to expose the association.

### Example - 1985 Automobiles

The Automobile dataset consists of 193 cars from the 1985 model year. The goal for this dataset is to predict a car's price (the target) from 23 of the car's features, such as make, body_style, and horsepower. In this example, we'll rank the features with mutual information and investigate the results by data visualization.

In [10]:
### Import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import zipfile as zfm

In [11]:
### Load 'Automobile' dataset
import requests
import zipfile as zfm
import io

ro = 'jmonti-gh'                  # repo_owner
rn = 'Datasets'                   # repo_name
zipfln = 'Automobile_data.zip'
dataset = 'Automobile_data.csv'

url = f'https://raw.githubusercontent.com/{ro}/{rn}/main/{zipfln}'

r = requests.get(url)

with zfm.ZipFile(io.BytesIO(r.content)) as zf:
    print(zf.namelist())
    df = pd.read_csv(zf.open(dataset))

print(df.shape)
df.iloc[[0, 9, -9, -1]]

['Automobile_data.csv']
(205, 26)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?
196,-2,103,volvo,gas,std,four,sedan,rwd,front,104.3,...,141,mpfi,3.78,3.15,9.5,114,5400,24,28,15985
204,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


In [12]:
print(df.columns)
df.drop('normalized-losses', axis=1, inplace=True)  # don't exist in tutotial
print(df.shape)
df.iloc[[0, 9, -9, -1]]

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')
(205, 25)


Unnamed: 0,symboling,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
9,0,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?
196,-2,volvo,gas,std,four,sedan,rwd,front,104.3,188.8,...,141,mpfi,3.78,3.15,9.5,114,5400,24,28,15985
204,-1,volvo,gas,turbo,four,sedan,rwd,front,109.1,188.8,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


The scikit-learn algorithm for MI treats discrete features differently from continuous features. Consequently, you need to tell it which are which. As a rule of thumb, anything that must have a float dtype is not discrete. Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding. (You can review label encodings in our Categorical Variables lesson.)

In [13]:
X = df.copy()
y = X.pop("price")

# # Label encoding for categoricals
# for colname in X.select_dtypes("object"):
#     X[colname], _ = X[colname].factorize()

# # All discrete features should now have integer dtypes
# #(double-check this before using MI!)
# discrete_features = X.dtypes == int

In [17]:
print(X.shape)
X.columns
X.info()

(205, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   make               205 non-null    int64  
 2   fuel-type          205 non-null    int64  
 3   aspiration         205 non-null    int64  
 4   num-of-doors       205 non-null    int64  
 5   body-style         205 non-null    int64  
 6   drive-wheels       205 non-null    int64  
 7   engine-location    205 non-null    int64  
 8   wheel-base         205 non-null    float64
 9   length             205 non-null    float64
 10  width              205 non-null    float64
 11  height             205 non-null    float64
 12  curb-weight        205 non-null    int64  
 13  engine-type        205 non-null    int64  
 14  num-of-cylinders   205 non-null    int64  
 15  engine-size        205 non-null    int64  
 16  fuel-system     

Scikit-learn has two mutual information metrics in its feature_selection module: one for real-valued targets (mutual_info_regression) and one for categorical targets (mutual_info_classif). Our target, price, is real-valued. The next cell computes the MI scores for our features and wraps them up in a nice dataframe.

In [14]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y,
                                       discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

ValueError: could not convert string to float: '?'