# Google Ventilator Feature Importance with LOFO

![](https://raw.githubusercontent.com/aerdem4/lofo-importance/master/docs/lofo_logo.png)

**LOFO** (Leave One Feature Out) Importance calculates the importances of a set of features based on **a metric of choice**, for **a model of choice**, by **iteratively removing each feature from the set**, and **evaluating the performance** of the model, with **a validation scheme of choice**, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

While other feature importance methods usually calculate how much a feature is used by the model, LOFO estimates how much a feature can make a difference by itself given that we have the other features. Here are some advantages of LOFO:
* It generalises well to unseen test sets since it uses a validation scheme.
* It is model agnostic.
* It gives negative importance to features that hurt performance upon inclusion.
* It can group the features. Especially useful for high dimensional features like TFIDF or OHE features. It is also good practice to group very correlated features to avoid misleading results.
* It can automatically group highly correlated features to avoid underestimating their importance.

https://github.com/aerdem4/lofo-importance

In [1]:
!pip install lofo-importance

Collecting lofo-importance




  Downloading lofo_importance-0.3.1-py3-none-any.whl (11 kB)
Installing collected packages: lofo-importance
Successfully installed lofo-importance-0.3.1



You should consider upgrading via the 'c:\pythons\p376\python.exe -m pip install --upgrade pip' command.


In [30]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import os, sys
import torch

PATH = 'e:\\Krivenko\\Kaggle\\2021\\New20211005\\'


df = pd.read_csv(f"{PATH}/train.csv")
print(df.shape)
df.head()

(6036000, 8)


Unnamed: 0,id,breath_id,R,C,time_step,u_in,u_out,pressure
0,1,1,20,50,0.0,0.083334,0,5.837492
1,2,1,20,50,0.033652,18.383041,0,5.907794
2,3,1,20,50,0.067514,22.509278,0,7.876254
3,4,1,20,50,0.101542,22.808822,0,11.742872
4,5,1,20,50,0.135756,25.35585,0,12.234987


In [31]:
def engineer_features(df):
    df["u_in_sum"] = df.groupby("breath_id")["u_in"].transform("sum")
    df["u_in_cumsum"] = df.groupby("breath_id")["u_in"].cumsum()
    df["u_in_std"] = df.groupby("breath_id")["u_in"].transform("std")
    df["u_in_min"] = df.groupby("breath_id")["u_in"].transform("min")
    df["u_in_max"] = df.groupby("breath_id")["u_in"].transform("max")
    df["u_in_cumsum_reverse"] = df["u_in_sum"] - df["u_in_cumsum"]
    
    df["u_in_first"] = df.groupby("breath_id")["u_in"].transform("first")
    df["u_in_last"] = df.groupby("breath_id")["u_in"].transform("last")
    
    df["u_in_lag1"] = df.groupby("breath_id")["u_in"].shift(1)
    df["u_in_lead1"] = df.groupby("breath_id")["u_in"].shift(-1)
    df["u_in_lag1_diff"] = df["u_in"] - df["u_in_lag1"]
    df["u_in_lead1_diff"] = df["u_in"] - df["u_in_lead1"]
    
    df['area'] = df['time_step'] * df['u_in']
    
    df["u_out_sum"] = df.groupby("breath_id")["u_out"].transform("sum")
    
    df["time_passed"] = df.groupby("breath_id")["time_step"].diff()
    
    return df
    
df = engineer_features(df)

In [32]:
in_df = df[df["u_out"] == 0].reset_index(drop=True)
in_df.shape

(2290968, 23)

In [33]:
from lofo import Dataset, LOFOImportance, plot_importance
from sklearn.model_selection import GroupKFold


cv = list(GroupKFold(n_splits=4).split(in_df, in_df["pressure"], groups=in_df["breath_id"]))

features = ["time_step", "u_in", "R", "C",
            "u_in_sum", "u_in_cumsum", "u_in_std", "u_in_min", "u_in_max", "u_in_cumsum_reverse",
            "u_in_lead1", "u_in_lag1", "u_in_lag1_diff", "u_in_lead1_diff",
            "u_out_sum", "time_passed", "u_in_first", "u_in_last", "area"]

ds = Dataset(in_df, target="pressure", features=features,
    feature_groups=None,
    auto_group_threshold=0.9
)

Automatically grouped features by correlation:
1 ['u_in_max', 'u_in_std', 'u_in_sum']


In [None]:
lofo_imp = LOFOImportance(ds, cv=cv, scoring="neg_mean_absolute_error")

importance_df = lofo_imp.get_importance()
importance_df

  0%|          | 0/16 [00:00<?, ?it/s]

In [None]:
plot_importance(importance_df, figsize=(8, 8))