# eli5

- eli5는 Permutation Feature Impotance를 추출하는 대표적인 방법
- Permutation 기법이란 해당 feature의 값을 무작위로 섞어 노이즈를 강제로 생성
- 이 때, 모델 성능이 크게 감소할 경우 해당 feature는 모델이 의존하고 있는 중요한 feature

***

## 예제

In [53]:
import eli5
from eli5.sklearn import PermutationImportance
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_boston, load_breast_cancer
from lightgbm import LGBMRegressor

### 1) Regression

In [None]:
boston = load_boston()

In [19]:
X = pd.DataFrame(boston['data'], columns = boston['feature_names'])

In [20]:
y = boston['target']

In [21]:
lgbm = LGBMRegressor(random_state = 42)

In [23]:
PI = PermutationImportance(lgbm, scoring = 'neg_mean_squared_error', random_state = 42)

In [24]:
PI.fit(X, y)

PermutationImportance(estimator=LGBMRegressor(random_state=42), random_state=42,
                      scoring='neg_mean_squared_error')

In [29]:
eli5.show_weights(PI, top = 100, feature_names = X.columns.tolist())

Weight,Feature
55.8891  ± 8.9536,LSTAT
30.9394  ± 1.2243,RM
7.5107  ± 0.7677,DIS
5.7804  ± 0.5563,NOX
3.8557  ± 0.6172,AGE
2.7903  ± 0.1717,CRIM
1.5908  ± 0.1627,PTRATIO
1.3812  ± 0.2214,TAX
1.2234  ± 0.2134,B
0.5704  ± 0.1167,INDUS


top n개 feature들을 우선으로 보여준다. 부호는 +, - 두개인데 값이 클수록 중요한 변수이다.

In [31]:
X['target'] = y

In [33]:
data = X.copy()

In [34]:
data.corr()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
CRIM,1.0,-0.200469,0.406583,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,-0.385064,0.455621,-0.388305
ZN,-0.200469,1.0,-0.533828,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,0.17552,-0.412995,0.360445
INDUS,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,-0.356977,0.6038,-0.483725
CHAS,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515,0.048788,-0.053929,0.17526
NOX,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,-0.380051,0.590879,-0.427321
RM,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,0.128069,-0.613808,0.69536
AGE,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,-0.273534,0.602339,-0.376955
DIS,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,0.291512,-0.496996,0.249929
RAD,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,-0.444413,0.488676,-0.381626
TAX,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,-0.441808,0.543993,-0.468536


변수간의 상관관계를 볼 때 target feature를 보면 LSTAT이 가장 큰 상관관계를 가지고 있는데 eli5 또한 LSTAT을 가장 중요한 feature로 보고 있다.

***
### 2) Classification

In [40]:
bc = load_breast_cancer()

In [41]:
data = pd.DataFrame(bc['data'], columns = bc['feature_names'])

In [42]:
data['target'] = bc.target

In [47]:
X = data.drop('target', axis = 1)
y = data.target

In [60]:
lgbm = LGBMClassifier(random_state = 42)

In [61]:
lgbm.fit(X, y)

LGBMClassifier(random_state=42)

In [62]:
PI = PermutationImportance(lgbm, scoring = 'f1', random_state = 42)

In [63]:
PI.fit(X, y)

PermutationImportance(estimator=LGBMClassifier(random_state=42),
                      random_state=42, scoring='f1')

In [64]:
eli5.show_weights(PI, top = 15, feature_names = X.columns.tolist())

Weight,Feature
0.0135  ± 0.0029,worst area
0.0107  ± 0.0029,worst concave points
0.0050  ± 0.0046,worst texture
0.0014  ± 0.0000,area error
0.0011  ± 0.0021,mean texture
0.0006  ± 0.0014,worst concavity
0  ± 0.0000,mean concavity
0  ± 0.0000,mean concave points
0  ± 0.0000,mean compactness
0  ± 0.0000,mean area


In [67]:
data.corr()['target'].sort_values()

worst concave points      -0.793566
worst perimeter           -0.782914
mean concave points       -0.776614
worst radius              -0.776454
mean perimeter            -0.742636
worst area                -0.733825
mean radius               -0.730029
mean area                 -0.708984
mean concavity            -0.696360
worst concavity           -0.659610
mean compactness          -0.596534
worst compactness         -0.590998
radius error              -0.567134
perimeter error           -0.556141
area error                -0.548236
worst texture             -0.456903
worst smoothness          -0.421465
worst symmetry            -0.416294
mean texture              -0.415185
concave points error      -0.408042
mean smoothness           -0.358560
mean symmetry             -0.330499
worst fractal dimension   -0.323872
compactness error         -0.292999
concavity error           -0.253730
fractal dimension error   -0.077972
symmetry error             0.006522
texture error              0