ITMO_FS разработали студенты и сотрудники лаборатории Машинного обучения Университета ИТМО. Она реализована на Python и совместима со scikit-learn, которая де-факто считается основным инструментом анализа данных. Ее селекторы признаков принимают те же параметры:

data: array-like (2-D list, pandas.Dataframe, numpy.array);  
targets: array-like (1-D list, pandas.Series, numpy.array).

Библиотека поддерживает все классические подходы к отбору признаков — фильтры, обертки и встраиваемые методы. Среди них числятся такие алгоритмы, как фильтры на основе корреляций Спирмена и Пирсона, критерий соответствия (Fit Criterion), QPFS, hill climbing filter и другие.

Рассмотрим некоторые методы, которые предоставляет эта библиотека

In [1]:
!pip install ITMO_FS
import ITMO_FS 
import numpy as np

Collecting ITMO_FS
  Downloading ITMO_FS-0.3.3-py3-none-any.whl (70 kB)
[K     ,████████████████████████████████, 70 kB 2.9 MB/s 
Collecting qpsolvers
  Downloading qpsolvers-1.8.1-py3-none-any.whl (36 kB)
Collecting quadprog>=0.1.8
  Downloading quadprog-0.1.11.tar.gz (121 kB)
[K     ,████████████████████████████████, 121 kB 13.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: quadprog
  Building wheel for quadprog (PEP 517) ... [?25l[?25hdone
  Created wheel for quadprog: filename=quadprog-0.1.11-cp37-cp37m-linux_x86_64.whl size=290750 sha256=8db7b79340167bd8325337b9997fe10979467a043ec5edfe3928653bcc0ba967
  Stored in directory: /root/.cache/pip/wheels/4a/4e/d7/41034ea11aeef1266df3cae546116cb6094e955c41ae3e2589
Successfully built quadprog
Installing collected packages: quadprog, qpsolvers, ITMO-FS
Successfully instal

In [2]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import KBinsDiscretizer
from ITMO_FS.filters.univariate import select_best_percentage, select_k_best
from ITMO_FS.filters.univariate import UnivariateFilter
from ITMO_FS.filters.univariate import f_ratio_measure, pearson_corr
from ITMO_FS.filters.univariate import VDM
from ITMO_FS.filters.multivariate import MRMR
from ITMO_FS.filters.multivariate import DISRWithMassive
from ITMO_FS.filters.unsupervised.trace_ratio_laplacian import TraceRatioLaplacian

x, y = make_classification(1000, 100, n_informative = 10, n_redundant = 30, n_repeated = 10, shuffle = False)

# Univariate filter methods

[Value Difference Metric](https://www.jair.org/index.php/jair/article/view/10182/24168#page=6) (VDM) - метрика расстояния, использующаяся в некоторых алгоритмах n ближайших соседей. Может использоваться для снижения влияния параметров, слабо связанных с целевой переменной, на эффективность классификации.

In [3]:
x_1 = np.array([[0, 0, 0, 0],
              [1, 0, 1, 1],
              [1, 0, 0, 2]])
y_1 = np.array([0,
              1,
              1])
vdm = VDM()
vdm.run(x_1, y_1)

array([[0.        , 4.35355339, 4.        ],
       [4.5       , 0.        , 0.5       ],
       [4.        , 0.35355339, 0.        ]])

# Measures for univariate filters

Для этой группы методов достаточно передать входные данные и их классы. Всего в группе есть 10 методов:


1. filters.univariate.**f_ratio_measure**(X, y)	- Calculates Fisher score for features.
2. filters.univariate.**gini_index**(X, y)	- Gini index is a measure of statistical dispersion.
3. filters.univariate.**su_measure**(X, y)	- SU is a correlation measure between the features and the class calculated, via formula SU(X,Y) = 2 * I(X,Y) / (H(X) + H(Y))
4. filters.univariate.**spearman_corr**(X, y)	- Calculates spearman correlation for each feature.
5. filters.univariate.**pearson_corr**(X, y)	- Calculates pearson correlation for each feature.
6. filters.univariate.**fechner_corr**(X, y)	- Calculates Sample sign correlation (Fechner correlation) for each feature.
7. filters.univariate.**kendall_corr**(X, y)	- Calculates Sample sign correlation (Kendall correlation) for each feature.
8. filters.univariate.**reliefF_measure**(X, y[, …])	- Counts ReliefF measure for each feature
9. filters.univariate.**chi2_measure**(X, y)	- Calculates score for the test chi-squared statistic from X.
10. filters.univariate.**information_gain**(X, y)	- Calculates mutual information for each feature by formula, I(X,Y) = H(X) - H(X,Y)



In [4]:
scores = f_ratio_measure(x, y)
scores1 = pearson_corr(x, y)

print(scores)
print(scores1)

[1.12646725e-01 1.02636539e-05 8.26286562e-02 4.35007377e-02
 5.80353033e-02 1.68999384e-03 3.75561595e-06 9.46477364e-02
 8.05012962e-04 4.55744633e-04 4.08304356e-03 1.43474056e-02
 9.44156464e-02 7.00700767e-02 3.53148641e-02 3.28103435e-03
 7.90283150e-02 5.02598321e-03 5.62442832e-02 1.95039528e-02
 3.90874407e-02 1.80455733e-03 7.41054789e-02 1.13548353e-02
 3.43347970e-04 3.50288142e-02 1.14063349e-01 3.64412950e-03
 3.59740959e-06 5.71517115e-02 4.44869516e-03 5.40203735e-04
 1.62135389e-02 1.94036310e-03 3.41507454e-04 3.04308281e-03
 1.38896621e-02 7.25888243e-02 1.16805669e-02 5.96178417e-03
 9.46477364e-02 9.46477364e-02 5.40203735e-04 9.46477364e-02
 4.55744633e-04 3.50288142e-02 1.16805669e-02 3.41507454e-04
 4.55744633e-04 3.64412950e-03 7.33302613e-07 1.16196068e-04
 2.49707555e-03 2.14779023e-03 1.03578443e-02 6.78886839e-03
 1.46441347e-03 2.31318087e-03 7.71785060e-07 2.15967377e-03
 8.41655331e-05 1.64234468e-03 9.91381441e-04 1.50447374e-07
 7.02420651e-04 1.111648

# Cutting rules for univariate filters

1. filters.univariate.**select_best_by_value**(value)	
2. filters.univariate.**select_worst_by_value**(value)	
3. filters.univariate.**select_k_best**(k)	
4. filters.univariate.**select_k_worst**(k)	
5. filters.univariate.**select_best_percentage**(percent)	
6. filters.univariate.**select_worst_percentage**(percent)	

In [5]:
ufilterBest = UnivariateFilter(f_ratio_measure, select_k_best(10))
ufilterBest.fit(x, y)

ufilterBestPercentage= UnivariateFilter(f_ratio_measure, select_best_percentage(0.1))
ufilterBestPercentage.fit(x, y)

print(ufilterBest.selected_features)
print(ufilterBestPercentage.selected_features)

[26, 0, 7, 40, 41, 43, 12, 2, 16, 22]
[0, 2, 3, 4, 7, 11, 12, 13, 14, 16, 18, 19, 20, 22, 25, 26, 29, 32, 36, 37, 38, 40, 41, 43, 45, 46]


# Multivariate filter methods

Доступны следующие фильтры
1. filters.multivariate.**DISRWithMassive**([…]) - Creates DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterion for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation.
2. filters.multivariate.**FCBFDiscreteFilter**() -	Creates FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features This filter finds best set of features by searching for a feature, which provides the most information about classification problem on given dataset at each step and then eliminating features which are less relevant than redundant
3. filters.multivariate.**STIR**([n_features_to_keep]) -	Feature selection using STIR algorithm.
4. filters.multivariate.**TraceRatioFisher**(…) -	Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e fisher version
5. filters.multivariate.**MIMAGA**(mim_size, …)	

Пример Double Input Symmetric Relevance

In [6]:
X = np.array([[1, 2, 3, 3, 1],[2, 2, 3, 3, 2], [1, 3, 3, 1, 3],[3, 1, 3, 1, 4],[4, 4, 3, 1, 5]])
y = np.array([1, 2, 3, 4, 5])
disr = DISRWithMassive(3)
print(disr.fit_transform(X, y))

[[1 2 1]
 [2 2 2]
 [1 3 3]
 [3 1 4]
 [4 4 5]]


# Measures for multivariate filters

1. filters.multivariate.**MIM**(selected_features, …)	- Mutual Information Maximization feature scoring criterion.
1. filters.multivariate.**MRMR**(selected_features, …)	- Minimum-Redundancy Maximum-Relevance feature scoring criterion.
1. filters.multivariate.**JMI**(selected_features, …)	- Joint Mutual Information feature scoring criterion.
1. filters.multivariate.**CIFE**(selected_features, …)	- Conditional Infomax Feature Extraction feature scoring criterion.
1. filters.multivariate.**MIFS**(selected_features, …)	- Mutual Information Feature Selection feature scoring criterion.
1. filters.multivariate.**CMIM**(selected_features, …)	- Conditional Mutual Info Maximisation feature scoring criterion.
1. filters.multivariate.**ICAP**(selected_features, …)	- Interaction Capping feature scoring criterion.
1. filters.multivariate.**DCSF**(selected_features, …)	- Dynamic change of selected feature with the class scoring criterion.
1. filters.multivariate.**CFR**(selected_features, …)	- The criterion of CFR maximizes the correlation and minimizes the redundancy.
1. filters.multivariate.**MRI**(selected_features, …)	- Max-Relevance and Max-Independence feature scoring criteria.
1. filters.multivariate.**IWFS**(selected_features, …)	- Interaction Weight base feature scoring criteria.
1. filters.multivariate.**generalizedCriteria**(…)	- This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature.

Пример использования Minimum-Redundancy Maximum-Relevance (MRMR)

In [10]:
est = KBinsDiscretizer(n_bins=10, encode='ordinal') # ordinal - способ кодирования идентификатора бина как числовое значение
data, target = np.array(x), np.array(y)
est.fit(data)
data = est.transform(data)
selected_features = [1, 2]
other_features = [i for i in range(0, data.shape[1]) if i not in selected_features]
print(MRMR(np.array(selected_features), np.array(other_features), data, target))

[-0.55825854 -0.56093276 -0.50623821 -0.34710248 -0.46306443 -0.46125139
 -0.47954247 -0.51589996 -0.48515566 -0.35819657 -0.37960375 -0.47230652
 -0.43401912 -0.39273988 -0.47125926 -0.46103338 -0.43915902 -0.49617427]


# Unsupervised filter methods

Пример TraceRatio

In [8]:
tracer = TraceRatioLaplacian(10)
print(tracer.run(x, y)[0])

[ 5 53  3 89 74 71 63 58 70 88]


# Sparse filter methods

1. filters.sparse.**MCFS**(d[, k, p, scheme, sigma]) -	Performs the Unsupervised Feature Selection for Multi-Cluster Data algorithm.
1. filters.sparse.**NDFS**(p[, c, k, alpha, beta, …]) -	Performs the Nonnegative Discriminative Feature Selection algorithm.
1. filters.sparse.**RFS**(p[, gamma, …])	- Performs the Robust Feature Selection via Joint L2,1-Norms Minimization algorithm.
1. filters.sparse.**SPEC**(p[, k, gamma, sigma, …]) -	Performs the Spectral Feature Selection algorithm.
1. filters.sparse.**UDFS**(p[, c, k, gamma, l, …]) -	Performs the Unsupervised Discriminative Feature Selection algorithm.

# Ensemble methods

1. ensembles.measure_based.**WeightBased**(filters)
1. ensembles.model_based.**BestSum**(models, …)
1. ensembles.ranking_based.**Mixed**(filters) - Performs feature selection based on several filters

# Embedded methods


1. embedded.**MOS**([model, loss, seed])	- Performs Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm.

# Hybrid methods

1. hybrid.**FilterWrapperHybrid**(filter_, wrapper)	
1. hybrid.**Melif**(filter_ensemble[, scorer, verbose])	

# Wrapper methods

1. wrappers.deterministic.**AddDelWrapper**(…[, …])	- Creates add-del feature wrapper
1. wrappers.deterministic.**BackwardSelection**(…)	- Backward Selection removes one feature at a time until the number of features to be removed is reached.
1. wrappers.deterministic.**RecursiveElimination**(…)	- Performs a recursive feature elimination until the required number of features is reached.
1. wrappers.deterministic.**SequentialForwardSelection**(…)	- Sequentially Adds Features that Maximises the Classifying function when combined with the features already used TODO add theory about this method
1. wrappers.deterministic.**qpfs_wrapper**(X, y, alpha)	- Performs Quadratic Programming Feature Selection algorithm.
1. wrappers.randomized.**HillClimbingWrapper**(…)	
1. wrappers.randomized.**SimulatedAnnealing**(…)	- Performs feature selection using simulated annealing
1. wrappers.randomized.**TPhMGWO**([wolfNumber, …])	- Performs Grey Wolf optimization with Two-Phase Mutation

Пример работы [Backward Selection](https://itmo-fs.readthedocs.io/en/latest/generated/ITMO_FS.wrappers.deterministic.BackwardSelection.html?highlight=wrappers.deterministic.BackwardSelection) - алгоритм убирает по одному параметру, пока число убранных параметров не достигнет заданного. Возвращает имена оставшихся параметров.

In [9]:
from ITMO_FS.wrappers import BackwardSelection
from sklearn.linear_model import LogisticRegression

x, y = make_classification(n_samples=100, n_features=20, n_informative=4, n_redundant=0, shuffle=False)
data, target = np.array(x), np.array(y)

model = BackwardSelection(LogisticRegression(), 15, 'f1_macro')
model.fit(data, target)
print(model.selected_features)

[ 0  2  9 11 13]


# Источники
Описание API - https://itmo-fs.readthedocs.io/en/latest/api.html  
Статья на хабре - https://habr.com/ru/company/spbifmo/blog/516194/

# Ревью

## Передреев Дмитрий

Туториал прошел. Если запускать все блоки кода по порядку, выдавалась ошибка, т.к. во втором блоке мы определяли x, y



```
x, y = make_classification(1000, 100, n_informative = 10, n_redundant = 30, n_repeated = 10, shuffle = False)
```

а в третьем перезаписывали эти значения для демонстрации VDM.



```
x = np.array([[0, 0, 0, 0],
              [1, 0, 1, 1],
              [1, 0, 0, 2]])
y = np.array([0,
              1,
              1])
```

Тогда зависящие от изначально заданных x, y блоки падали с ошибкой размерности. Пофиксил.

Добавил короткое описание VDM и пример Backward Selection.


## Мартын Евгений
Туториал прошел. Добавил описание строковой константы(комментарием)