ITMO_FS разработали студенты и сотрудники лаборатории Машинного обучения Университета ИТМО. Она реализована на Python и совместима со scikit-learn, которая де-факто считается основным инструментом анализа данных. Ее селекторы признаков принимают те же параметры:

data: array-like (2-D list, pandas.Dataframe, numpy.array);  
targets: array-like (1-D list, pandas.Series, numpy.array).

Библиотека поддерживает все классические подходы к отбору признаков — фильтры, обертки и встраиваемые методы. Среди них числятся такие алгоритмы, как фильтры на основе корреляций Спирмена и Пирсона, критерий соответствия (Fit Criterion), QPFS, hill climbing filter и другие.

Рассмотрим некоторые методы, которые предоставляет эта библиотека

In [1]:
!pip install ITMO_FS
import ITMO_FS 
import numpy as np

Collecting ITMO_FS
  Downloading ITMO_FS-0.3.3-py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 3.8 MB/s 
[?25hCollecting qpsolvers
  Downloading qpsolvers-1.7.1-py3-none-any.whl (35 kB)
Collecting quadprog>=0.1.8
  Downloading quadprog-0.1.10.tar.gz (121 kB)
[K     |████████████████████████████████| 121 kB 11.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 60 kB/s 
[?25hBuilding wheels for collected packages: quadprog
  Building wheel for quadprog (PEP 517) ... [?25l[?25hdone
  Created wheel for quadprog: filename=quadprog-0.1.10-cp37-cp37m-linux_x86_64.whl size=290739 sha256=513543d138cb869c041c36e76a0312b0ac205e98089ada4583788996cc610cad
  Stored in directory: /root/.

In [37]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import KBinsDiscretizer
from ITMO_FS.filters.univariate import select_best_percentage, select_k_best
from ITMO_FS.filters.univariate import UnivariateFilter
from ITMO_FS.filters.univariate import f_ratio_measure, pearson_corr
from ITMO_FS.filters.univariate import VDM
from ITMO_FS.filters.multivariate import MRMR
from ITMO_FS.filters.multivariate import DISRWithMassive
from ITMO_FS.filters.unsupervised.trace_ratio_laplacian import TraceRatioLaplacian

x, y = make_classification(1000, 100, n_informative = 10, n_redundant = 30, n_repeated = 10, shuffle = False)

# Univariate filter methods

Value Difference Metric

In [32]:
x = np.array([[0, 0, 0, 0],
              [1, 0, 1, 1],
              [1, 0, 0, 2]])
y = np.array([0,
              1,
              1])
vdm = VDM()
vdm.run(x, y)

array([[0.        , 4.35355339, 4.        ],
       [4.5       , 0.        , 0.5       ],
       [4.        , 0.35355339, 0.        ]])

# Measures for univariate filters

Для этой группы методов достаточно передать входные данные и их классы. Всего в группе есть 10 методов:


1. filters.univariate.**f_ratio_measure**(X, y)	- Calculates Fisher score for features.
2. filters.univariate.**gini_index**(X, y)	- Gini index is a measure of statistical dispersion.
3. filters.univariate.**su_measure**(X, y)	- SU is a correlation measure between the features and the class calculated, via formula SU(X,Y) = 2 * I(X|Y) / (H(X) + H(Y))
4. filters.univariate.**spearman_corr**(X, y)	- Calculates spearman correlation for each feature.
5. filters.univariate.**pearson_corr**(X, y)	- Calculates pearson correlation for each feature.
6. filters.univariate.**fechner_corr**(X, y)	- Calculates Sample sign correlation (Fechner correlation) for each feature.
7. filters.univariate.**kendall_corr**(X, y)	- Calculates Sample sign correlation (Kendall correlation) for each feature.
8. filters.univariate.**reliefF_measure**(X, y[, …])	- Counts ReliefF measure for each feature
9. filters.univariate.**chi2_measure**(X, y)	- Calculates score for the test chi-squared statistic from X.
10. filters.univariate.**information_gain**(X, y)	- Calculates mutual information for each feature by formula, I(X,Y) = H(X) - H(X|Y)



In [19]:
scores = f_ratio_measure(x, y)
scores1 = pearson_corr(x, y)

print(scores)
print(scores1)

[ 8.05398488e-02  2.13346518e-04  3.25891040e-01  9.49682748e-04
  4.32314534e-02  1.18651380e-03  1.90152601e-03  4.01719683e-04
  1.11866934e-04  7.57029056e-04  3.78730110e-01  1.00005502e-02
  5.65204598e-02  4.57917012e-02  2.80749117e-02  1.91162761e-02
  5.70774761e-02  1.35133915e-02  1.06354745e-02  7.35320193e-04
  1.65662497e-03  3.60693230e-02  5.46651776e-03  2.42480030e-02
  1.27767226e-01  2.25308688e-03  2.56475343e-03  9.54718806e-03
  1.51391025e-01  1.78511449e-03  8.57983783e-03  1.37487835e-03
  6.05477389e-02  8.90651472e-02  1.62906279e-02  5.50502626e-02
  3.42524694e-02  7.11629606e-02  1.76246231e-01  5.23193375e-03
  5.65204598e-02  3.25891040e-01  1.00005502e-02  2.13346518e-04
  1.11866934e-04  1.11866934e-04  1.78511449e-03  4.57917012e-02
  5.70774761e-02  1.65662497e-03  7.27146840e-04  9.35169160e-04
  1.62337349e-03  3.39813971e-04  1.23564437e-03  6.16315949e-07
  1.49062313e-04  3.58300984e-05  2.24140695e-04  4.56096929e-03
  5.52473605e-04  1.31153

# Cutting rules for univariate filters

1. filters.univariate.**select_best_by_value**(value)	
2. filters.univariate.**select_worst_by_value**(value)	
3. filters.univariate.**select_k_best**(k)	
4. filters.univariate.**select_k_worst**(k)	
5. filters.univariate.**select_best_percentage**(percent)	
6. filters.univariate.**select_worst_percentage**(percent)	

In [25]:
ufilterBest = UnivariateFilter(f_ratio_measure, select_k_best(10))
ufilterBest.fit(x, y)

ufilterBestPercentage= UnivariateFilter(f_ratio_measure, select_best_percentage(0.1))
ufilterBestPercentage.fit(x, y)

print(ufilterBest.selected_features)
print(ufilterBestPercentage.selected_features)

[26, 36, 11, 42, 10, 22, 34, 23, 46, 32]
[2, 4, 5, 8, 10, 11, 13, 20, 21, 22, 23, 26, 31, 32, 34, 36, 38, 39, 42, 45, 46, 47, 49]


# Multivariate filter methods

Доступны следующие фильтры
1. filters.multivariate.**DISRWithMassive**([…]) - Creates DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterin for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation.
2. filters.multivariate.**FCBFDiscreteFilter**() -	Creates FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features This filter finds best set of features by searching for a feature, which provides the most information about classification problem on given dataset at each step and then eliminating features which are less relevant than redundant
3. filters.multivariate.**STIR**([n_features_to_keep]) -	Feature selection using STIR algorithm.
4. filters.multivariate.**TraceRatioFisher**(…) -	Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e fisher version
5. filters.multivariate.**MIMAGA**(mim_size, …)	

Пример Double Input Symmetric Relevance

In [35]:
X = np.array([[1, 2, 3, 3, 1],[2, 2, 3, 3, 2], [1, 3, 3, 1, 3],[3, 1, 3, 1, 4],[4, 4, 3, 1, 5]])
y = np.array([1, 2, 3, 4, 5])
disr = DISRWithMassive(3)
print(disr.fit_transform(X, y))

[[1 2 1]
 [2 2 2]
 [1 3 3]
 [3 1 4]
 [4 4 5]]


# Measures for multivariate filters

1. filters.multivariate.**MIM**(selected_features, …)	- Mutual Information Maximization feature scoring criterion.
1. filters.multivariate.**MRMR**(selected_features, …)	- Minimum-Redundancy Maximum-Relevance feature scoring criterion.
1. filters.multivariate.**JMI**(selected_features, …)	- Joint Mutual Information feature scoring criterion.
1. filters.multivariate.**CIFE**(selected_features, …)	- Conditional Infomax Feature Extraction feature scoring criterion.
1. filters.multivariate.**MIFS**(selected_features, …)	- Mutual Information Feature Selection feature scoring criterion.
1. filters.multivariate.**CMIM**(selected_features, …)	- Conditional Mutual Info Maximisation feature scoring criterion.
1. filters.multivariate.**ICAP**(selected_features, …)	- Interaction Capping feature scoring criterion.
1. filters.multivariate.**DCSF**(selected_features, …)	- Dynamic change of selected feature with the class scoring criterion.
1. filters.multivariate.**CFR**(selected_features, …)	- The criterion of CFR maximizes the correlation and minimizes the redundancy.
1. filters.multivariate.**MRI**(selected_features, …)	- Max-Relevance and Max-Independence feature scoring criteria.
1. filters.multivariate.**IWFS**(selected_features, …)	- Interaction Weight base feature scoring criteria.
1. filters.multivariate.**generalizedCriteria**(…)	- This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature.

Пример использования Minimum-Redundancy Maximum-Relevance (MIMR)

In [29]:
est = KBinsDiscretizer(n_bins=10, encode='ordinal')
data, target = np.array(x), np.array(y)
est.fit(data)
data = est.transform(data)
selected_features = [1, 2]
other_features = [i for i in range(0, data.shape[1]) if i not in selected_features]
print(MRMR(np.array(selected_features), np.array(other_features), data, target))

[-0.03228359 -0.04376638 -0.06259611 -0.07149853 -0.02633139 -0.05679009
 -0.06189949 -0.05359309 -0.20235912 -0.08544793 -0.1447178  -0.12556481
 -0.0462733  -0.06021178 -0.06523317 -0.06859147 -0.11886985 -0.12936973
 -0.13170482 -0.09606468 -0.13410027 -0.07378414 -0.0897233  -0.04749364
 -0.20146248 -0.13938036 -0.09502407 -0.024849   -0.1728354  -0.06459796
 -0.05022433 -0.18661539 -0.04922168 -0.09198459 -0.12394811 -0.18030134
 -0.10034338 -0.09610374 -0.10034338 -0.13938036 -0.04749364 -0.024849
 -0.20235912 -0.06859147 -0.20146248 -0.04922168 -1.1639283  -0.06259611
 -0.03448612 -0.03470784 -0.04285411 -0.04167731 -0.03735661 -0.04291407
 -0.03899624 -0.03273649 -0.04911312 -0.03423712 -0.02940827 -0.03351563
 -0.0403405  -0.03889953 -0.03952306 -0.03973018 -0.02569273 -0.03588849
 -0.04035212 -0.03326063 -0.02973029 -0.03043174 -0.03088164 -0.03391277
 -0.02536295 -0.04209314 -0.03003628 -0.03181613 -0.03272788 -0.03011073
 -0.03510675 -0.03092084 -0.04312557 -0.03259453 -0.0

# Unsupervised filter methods

Пример TraceRatio

In [38]:
tracer = TraceRatioLaplacian(10)
print(tracer.run(x, y)[0])

[11 89 80 76 60 69 67 87  1 20]


# Sparse filter methods

1. filters.sparse.**MCFS**(d[, k, p, scheme, sigma]) -	Performs the Unsupervised Feature Selection for Multi-Cluster Data algorithm.
1. filters.sparse.**NDFS**(p[, c, k, alpha, beta, …]) -	Performs the Nonnegative Discriminative Feature Selection algorithm.
1. filters.sparse.**RFS**(p[, gamma, …])	- Performs the Robust Feature Selection via Joint L2,1-Norms Minimization algorithm.
1. filters.sparse.**SPEC**(p[, k, gamma, sigma, …]) -	Performs the Spectral Feature Selection algorithm.
1. filters.sparse.**UDFS**(p[, c, k, gamma, l, …]) -	Performs the Unsupervised Discriminative Feature Selection algorithm.

# Ensemble methods

1. ensembles.measure_based.**WeightBased**(filters)
1. ensembles.model_based.**BestSum**(models, …)
1. ensembles.ranking_based.**Mixed**(filters) - Performs feature selection based on several filters

# Embedded methods


1. embedded.**MOS**([model, loss, seed])	- Performs Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm.

# Hybrid methods

1. hybrid.**FilterWrapperHybrid**(filter_, wrapper)	
1. hybrid.**Melif**(filter_ensemble[, scorer, verbose])	

# Wrapper methods

1. wrappers.deterministic.**AddDelWrapper**(…[, …])	- Creates add-del feature wrapper
1. wrappers.deterministic.**BackwardSelection**(…)	- Backward Selection removes one feature at a time until the number of features to be removed is reached.
1. wrappers.deterministic.**RecursiveElimination**(…)	- Performs a recursive feature elimination until the required number of features is reached.
1. wrappers.deterministic.**SequentialForwardSelection**(…)	- Sequentially Adds Features that Maximises the Classifying function when combined with the features already used TODO add theory about this method
1. wrappers.deterministic.**qpfs_wrapper**(X, y, alpha)	- Performs Quadratic Programming Feature Selection algorithm.
1. wrappers.randomized.**HillClimbingWrapper**(…)	
1. wrappers.randomized.**SimulatedAnnealing**(…)	- Performs feature selection using simulated annealing
1. wrappers.randomized.**TPhMGWO**([wolfNumber, …])	- Performs Grey Wolf optimization with Two-Phase Mutation

# Источники
Описание API - https://itmo-fs.readthedocs.io/en/latest/api.html  
Статья на хабре - https://habr.com/ru/company/spbifmo/blog/516194/