# Partially Supervised Feature Selection with Regularized Linear Models


## Summary

### Feature selection methods overview

This item is based con the first paper.

**Goals of feature selection**

Scenarios related to few tens of samples but thousands dimensions: microarray data, 

1. To avoid overfitting and improve model performance, prediction performance in the case of supervised classification and better cluster detection in unsupervised scenarios.

2. To provide more efficient models

3. To gain a deeper insight into the underlying processes that generated the data. The excess of dimensionality difficult the understanding.

The problem is related to find the optimal model parameters for the optimal feature subset. So, the model parameters becomes dependent of the features selected and need to be computed more or less coupled with the guessing of model parameters.

From less (zero) to more coupled computation, we have three strategies:

1. Filter techniques. Two step process, firs the filtering, then the training of the model. Take into account only the properties of the data and in some cases a certain amount of prior knowledge. Therefore it's independent of the classification method. In its most simplest form ignores dependences on the data (univariate).

    Examples: Euclidean distance, i-test Information gain, Markov blanket filter

2. Wrapper methods. Once selected a candidate subset of features, the classification model is evaluated by training and testing the model. This is iterated over a ensemble of candidate subsets, and the model (with his feature subsets) selected is the model with the best accuracy. 
    
    It's very important to construct a good searching algorithm of subsets, in order to reduce the number of sets to model with. This methods are dependent of the classifier, model feature dependencies and have the risk to be bind to a local optima. With randomizing techniques this problem is bypassed to some extent. 
    
    Examples: Sequential forward selection (SFS) , Sequential backward elimination, Simulated annealing, Randomized hill climbing, Genetic algorithms.

3. Embedded methods. The search of the optimal subset of features is built into the classifier. Have the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods.

    Examples: Decision trees Weighted naive Bayes, Feature selection using the weight vector of SVM.
    
### AROM methods

### L1-AROM

The acronym derives from Approximation of Minimization zeRO-norm

The problem is obtain a linear model predictor, minimizing the number of independent variables (features) without loss of accuracy.

$$g(x) = sign(w · x + b)$$
Given m examples $xi ∈ R^n$ and the corresponding class labels yi ∈ {±1} with i = 1, ..., m, a linear model g(x) predicts the class of any point x ∈ Rn as follows.
g(x) = sign(w · x + b) (1)
Feature selection is closely related to a specific form of regularization of this deci- sion function to enforce sparsity of the weight vector w. Weston et al. [25] study in particular the zero-norm minimization subject to linear margin constraints:
min||w||0 subjecttoyi(w·xi+b)≥1 (2) w
where ||w||0 = card{wj |wj ̸= 0} and card is the set cardinality. Since problem (2) is NP-Hard, a log l1-norm minimization is proposed instead.
􏰐n
w
j=1
where 0 < ε ≪ 1 is added to smooth the objective when some |wj| vanishes. The natural logarithm in the objective facilitates parameter estimation with a sim- ple gradient descent procedure. The resulting algorithm l1-AROM1 iteratively optimizes the l1-norm of w with rescaled inputs.
1 AROM stands for Approximation of the zero-norm minimization.
min
ln(|wj| + ε) subject to yi(w · xi + b) ≥ 1
**Goal**

Classification of microarray data: few tens of samples but several thousand dimensions (genes).

**Key differential strategy**

By means of partial supervision on the dimensions of a feature selection procedure. For instance in the case of microarray data classification, a molecular biologist may know or guess that some genes are likely to be more discriminant. 

The technique proposes to use this prior knowledge to guide feature selection, but flexible enough to let the final selection depart from it if necessary to optimize the classification objective.
 
Support vector machines (SVMs) are particularly con- venient to classify high dimensional data with only a few samples. In their simplest form, SVMs sim- ply reduce to maximal margin hyperplanes in the input space.
These are a type of embedded methods.
 
 A good set of features is ideally highly stable with respect to sampling variation. In the context of biomarker selection from microarray data, high stabil-
 
Partially Supervised Feature Selection with Regularized Linear Models
 ity means that different sub-samples of patients lead to very similar sets of biomarkers. This is motivated by the assumption that the biological process explaining the outcome is common among different patients. 

**Method**
We propose a new feature selection method based on regularized linear models. This approach makes use of a partial supervision on the features a priori as- sumed to be more relevant. This method naturally extends the AROM methods due to (Weston et al., 2003). Several experiments on microarray data sets show that the partial supervision largely improves the stability of the selected gene lists, with respect to vari- ation in data sampling


### L2-AROM

The l2-AROM method further approximates this optimization by replacing the l1-norm by the l2-norm. Even though such an approximation may result in a less sparse solution, it is very efficient in practice when m ≪ n. Indeed, a dual formulation may be used and the final algorithm boils down to a linear SVM estimation with iterative rescaling of the inputs. A standard SVM solver can be iteratively called on properly rescaled inputs. A smooth feature selection occurs during this iterative process since the weight coefficients along some dimensions progressively drop below the machine precision while other dimensions become more significant. A final ranking on the absolute values of each dimension can be used to obtain a fixed number of features.

**Prior knowledge**

Whenever some knowledge on the relative importance of each feature is available (either from actual prior knowledge or from a related dataset), the l1-AROM objective can be modified by adding a prior relevance vector β = [β1,...,βn]t defined over the input dimensions. Let βj > 0 denote the relative prior relevance of the jth feature, the higher its value the more relevant the corresponding feature is a priori assumed. In practice, only a few dimensions can be assumed more relevant (e.g. βj > 1) while the vast majority of remaining dimensions are not favored (e.g. βj = 1). Section 5 further discusses the practical definition of β. In contrast with semi-supervised learning, this is a form of partial supervision (PS) on the relevant dimensions rather than the labels.
The optimization problem of PS-l1-AROM is defined to penalize less the di- mensions which are assumed a priori more relevant:
􏰐n 1
min ln(|wj | + ε) subject to yi(w · xi + b) ≥ 1 (4)
 w j=1βj
It was recently shown how problem (4) can be reformulated as an iterated l1-
norm optimization with margin constraints on rescaled inputs [26]:
􏰐n
w′
j=1
where ∗ denotes the component-wise product and the initial weight vector is defined as w0 = [1, . . . , 1]t. At iteration k + 1, problem (5) is solved given the previous weight vector wk and the fixed relevance vector β, and the process is iterated till convergence.





## Project planification


# Outputs

In [2]:
%%bash
jupyter nbconvert --to=latex --template=~/report.tplx feature_selection_linear_models.ipynb 1> /dev/null
pdflatex -shell-escape feature_selection_linear_models 1> /dev/null
jupyter nbconvert --to html_with_toclenvs feature_selection_linear_models.ipynb 1> /dev/null

[NbConvertApp] Converting notebook feature_selection_linear_models.ipynb to latex
[NbConvertApp] Writing 34075 bytes to feature_selection_linear_models.tex
