# Partially Supervised Feature Selection with Regularized Linear Models


## Feature selection methods overview

This item is based con the first paper.

**Goals of feature selection**

Scenarios related to few tens of samples but thousands dimensions: microarray data, 

1. To avoid overfiting and improve model performance, prediction performance in the case of supervised classification and better cluster detection in unsupervised scenarios.

2. To provide more efficient models

3. To gain a deeper insight into the underlying processes that generated the data. The excess of dimensionality difficult the understanding.

The problem is related to find the optimal model parameters for the optimal feature subset. So, the model parameters becomes dependent of the features selected and need to be computed more or less coupled with the guessing of model parameters.

From less (zero) to more coupled computation, we have three strategies:

1. Filter techniques. Two step process, first the filtering, then the training of the model. Take into account only the properties of the data and in some cases a certain amount of prior knowledge. Therefore it's independent of the classification method. In its most simplest form ignores dependences on the data (univariate).

    Examples: Euclidean distance, i-test Information gain, Markov blanket filter

2. Wrapper methods. Once selected a candidate subset of features, the classification model is evaluated by training and testing the model. This is iterated over a ensemble of candidate subsets, and the model (with his feature subsets) selected is the model with the best accuracy. 
    
    It's very important to construct a good searching algorithm of subsets, in order to reduce the number of sets to model with. This methods are dependent of the classifier, model feature dependencies and have the risk to be bind to a local optima. With randomizing techniques this problem is bypassed to some extent. 
    
    Examples: Sequential forward selection (SFS) , Sequential backward elimination, Simulated annealing, Randomized hill climbing, Genetic algorithms.

3. Embedded methods. The search of the optimal subset of features is built into the classifier. Have the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods.

    Examples: Decision trees Weighted naive Bayes, Feature selection using the weight vector of SVM, AROM
    
### AROM methods

The acronym derives from *Approximation of Minimization zeRO-norm*

The problem is obtain a linear predictor $h$, minimizing the number of independent variables (features) without loss of accuracy:

$$h(\mathbf{x}) = sign(\mathbf{w} \cdot \mathbf{x} + b)$$

for $n$ samples $x_i \in \mathbb{R}^n$ and $m$ labels $y_i \in \{\pm1\}$.

The accuracy constraint requires correspondence of sign 

$sign(y_i) \cdot sign(h_i) > 0$ or in other form $y_i \cdot h_i = 1$

or less restrictive, enabling $\mathbf{w}$ to scale freely $y_i \cdot h_i \ge 1$

so 

$$y_i(\mathbf{w} \cdot \mathbf{x} + b) \ge 1$$

The minimization is done with a norm defined over the vectorial space of $\mathbf{w}$. One approach is to minimize the zero-norm, that is, the number of components of the vector (number of non null $w_i$). But it's know to be a NP-Hard problem.

It's more adequate compute over a 1-norm or a 2-norm. In the second paper, the author deduce a suitable form for the function that could be minimized, taken into account the former constraint:

$$\displaystyle\sum_{j=1}^n ln(|w_j| + \epsilon)$$

The term $\epsilon$ is included to protect from zero values inside logarithm.

AROM methods are therefore feature selection embedded methods.

**l1-AROM** and **l2-AROM** (in this case by means of a 2-norm minimization) algorithms optimize this algorithm by iterative rescaling of inputs and doing a smooth feature selection since the weight coefficients along some dimensions progressively drop below the machine precision while other dimensions become more significant.

### AROM semi-supervised

Third and Fourth papers explore a improvement of these previous described methods.

**Goal**

Classification of microarray data: few tens of samples against several thousand dimensions (genes).

**Key differential strategy**

Extend AROM methods by means of partial supervision on the dimensions of a feature selection procedure. The technique proposes to use of prior knowledge to guide feature selection, but flexible enough to let the final selection depart from it if necessary to optimize the classification objective.

The preferential features are previously selected from similar datasets in large microarray databases because it's known that different sub-samples of patients lead to very similar sets of biomarkers, as expected if we are aware that the biological process explaining the outcome is common among different patients.

This datasets are called source datasets and we expect that the prediction for a similar feature vector is the same than the prediction for this vector in our dataset (the target).

*In third paper prior knowledge is incorporated by biological information*

So, if we have some knowledge on the relative importance of each feature (either from actual prior knowledge or from a related dataset), the supervised AROM objective can be modified by adding a prior relevance vector $\beta = [\beta_1,...,\beta_n]$  defined over the $n$ dimensions and where $\beta_j >0$ is the prior relevance of the $j$ feature.

So in this case, the function to minimize in the case of 1-norm is:

$$\displaystyle\sum_{j=1}^n \frac{1}{\beta_j} ln(|w_j| + \epsilon)$$


## L2-AROM
Describe how the provided implementation of L2-AROM works. See [2, 3, 4] for specific details. Next, implement a variable ranking approach based on the PS-L2-AROM method, as described in [4], using the provided implementation of L2-AROM.

You should introduce the possibility in the previous implementation to specify the initial value of of the scaling vector z. By default this vector should be equal to a vector with all components equal to one. By increasing or reducing these values, one should be able to favor, or make more difficult the selection of specific features. This will lead to the method PS-L2-AROM, in which some sort of prior-knowledge about the importance of each feature can be considered.

### Implementation

**SVM**

Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support-vector machines, a data point is viewed as a {\displaystyle p} p-dimensional vector (a list of {\displaystyle p} p numbers), and we want to know whether we can separate such points with a {\displaystyle (p-1)} (p-1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So **we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized**. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum-margin classifier.

But often the target are not linearly separable in that space. For this reason,  the original finite-dimensional space be mapped into a much higher-dimensional space, making the separation viable in the new space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products of pairs of input data vectors may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function {\displaystyle k(x,y)} {\displaystyle k(x,y)} selected to suit the problem.[5] 

The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant, where such a set of vector is an orthogonal (and thus minimal) set of vectors that defines a hyperplane. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters {\displaystyle \alpha _{i}} \alpha _{i} of images of feature vectors {\displaystyle x_{i}} x_{i} that occur in the data base.[clarification needed] With this choice of a hyperplane, the points {\displaystyle x} x in the feature space that are mapped into the hyperplane are defined by the relation {\displaystyle \textstyle \sum _{i}\alpha _{i}k(x_{i},x)={\text{constant}}.} {\displaystyle \textstyle \sum _{i}\alpha _{i}k(x_{i},x)={\text{constant}}.} Note that if {\displaystyle k(x,y)} {\displaystyle k(x,y)} becomes small as {\displaystyle y} y grows further away from {\displaystyle x} x, each term in the sum measures the degree of closeness of the test point {\displaystyle x} x to the corresponding data base point {\displaystyle x_{i}} x_{i}. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points {\displaystyle x} x mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets that are not convex at all in the original space.

**RFE**

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

RFECV performs RFE in a cross-validation loop to find the optimal number of features.


**L2-AROM**

In the present work we rely on another embedded selection method with linear models, called l1-AROM [25]. This specific choice is motivated by the possibil- ity to extend this approach in a simple yet efficient way to perform transfer learning by biasing the optimization procedure towards certain dimensions. We proposed recently such a partially supervised (PS) extension [26] but the favored dimensions were then defined from prior knowledge. In the context of microarray data, molecular biologists may indeed sometimes guess that a few genes should be considered a priori more relevant. In the present work, we do not use such prior knowledge but rather related datasets, hence performing inductive trans- fer learning at the feature level. The additional benefits are a fully automated feature selection procedure and the possibility to choose the number of features to be transferred independently of some expert knowledge. A practical approx- imation of this technique reduces to learn linear SVMs with iterative rescaling of the inputs. The rescaling factors depend here on previously selected features from existing datasets.

At step k = 0, initialize wk = β Iterate until convergence:
1 minw ||w||2
subject to: yi (w · (xi ∗ wk ) + b) ≥ 1
2 Let (w ̄ ) be the solution, set wk+1 ← wk ∗w ̄ ∗β
 
**SVC**
C-Support Vector Classification.
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided kernel functions and how gamma, coef0 and degree affect each other, see the corresponding section in the narrative documentation: Kernel functions.


A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.



The l2-AROM method further approximates this optimization by replacing the l1-norm by the l2-norm. Even though such an approximation may result in a less sparse solution, it is very efficient in practice when m ≪ n. Indeed, a dual formulation may be used and the final algorithm boils down to a linear SVM estimation with iterative rescaling of the inputs. 

**A standard SVM solver can be iteratively called on properly rescaled inputs. A smooth feature selection occurs during this iterative process since the weight coefficients along some dimensions progressively drop below the machine precision while other dimensions become more significant. A final ranking on the absolute values of each dimension can be used to obtain a fixed number of features.**


**T-test**

Assuming it to be a binary classification problem, where each sample can be classified either into class C1 or class C2, t-Statistics helps us to evaluate that whether the values of a particular feature for class C1 is significantly different from values of same feature for class C2. If this holds, then the feature can helps us to better differentiate our data.

e.g. Does the salary of a person impact his chances to get a loan ? Here we will calculate mean and variance of the following observations separately :

Salaries of individuals when the loan was approved
Salaries of individuals when the loan was not approved
and then we will use t-statistics to check whether these two samples are significantly different or not.

t- Statistics is computed using:


where 𝑢𝑖𝑗 denotes mean of ith feature 𝑋𝑖 for class 𝐶𝑗and 𝑠𝑖𝑔𝑚𝑎𝑖𝑗denotes Standard Deviation of ith feature 𝑋𝑖 for class 𝐶𝑗 . The class index is denoted by j i.e. j =1 or j=2.

After calculating the values of t-Statistic for each feature, we sort these values in descending order in order to select the important the feature.


In [114]:
import numpy as np
import pandas as pd
#from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
#from sklearn.feature_selection import RFE

import warnings
warnings.filterwarnings("ignore")

##
# X is numpy array witht the data (rows are data instances)
# Y is a numpy vector with the class labels (-1 or 1)
# C is the regularization coefficient of the SVM
# threshold is the threshold value to drop features in L2AROM

# 1. At step k = 0, initialize z = (1, ..., 1) 
# 2. Iterate until convergence:
    # 1.1 minw ||w||2 Subject to: yi (w · (xi ∗ z ) + b) ≥ 1 
    # 1.2 Let (w) be the solution, set w_new ← z ∗ w

#
#Relevance vector β
#Prior relevance of feature j encoded in βj .
#The more (a priori) relevant feature j, the higher βj. If no information on j, βj = 1.

def variable_ranking(X, Y, C = 1, b=None, threshold=1e-10, feature_len=10):
    """
    """

    # Copy X to modify it later

    final_X = X.copy()
    print("First final_X", final_X)
    # Initialice w_k = (1,....,1)
    # 0. At step k = 0, initialize z = (1, ..., 1) /b
    z = b.copy()
    print("Z", z)
    # Number of attributes

    length = z.shape[0]
    print("#Features", length)

    # Array that stores the elimination order, being the higher number the first attribute 
    # that is eliminated and 1 the last one

    elimination_order = np.zeros(length, dtype = int)
    print("#Elimination order", elimination_order)
    original_feature_indices = np.arange(0, length, dtype = int)
    print("#Init feature indices", original_feature_indices)
    clf = SVC(kernel = "linear", C = C, random_state = 0)
    print(clf)
    iter_without_dropping = 0
    n_removed_features = 0
    
    # 2. Iterate until convergence:
    while iter_without_dropping < 20 and length > feature_len:

        # Fit the SVC and compute z
        print("ones", np.ones(X.shape[ 0 ]))
        # xi ∗ z
        print("outer", np.outer(np.ones(X.shape[ 0 ]), z)) 
        print("final_X", final_X * np.outer(np.ones(X.shape[ 0 ]), z))
        # 2.1 minw ||w||2 Subject to: yi (w · (xi ∗ z ) + b) ≥ 1 
        clf.fit(final_X * np.outer(np.ones(X.shape[ 0 ]), z), Y)
        # w = coef_
        print("Coefs", clf.coef_)
        #2.2 Let (w=coef_) be the solution, set z_new ← z ∗ w * b
        z *= np.abs(clf.coef_[0])*b # In absolute value
#         print("------Z", z)
#         print("------B", b)
#         print("------ZB", z*b)
        #clf devuelve los coeficientes w y con ellos escalamos los w
        #aprovechamos para calzarnos los z con un coeficiente pequeño
        n_features_to_drop = np.sum(z < threshold)
        
        if n_features_to_drop == 0:
            iter_without_dropping += 1
        else:
            iter_without_dropping = 0
            print("@@Z to remove", z, z[ z < threshold ])
            remove_order = np.argsort(z[ z < threshold ])
            print("@@@Threshold", z < threshold)
            print("@@@Remove order", remove_order)
            print("@@@Elimination order", original_feature_indices[ z < threshold ])
            print("@@@Elimination order 2", original_feature_indices[ z < threshold ][ remove_order ])
            elimination_order[ original_feature_indices[ z < threshold ][ remove_order ] ] = \
                np.arange(0, n_features_to_drop) + n_removed_features + 1
            print("@@@Elimination order", original_feature_indices[ z < threshold ])
            print("@@@Elimination order 2", original_feature_indices[ z < threshold ][ remove_order ])
            print("@@@@elimination_order 3", elimination_order)
            n_removed_features += n_features_to_drop
            length -= n_features_to_drop
        
            # Delete from X, z and original_features the selected attributes 

            final_X = final_X[ :, z >= threshold ]
            original_feature_indices = original_feature_indices[ z >= threshold ]
            b = b[ z >= threshold ]
            z = z[ z >= threshold ]

    # We remove all remaining features

    if length > 0:
            remove_order = np.argsort(z)
            elimination_order[ original_feature_indices[ remove_order ] ] = \
                np.arange(0, length) + n_removed_features + 1
    print(elimination_order)
    print(np.argsort(elimination_order))
    #ranking of features for more to less significance
    return np.argsort(-elimination_order)  # So array starts at 0 (python indexing)


In [64]:
import sys
NROWS = sys.maxsize
#NROWS = 10

df_chandran = pd.read_csv('./data/chandran.csv', sep=',', 
                     header=0, nrows = NROWS)
display(df_chandran.head())


Unnamed: 0,X100_g_at,X1000_at,X1001_at,X1002_f_at,X1003_s_at,X1004_at,X1005_at,X1006_at,X1007_s_at,X1008_f_at,...,AFFX.ThrX.5_at,AFFX.ThrX.M_at,AFFX.TrpnX.3_at,AFFX.TrpnX.5_at,AFFX.TrpnX.M_at,AFFX.YEL002c.WBP1_at,AFFX.YEL018w._at,AFFX.YEL021w.URA3_at,AFFX.YEL024w.RIP1_at,Y
0,7.234793,6.494211,4.853264,3.527822,5.575283,5.630715,7.070994,3.586507,8.607721,8.37615,...,4.124928,3.130851,2.983105,3.286748,3.632831,3.200749,3.157482,3.57255,3.201209,1
1,6.967237,6.632175,4.32049,3.53503,5.50527,5.173343,7.826527,3.470474,6.871599,8.732676,...,4.089809,3.030838,2.710369,3.204168,3.721313,3.080551,2.90875,2.980353,3.264706,1
2,7.026961,6.510959,4.267634,3.387379,5.906008,5.321219,7.857653,3.292397,7.521978,8.636165,...,3.693827,2.755653,2.526112,3.25425,3.362329,2.862432,3.0482,3.247433,3.06189,1
3,7.123875,6.1559,4.114608,3.380995,5.891499,5.602339,8.285221,3.636381,8.148127,8.472201,...,4.345752,3.122182,2.65612,3.530544,3.515947,3.026449,3.231532,3.762868,3.354885,1
4,7.182206,6.237578,4.194653,3.380361,5.511587,5.383889,8.941296,3.331588,8.257033,8.700136,...,4.01699,2.956002,2.622684,3.263552,3.606437,3.035578,2.938062,3.156967,3.055146,1


In [330]:
import sys
NROWS = sys.maxsize
#NROWS = 10
SELECT_SAMPLES = 4
SELECT_FEATURES_INI = 100
SELECT_FEATURES_FIN = 110
TEST_SIZE = 0.10
PATH_DATA = './data'
RANDOM_STATE = 0

def load_df(file):
    """
    Load sample files
    """
    df = pd.read_csv('./data' + '/' + file, sep=',', header=0, nrows = NROWS)
    return df

def normalize_feature_names(features):
    """
    Normalize the names of the features in order to select the common features
    """
    features_new = []
    for idx, feature in enumerate(features):
        feature_new = feature.replace('/', '@').replace('-', '@').replace('_', '@').replace('.', '@')
        if feature[0] == "X":
            features_new.append(feature_new[1:])
        else:
            features_new.append(feature_new)
    return features_new

def intersect_features(df_samples):
    """
    """
    features_0 = df_samples[0].columns
    features_1 = df_samples[1].columns
    features_2 = df_samples[2].columns
    norm_features_0 = np.array(normalize_feature_names(features_0))
    norm_features_1 = np.array(normalize_feature_names(features_1))
    norm_features_2 = np.array(normalize_feature_names(features_2))
    intersect_0 = np.array([], dtype=int)
    intersect_1 = np.array([], dtype=int)
    intersect_2 = np.array([], dtype=int)
    for idx_0, feature_0 in enumerate(norm_features_0):
        idx_1 = np.where(norm_features_1 == feature_0)
        idx_2 = np.where(norm_features_2 == feature_0)
        if idx_1[0] and idx_2[0]:
            intersect_0 = np.append(intersect_0, idx_0)
            intersect_1 = np.append(intersect_1, idx_1[0])
            intersect_2 = np.append(intersect_2, idx_2[0])
        else:
            print("UnMatch", idx_0, feature_0)
    print(intersect_0.shape)
    print(intersect_0)
    print(intersect_1)
    print(intersect_2)
    df_samples_norm = []
    df_samples_norm.append(df_samples[0].iloc[:, intersect_0])
    df_samples_norm.append(df_samples[1].iloc[:, intersect_1])
    df_samples_norm.append(df_samples[2].iloc[:, intersect_2])
    return df_samples_norm

def split_data(df):
    """
    Split data
    """
    labels = np.concatenate([df_chandran.iloc[:SELECT_SAMPLES, -1], 
                             df_chandran.iloc[-SELECT_SAMPLES:, -1]], axis = 0)

    features = np.concatenate([df_chandran.iloc[:SELECT_SAMPLES, SELECT_FEATURES_INI:SELECT_FEATURES_FIN], 
                               df_chandran.iloc[-SELECT_SAMPLES:, SELECT_FEATURES_INI:SELECT_FEATURES_FIN]], axis = 0)


    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=TEST_SIZE, random_state=RANDOM_STATE)

#     print(X_train)
#     print(y_train)
    print(X_train.shape, y_train.shape)
    return X_train, X_test, y_train, y_test

df_samples = []
for file in ['chandran.csv','singh.csv', 'welsh.csv']:
    df_samples.append(load_df(file))

df_samples_norm = intersect_features(df_samples)

# Standardize the 0 label as -1 in dataset 1
mask = df_samples_norm[1]["Y"] == 0
df_samples_norm[1].loc[mask, "Y"] = -1

for i in range(3):
    display(df_samples_norm[i].head()) 
    
#     X_train = []
#     X_test = []
#     y_train = [] 
#     y_test = [] 
#     idx = 0
#     X_tr, X_t, y_tr, y_t = split_data(df_samples[idx])
#     X_train.append(X_tr)
#     X_test.append(X_t)
#     y_train.append(y_tr)
#     y_test.append(y_t)
#     features = df_samples[idx].columns[SELECT_FEATURES_INI:SELECT_FEATURES_FIN].values
#     print(features)
#     idx += 1

UnMatch 0 100@g@at
UnMatch 651 160020@at
UnMatch 652 160021@r@at
UnMatch 653 160022@at
UnMatch 654 160023@at
UnMatch 655 160024@at
UnMatch 656 160025@at
UnMatch 657 160026@at
UnMatch 658 160027@s@at
UnMatch 659 160028@s@at
UnMatch 660 160029@at
UnMatch 661 160030@at
UnMatch 662 160031@at
UnMatch 663 160032@at
UnMatch 664 160033@s@at
UnMatch 665 160034@s@at
UnMatch 666 160035@at
UnMatch 667 160036@at
UnMatch 668 160037@at
UnMatch 669 160038@s@at
UnMatch 670 160039@at
UnMatch 671 160040@at
UnMatch 672 160041@at
UnMatch 673 160042@s@at
UnMatch 674 160043@at
UnMatch 675 160044@g@at
UnMatch 12610 AFFX@MurIL2@at
(12599,)
[    1     2     3 ... 12623 12624 12625]
[    1     2     3 ... 12623 12624 12625]
[11737 11738 11739 ...    66    65 12626]


Unnamed: 0,X1000_at,X1001_at,X1002_f_at,X1003_s_at,X1004_at,X1005_at,X1006_at,X1007_s_at,X1008_f_at,X1009_at,...,AFFX.ThrX.5_at,AFFX.ThrX.M_at,AFFX.TrpnX.3_at,AFFX.TrpnX.5_at,AFFX.TrpnX.M_at,AFFX.YEL002c.WBP1_at,AFFX.YEL018w._at,AFFX.YEL021w.URA3_at,AFFX.YEL024w.RIP1_at,Y
0,6.494211,4.853264,3.527822,5.575283,5.630715,7.070994,3.586507,8.607721,8.37615,8.000115,...,4.124928,3.130851,2.983105,3.286748,3.632831,3.200749,3.157482,3.57255,3.201209,1
1,6.632175,4.32049,3.53503,5.50527,5.173343,7.826527,3.470474,6.871599,8.732676,7.820364,...,4.089809,3.030838,2.710369,3.204168,3.721313,3.080551,2.90875,2.980353,3.264706,1
2,6.510959,4.267634,3.387379,5.906008,5.321219,7.857653,3.292397,7.521978,8.636165,7.880739,...,3.693827,2.755653,2.526112,3.25425,3.362329,2.862432,3.0482,3.247433,3.06189,1
3,6.1559,4.114608,3.380995,5.891499,5.602339,8.285221,3.636381,8.148127,8.472201,7.894054,...,4.345752,3.122182,2.65612,3.530544,3.515947,3.026449,3.231532,3.762868,3.354885,1
4,6.237578,4.194653,3.380361,5.511587,5.383889,8.941296,3.331588,8.257033,8.700136,8.103976,...,4.01699,2.956002,2.622684,3.263552,3.606437,3.035578,2.938062,3.156967,3.055146,1


Unnamed: 0,1000_at,1001_at,1002_f_at,1003_s_at,1004_at,1005_at,1006_at,1007_s_at,1008_f_at,1009_at,...,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at,AFFX-YEL002c/WBP1_at,AFFX-YEL018w/_at,AFFX-YEL021w/URA3_at,AFFX-YEL024w/RIP1_at,Y
0,7.391657,3.812922,3.453385,6.070151,5.527153,5.812353,3.167275,7.354981,9.419909,7.697655,...,3.770583,2.884436,2.730025,3.126168,2.870161,3.08221,2.747289,3.226588,3.480196,-1
1,7.32905,3.958028,3.407226,5.921265,5.376464,7.303408,3.108708,7.391872,10.539579,8.544981,...,3.190759,2.460119,2.696578,2.675271,2.940032,3.126269,3.013745,3.517859,3.428752,1
2,7.664007,3.783702,3.152019,5.452293,5.111794,7.207638,3.07736,7.488371,6.833428,8.448252,...,3.325183,2.603014,2.469759,2.615746,2.510172,2.730814,2.613696,2.823436,3.049716,-1
3,7.469634,4.004581,3.34117,6.070925,5.296108,8.744059,3.117104,7.203028,10.400557,7.185107,...,3.625057,2.765521,2.681757,3.310741,3.197177,3.414182,3.193867,3.353537,3.567482,-1
4,7.322408,4.242724,3.489324,6.141657,5.62839,6.82537,3.794904,7.403024,10.240322,7.163157,...,3.698067,3.026876,2.69167,3.23603,3.003906,3.081497,2.963307,3.47205,3.598103,1


Unnamed: 0,X1000_at,X1001_at,X1002_f_at,X1003_s_at,X1004_at,X1005_at,X1006_at,X1007_s_at,X1008_f_at,X1009_at,...,AFFX.ThrX.5_at,AFFX.ThrX.M_at,AFFX.TrpnX.3_at,AFFX.TrpnX.5_at,AFFX.TrpnX.M_at,AFFX.YEL002c.WBP1_at,AFFX.YEL018w._at,AFFX.YEL021w.URA3_at,AFFX.YEL024w.RIP1_at,Y
0,269,46,68,-11,-67,2059,43,1425,1073,2317,...,873,613,-2,41,-26,5,-15,75,16,1
1,245,45,18,-15,44,1885,25,1313,1521,2029,...,908,700,-8,13,-26,14,-9,137,5,1
2,310,24,93,-30,-8,652,-52,1648,1561,1287,...,1450,1102,9,27,-7,34,-4,67,16,1
3,328,13,41,16,38,2536,-35,1006,1307,2921,...,853,604,-4,8,-35,-4,-15,70,18,1
4,359,40,13,-1,42,3845,-18,1009,1156,1924,...,931,664,-3,21,-16,5,1,43,12,1


### Ranking model

In [117]:
##
# X is numpy array witht the data (rows are data instances)
# Y is a numpy vector with the class labels (-1 or 1)
# C is the regularization coefficient of the SVM
# threshold is the threshold value to drop features in L2AROM

print("Shape", X_train.shape)
# b = np.ones(X_train.shape[1])
b = np.array([1, 1, 1, 1, 1, 10, 1, 1, 1, 1], dtype=float)
variable_ranking(X_train, y_train, C = 1, b=b, threshold=1e-10, feature_len=5)

Shape (5, 10)
First final_X [[3.04267208 7.19191304 7.82334842 2.51988026 4.43116838 2.28238102
  6.20866444 3.97344996 2.58739816 3.38964881]
 [2.95738953 7.14597925 8.02878747 3.09612524 4.88298697 2.4350044
  5.69849878 4.77639802 2.7831269  5.01608014]
 [2.96428904 7.18547986 7.83712188 3.21338741 4.88506596 2.41277747
  5.65483307 4.73698059 2.75456921 3.76447161]
 [3.43724992 7.60105699 7.93612435 2.9981392  5.01502853 2.5739303
  5.82747709 4.93601557 2.86524365 4.73670692]
 [3.19159297 7.44015753 8.01133704 2.9925223  5.12199766 2.2825365
  5.78955862 5.44155007 2.69530392 3.59354046]]
Z [ 1.  1.  1.  1.  1. 10.  1.  1.  1.  1.]
#Features 10
#Elimination order [0 0 0 0 0 0 0 0 0 0]
#Init feature indices [0 1 2 3 4 5 6 7 8 9]
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=0,
  shrinking=True, tol=0.001, verbose=False)
ones [1. 1. 1. 1. 1.]
o

array([0, 5, 1, 9, 3, 6, 7, 4, 8, 2])

In [22]:
%%bash
ls data
wc -l data/chandran.csv
#head -n 1 data/chandran.csv
#array([0, 9, 1, 3, 6, 7, 4, 8, 2, 5])

chandran.csv
singh.csv
welsh.csv
     105 data/chandran.csv


In [180]:
# We need to nake this with t-test
from sklearn.feature_selection import GenericUnivariateSelect, chi2
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler, MaxAbsScaler
from scipy import stats

def ttest(X, y):
    """
    Score statistic function for transformer GenericUnivariate
    """
    t, p = stats.ttest_ind(X[y==1] , X[y==-1])
    print('t',t, 'p', p)
    return t,p

#X, y = load_breast_cancer(return_X_y=True)
def select_features(X, y, num_features=3):
    """
    Select the features with the 
    """
    print(X.shape)
    transformer = GenericUnivariateSelect(ttest, 'k_best', param=num_features)
    X_new = transformer.fit_transform(X, y)
    print(X_new.shape)
    feature_indexes = SELECT_FEATURES_INI + np.argsort(transformer.scores_)[::-1][0:num_features]
    return feature_indexes

for i in range(3):
    display(df_samples_norm[i].head()) 

# First version
#     X_train = []
#     X_test = []
#     y_train = [] 
#     y_test = [] 
#     idx = 0
#     X_tr, X_t, y_tr, y_t = split_data(df_samples[idx])
#     X_train.append(X_tr)
#     X_test.append(X_t)
#     y_train.append(y_tr)
#     y_test.append(y_t)
#     features = df_samples[idx].columns[SELECT_FEATURES_INI:SELECT_FEATURES_FIN].values
#     print(features)
#     idx += 1

X_train_scaled = StandardScaler().fit_transform(X_train)
print(X_train_scaled)
indexes = select_features(X_train_scaled, y_train, num_features=5)
print(indexes)

features = df_chandran.columns[SELECT_FEATURES_INI:SELECT_FEATURES_FIN].values
print(features)
best_features = df_chandran.columns[indexes].values
print(best_features)

[[-0.42147589 -0.68077729 -1.21921664 -1.88058457 -1.85117099 -1.05603962
   1.89860388 -1.6935514  -1.61454494 -1.09149417]
 [-0.89463816 -0.93920353  1.18930025  0.55941249  0.06680582  0.34616531
  -0.69917435  0.00745521  0.49600344  1.40729302]
 [-0.8563585  -0.71697077 -1.05773997  1.05593632  0.07563114  0.14195869
  -0.92152144 -0.07604871  0.18806506 -0.51563069]
 [ 1.76770977  1.62109074  0.10294062  0.14450971  0.62732422  1.62252674
  -0.04241309  0.34559726  1.3814706   0.97807463]
 [ 0.40476279  0.71586084  0.98471574  0.12072605  1.08140982 -1.05461113
  -0.23549501  1.41654763 -0.45099415 -0.77824279]]
(5, 10)
t [-1.77057815 -1.5894688   0.09316163  1.51944707  0.10088885  0.35222012
 -1.52838138 -0.04852196  0.50375244  0.6769448 ] p [0.17476488 0.2101723  0.93164814 0.22596742 0.92600332 0.74796361
 0.22388075 0.96434991 0.6491003  0.54696565]
(5, 5)
[103 109 108 105 104]
['X1090_f_at' 'X1091_at' 'X1092_at' 'X1093_at' 'X1094_g_at' 'X1095_s_at'
 'X1096_g_at' 'X1097_s_a

In [145]:
print(y_train)
print( [ y_train==-1])


[-1  1  1 -1 -1]
[array([ True, False, False,  True,  True])]


In [171]:
# T-test

## Import the packages
import numpy as np
from scipy import stats


## Define 2 random distributions
#Sample Size
N = 10
np.random.seed(0)
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)


t2, p2 = stats.ttest_ind(X_train, y_train)
print("t = " + str(t2))
print("p = " + str(p2))

t2, p2 = stats.ttest_ind(X_train[y_train==1] , X_train[y_train==-1])
print("t = " + str(t2))
print("p = " + str(p2))

print(X_train.shape)
print(X_train)
print(y_train)
a = X_train[y_train==1].T
b = X_train[y_train==-1].T

print(a[0])
print(b[0])
# print(y_train)
# print(X_train[y_train==1])

# print()
# print(X_train[y_train==-1])

t2, p2 = stats.ttest_ind(a[1] , b[1])
print("t = " + str(t2))
print("p = " + str(p2))

t = [ 6.66235409 15.08939652 16.52736128  6.2786939  10.05689064  5.2693547
 12.08027011  9.1448739   5.9687117   7.31128142]
p = [1.58762463e-04 3.67991267e-07 1.81320448e-07 2.38199665e-04
 8.13605052e-06 7.55884801e-04 2.03763261e-06 1.64760146e-05
 3.34888705e-04 8.29781945e-05]
t = [-1.77057815 -1.5894688   0.09316163  1.51944707  0.10088885  0.35222012
 -1.52838138 -0.04852196  0.50375244  0.6769448 ]
p = [0.17476488 0.2101723  0.93164814 0.22596742 0.92600332 0.74796361
 0.22388075 0.96434991 0.6491003  0.54696565]
(5, 10)
[[3.04267208 7.19191304 7.82334842 2.51988026 4.43116838 2.28238102
  6.20866444 3.97344996 2.58739816 3.38964881]
 [2.95738953 7.14597925 8.02878747 3.09612524 4.88298697 2.4350044
  5.69849878 4.77639802 2.7831269  5.01608014]
 [2.96428904 7.18547986 7.83712188 3.21338741 4.88506596 2.41277747
  5.65483307 4.73698059 2.75456921 3.76447161]
 [3.43724992 7.60105699 7.93612435 2.9981392  5.01502853 2.5739303
  5.82747709 4.93601557 2.86524365 4.73670692]
 [3.19

# Outputs

In [14]:
%%bash
jupyter nbconvert --to=latex --template=~/report.tplx feature_selection_linear_models.ipynb 1> /dev/null
pdflatex -shell-escape feature_selection_linear_models 1> /dev/null
jupyter nbconvert --to html_with_toclenvs feature_selection_linear_models.ipynb 1> /dev/null

[NbConvertApp] Converting notebook feature_selection_linear_models.ipynb to latex
[NbConvertApp] Writing 32105 bytes to feature_selection_linear_models.tex
[NbConvertApp] Converting notebook feature_selection_linear_models.ipynb to html_with_toclenvs
[NbConvertApp] Writing 285913 bytes to feature_selection_linear_models.html
