# Feature Selection

This notebook is trying out different strategies for feature selection based on sklearn as well as a few neural approaches.

## Dependencies

In [None]:
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

## Preprocess Dataset

Download dataset to parent data directory.

In [None]:
!if [ ! -f ../data/dorothea.zip ]; then wget -P ../data/ https://archive.ics.uci.edu/static/public/169/dorothea.zip && unzip ../data/dorothea.zip -d ../data/; fi

In [None]:
class DataPreprocessor:
    def __init__(self, features_file, targets_file):
        self.features_file = features_file
        self.targets_file = targets_file
        self.df = None
    
    def _preprocess(self):
        data = []
        with open(self.features_file, 'r') as f:
            for line in f:
                active_features = line.strip().split()
                data.append(pd.Series({int(feature): 1 for feature in active_features}))
        features = pd.concat(data, axis=1).T.fillna(0).sort_index(axis=1)
        targets = pd.read_csv(self.targets_file, header=None, names=["target"])
        self.df = pd.concat([features, targets], axis=1)
    
    def __call__(self):
        self._preprocess()
        return self.df

In [None]:
wrangler = DataPreprocessor('../data/DOROTHEA/dorothea_train.data', '../data/DOROTHEA/dorothea_train.labels')

In [None]:
df = wrangler()

In [None]:
df.head()

In [None]:
X, y = df.drop('target', axis=1), df['target']

## Feature Data Types

In [None]:
feature_number_unique_values = {column: X[column].unique() for column in X.columns}

In [None]:
unique_values_by_columns = [value for _, value in feature_number_unique_values.items()]
tuples = [tuple(np.sort(arr)) for arr in unique_values_by_columns]

# Count the frequencies
frequency_table = Counter(tuples)

# Convert Counter to DataFrame
df = pd.DataFrame.from_records(list(frequency_table.items()), columns=['Array', 'Frequency'])

print(df)


So we see that each feature takes binary values 1 or 0, at least 1 of each.

Let's look at the target `y`.

In [None]:
y.unique()

The drug discovery target `y` is also binary, taking values `-1` and `1`.

# Feature Selection Algorithms

## Sweeping Univariate Feature Selection

Selecting features using univariate statistical tests of teh relationship between each feature and the target variable.

### Only Univariate Feature Based

#### Variance Threshold

Let's see a histogram of the variance of each feature.

In [None]:
variances = X.var()

plt.hist(variances, bins='auto', log=True)
plt.title('Histogram of Variances')
plt.xlabel('Variance')
plt.ylabel('Frequency')
plt.show()

In [None]:
log_variances = np.log(variances + 1e-9)

plt.hist(log_variances, bins='auto', log=True)
plt.title('Histogram of Log Variances', )
plt.xlabel('Log Variance')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(variances, bins='auto', density=True, cumulative=True, histtype='step', alpha=0.8)
plt.title('CDF of Variances')
plt.xlabel('Log Variance')
plt.ylabel('Cumulative Probability')
plt.grid(True)

# Calculate the median of variances
median = np.median(variances)

# Plot the median as a dotted line
plt.axvline(median, color='b', linestyle='dotted', linewidth=2, label=f'Median Variance: {median:.2f}')

plt.legend()
plt.show()

In [None]:
plt.boxplot(variances)
plt.title('Boxplot of Variances')
plt.ylabel('Variance')
plt.grid(True)
plt.show()

In [None]:
plt.hist(log_variances, bins='auto', density=True, cumulative=True, histtype='step', alpha=0.8)
plt.title('CDF of Log Variances')
plt.xlabel('Log Variance')
plt.ylabel('Cumulative Probability')
plt.grid(True)

# Calculate the median of log_variances
median = np.median(log_variances)

# Plot the median as a dotted thick blue line
plt.axvline(median, color='b', linestyle='dotted', linewidth=2, label=f'Median Variance: {median:.2f}')

plt.legend()
plt.show()

In [None]:
plt.boxplot(variances)
plt.title('Boxplot of Variances')
plt.ylabel('Variance')
plt.grid(True)
plt.show()

In [None]:
variance_cutoff = 0.01
selector = VarianceThreshold(threshold=variance_cutoff)
selected_features = selector.fit_transform(X)

### 

### Model Based Feature Selection: Univariate Feature Target

Classification based univariate feature selection using SelectBestK, SelectPercentile.

### Recursive Feature Elimination

### SelectFromModel

#### L1 based 

#### Tree based

### Sequential Feature Selection

### Pipeline Feature Selection

### Neural Learning

#### MLP gradient based feature selection for each class (or relative regression from a baseline)

#### Relative gradient 


#### Permutation based feature selection

#### AutoEncoder (relu activations as importance of first layer)

#### Gradient based method

#### Regularization of Neural network

#### RFE with Neural Network

#### TabTransformer based selection (if possible)