# Feature Selection

This notebook is trying out different strategies for feature selection based on sklearn as well as a few neural approaches.

## Dataset

Download dataset to parent data directory.

!wget -P ../data/ https://archive.ics.uci.edu/static/public/169/dorothea.zip

In [None]:
import pandas as pd

In [None]:

data = []

with open('../data/DOROTHEA/dorothea_train.data', 'r') as f:
    for line in f:
        active_features = line.strip().split()
        data.append(pd.Series({int(feature): 1 for feature in active_features}))

features = pd.concat(data, axis=1).T.fillna(0).sort_index(axis=1)

features.head()

In [None]:
targets = pd.read_csv('../data/DOROTHEA/dorothea_train.labels', header=None)

In [None]:
df = pd.concat([features, targets], axis=1)

In [None]:
df.head()

In [None]:
df.rename(columns={0: 'target'}, inplace=True)

In [None]:
df.head()

In [None]:
class DataPreprocessor:
    def __init__(self, features_file, targets_file):
        self.features_file = features_file
        self.targets_file = targets_file
        self.df = None
    
    def _preprocess(self):
        data = []
        with open(self.features_file, 'r') as f:
            for line in f:
                active_features = line.strip().split()
                data.append(pd.Series({int(feature): 1 for feature in active_features}))
        features = pd.concat(data, axis=1).T.fillna(0).sort_index(axis=1)
        targets = pd.read_csv(self.targets_file, header=None, names=["target"])
        self.df = pd.concat([features, targets], axis=1)
    
    def __call__(self):
        self._preprocess()
        return self.df

In [None]:
wrangler = DataPreprocessor('../data/DOROTHEA/dorothea_train.data', '../data/DOROTHEA/dorothea_train.labels')

In [None]:
df = wrangler()

In [None]:
df.head()

In [None]:
X, y = df.drop('target', axis=1), df['target']

# Feature Selection Algorithms

## Univariate Feature Selection

In [None]:
Selecting features using univariate statistical tests of teh relationship between each feature and the target variable.

### Variance Threshold

Let's see a histogram of the variance of each feature.

In [None]:
import matplotlib.pyplot as plt

# calculate variances
variances = X.var()

# plot histogram
plt.hist(variances, bins='auto')
plt.title('Histogram of Variances')
plt.xlabel('Variance')
plt.ylabel('Frequency')
plt.show()

In [None]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)