# AutoEncoder Feature Selection
In this notebook, we will explore a deep learning-based approach for feature selection in hyperspectral data using autoencoders. Specifically, we will use the concrete_autoencoder library to train an autoencoder on our hyperspectral data and select the top K most important features based on the reconstruction error. We will apply this method to a real-world hyperspectral dataset and evaluate the performance of our feature selection approach using a classification task.

**Link to the original code**: [Link](https://github.com/mfbalin/Concrete-Autoencoders)

**Link to the paper**: [Link](https://arxiv.org/abs/1901.09346)

Autoencoders are a type of neural network that learns to reconstruct input data from a compressed representation, also known as the latent space. They consist of two main components: an encoder that maps the input data into the latent space, and a decoder that reconstructs the input data from the latent representation.

Autoencoders have been shown to be effective in feature selection by learning a compressed representation of the input data that retains the most important features. By training the autoencoder on the input data and then examining the importance of the latent features, we can select the top K most important features and discard the rest.

**Important note**🔥🔥! This notebook addresses **regression problems only**. You can find an example in the attached github for a classification problem

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
!pip install concrete-autoencoder
from concrete_autoencoder import ConcreteAutoencoderFeatureSelector
from keras.datasets import mnist
from keras.utils import to_categorical
from keras.layers import Dense, Dropout, LeakyReLU, Softmax
import numpy as np

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
def load_data(csv_path, feature_col_start, feature_col_end, target_col):
    """
    Load a CSV file into a Pandas DataFrame,drop Nan, and separate the feature and target columns.

    Parameters:
        csv_path (str): Path to the CSV file to load.
        feature_col_start, feature_col_end, (ints): Range of column indices to use as features.
        target_col (str or int): Name or index of the column to use as target.

    Returns:
        new_df: A df containing the features + labels DataFrame.
    """
    # Load CSV into a Pandas DataFrame
    df = pd.read_csv(csv_path)

    # drop nan
    df = df.dropna()

    # Extract the feature and target columns
    new_df = df[df.columns[feature_col_start: feature_col_end]]
    new_df[target_col] = df[target_col]

    return new_df

In [None]:
def split_data(df, target_col, test_size=0.3, random_state=42):
    """
    Splits the input DataFrame into training and testing sets.
    
    Parameters:
    -----------
    df (pandas DataFrame): The input DataFrame containing the features and target variable.
    target_col (str): The name of the target column in the DataFrame.
    test_size (float, optional): The proportion of the data to use for testing (default=0.3).
    random_state (int, optional): The random seed to use for the train-test split (default=42).
        
    Returns:
    --------
    X_train (pandas DataFrame): The training set features.     
    X_test (pandas DataFrame): The testing set features.        
    y_train (pandas Series): The training set target variable.
    y_test (pandas Series): The testing set target variable.
    """
    # Extract the features and target variable from the DataFrame
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Return the training and testing sets
    return X_train, X_test, y_train, y_test

In [None]:
def sort_df_by_important_features(df, X_train, X_test, target_col, top_K_bands):
    """
    Sorts the dataframe columns by their importance using the Concrete Autoencoder Feature Selector.

    Args:
        df (pd.DataFrame): The original dataframe to sort.
        X_train (np.ndarray): The training data to use in the feature selector.
        X_test (np.ndarray): The testing data to use in the feature selector.
        target_col (str): The name of the target column in the dataframe.
        top_K_bands (int): The number of most important features to select.

    Returns:
        pd.DataFrame: A new dataframe containing only the top K most important features and the target column.
    """
    def f(x):

        x = Dense(180)(x)  # apply a fully connected layer with 180 units
        x = LeakyReLU(0.2)(x)  # apply a leaky ReLU activation function
        x = Dropout(0.1)(x)  # apply dropout regularization
        x = Dense(180)(x)  # apply another fully connected layer with 180 units
        x = LeakyReLU(0.2)(x)  # apply another leaky ReLU activation function
        x = Dropout(0.1)(x)  # apply another dropout regularization
        x = Dense(204)(x)  # apply a final fully connected layer with 204 units
        return x

    def g(x):

        x = Dense(180)(x)  # apply a fully connected layer with 180 units
        x = LeakyReLU(0.2)(x)  # apply a leaky ReLU activation function
        x = Dropout(0.1)(x)  # apply dropout regularization
        x = Dense(1800)(x)  # apply another fully connected layer with 1800 units
        x = LeakyReLU(0.2)(x)  # apply another leaky ReLU activation function
        x = Dropout(0.1)(x)  # apply another dropout regularization
        x = Dense(10)(x)  # apply a final fully connected layer with 10 units
        x = LeakyReLU(1)(x)  # apply a final leaky ReLU activation function
        return x

    # create a ConcreteAutoencoderFeatureSelector object with the given output function and number of epochs
    selector = ConcreteAutoencoderFeatureSelector(K=top_K_bands, output_function=f, num_epochs=25)
    # fit the feature selector using the training data
    selector.fit(X_train, X_train, X_test, X_test)
    # get the indices of the top K most important features
    bands_list = selector.get_support(indices=True).tolist()
    # create a new dataframe containing only the top K most important features and the target column
    sorted_df = df[df.columns[bands_list]]
    sorted_df[target_col] = df[target_col]
    return sorted_df


## Example

In [None]:
# Define input parameters
csv_path = '/content/data.csv'
feature_idx_i,feature_idx_f = 16,-2 # columns index of features
target_col = 'A' # labael column (regression)

# Number of features to get:
top_K_bands = 20

In [None]:
# Load data
data = load_data(csv_path, feature_idx_i,feature_idx_f, target_col)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[target_col] = df[target_col]


Unnamed: 0,397.32,400.2,403.09,405.97,408.85,411.74,414.63,417.52,420.4,423.29,...,978.88,981.96,985.05,988.13,991.22,994.31,997.4,1000.49,1003.58,A
0,0.179808,0.152106,0.129191,0.115715,0.107613,0.102074,0.101501,0.099727,0.096248,0.096929,...,0.458213,0.464172,0.45852,0.462214,0.467727,0.467549,0.466043,0.471523,0.447471,2.01727
1,0.221156,0.186298,0.160032,0.146194,0.136323,0.128331,0.124891,0.12185,0.116359,0.114495,...,0.71797,0.717748,0.722268,0.726763,0.738159,0.741649,0.739217,0.762054,0.622104,1.872474
2,0.221893,0.185626,0.164002,0.154074,0.146511,0.137888,0.133002,0.13092,0.128935,0.126446,...,0.670528,0.675308,0.669332,0.689363,0.685825,0.698885,0.689815,0.705207,0.580815,2.043818
3,0.162126,0.129779,0.104428,0.089685,0.080833,0.075142,0.068085,0.063978,0.058188,0.054447,...,0.57067,0.574177,0.580435,0.579218,0.582644,0.592902,0.597743,0.609343,0.480618,2.123489
4,0.206857,0.164631,0.137415,0.118823,0.102912,0.09785,0.090029,0.084146,0.07765,0.072445,...,0.602451,0.609186,0.624415,0.62275,0.633371,0.64097,0.649146,0.659158,0.5361,2.122085


In [None]:
X_train, X_test, y_train, y_test = split_data(data, target_col, test_size=0.3, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((429, 204), (184, 204), (429,), (184,))

In [None]:
sorted_df =  sort_df_by_important_features(data,X_train, X_test,target_col,top_K_bands)
sorted_df



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 204)]             0         
                                                                 
 concrete_select (ConcreteSe  (None, 20)               4081      
 lect)                                                           
                                                                 
 dense (Dense)               (None, 180)               3780      
                                                                 
 leaky_re_lu (LeakyReLU)     (None, 180)               0         
                                                                 
 dropout (Dropout)           (None, 180)               0         
                                                                 
 dense_1 (Dense)             (None, 180)               32580     
                                                             



mean max of probabilities: 0.0059950966 - temperature 10.0
Epoch 1/50
mean max of probabilities: 0.0059970096 - temperature 8.754307
Epoch 2/50
mean max of probabilities: 0.0059968373 - temperature 7.6637883
Epoch 3/50
mean max of probabilities: 0.005995857 - temperature 6.709115
Epoch 4/50
mean max of probabilities: 0.005996452 - temperature 5.8733654
Epoch 5/50
mean max of probabilities: 0.005996219 - temperature 5.1417246
Epoch 6/50
mean max of probabilities: 0.0059959195 - temperature 4.5012255
Epoch 7/50
mean max of probabilities: 0.0059988596 - temperature 3.940511
Epoch 8/50
mean max of probabilities: 0.006001671 - temperature 3.4496443
Epoch 9/50
mean max of probabilities: 0.0060096537 - temperature 3.0199244
Epoch 10/50
mean max of probabilities: 0.0060204617 - temperature 2.643735
Epoch 11/50
mean max of probabilities: 0.0060348557 - temperature 2.314407
Epoch 12/50
mean max of probabilities: 0.006043675 - temperature 2.0261035
Epoch 13/50
mean max of probabilities: 0.0060538



None
mean max of probabilities: 0.006016708 - temperature 10.0
Epoch 1/100
mean max of probabilities: 0.00602386 - temperature 9.356454
Epoch 2/100
mean max of probabilities: 0.0060253525 - temperature 8.754322
Epoch 3/100
mean max of probabilities: 0.0060266717 - temperature 8.190941
Epoch 4/100
mean max of probabilities: 0.0060276412 - temperature 7.6638165
Epoch 5/100
mean max of probabilities: 0.0060287677 - temperature 7.170615
Epoch 6/100
mean max of probabilities: 0.006030131 - temperature 6.709152
Epoch 7/100
mean max of probabilities: 0.0060319216 - temperature 6.2773857
Epoch 8/100
mean max of probabilities: 0.0060342527 - temperature 5.873406
Epoch 9/100
mean max of probabilities: 0.006036726 - temperature 5.495424
Epoch 10/100
mean max of probabilities: 0.006039918 - temperature 5.1417685
Epoch 11/100
mean max of probabilities: 0.006045205 - temperature 4.810873
Epoch 12/100
mean max of probabilities: 0.006052129 - temperature 4.5012703
Epoch 13/100
mean max of probabilitie



None
mean max of probabilities: 0.0059922365 - temperature 10.0
Epoch 1/200
mean max of probabilities: 0.0059942612 - temperature 9.672865
Epoch 2/200
mean max of probabilities: 0.0059896046 - temperature 9.356432
Epoch 3/200
mean max of probabilities: 0.005988942 - temperature 9.050355
Epoch 4/200
mean max of probabilities: 0.0059885215 - temperature 8.754288
Epoch 5/200
mean max of probabilities: 0.005987818 - temperature 8.467904
Epoch 6/200
mean max of probabilities: 0.005986646 - temperature 8.190891
Epoch 7/200
mean max of probabilities: 0.005986704 - temperature 7.9229407
Epoch 8/200
mean max of probabilities: 0.005987694 - temperature 7.663756
Epoch 9/200
mean max of probabilities: 0.0059908433 - temperature 7.4130487
Epoch 10/200
mean max of probabilities: 0.0059937234 - temperature 7.170543
Epoch 11/200
mean max of probabilities: 0.0059984745 - temperature 6.9359717
Epoch 12/200
mean max of probabilities: 0.0060021016 - temperature 6.709073
Epoch 13/200
mean max of probabilit



mean max of probabilities: 0.0060061766 - temperature 10.0
Epoch 1/400
mean max of probabilities: 0.0060085123 - temperature 9.835086
Epoch 2/400
mean max of probabilities: 0.006008164 - temperature 9.672891
Epoch 3/400
mean max of probabilities: 0.006008107 - temperature 9.5133705
Epoch 4/400
mean max of probabilities: 0.006008015 - temperature 9.356482
Epoch 5/400
mean max of probabilities: 0.006007933 - temperature 9.20218
Epoch 6/400
mean max of probabilities: 0.0060088434 - temperature 9.050424
Epoch 7/400
mean max of probabilities: 0.006009791 - temperature 8.901169
Epoch 8/400
mean max of probabilities: 0.006010634 - temperature 8.7543745
Epoch 9/400
mean max of probabilities: 0.0060117436 - temperature 8.610001
Epoch 10/400
mean max of probabilities: 0.006014514 - temperature 8.468009
Epoch 11/400
mean max of probabilities: 0.0060171965 - temperature 8.328361
Epoch 12/400
mean max of probabilities: 0.0060216 - temperature 8.191013
Epoch 13/400
mean max of probabilities: 0.00602

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sorted_df[target_col] = df[target_col]


Unnamed: 0,877.69,720.54,819.95,795.74,759.56,771.61,783.67,720.54.1,813.89,783.67.1,...,798.77,750.54,874.64,741.53,829.04,871.6,826.01,753.55,819.95.1,A
0,0.579423,0.292235,0.583753,0.562910,0.536819,0.549620,0.558424,0.292235,0.573157,0.558424,...,0.563612,0.517486,0.580026,0.473878,0.581986,0.582378,0.584543,0.525307,0.583753,2.017270
1,0.885729,0.449958,0.894158,0.898464,0.864115,0.882618,0.890052,0.449958,0.897237,0.890052,...,0.898274,0.826603,0.884617,0.753038,0.896991,0.887789,0.896801,0.842405,0.894158,1.872474
2,0.855833,0.433595,0.864074,0.866420,0.828052,0.848288,0.856138,0.433595,0.857534,0.856138,...,0.868170,0.791010,0.853694,0.720103,0.865504,0.857715,0.867874,0.805933,0.864074,2.043818
3,0.748112,0.328464,0.753514,0.754643,0.719879,0.739644,0.746800,0.328464,0.754795,0.746800,...,0.756333,0.684226,0.747620,0.611263,0.755369,0.748048,0.757952,0.700228,0.753514,2.123489
4,0.801877,0.369840,0.815905,0.821306,0.789060,0.807974,0.814888,0.369840,0.820912,0.814888,...,0.822021,0.748561,0.802887,0.671791,0.817139,0.803576,0.821392,0.767407,0.815905,2.122085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
612,0.786732,0.333123,0.796363,0.791217,0.752083,0.774240,0.785829,0.333123,0.796099,0.785829,...,0.793076,0.710342,0.786948,0.629900,0.793902,0.787840,0.795306,0.728559,0.796363,4.258127
613,0.364014,0.129281,0.346632,0.338186,0.310817,0.326364,0.332254,0.129281,0.344831,0.332254,...,0.340059,0.289555,0.362873,0.251838,0.350013,0.361193,0.348754,0.298084,0.346632,1.826188
614,0.679027,0.354364,0.669369,0.659740,0.624200,0.644844,0.651237,0.354364,0.661221,0.651237,...,0.661552,0.595233,0.676018,0.543744,0.669436,0.678512,0.668166,0.609054,0.669369,0.933424
615,0.532599,0.181478,0.537261,0.514193,0.478409,0.498795,0.508198,0.181478,0.531458,0.508198,...,0.515659,0.452294,0.531816,0.395480,0.526283,0.534131,0.530676,0.464050,0.537261,2.009618
