# AutoEncoder Feature Exraction
Hyperspectral images are usually high-dimensional and contain a large amount of redundant information. Therefore, feature extraction is essential for reducing the dimensionality of the data and selecting the most relevant features.

In this code notebook, we will be discussing how to extract features from hyperspectral images using an autoencoder. An autoencoder is a type of neural network that can learn to encode high-dimensional data into a lower-dimensional representation. We will be using the autoencoder to generate new features that capture the essential information contained in the hyperspectral images.

**Feature selection** and **feature extraction** are techniques used in machine learning to reduce the dimensionality of input data by selecting or transforming the most relevant features. Feature selection involves selecting a subset of the original features based on correlation with the target variable, while feature extraction transforms the original features into a new set of features. Feature extraction can be more effective when original features are highly correlated or when there are nonlinear relationships between features and the target variable. Both techniques can improve the performance of machine learning models.

Here we use Autoencoder for feature extraction. Note that there is another notebook that deals with Autoencoder for feature selection.

## Usage

1. Use the functions and the example to load dataset, split to train/test, and get new dataset after extracting new features.

2. The function *autoencoder_features()* based on basic architecture. You may change it.





In [2]:
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, LeakyReLU,Dropout

In [3]:
def load_data(csv_path, feature_col_start, feature_col_end, target_col):
    """
    Load a CSV file into a Pandas DataFrame,drop Nan, and separate the feature and target columns.

    Parameters:
        csv_path (str): Path to the CSV file to load.
        feature_col_start, feature_col_end, (ints): Range of column indices to use as features.
        target_col (str or int): Name or index of the column to use as target.

    Returns:
        new_df: A df containing the features + labels DataFrame.
    """
    # Load CSV into a Pandas DataFrame
    df = pd.read_csv(csv_path)

    # drop nan
    df = df.dropna()

    # Extract the feature and target columns
    new_df = df[df.columns[feature_col_start: feature_col_end]]
    new_df[target_col] = df[target_col]

    return new_df

In [4]:
def split_data(df, target_col, test_size=0.3, random_state=42):
    """
    Splits the input DataFrame into training and testing sets.
    
    Parameters:
    -----------
    df (pandas DataFrame): The input DataFrame containing the features and target variable.
    target_col (str): The name of the target column in the DataFrame.
    test_size (float, optional): The proportion of the data to use for testing (default=0.3).
    random_state (int, optional): The random seed to use for the train-test split (default=42).
        
    Returns:
    --------
    X_train (pandas DataFrame): The training set features.     
    X_test (pandas DataFrame): The testing set features.        
    y_train (pandas Series): The training set target variable.
    y_test (pandas Series): The testing set target variable.
    """
    # Extract the features and target variable from the DataFrame
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Return the training and testing sets
    return X_train, X_test, y_train, y_test

In [5]:
def autoencoder_features(df,X_train,X_test,
                             target_col,
                             loss="mse",optimizer="adam",epochs=50, batch_size=32, validation_split=0.25):
    """
    function that takes in a DataFrame (df) and a target column name (target_col),
    and returns DataFrame with new features generated by an Autoencoder model.
    Notice that you may change the net architecture.
        Parameters:
    -----------
    df (pandas DataFrame): The input DataFrame containing the features and target variable.
    X_train (pandas DataFrame): The training set features.     
    X_test (pandas DataFrame): The testing set features.  
    target_col (str): The name of the target column in the DataFrame.
    loss,optimizer,epochs,batch_size,batch_size: hyperparameters of the model.
    Returns:
    --------
    new_features_df (pandas DataFrame): DataFrame with new features generated by an Autoencoder model.     
    """
    # split data into X (input) and y (output) 
    X = df.drop(target_col, axis=1)
    y = df[target_col]

    # implementation of the autoencoder model
    input = Input(shape=X_train.shape[1:])
    enc = Dense(64)(input)
    enc = LeakyReLU()(enc)
    enc = Dense(32)(enc)
    enc = LeakyReLU()(enc)
    # latent space with tanh
    latent_space = Dense(16, activation="tanh")(enc)

    dec = Dense(32)(latent_space)
    dec = LeakyReLU()(dec)
    dec = Dense(64)(dec)
    dec = LeakyReLU()(dec)

    dec = Dense(units=X_train.shape[1], activation="relu")(dec)
    # init model
    autoencoder = Model(input, dec)
    # compile model
    autoencoder.compile(optimizer=optimizer, metrics=["mse"], loss=loss)
    # train model
    autoencoder.fit(X_train, X_train, epochs=epochs, batch_size=batch_size, validation_split=validation_split)
    encoder = Model(input, latent_space)
    # generate new features using the encoder
    new_features = encoder.predict(X)
    # create a DataFrame with new features and concatenate it with the original DataFrame
    new_features_df = pd.DataFrame(new_features, columns=[f'feature_{i}' for i in range(new_features.shape[1])])
    
    return new_features_df

## Example

In [6]:
# Define input parameters
csv_path = '/content/data.csv'
feature_idx_i,feature_idx_f = 16,-2 # columns index of features
target_col = 'A' # labael column (regression)

In [7]:
# Load data
data = load_data(csv_path, feature_idx_i,feature_idx_f, target_col)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[target_col] = df[target_col]


Unnamed: 0,397.32,400.2,403.09,405.97,408.85,411.74,414.63,417.52,420.4,423.29,...,978.88,981.96,985.05,988.13,991.22,994.31,997.4,1000.49,1003.58,A
0,0.179808,0.152106,0.129191,0.115715,0.107613,0.102074,0.101501,0.099727,0.096248,0.096929,...,0.458213,0.464172,0.45852,0.462214,0.467727,0.467549,0.466043,0.471523,0.447471,2.01727
1,0.221156,0.186298,0.160032,0.146194,0.136323,0.128331,0.124891,0.12185,0.116359,0.114495,...,0.71797,0.717748,0.722268,0.726763,0.738159,0.741649,0.739217,0.762054,0.622104,1.872474
2,0.221893,0.185626,0.164002,0.154074,0.146511,0.137888,0.133002,0.13092,0.128935,0.126446,...,0.670528,0.675308,0.669332,0.689363,0.685825,0.698885,0.689815,0.705207,0.580815,2.043818
3,0.162126,0.129779,0.104428,0.089685,0.080833,0.075142,0.068085,0.063978,0.058188,0.054447,...,0.57067,0.574177,0.580435,0.579218,0.582644,0.592902,0.597743,0.609343,0.480618,2.123489
4,0.206857,0.164631,0.137415,0.118823,0.102912,0.09785,0.090029,0.084146,0.07765,0.072445,...,0.602451,0.609186,0.624415,0.62275,0.633371,0.64097,0.649146,0.659158,0.5361,2.122085


In [8]:
X_train, X_test, y_train, y_test = split_data(data, target_col, test_size=0.3, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((429, 204), (184, 204), (429,), (184,))

In [10]:
df_features = autoencoder_features(data,X_train,X_test,
                             target_col,
                             loss="mse",optimizer="adam",epochs=50, batch_size=32, validation_split=0.25)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [11]:
df_features

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15
0,-0.226710,0.367379,-0.453775,-0.243337,-0.379570,0.337105,-0.258442,-0.096716,0.249617,0.161003,0.261766,-0.237449,-0.425229,-0.161112,-0.234261,-0.012255
1,-0.392412,0.511957,-0.595188,-0.408742,-0.518194,0.472658,-0.407552,-0.267419,0.411955,0.280689,0.442423,-0.402998,-0.538262,-0.311342,-0.356958,-0.109674
2,-0.363270,0.514233,-0.609711,-0.379850,-0.529868,0.466887,-0.395367,-0.213084,0.396524,0.261208,0.404896,-0.392904,-0.568253,-0.274818,-0.347737,-0.041594
3,-0.367679,0.340414,-0.403759,-0.367892,-0.332654,0.357605,-0.325814,-0.322821,0.376223,0.203409,0.430623,-0.296424,-0.280870,-0.294212,-0.271220,-0.271321
4,-0.392967,0.387700,-0.458593,-0.391254,-0.385409,0.390036,-0.353436,-0.334596,0.399288,0.224156,0.448335,-0.336498,-0.339386,-0.307971,-0.292036,-0.255892
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
608,-0.368498,0.383428,-0.457780,-0.383372,-0.376442,0.392817,-0.358337,-0.324440,0.378109,0.215161,0.441282,-0.316237,-0.335153,-0.291110,-0.282997,-0.248043
609,-0.154761,0.179125,-0.194564,-0.152249,-0.139075,0.187430,-0.128001,-0.085523,0.136962,0.094085,0.190587,-0.087723,-0.152354,-0.111654,-0.121720,-0.135936
610,-0.292893,0.428690,-0.482035,-0.303205,-0.418158,0.386230,-0.297801,-0.159050,0.278662,0.238146,0.325819,-0.298994,-0.462407,-0.236915,-0.282680,-0.065743
611,-0.255581,0.203358,-0.252723,-0.250916,-0.190387,0.236246,-0.209040,-0.225981,0.263876,0.106747,0.317439,-0.162832,-0.133378,-0.191363,-0.162778,-0.239692
