# Time Series Based Feature Selection for HSI

This notebook code performs feature selection on hyperspectral images using a time series approach. Hyperspectral images contain a large number of spectral bands, making it difficult to analyze and extract meaningful information. Therefore, feature selection is necessary to reduce the dimensionality of the data while retaining relevant information.

The approach used in this notebook is based on the featurize package in Python, which calculates a set of statistical features from time series data. These features include amplitude, maximum, median, skewness, and many others. By applying featurize to each spectral band as a time series, a set of features is generated for each band, reducing the dimensionality of the data.

The output of the code is a set of selected features for each spectral band, which can be used for further analysis and ML regression/Classification of hyperspectral images.

In [1]:
import numpy as np
import pandas as pd

In [2]:
def load_data(csv_path, feature_col_start, feature_col_end, target_col):
    """
    Load a CSV file into a Pandas DataFrame,drop Nan, and separate the feature and target columns.

    Parameters:
        csv_path (str): Path to the CSV file to load.
        feature_col_start, feature_col_end, (ints): Range of column indices to use as features.
        target_col (str or int): Name or index of the column to use as target.

    Returns:
        new_df: A df containing the features + labels DataFrame.
    """
    # Load CSV into a Pandas DataFrame
    df = pd.read_csv(csv_path)

    # drop nan
    df = df.dropna()

    # Extract the feature and target columns
    new_df = df[df.columns[feature_col_start: feature_col_end]]
    new_df[target_col] = df[target_col]

    return new_df

In [7]:
def time_serise_feature_extraction(df, target_col):
    """
    Extract time series features from a DataFrame with hyperspectral data.

    Parameters
    ----------
    df : pandas DataFrame
        The input DataFrame with hyperspectral data.
    target_col : str
        The name of the target column in the DataFrame.

    Returns
    -------
    pandas DataFrame
        The output DataFrame with computed features and the target column.
    """
    # Save the target data 
    target_col_data = df[target_col].values

    # Get the hyperspectral data columns
    data_cols = [col for col in df.columns if col != target_col]

    df = df[data_cols]

    # calculate means
    means = df.mean(axis=1).values

    # calculate medians
    medians = df.median(axis=1).values

    # calculate standard deviations
    stds = df.std(axis=1).values

    # calculate percent of data beyond 1 std
    percent_beyond_1_std = df.apply(lambda x: np.sum(np.abs(x - x.mean()) > x.std()) / len(x), axis=1)
    percent_beyond_1_std = percent_beyond_1_std.values

    # calculate amplitudes
    amplitudes = df.apply(lambda row: np.ptp(row), axis=1)
    amplitudes = amplitudes.values

    # calculate max values
    maxs = df.max(axis=1).values

    # calculate min values
    mins = df.max(axis=1).values

    # calculate max slopes
    max_slopes = df.apply(lambda row: np.max(np.abs(np.diff(row))), axis=1)
    max_slopes = max_slopes.values

    # calculate median absolute deviations (MAD)
    mads = df.apply(lambda row: np.median(np.abs(row - np.median(row))), axis=1)
    mads = mads.values

    # calculate percent close to median
    percent_close_to_median = df.apply(lambda x: np.sum(np.abs(x - np.median(x)) < 0.5 * np.median(x)) / len(x) * 100, axis=1)
    percent_close_to_median = percent_close_to_median.values

    # calculate skewness
    skewness = df.apply(lambda x: x.skew(), axis=1)
    skewness = skewness.values

    # calculate flux percentile
    flux_percentile = df.quantile(q=0.9, axis=1)

    # calculate percent difference in flux percentile
    percent_difference = flux_percentile.pct_change().fillna(0)
    percent_difference = percent_difference.values

    # define the weights as a numpy array of the column names
    wavelengths = np.array(df.columns)

    # convert the column names to floats if necessary
    if wavelengths.dtype == 'object':
        wavelengths = wavelengths.astype(float)

    # calculate the weighted average for each row in the DataFrame
    weighted_average = df.apply(lambda row: np.average(row, weights=wavelengths), axis=1)
    weighted_average = weighted_average.values

    # create a new DataFrame to store the parameter values for each row
    parameters_df = pd.DataFrame({
        'Mean': means,
        'Median': medians,
        'Std': stds,
        'Percent_Beyond_Std': percent_beyond_1_std,
        'Amplitude': amplitudes,
        'Max': maxs,
        'Min': mins,
        'Max_Slope': max_slopes,
        'MAD': mads,
        'Percent_Close_to_Median': percent_close_to_median,
        'Skew': skewness,
        'Flux_Percentile': flux_percentile.values,
        'Percent_Difference_Flux_Percentile': percent_difference,
        'Weighted_Average': weighted_average
    })
    # Add the target column
    parameters_df[target_col] = target_col_data
    
    return parameters_df


## Example

In [3]:
# Define input parameters
csv_path = '/content/data.csv'
feature_idx_i,feature_idx_f = 16,-2
target_col = 'A'

In [4]:
# Load data
data = load_data(csv_path, feature_idx_i,feature_idx_f, target_col)
# Add ID column to the procces
# data['ID'] = data.index
# y = data[target_col]
# data = data.drop(target_col,1)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[target_col] = df[target_col]


Unnamed: 0,397.32,400.2,403.09,405.97,408.85,411.74,414.63,417.52,420.4,423.29,...,978.88,981.96,985.05,988.13,991.22,994.31,997.4,1000.49,1003.58,A
0,0.179808,0.152106,0.129191,0.115715,0.107613,0.102074,0.101501,0.099727,0.096248,0.096929,...,0.458213,0.464172,0.45852,0.462214,0.467727,0.467549,0.466043,0.471523,0.447471,2.01727
1,0.221156,0.186298,0.160032,0.146194,0.136323,0.128331,0.124891,0.12185,0.116359,0.114495,...,0.71797,0.717748,0.722268,0.726763,0.738159,0.741649,0.739217,0.762054,0.622104,1.872474
2,0.221893,0.185626,0.164002,0.154074,0.146511,0.137888,0.133002,0.13092,0.128935,0.126446,...,0.670528,0.675308,0.669332,0.689363,0.685825,0.698885,0.689815,0.705207,0.580815,2.043818
3,0.162126,0.129779,0.104428,0.089685,0.080833,0.075142,0.068085,0.063978,0.058188,0.054447,...,0.57067,0.574177,0.580435,0.579218,0.582644,0.592902,0.597743,0.609343,0.480618,2.123489
4,0.206857,0.164631,0.137415,0.118823,0.102912,0.09785,0.090029,0.084146,0.07765,0.072445,...,0.602451,0.609186,0.624415,0.62275,0.633371,0.64097,0.649146,0.659158,0.5361,2.122085


In [8]:
ts_df = time_serise_feature_extraction(data, target_col)
ts_df

Unnamed: 0,Mean,Median,Std,Percent_Beyond_Std,Amplitude,Max,Min,Max_Slope,MAD,Percent_Close_to_Median,Skew,Flux_Percentile,Percent_Difference_Flux_Percentile,Weighted_Average,A
0,0.302653,0.139873,0.214796,0.406863,0.504694,0.587329,0.587329,0.035579,0.056398,52.450980,0.228641,0.577487,0.000000,0.348604,2.017270
1,0.453233,0.190473,0.350251,0.436275,0.816050,0.899437,0.899437,0.139949,0.105816,46.078431,0.214852,0.890146,0.541413,0.527716,1.872474
2,0.445331,0.195120,0.325547,0.421569,0.769176,0.868950,0.868950,0.124392,0.093269,52.450980,0.233944,0.857434,-0.036749,0.514058,2.043818
3,0.348508,0.108503,0.316918,0.357843,0.729252,0.758612,0.758612,0.128726,0.078513,19.607843,0.236127,0.750027,-0.125266,0.415496,2.123489
4,0.382094,0.136445,0.337898,0.411765,0.788610,0.824365,0.824365,0.123058,0.100058,20.588235,0.233535,0.812310,0.083040,0.452947,2.122085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
608,0.373296,0.138763,0.325593,0.333333,0.753457,0.798247,0.798247,0.104593,0.088250,20.588235,0.251307,0.789044,-0.041203,0.442041,4.258127
609,0.170274,0.060697,0.153192,0.387255,0.351165,0.367975,0.367975,0.048880,0.043340,19.607843,0.233930,0.360856,-0.542667,0.204143,1.826188
610,0.348698,0.168809,0.269712,0.549020,0.628439,0.682615,0.682615,0.085302,0.112209,27.941176,0.170638,0.675234,0.871200,0.407893,0.933424
611,0.234915,0.060162,0.224646,0.308824,0.523731,0.537261,0.537261,0.044733,0.045961,20.588235,0.276560,0.529901,-0.215233,0.282119,2.009618
