## NDSI

When using a hyperspectral camera, we typically compare the signal received by the camera (i.e., reflected radiation at different wavelengths) to a measurement we made to determine the correlations and mathematical relationship between the two. 

Since each pixel in the camera contains reflected radiation at 204 different wavelengths, we need to process the camera signal and determine which wavelengths correlate to the Y parameters values. One common method for doing this is to calculate all possible combinations of the normalized difference values between two wavelengths using the formula:
 
$$\frac{a-b}{a+b}$$
 
 
 where **a** and **b** are the reflectance values at two different wavelengths.

NDSI can be usefull as feature engineering technique in hyperspectral images analysis and find the relevant band that correlated with the labes. 

Of course, the formula can be changed to any other mathematical combination. The code is written in a modular way so that it is easy to change it.

* **The function NDSI** calculates the Normalized Difference Spectral Index (NDSI) for all possible pairs of bands in x_df and their correlation with y (target column). The function returns a DataFrame containing the band pair, the correlation and p-value between the NDSI and target column using Spearman correlation, and the absolute value of the correlation. The resulting DataFrame is sorted by the absolute value of the correlation in descending order.

* The  **function NDSI_pearson** takes is the same as NDSI function
but calculate the **pearson correlation instead of Spearman correlation**.

**Important note**🔥🔥! This notebook addresses **regression problems only**

In [3]:
import pandas as pd
import numpy as np
from scipy import stats
import itertools
from itertools import combinations
from tqdm.notebook import tqdm

In [4]:
def load_data(csv_path, feature_col_start, feature_col_end, target_col):
    """
    Load a CSV file into a Pandas DataFrame,drop Nan, and separate the feature and target columns.

    Parameters:
        csv_path (str): Path to the CSV file to load.
        feature_col_start, feature_col_end, (ints): Range of column indices to use as features.
        target_col (str or int): Name or index of the column to use as target.

    Returns:
        new_df: A df containing the features + labels DataFrame.
    """
    # Load CSV into a Pandas DataFrame
    df = pd.read_csv(csv_path)

    # drop nan
    df = df.dropna()

    # Extract the feature and target columns
    new_df = df[df.columns[feature_col_start: feature_col_end]]
    new_df[target_col] = df[target_col]

    return new_df

In [2]:
def NDSI(df, target_col):
    '''
    The NDSI() function takes in two input arguments:
    x_df: a pandas DataFrame with columns representing different bands and rows representing different samples.
    y_df: a pandas DataFrame or Series with the same number of rows as x_df.
    The values in this DataFrame or Series are used to calculate the correlation with the NDSI for each pair of bands in x_df.
    It returns a DataFrame containing the band pair, the correlation and p-value between the NDSI and the second df using spearman correlation and the absolute value of the correlation.
    '''
    # exctract labels column
    y = df[target_col].values
    # delete y column from features df
    df = df.drop(target_col, axis = 1)
    # Convert columns names to str
    df.columns = df.columns.map(str)
    bands_list = df.columns

    # All possible pairs of columns
    all_pairs = list(itertools.combinations(bands_list, 2))

    # Calculate the NDSI
    corrs = np.zeros(len(all_pairs))  # array for filling with correlation values
    pvals = np.zeros(len(all_pairs))  # array for filling with p values

    # Use tqdm to show the progress bar
    for index, pair in tqdm(enumerate(all_pairs), total=len(all_pairs), desc="Calculating NDSI"):
        a = df[pair[0]].values
        b = df[pair[1]].values
        Norm_index = (a-b)/(a+b)
        # Spearman correlation and p value
        corr, pval = stats.spearmanr(Norm_index, y)
        corrs[index] = corr
        pvals[index] = pval

    # Convert to DataFrame
    col1 = [tple[0] for tple in all_pairs]  # column of the first wavelength
    col2 = [tple[1] for tple in all_pairs]  # column of the second wavelength
    index_col = [f"{tple[0]},{tple[1]}" for tple in all_pairs]  # index column
    data = {'band1': col1, "band2": col2, 'Spearman_Corr': corrs, 'p_value': pvals}
    df_results = pd.DataFrame(data=data, index=index_col)
    df_results["Abs_Spearman_Corr"] = df_results["Spearman_Corr"].abs()
    return df_results.sort_values('Abs_Spearman_Corr',ascending=False)

In [9]:
def NDSI_pearson(df, trget_col):
    '''
    The  NDSI_pearson takes is the same as NDSI() function
    but calculate the pearson correlation instead of Spearman correlation
    '''
    # exctract labels column
    y = df[target_col].values
    # delete y column from features df
    df = df.drop(target_col, axis = 1)
    # Convert columns names to str
    df.columns = df.columns.map(str)
    bands_list = df.columns

    # All possible pairs of columns
    all_pairs = list(itertools.combinations(bands_list, 2))

    # Calculate the NDSI
    corrs = np.zeros(len(all_pairs))  # array for filling with correlation values
    pvals = np.zeros(len(all_pairs))  # array for filling with p values

    # Use tqdm to show the progress bar
    for index, pair in tqdm(enumerate(all_pairs), total=len(all_pairs), desc="Calculating NDSI"):
        a = df[pair[0]].values
        b = df[pair[1]].values
        Norm_index = (a-b)/(a+b)
        # Pearson correlation and p value
        corr, pval = stats.pearsonr(Norm_index, y)
        corrs[index] = corr
        pvals[index] = pval

    # Convert to DataFrame
    col1 = [tple[0] for tple in all_pairs]  # column of the first wavelength
    col2 = [tple[1] for tple in all_pairs]  # column of the second wavelength
    index_col = [f"{tple[0]},{tple[1]}" for tple in all_pairs]  # index column
    data = {'band1': col1, "band2": col2, 'Pearson_Corr': corrs, 'p_value': pvals}
    df_results = pd.DataFrame(data=data, index=index_col)
    df_results["Abs_Pearson_Corr"] = df_results["Pearson_Corr"].abs()
    return df_results.sort_values('Abs_Pearson_Corr',ascending=False)

## Example

In [5]:
# Define input parameters
csv_path = '/content/data.csv'
feature_idx_i,feature_idx_f = 16,-2 # columns index of features
target_col = 'A' # labael column (regression)

In [6]:
# Load data
data = load_data(csv_path, feature_idx_i,feature_idx_f, target_col)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[target_col] = df[target_col]


Unnamed: 0,397.32,400.2,403.09,405.97,408.85,411.74,414.63,417.52,420.4,423.29,...,978.88,981.96,985.05,988.13,991.22,994.31,997.4,1000.49,1003.58,A
0,0.179808,0.152106,0.129191,0.115715,0.107613,0.102074,0.101501,0.099727,0.096248,0.096929,...,0.458213,0.464172,0.45852,0.462214,0.467727,0.467549,0.466043,0.471523,0.447471,2.01727
1,0.221156,0.186298,0.160032,0.146194,0.136323,0.128331,0.124891,0.12185,0.116359,0.114495,...,0.71797,0.717748,0.722268,0.726763,0.738159,0.741649,0.739217,0.762054,0.622104,1.872474
2,0.221893,0.185626,0.164002,0.154074,0.146511,0.137888,0.133002,0.13092,0.128935,0.126446,...,0.670528,0.675308,0.669332,0.689363,0.685825,0.698885,0.689815,0.705207,0.580815,2.043818
3,0.162126,0.129779,0.104428,0.089685,0.080833,0.075142,0.068085,0.063978,0.058188,0.054447,...,0.57067,0.574177,0.580435,0.579218,0.582644,0.592902,0.597743,0.609343,0.480618,2.123489
4,0.206857,0.164631,0.137415,0.118823,0.102912,0.09785,0.090029,0.084146,0.07765,0.072445,...,0.602451,0.609186,0.624415,0.62275,0.633371,0.64097,0.649146,0.659158,0.5361,2.122085


In [7]:
df_NDSI = NDSI(data,'A')

Calculating NDSI:   0%|          | 0/20706 [00:00<?, ?it/s]

In [8]:
df_NDSI

Unnamed: 0,band1,band2,Spearman_Corr,p_value,Abs_Spearman_Corr
"397.32,975.79",397.32,975.79,-0.445571,3.118820e-31,0.445571
"397.32,985.05",397.32,985.05,-0.445262,3.465633e-31,0.445262
"397.32,960.4",397.32,960.4,-0.444744,4.135920e-31,0.444744
"397.32,966.55",397.32,966.55,-0.444406,4.641126e-31,0.444406
"400.2,975.79",400.2,975.79,-0.444404,4.644916e-31,0.444404
...,...,...,...,...,...
"657.87,699.6",657.87,699.6,-0.000056,9.988970e-01,0.000056
"437.76,539.75",437.76,539.75,-0.000047,9.990788e-01,0.000047
"572.07,708.57",572.07,708.57,0.000043,9.991527e-01,0.000043
"847.25,883.79",847.25,883.79,-0.000039,9.992400e-01,0.000039


Now the same with Pearson correlation:

In [10]:
df_NDSI_pearson = NDSI_pearson(data,'A')

Calculating NDSI:   0%|          | 0/20706 [00:00<?, ?it/s]

In [11]:
df_NDSI_pearson

Unnamed: 0,band1,band2,Pearson_Corr,p_value,Abs_Pearson_Corr
"397.32,975.79",397.32,975.79,-0.460933,1.425293e-33,0.460933
"397.32,966.55",397.32,966.55,-0.460910,1.437424e-33,0.460910
"397.32,960.4",397.32,960.4,-0.460477,1.679556e-33,0.460477
"400.2,975.79",400.2,975.79,-0.459759,2.172725e-33,0.459759
"397.32,963.47",397.32,963.47,-0.459503,2.381089e-33,0.459503
...,...,...,...,...,...
"513.4,583.85",513.4,583.85,0.000131,9.974206e-01,0.000131
"747.54,841.18",747.54,841.18,-0.000077,9.984776e-01,0.000077
"859.42,874.64",859.42,874.64,0.000075,9.985298e-01,0.000075
"472.59,580.9",472.59,580.9,0.000060,9.988192e-01,0.000060
