# Identifying Optimal Wavelength of Study of my Wine Library

Leading in from [Investigating Shortening Runtimes]('2023-03-14_investigating-shortening-runtimes.ipynb').

The question of which wavelength is optimal, that is, which wavelength contains the most information, has been coming for a while. Rather than setting my instrument trace to a range of wavelengths and bloating my data dir size,  I can just pull the spectrum and that one optimal wavelength. But how do you define 'optimal', or 'most information'? I will use the definition of minimal average baseline gradient to maximal average peak heights. That is, any variation in peaks should be sample specific and not random background noise or intensity.

## Set up Environment

In [None]:
%load_ext autoreload
%autoreload 2

import sys

import os

import pandas as pd

import numpy as np

from scipy.signal import find_peaks

pd.options.plotting.backend = "plotly"

import plotly.graph_objs as go

from plotly.subplots import make_subplots

from sklearn.preprocessing import MinMaxScaler

from pybaselines import Baseline

# adds root dir 'wine_analyis_hplc_uv' to path.

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "../")))

from agilette import agilette_core as ag

lib = ag.Agilette("/Users/jonathan/0_jono_data").library

In [None]:
lib_df = lib.data_table()

As in the leading in notebook, I will use the latest De Bertoli Cab Merlot sample `2023-03-07_DEBERTOLI_CS_001.D`.

In [None]:
run = lib.single_runs[lib_df.loc[3].run_name]

In [None]:
run.load_spectrum()

In [None]:
run.spectrum.line_plot()

Before going further, for these calculations, it is necessary to scale all the data.

In [None]:
scaler = MinMaxScaler()

scaled_df = scaler.fit_transform(run.spectrum.uv_data.values)

scaled_uv_data = pd.DataFrame(
    scaled_df, columns=run.spectrum.uv_data.columns, index=run.spectrum.uv_data.index
)

## Calculating Average Baseline Gradient

First fit the baseline for each wavelength of the spectrum.

### Calculate Baselines

In [None]:
def baseline_calculator(column):
    baseline_fitter = Baseline(column.index)
    baseline_y = baseline_fitter.iasls(column.values)

    return baseline_y[0]


scaled_uv_data = scaled_uv_data.set_index("mins")

baselines = scaled_uv_data.apply(baseline_calculator)

### Calculate Average Gradient

As an example of what we are trying to achieve, let's plot the 254nm wavelength chromatogram with its fitted baseline.

In [None]:
baseline_254 = baselines["254"]

fig_1 = go.Figure()

chrom_trace = go.Scatter(
    x=scaled_uv_data["254"].index.values, y=scaled_uv_data["254"].values, name="chrom"
)

baseline_trace = go.Scatter(x=baseline_254.index, y=baseline_254, name="baseline")
fig_1.add_trace(chrom_trace)

fig_1.add_trace(baseline_trace)

fig_1.show()

To calculate the average gradient we can simply use `np.gradient()` which returns a numpy array, then take the mean of that array:

In [None]:
def calc_av_grad(column):
    grad = np.gradient(column)
    return np.mean(grad)


av_baseline_grad = baselines.apply(calc_av_grad)

av_baseline_grad.plot()

So as expected, the further you get from the Methanol cuttoff, the lower the average baseline fluctuation.

The average baseline is now calculated, onto average peak height.

## Find Peak Maxima

## Correct the Baseline

In [None]:
def baseline_correction(column):
    baseline_fitter = Baseline(x_data=column.index.values)
    baseline_y = baseline_fitter.iasls(column.values)[0]

    corrected_column = column - baseline_y
    return corrected_column

In [None]:
baseline_corrected_data = scaled_uv_data.apply(baseline_correction)

baseline_corrected_data[190].plot()

## Get Peak Height Values.

To get the peak height values, we can use `scipy.signal.find_peaks`. A height of 4 will be the minimum requirement, and all other settings will be the default.

In [None]:
def peak_finder(column):
    peaks = find_peaks(height=0.05, x=column)

    # peaks[0] is the peak maxima indexes.

    peaks_x = column.index[peaks[0]]

    # peak[1] is a dict with information about the peaks including peak heights.
    peaks_y = peaks[1]["peak_heights"]
    return peaks_x, peaks_y


found_peaks = baseline_corrected_data.apply(peak_finder)

To verify that the peak finder algorithm worked as expected, let's plot the peaks.

In [None]:
peak_finder_190_fig = go.Figure()

chromatogram_trace_190 = go.Scatter(
    x=baseline_corrected_data[190].index,
    y=baseline_corrected_data[190].values,
    name="190 nm",
    mode="lines",
)

peak_trace = go.Scatter(
    x=found_peaks[190][0], y=found_peaks[190][1], name="peaks", mode="markers"
)

peak_finder_190_fig.add_traces([chromatogram_trace_190, peak_trace])

peak_finder_190_fig.show()

And now to calculate the average peak height for each wavelength:

In [None]:
def peak_av_calc(column):
    return column[1].mean()


av_peak_height = found_peaks.apply(peak_av_calc)
av_peak_height.plot()

Now that we have the average peak heights and average baseline gradient, just need to find the wavelength with the highest ratio of peak_height : baseline gradient.

In [None]:
peak_height_baseline_grad_ratio = av_peak_height / av_baseline_grad
peak_height_baseline_grad_ratio

In [None]:
result_df = pd.DataFrame(
    {
        "av heights": av_peak_height,
        "av_baseline_grad": av_baseline_grad,
        "ratio": peak_height_baseline_grad_ratio,
    }
)

In [None]:
print(run)

In [None]:
found_peaks

In [None]:
def found_peaks_extractor(column):
    return len(column[1])


number_of_peaks_per_nm = found_peaks.apply(found_peaks_extractor)

In [None]:
subplot_names = list(result_df.columns)

subplot_names.append("")

subplot_names.append("# peaks per nm")

fig = make_subplots(rows=2, cols=3, subplot_titles=subplot_names)

print(result_df.columns)

for idx, column in enumerate(result_df.columns):
    print(idx, column)
    fig.add_trace(
        go.Scatter(x=result_df.index, y=result_df[column], mode="lines", name=column),
        row=1,
        col=idx + 1,
    )
    fig.update_layout(title=run.name, showlegend=False)

fig.add_trace(
    go.Scatter(x=number_of_peaks_per_nm.index, y=number_of_peaks_per_nm.values),
    row=2,
    col=2,
)
fig.show()

Interesting result. 262 appears to be the best for these current settings, but I get the feeling that my peak detection needs more nuance. My current process is:

1. MinMax scale the whole data set.
2. Calculate baseline.
3. Calculate average baseline gradient.
3. Subtract baseline from signals.
4. Calculate average peak height.
5. Calculate ratio.


I will continue this line of investigation in [investigating-optimal-wavelength-for-all-runs](2023-03-15_investigating-optimal-wavelength-for-all-runs.ipynb).