Continuing from [Identifying Optimal Wavelength](2023-03-14_identifying_optimal_wavelength.ipynb), I will use the methods developed there to aggregate the results for all 2.5% avantor runs thus far. Although presumably, this method should work for all methods.

## Set up Environment

In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os
import pandas as pd
import numpy as np
from scipy.signal import find_peaks

pd.options.plotting.backend = "plotly"
from sklearn.preprocessing import MinMaxScaler
from pybaselines import Baseline

# adds root dir 'wine_analyis_hplc_uv' to path.

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "../")))

from agilette import agilette_core as ag

lib = ag.Agilette("/Users/jonathan/0_jono_data").library

In [None]:
lib_df = lib.data_table()
lib_df.head()

As in the leading in notebook, I will use the latest De Bertoli Cab Merlot sample `2023-03-07_DEBERTOLI_CS_001.D`.

## Planning the Experiment

The way to do this is to stay within a DataFrame environment.

1. Form a a DF of:
run name | uv_data object.
2. for each run: scale, baseline adjust, calculate average baseline gradient and peak heights, get the ratio. 
3. Plot the maxima of the above values for each run. Probably drop after 380nm. 

## Filtering Runs

In [None]:
runs = lib_df[
    (lib_df["method"].str.contains("2_1*"))
    & ~(lib_df["sample_name"].str.contains("uracil*"))
    & ~(lib_df["uv_files"].apply(len) == 0)
]

runs.head()

## Prepare the Data

### Assemble the runs_uv_data DF.

In [None]:
all_data = lib.all_data()


def uv_data_extractor(column):
    data_dir = all_data[column]
    uv_data = data_dir.load_spectrum().uv_data

    return uv_data


uv_data_series = runs["run_name"].apply(uv_data_extractor)

In [None]:
runs_uv = runs

runs_uv["uv_data"] = uv_data_series

runs_uv = runs_uv.drop(
    ["uv_files", "sequence", "ch_files", "sample_name", "desc"], axis=1
)

In [None]:
runs_uv.head()

### Scale

In [None]:
runs_uv["uv_data"].iloc[0].columns

In [None]:
runs_uv["uv_data"].iloc[0]

In [None]:
# manually iterating over each row in the top level df then each column in the uv_data df and applying fit_transform()

# scaler = MinMaxScaler()

# for idx, row in runs_uv.iterrows():
#     for column in row['uv_data']:
#         scaled_column = scaler.fit_transform(row['uv_data'][column])

In [None]:
runs_uv.loc[3, "uv_data"]

In [None]:
scaler = MinMaxScaler()

# How do I access an individual dataframe.

uv_df_1 = runs_uv.loc[3, "uv_data"]

# test applying minmaxscaler to a single dataframe.

scaled_uv = scaler.fit_transform(uv_df_1)
scaled_uv_df = pd.DataFrame(scaled_uv, columns=uv_df_1.columns, index=uv_df_1.index)
scaled_uv_df

In [None]:
# Try it again

counter = 0


def df_scaler(column):
    scaled_column = scaler.fit_transform(column)

    scaled_column = pd.DataFrame(
        scaled_column, columns=column.columns, index=column.index
    )

    return scaled_column


runs_uv["scaled_uv_data"] = runs_uv["uv_data"].apply(df_scaler)
runs_uv["scaled_uv_data"][10]

Scaling has been achieved successfully.

### Baseline Correct

In [None]:
runs_uv.reset_index(drop=True)[0:3]

In [None]:
# runs_uv_test = runs_uv.reset_index(drop = True)[0:1]


def baseline_calculator(column):
    baseline_fitter = Baseline(column.index)
    baseline_y = baseline_fitter.iasls(column.values)

    return baseline_y[0]


def get_cols(column):
    baseline = column.apply(baseline_calculator)
    return baseline


runs_uv["scaled_baselines"] = runs_uv["scaled_uv_data"].apply(get_cols)

runs_uv["scaled_baselines"][10]

# runs_uv_test['scaled_baselines'] = runs_uv['scaled_uv_data'].apply(lambda col: col.apply(baseline_calculator))
# runs_uv['scaled_baselines'][10]

In [None]:
runs_uv["scaled_baselines"][10] - runs_uv["scaled_uv_data"][10]

In [None]:
runs_uv["scaled_baselines"][10].drop("mins", axis=1).plot()

## Baseline Adjustment

In [None]:
runs_uv["baseline_adjusted_uv_data"] = (
    runs_uv["scaled_uv_data"] - runs_uv["scaled_baselines"]
)

In [None]:
runs_uv.loc[10, "baseline_adjusted_uv_data"].plot()

## Average Baseline Gradient

The gradient is calculated for each wavelength for each run. Thus the format of the data for each run should be nm | gradient average. Top level df can hold a column called `['av baseline gradients']`which can contain a df of the stated format.


In [None]:
runs_uv.set_index("run_name").loc[:, "scaled_baselines"]

For each row, I want to access the scaled_baselines df's, calculate the av. gradient for each wavelength there, then return the DF of the calculated av. gradient.

In [None]:
runs_uv = runs_uv.drop_duplicates(subset="run_name")
runs_uv

In [None]:
runs_uv.set_index("run_name").loc[:, "scaled_baselines"]

In [None]:
runs_uv = runs_uv.set_index("run_name")

In [None]:
def calc_av_grad(scaled_baseline_df):
    av_grads = scaled_baseline_df.set_index("mins").apply(
        lambda column: np.mean(np.gradient(column))
    )

    av_grads = pd.DataFrame(av_grads, columns=["av_grads"])

    return av_grads


runs_uv["av_grads"] = runs_uv["scaled_baselines"].apply(func=calc_av_grad)

In [None]:
runs_uv["av_grads"]

## Peak Heights

In [None]:
runs_uv.columns

In [None]:
runs_uv["scaled_uv_data"].head(3)

1. Start with a pandas Series of dataframes containing scaled uv data.
2. For each column in the uv dataframe, find the peaks.
3. return the peaks and the x values as a dataframe object whose structure is nm: x_values | y_values.

4. Put that back in to the main df with index run : wavelength df.

So the overall structure is:

run : peaks_wavelength_df : wavelength : x_values, y_values.

How do I simulate that structure?

In [None]:
import pandas as pd

# Create an empty DataFrame with the desired columns
df = pd.DataFrame(columns=["A", "B"])

# Create another DataFrame object to store as an element of the first DataFrame
df2 = pd.DataFrame({"C": [7, 8, 9], "D": [10, 11, 12]})

# Set the value of a specific row and column to the second DataFrame object
df.loc[0, "B"] = df2

# Print the resulting DataFrame
print(df)

In [None]:
x_values = [1, 2, 3]
y_values = [4, 5, 6]
index = [0, 1, 2]

wavelength = pd.DataFrame(zip(x_values, y_values), index=index, columns=["x", "y"])

wavelength_series = pd.Series([wavelength])
wavelengths_df = pd.DataFrame(["wavelengths"])

wavelength_df.loc[0, "wavelengths"] = wavelength_series
wavelength_df

In [None]:
from pandas.api.types import is_scalar


def peak_finder(nm):
    print(nm.index)
    peaks = find_peaks(height=0.05, x=nm)

    # peaks[0] is the peak maxima indexes.

    peaks_x = nm.index[peaks[0]].values

    # peak[1] is a dict with information about the peaks including peak heights.
    peaks_y = peaks[1]["peak_heights"]

    assert is_scalar(peaks_x), "peaks_x is a scalar"
    assert is_scalar(peaks_y), "peaks_y is a scalar"

In [None]:
runs_uv["peaks"][0].T

In [None]:
runs_uv.loc[runs_uv.index[0], "peaks"].T.loc["190"]

## Peak Height to Baseline Gradient Ratio

In [None]:
runs_uv.columns

In [None]:
# ratio = av_peak_height / av_grads