Continuing from [Identifying Optimal Wavelength](2023-03-14_identifying_optimal_wavelength.ipynb), I will use the methods developed there to aggregate the results for all 2.5% avantor runs thus far. Although presumably, this method should work for all methods.

## Set up Environment

In [None]:
%load_ext autoreload
%autoreload 2

import sys

import os

import pandas as pd

import numpy as np

from scipy.signal import find_peaks

pd.options.plotting.backend = 'plotly'

import plotly.graph_objs as go

from plotly.subplots import make_subplots

from sklearn.preprocessing import MinMaxScaler

from pybaselines import Baseline

# adds root dir 'wine_analyis_hplc_uv' to path.

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../')))

from agilette import agilette_core as ag

lib = ag.Agilette('/Users/jonathan/0_jono_data').library

In [None]:
lib_df = lib.data_table()
lib_df.head()

As in the leading in notebook, I will use the latest De Bertoli Cab Merlot sample `2023-03-07_DEBERTOLI_CS_001.D`.

## Planning the Experiment

The way to do this is to stay within a DataFrame environment.

1. Form a a DF of:
run name | uv_data object.
2. for each run: scale, baseline adjust, calculate average baseline gradient and peak heights, get the ratio. 
3. Plot the maxima of the above values for each run. Probably drop after 380nm. 

## Filtering Runs

In [None]:
runs = lib_df[(lib_df['method'].str.contains("2_1*")) & ~(lib_df['sample_name'].str.contains("uracil*")) & ~(lib_df['uv_files'].apply(len)==0)]

runs.head()

## Prepare the Data

### Assemble the runs_uv_data DF.

In [None]:
all_data = lib.all_data()

def uv_data_extractor(column):
    data_dir = all_data[column]
    uv_data = data_dir.load_spectrum().uv_data

    return uv_data

uv_data_series = runs['run_name'].apply(uv_data_extractor)

In [None]:
runs_uv = runs

runs_uv['uv_data'] = uv_data_series

runs_uv = runs_uv.drop(['uv_files', 'sequence', 'ch_files', 'sample_name', 'desc'], axis = 1)

In [None]:
runs_uv.head()

### Scale

In [None]:
runs_uv['uv_data'].iloc[0].columns

In [None]:
runs_uv['uv_data'].iloc[0]

In [None]:
# manually iterating over each row in the top level df then each column in the uv_data df and applying fit_transform()

# scaler = MinMaxScaler()

# for idx, row in runs_uv.iterrows():
#     for column in row['uv_data']:
#         scaled_column = scaler.fit_transform(row['uv_data'][column])

In [None]:
runs_uv.loc[3,'uv_data']

In [None]:
scaler = MinMaxScaler()

# How do I access an individual dataframe.

uv_df_1 = runs_uv.loc[3,'uv_data']

# test applying minmaxscaler to a single dataframe.

scaled_uv = scaler.fit_transform(uv_df_1)
scaled_uv_df = pd.DataFrame(scaled_uv, columns = uv_df_1.columns, index = uv_df_1.index)
scaled_uv_df

In [None]:
# Try it again

counter = 0

def df_scaler(column):
    scaled_column = scaler.fit_transform(column)

    scaled_column = pd.DataFrame(scaled_column, columns = column.columns, index = column.index)
        
    return scaled_column
    
runs_uv['scaled_uv_data'] = runs_uv['uv_data'].apply(df_scaler)
runs_uv['scaled_uv_data'][10]


Scaling has been achieved successfully.

### Baseline Correct

In [None]:
runs_uv.reset_index(drop = True)[0:3]

In [None]:
#runs_uv_test = runs_uv.reset_index(drop = True)[0:1]

def baseline_calculator(column):
    baseline_fitter = Baseline(column.index)
    baseline_y = baseline_fitter.iasls(column.values)

    return (baseline_y[0])

def get_cols(column):
    baseline = column.apply(baseline_calculator)
    return baseline

runs_uv['scaled_baselines'] = runs_uv['scaled_uv_data'].apply(get_cols)

runs_uv['scaled_baselines'][10]

# runs_uv_test['scaled_baselines'] = runs_uv['scaled_uv_data'].apply(lambda col: col.apply(baseline_calculator))
# runs_uv['scaled_baselines'][10]

In [None]:
runs_uv['scaled_baselines'][10] - runs_uv['scaled_uv_data'][10]

In [None]:
runs_uv['scaled_baselines'][10].drop('mins', axis = 1).plot()

## Baseline Adjustment

In [None]:
runs_uv['baseline_adjusted_uv_data'] = runs_uv['scaled_uv_data'] - runs_uv['scaled_baselines']

In [None]:
runs_uv.loc[10, 'baseline_adjusted_uv_data'].plot()

## Average Baseline Gradient

The gradient is calculated for each wavelength for each run. Thus the format of the data for each run should be nm | gradient average. Top level df can hold a column called `['av baseline gradients']`which can contain a df of the stated format.


In [None]:
def calc_av_grad(column):
    grad = np.gradient(column)
    av = np.mean(grad)
    # returns a series with nm as index, avs as values.
    return(av)

runs_uv['av_baseline_grads'] = runs_uv['scaled_baselines'].∏drop('mins').apply(lambda df: df.apply(calc_av_grad))

runs_uv.loc[10, 'av_baseline_grads']