refer to [2023-03-08_logbook](/Users/jonathan/001_obsidian_vault/2023-03-08_logbook.md)

# Investigating Wavelength of Absorbance Maxima Outside of Mobile Phase Region of Wines

The purpose of this notebook is to:

- [x] Isolate the sequences and runs which used the avantor column.
- [ ] Identify the wavelength region of 'minimal' baseline.
- [ ] Identify the single wavelength with maximal absorbance within that region.

First I need to set up the environment.

## Set Up Environment

In [None]:
%load_ext autoreload
%autoreload 2

import sys

import os

# adds root dir 'wine_analyis_hplc_uv' to path.

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '../')))

from agilette import agilette_core as ag

lib = ag.Agilette('/Users/jonathan/0_jono_data').library

## My Library

In [None]:
lib_df = lib.data_table().applymap(lambda x: x.lower() if type(x) == str else x)
lib_df.head()

In [None]:
lib_df.describe()

## Selecting Avantor Column Runs

In [None]:
avantor_runs = lib_df[lib_df['method'].str.contains('avantor')]
avantor_runs.head()

In [None]:
avantor_runs.info()

In [None]:
avantor_runs.describe()

Looking good so far! 63 runs to play with, apparently only 42 are unique. Interesting. What does that mean?

## Duplicate Runs?

In [None]:
avantor_runs_dup_df = avantor_runs[avantor_runs.duplicated(subset = ['names'], keep = False)]
avantor_runs_dup_df.head()

Duplicates in name column are occuring simply because I'm running the same sample multiple times.

## Approaching Baseline Analysis

Now, there is a question of baseline analysis. Let's begin as usual by lecting one data set, creating a process for that one, then generalising to all. Let's do 2021-debortoli-cabernet-merlot_avantor. First q: how do we interface between the data table and the rest of Agilette? feed the pathname back in?

In [None]:
the_run = avantor_runs[(avantor_runs['names'] == "2021-debortoli-cabernet-merlot_avantor") & (avantor_runs.index == '2023-02-23 12:21:12')]['path']

the_run.values

In [None]:
wine_uv_data = wine_run_dir.get_uv_data()
wine_uv_data.extract_uv_data()

In [None]:
wine_uv_data.uv_data.head()

In [None]:
wine_uv_data.line_plot()

Weirdly, it appears to skip 216 - 220nm. Is another dataset the same? Rather than creating another plot, we can pull the column names

## Investigating a 3d Plot Bug

start_index_214 = wine_uv_data.uv_data.columns.to_list().index(214)

next_nm = wine_uv_data.uv_data.columns[start_index_214 + 1]

for x in range(0,5):
    print(wine_uv_data.uv_data.columns[start_index_214 + x])

The wavelength columns are definitely in the uv dataframe. Why arn't they plotting? Maybe try plotting directly on the uv_data first.


In [None]:
from scripts.core_scripts.hplc_dad_plots import plot_3d_line

plot_3d_line(wine_uv_data.uv_data)

So it works OK when pltoted on directly, which means its something that happens in between extracting the UV data and passing it to the plot method in UV_Data class. However, on inspection it doesnt appear that there is anything that could cause that, so i'll just ignore it for now..

Now to functionalise the above to be able to iterate over the library.

In [None]:
# The list of runs to iterate over

run_path_list = [x for x in avantor_runs['path']]
run_path_list


In [None]:
len(run_path_list)

## Extracting the UV data for all Avantor runs

In [None]:
the_run = avantor_runs[(avantor_runs['names'] == "2021-debortoli-cabernet-merlot_avantor") & (avantor_runs.index == '2023-02-23 12:21:12')]['path']

def wine_uv_extractor(run_path):
    """
    For the given runs, extract the uv_data for further analysis
    """
    wine_run_dir = lib.all_data_files[run_path] 

    wine_uv_data = wine_run_dir.get_uv_data().extract_uv_data()

    return wine_uv_data

for x in run_path_list:
    wine_uv_extractor(x.name)


## Building a List of all .D Files Available

I am encountering a problematic issue here. I am trying to form a combined dict of single run data dir objects and sequence data dir objects. The core object is Run_Dir.

In Sequence, the structure is as follows:

`Sequence.data_files = {'run_dir_name' : Run_Dir}`

One method could be implemented with the below code.

In [None]:
import re

seq_run_dict = {}

for sequence_key in lib.sequences.keys():
    #print(sequence_key)
    for run_key in lib.sequences[sequence_key].data_files.keys():

        #print(f"\t{run_key}")

        run_key_list = []

        if run_key not in run_key_list:
            run_key_list.append(run_key)
        else:
            print(f"\t{run_key} is a duplicate")

        # handling duplicates by adding a counter, or incrementing the last counter if there isnt one already. The problem is that a common name is 0001, for example. How do I add a counter to that?

        # if pattern *_[0-9] at the end of the key, increment the last number by 1, else add _1 to the end of the key.

        if re.search(r'_[0-9]$', run_key):
            key_prefix, suffix = run_key.rsplit('_',1)
            new_suffix = str(int(suffix) + 1)
            new_key = f'{key_prefix}_{new_suffix}'

            print(new_key)
        else:
            new_key = f'{run_key}_1'

        run = lib.sequences[sequence_key].data_files[run_key]
        seq_run_dict[new_key] = run


In [None]:
len(seq_run_dict)

## Is Dict Implementation Deleting Duplicate named runs?

Note: this line of inquiry has led me to question whether using dicts as a fundamental data structure is causing duplicates to be ignored. I will investigate this by using Unix find and grep in the following manner:

`find . -type d -name "*.D" -not -name ".DS_Store" | grep -c ".D$"`

The result is 110 directories.

This was verified by checking the paths of each file counted, and noting that the paths were indeed unique.

`find . -type d -name "*.D" -not -name ".DS_Store" -print | tee >(wc -l)`

Considering that the paths are unique, and that the sequence data dirs are constructed WITHIN the sequence object, I think I can safely say that no duplicates are being lost. It is however a messy problem, one that needs to be solved later.

In [None]:
len(lib.single_runs.keys())

In [None]:
total_len = len(seq_run_dict) + len(lib.single_runs.keys())
total_len

Seems like it's 13 runs short. Sounds suspiciously like its missing one of the initial wine runs.

## Looking for Missing Runs

In [None]:
print(seq_run_dict['0091_1'].acq_date)

That's the original one. What about the run, what two weeks later?

In [None]:
for x in lib.sequences.keys():
    print(x)

It appears to be duplicating MOST of the sequences..

In [None]:
for data_file in lib.sequences['2023-02-15_WINES_2023-02-15_15-19-53.sequence'].data_files:
    print(data_file)

## Removing 'Empty' Sequences

These 3 are empty:

2023-02-07_WINES.sequence:
2023-02-07-WINES.S METHODS.REG

2023-02-15_WINES_2023-02-15_15-08-23.sequence:
2023-02-15_WINES.S METHODS.REG

2023-02-15_WINES_2023-02-15_15-18-24.sequence.

Remove them.

I have run '/Users/jonathan/wine_analysis_hplc_uv/scripts/remove_empty_sequences.py' to remove the empty sequences. Now what about the rest?

In [None]:
for x in lib.sequences.keys():
    print(x)

## Cleaning up the Avantor Data

In [None]:
for sequence_key in sorted(lib.sequences.keys(), reverse = True):
    print(lib.sequences[sequence_key])

Now, we can drop '2023-02-07-WINES.sequence' because that's the run which didnt pump mobile phase through.

In [None]:
avantor_runs = avantor_runs[avantor_runs['sequence'] != '2023-02-07-WINES.sequence']
avantor_runs.style.set_properties(**{'max-height': '200px', 'max-width': '800px', 'overflow': 'scroll'})

avantor_runs