# Investgating Data Collection Progress Thus Far

i.e. 2023-03-30.

Refer to [2023-03-30_logbook](../../001_obsidian_vault/2023-03-31_logbook.md).

## 1. What Runs are Appropriate for Analysis.

Criteria: 

1. were on the Avantor 10cm column.
2. contain UV spectra.
3. Not uracil, acetone or coffee runs.

In [None]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('../')
from pathlib import Path
import pandas as pd 

from agilette.modules.library import Library


In [None]:
lib = Library('/Users/jonathan/0_jono_data')
meta_df = lib.metadata_table

Let's try selecting only for rows that contain UV data.


In [None]:
avantor_df = meta_df[(meta_df['uv_filenames']!='') & (meta_df['acq_method'].str.contains('AVANTOR')) & ~(meta_df['name'].str.contains('uracil')) & ~(meta_df['name'].str.contains('coffee')) & ~(meta_df['name'].str.contains('lor'))]

avantor_df.shape

In [None]:
avantor_df.head()

In [None]:
avantor_df.describe()

70 unique samples doesn't seem too bad. There should be more to pull sitting in the instrument as well. It appears that this filter is legitimate, as there is UV data for every remaining row.

Now, one sticking point is that we don't have the wine names in this table. To get them we will need to load the sample tracker as a df and merge them on the sample ID., or 'name' in this dataframe. A further complicating factor is that the names are sometimes not consistant with sample tracker, for example '2021-debortoli-cabernet-merlot_avantor`. I dont believe it was ever added, and it makes up 7/107 of the runs. Let's first load sample_tracker then compare their contents.

In [None]:
def get_sample_tracker() -> pd.DataFrame:

    from google_sheets_api import get_sheets_values_as_df

    sheet_id = '15S2wm8t6ol2MRwTzgKTjlTcUgaStNlA22wJmFYhcwAY'
    path_to_creds = '/Users/jonathan/wine_analysis_hplc_uv/credentials_tokens'

    tracker_df = get_sheets_values_as_df(sheet_id, "sample_tracker!A1:H200",path_to_creds)

    return tracker_df

tracker_df = get_sample_tracker()
tracker_df.head()

## Making id/name Column the Same Prior to Join

To join `sample_tracker` with `metadata_table` I need to do the following:
1. Ensure they are the same datatype.
2. Clear whitespace.
3. Sort.
4. Drop duplicates.
5. Compare using `equals()`. If not, continue.

1. Find common elements using `isin()`.
2. filter on `isin()`.
3. compare filtered columns using `equals()`.

Credits: ChatGPT.

I don't quite know how `filter` or `isin` works, so let's check it out.

## `isin`

`pd.DataFrame.isin` takes `values` as its argument, which can be any iterable, Series, DataFrame or dict, and returns a DataFrame of booleans depending on matches.

For the given `values`, if any of the elements match an element of the DataFrame, a value of True is marked in the output DataFrame at that coordinate.

The idea is to return a mask that can be applied to the original DataFrame.

## `filter`

`pd.DataFrame.filter` subsets the DataFrame rows and/or columns with an index-oriented approach, and the `like` keyword argument allows you to match column or row names against substrings and regex patterns for psuedo fuzzy matching.



### Clean up the 'id' columns

In [None]:
avantor_df.head()

In [None]:
# 1. Rename avantor 'name' to 'id' for consistancy.
avantor_df = avantor_df.rename({'name' : 'id'}, axis = 1)

# clean the columns up, sort and drop duplicates

def dataframe_cleaner(df : pd.DataFrame, col_name : str) -> pd.DataFrame:
    df[col_name] = df[col_name].str.strip()
    df[col_name] = df.sort_values(col_name)
    df[col_name] = df.drop_duplicates(col_name)
    return df
    
avantor_df = (avantor_df.pipe(dataframe_cleaner, col_name='id'))
# avantor_df['id'] = avantor_df['id'].str.strip()
# avantor_df = avantor_df.sort_values('id').drop_duplicates('id')

avantor_df.head()

There is a number of NaN's in the avantor_df, which is surprising. Presumably if there is no name in the xml file it is recorded as NaN after forming the dataframe. How many?


In [None]:
avantor_df.info()

In [None]:
avantor_df[avantor_df['id'].isna()].describe()

So out of 101, 31 have names that are are NaN, all the way through the dataset.

I need an overview to get what's going on. A visualisation would be useful.

In [None]:
avantor_df_date_sort = avantor_df.sort_values('acq_date')

In [None]:
import plotly.express as px

avantor_df_date_sort['id'] = avantor_df_date_sort['id'].fillna('none')

fig = px.scatter(avantor_df_date_sort, x = 'acq_date', y = 'id')

# show the figure
fig.show()

That's a pretty useful representation. Now we can start renaming the id rows.

From this point, we will be disconnecting from the source material. This is not a problem as the filepaths to the data files, but any further work should be done on this table directly rather than reloading from the source material.

This introduces the question of how to integrate new data in. A question that will be answered later!

For now, let's get the ids reconciled.

## Aligning IDs Across Tables.

In [None]:
pd.options.display.max_rows = 105
tracker_df['id']