# Investgating Data Collection Progress Thus Far

i.e. 2023-03-30.

Refer to [2023-03-30_logbook](../../001_obsidian_vault/2023-03-31_logbook.md).

## 1. What Runs are Appropriate for Analysis.

Criteria: 

1. were on the Avantor 10cm column.
2. contain UV spectra.
3. Not uracil, acetone or coffee runs.

In [None]:
%load_ext autoreload
%autoreload 2
import sys

from pathlib import Path
import pandas as pd
import numpy as np

from agilette.modules.library import Library


In [None]:
lib = Library('/Users/jonathan/0_jono_data')
df = lib.metadata_table

Let's try selecting only for rows that contain UV data.


## Selecting UV Wine Runs

In [None]:
def metadata_df_avantor_wine_subset(df : pd.DataFrame) -> pd.DataFrame:

    df = df[(df['uv_filenames']!='') & (df['acq_method'].str.contains('AVANTOR')) & ~(df['name'].str.contains('uracil')) & ~(df['name'].str.contains('coffee')) & ~(df['name'].str.contains('lor'))]

    return df

avantor_df = metadata_df_avantor_wine_subset(lib.metadata_table)

avantor_df.head()

In [None]:
avantor_df.describe()

70 unique samples doesn't seem too bad. There should be more to pull sitting in the instrument as well. It appears that this filter is legitimate, as there is UV data for every remaining row.

Now, one sticking point is that we don't have the wine names in this table. To get them we will need to load the sample tracker as a df and merge them on the sample ID., or 'name' in this dataframe. A further complicating factor is that the names are sometimes not consistant with sample tracker, for example '2021-debortoli-cabernet-merlot_avantor`. I dont believe it was ever added, and it makes up 7/107 of the runs. Let's first load sample_tracker then compare their contents.

## Importing Sample Tracker Table

In [None]:
def get_sample_tracker() -> pd.DataFrame:

    from google_sheets_api import get_sheets_values_as_df

    sheet_id = '15S2wm8t6ol2MRwTzgKTjlTcUgaStNlA22wJmFYhcwAY'
    path_to_creds = '/Users/jonathan/wine_analysis_hplc_uv/credentials_tokens'

    tracker_df = get_sheets_values_as_df(sheet_id, "sample_tracker!A1:H200", path_to_creds)

    return tracker_df

tracker_df = get_sample_tracker()
# 2023-04-03-09-01 reading the local file today because I dont have internet.
tracker_df = tracker_df.replace("", np.nan)
tracker_df.head()

## Making id/name Column the Same Prior to Join

To join `tracker_df` with `avantor_df` I need to do the following:
1. Ensure they are the same datatype.
2. Clear whitespace.
3. Sort.
4. Drop duplicates.
5. Compare using `equals()`. If not, continue.

1. Find common elements using `isin()`.
2. filter on `isin()`.
3. compare filtered columns using `equals()`.

Credits: ChatGPT.

I don't quite know how `filter` or `isin` works, so let's check it out.

## `isin`

`pd.DataFrame.isin` takes `values` as its argument, which can be any iterable, Series, DataFrame or dict, and returns a DataFrame of booleans depending on matches.

For the given `values`, if any of the elements match an element of the DataFrame, a value of True is marked in the output DataFrame at that coordinate.

The idea is to return a mask that can be applied to the original DataFrame.

## `filter`

`pd.DataFrame.filter` subsets the DataFrame rows and/or columns with an index-oriented approach, and the `like` keyword argument allows you to match column or row names against substrings and regex patterns for psuedo fuzzy matching.



### Clean up the 'id' columns

In [None]:
# clean the columns up, sort and drop duplicates

def dataframe_cleaner(df : pd.DataFrame, col_name : str) -> pd.DataFrame:
    df[col_name] = df[col_name].str.strip()
    df = df.sort_values(col_name)
    df = df.drop_duplicates(col_name)
    return df
    
# todo: add subsetter to pipeline.
avantor_df = (avantor_df.pipe(dataframe_cleaner, col_name='id'))
tracker_df = (tracker_df.pipe(dataframe_cleaner, col_name='id'))

tracker_df.head()

To form the link between `tracker_df` and `avantor_df`, I need to have an understanding of what happened when, and draw a link between sample dates, wine names, and ids. First though we can reconcile the formats of the id numbers. In tracker_df they have the format "DD" where D is a digit, but in avantor_df they are either empty, "DD" "00DD", or other. What methods are there of finding patterns in strings in a column?

## Identifying Patterns in Strings in a Column



In [None]:
#pd.set_option('display.max_rows', None)

for idx, row in tracker_df.iterrows():
    print(row['id'], end=', ')

`tracker_df['id']` has pattern "D", "DD", or "zD".

In [None]:
for idx, row in avantor_df.iterrows():
    print(row['id'], end=', ')

Avantor DF starts with DDDD, goes to DD, then "string", DD, "string". Tbh there should be more DD after those, looks like im missing files? Anyway. To rectify this, we need to:
1.  [x] pad tracker_df to be DD
2. [x] drop the first and last digits on the first 19 rows of avantor df.
3. replace the strings with DD.

## 1. Padding Sample Tracker ID Column

In [None]:
### 1 Pad Tracker_DF to be DD. Can use `pd.Series.str.pad`

def pad_id(df : pd.DataFrame) -> pd.DataFrame:
    return df['id'].str.pad(2, fillchar='0')

tracker_df['id'] = pad_id(tracker_df)

### 2. Slicing 4 Digit ID's in `avantor_df`

subset first 19 rows of avantor df (identify that group, probs by acq_date), drop first and last digits of id through slicing.

In [None]:
four_digit_ids = avantor_df[avantor_df['id'].str.len() ==4].sort_values('acq_date')

# ranges from the 02-14 13:18:27 to 02-16 12:45:35. Is this the same as all values within that range?

avantor_four_digit_id_range = avantor_df[(avantor_df['acq_date'] >= four_digit_ids.loc[four_digit_ids.index[0],'acq_date']) & (avantor_df['acq_date'] <= four_digit_ids.loc[four_digit_ids.index[-1],'acq_date'])].sort_values('acq_date')

avantor_four_digit_id_range.equals(four_digit_ids)
# confirmed that for those date ranges, all entries were four digit ids.

Now to perform the slice.

In [None]:
avantor_df['id'] = avantor_df['id'].apply(lambda x : x[2:4] if len(x)==4 else x)

display(avantor_df[avantor_df['id'].str.len()==4])

### 3. replace the strings with DD.

This is the most difficult. First find all which are names rather than numbers.

In [None]:
digit_mask = avantor_df['id'].str.isdigit()
avantor_df[~(avantor_df['id'].str.isdigit())].head()

In [None]:
tracker_df[tracker_df['name'].str.contains('crawford')]

After checking "sample_tracker" for the 5 samples with string id's, I discovered that these 5 had never been logged. This is because at the time, they were just screening samples when I was worried that he column was damaged. Refer to [2023-02-21_logbook](file:///Users/jonathan/001_obsidian_vault/mres_logbook/2023-02-22_logbook.md).

Simplest solution will be to enter them with new IDs.

Note: can't update sample tracker until I get some internet. In the meantime just add them directly to the df. The names are:

In [None]:
for idx, row in avantor_df[~(avantor_df['id'].str.isdigit())].iterrows():
    print(row['id'], end = ", ")

Because 'tracker_df.id' contains strings and string integers, I cannot sort properly. This is a secondary motivation for replacing the strings with integers. "z3" and "NC" are the main offenders, except that "NC" is not present in `tracker_df`, only `avantor_df`. "z3" will be replaced now with "00", as it was the first wine added to the library.

In [None]:
zema_mask = tracker_df['name'] == "zema estate 'family selection' cabernet sauvignon"

tracker_df.loc[zema_mask,['id']] = '00'
avantor_df.loc[avantor_df['id'] == 'z3', ['id']] = '00'

In this manner, "z3" has been replaced by "00" in both `tracker_df` and `avantor_df`. The next step is to identify what the next available ID number is:

In [None]:
display(avantor_df['id'].describe())
display(tracker_df['id'].describe())

In [None]:
import dtale

dtale.show(tracker_df)

In [None]:
tracker_df['id'] = tracker_df['id'].fillna(0)

In [None]:
tracker_df[tracker_df.isna()]

In [None]:
display(tracker_df[tracker_df['id'].str.isdigit()])

So the next id available is 72.

In [None]:
avantor_df[~(avantor_df['id'].str.isdigit())].sort_values('acq_date')

Now I need to:

1. add wines to sample_tracker.
2. rename wines in avantor_df.

The fields I need in tracker_df are: 'id', 'vintage', 'name', 'sample_date', 'open_date'.

Sample date is acq_date.

1. [x] Make a new df from tracker_df with the wines that are not digit id. In that df get:
    1. acq_date.
    2. id
2. [x] Add a column with the proper name based on fuzzy searching on tracker_df 'name'.
3. [x] Add a col 'id_new' with with range >72.
3. [ ] Add 'name', 'id', 'acq_date' to tracker_df, concat vertically.
4. [ ] rename id in avantor_df with id_new.
    
Fuzzy match function detailed [here](2023-03-28-joining-cellartracker-metadata.ipynb).

In [None]:
string_ids = avantor_df[~(avantor_df['id'].str.isdigit())][['acq_date', 'id']]
string_ids

In [None]:
string_ids = string_ids.rename({'id' : 'exp_id'}, axis =1)

In [None]:
tracker_df[tracker_df['name'].str.contains('stoney')]

In [None]:
crawford = [2018, "crawford river cabernets", 73]
hey_malbec = [2020, "matias riccitelli hey malbec", 74]
stoney_rise_pn = [2021, "stoney rise pinot noir", 75]
koerner_nielluccio = [2021, 'koerner wine nielluccio sangiovese', 76]
debortoli = [2021, 'De Bertoli Sacred Hill Cabernet Merlot', 77]
nocturne  = [2021, 'Nocturne Cabernet', 78]

wines_to_add = pd.DataFrame([crawford, hey_malbec, koerner_nielluccio,stoney_rise_pn, debortoli, nocturne], columns = ['vintage', 'name', 'id'])
wines_to_add

In [None]:
string_ids

In [None]:
string_ids = string_ids.rename({'id':'exp_id'}, axis = 1)
string_ids['new_id'] = [77,73,73,77,74,76,76,78,75]
string_ids

In [None]:
merge_df = pd.merge(string_ids, wines_to_add, left_on = 'new_id', right_on = 'id')
merge_id = merge_df.drop('id', axis =1)

In [None]:
tracker_df = tracker_df.drop('Unnamed: 0', axis =1)
tracker_df.tail()

In [None]:
merge_id = merge_id.rename({"new_id":"id"}, axis = 1)
merge_id

In [None]:
tracker_df = tracker_df.rename({'new_id':'id'}, axis =1)

In [None]:
print(tracker_df.columns)
print(merge_id.columns)

In [None]:
help(pd.DataFrame.append)

In [None]:
merge_id['acq_date'] = merge_id['acq_date'].dt.strftime('%Y-%m-%d')

In [None]:
tracker_df = tracker_df.rename({"sample_date":"acq_date"}, axis =1)

tracker_df = tracker_df.append(merge_id[['id', 'vintage','name', 'acq_date']], ignore_index=True)


In [None]:
merge_df_2 = pd.merge(avantor_df, string_ids, left_on='id', right_on = 'exp_id', how = 'left')
merge_df_2['id'] = merge_df_2['new_id'].fillna(merge_df_2['id'])
merge_df_2 = merge_df_2.drop(columns=['new_id'])
merge_df_2.tail()