# Description of Dataset Thus Far

see [2023-04-05_logbook](/Users/jonathan/001_obsidian_vault/mres_logbook/2023-04-05_logbook.md) for task description.

## 1. Load the Data

In [None]:
%load_ext autoreload
%autoreload 2
import sys

from pathlib import Path
import pandas as pd

from agilette.modules.library import Library

In [None]:
runs = "/Users/jonathan/0_jono_data/"

lib = Library(runs)

df = lib.metadata_table

In [None]:
df

In [None]:
df['acq_date'].max()

In [None]:
df.columns

In [None]:
df[df['acq_date']>pd.to_datetime('2023-01-01', format = '%Y-%m-%d')].describe(datetime_is_numeric=True, include='all')

So the dataset for 2023 extends from 2023-01-23 to 2023-04-05. What are the major separations? Halo to Avantor is the primary.

## Number of Halo and Avantor Runs

In [None]:
# clean up string columns. For some reason using the pandas string test replaces the posix paths with NaN. Workaround is to drop it from the df then apply on those remaining.
df[df.columns.drop('path')] = df[df.columns.drop('path')].apply(lambda x : x.str.lower() if pd.api.types.is_string_dtype(x) else x)

### Avantor Runs

In [None]:
avantor_df = df[(df['acq_date'] > '2023-01-01') & (df['acq_method'].str.contains('avantor'))]
avantor_df.describe(datetime_is_numeric=True, include='all')

### Timeline of All Runs with Sequences as Intervals

In [None]:
import plotly.graph_objects as go


def library_timeline(df = pd.DataFrame) -> go.Figure:
    f = go.Figure()

    df = df.fillna('None')

    library_trace = go.Scatter(x = df['acq_date'], y = df['id'], mode = 'markers+text', text=df['id'], textposition='top right')

    f.add_trace(library_trace)

    sequences_start_finish_times = df[df['program_type']=='sequence'].groupby('sequence_name').agg({'acq_date':['min','max']})

    sequences_start_finish_times.apply(lambda row : f.add_shape(x0 = row[('acq_date', 'min')], x1= row[('acq_date', 'max')], y0 = 0, y1=120), axis =1)

    return f

f = library_timeline(avantor_df)
f.show()

There are 197 avantor runs running from 2023-02-07 to 2023-04-05.

### Halo Runs

In [None]:
halo_df = df[(df['acq_date'] > '2023-01-01') & (df['acq_method'].str.contains('halo'))]
halo_df.describe(datetime_is_numeric=True, include='all')

Doesnt seem right, but it doesnt matter.

## Number of Avantor Single Runs and Sequences

In [None]:
avantor_df.groupby('program_type').describe(datetime_is_numeric=True, include = 'all')

So far there have been 165 runs as sequences, and 32 as single runs.

### Avantor Single Runs

In [None]:
single_runs = avantor_df[avantor_df['program_type']=='single run']
single_runs.describe()

### Avantor Sequences

In [None]:
avantor_sequence_groups = avantor_df[avantor_df.program_type=='sequence'].groupby('sequence_name')
avantor_sequence_groups.size()

All of the sequences with size of 1 can be declared to be defunct, or deletable.

### Removing 1 Run Sequences

Which ones only have 1 run?

In [None]:
avantor_sequence_groups.filter(lambda x: len(x) == 1).groupby('sequence_name').groups.keys()

The failed Uracil runs and two runs from mid March with the shifting baseline problem. Can drop.

In [None]:
avantor_sequence_groups = avantor_sequence_groups.filter(lambda x: len(x) > 1).groupby('sequence_name')
avantor_sequence_groups.size()

## Assessing Two Run Seqeuence '2023-03-24_wines_2023-03-24_13-17-09'

Now what about that sequence with 2 runs?

In [None]:
avantor_sequence_groups.filter(lambda x: len(x) == 2).groupby('sequence_name').head()

Looks legit. Keep it in the dataset.

So number of sequences is:

In [None]:
avantor_sequence_groups.size().shape

### Dropping 44min runs

The majority of my dataset is on a 52 minute gradient run. I hypothesize that 52 min run chromatograms should be compatible with 44min runs, but I haven't proven this yet. In the meantime, I would like to eliminate 44min runs from the dataset IF they do not contain uniques.

In [None]:
avantor_sequence_groups.get_group('2023-03-16_red-wines-44min_2023-03-16_12-08-23')

In [None]:
id_groups_44min_2023_03_16 = avantor_df.loc[avantor_df['id'].isin(avantor_sequence_groups.get_group('2023-03-16_red-wines-44min_2023-03-16_12-08-23')['id'])].groupby(['id','sequence_name'])

size = id_groups_44min_2023_03_16.size()
size

So yes, it appears that '2023-03-16_red-wines-44min_2023-03-16_12-08-23' is a repeat, and that all those wines were included in "2023-03-14_wines_2023-03-14_19-49-27" and "2023-03-15_wine_dups_2023-03-15_22-17-47". Should drop these two repeat sequences and continue.

After this, it would be a good idea to make a brief summary of each sequence.

In [None]:
avantor_sequence_groups.size().index

### Duplicates

There are two duplicate runs that I know of:
- '2023-03-15_wine_dups_2023-03-15_22-17-47'
- '2023-03-16_random_wines_repeat_44min_run_2023-03-16_17-30-55'

## Applying the Sequence Cleanup to `Avantor_df`.

In [None]:
# drop sequences with only 1 run, duplicates, 44 min runs, and acetone runs. 

avantor_sequence_groups = avantor_df.groupby('sequence_name')

sequences_to_drop = list(avantor_sequence_groups.filter(lambda x: len(x) == 1).groupby('sequence_name').groups.keys())+ avantor_df[avantor_df['sequence_name'].str.contains('dups')]['sequence_name'].unique().tolist() + avantor_df[avantor_df['sequence_name'].str.contains('repeat')]['sequence_name'].unique().tolist() + avantor_df[avantor_df['sequence_name'].str.contains('44min')]['sequence_name'].unique().tolist() + avantor_df[avantor_df['sequence_name'].str.contains('acetone')]['sequence_name'].unique().tolist()

print(sequences_to_drop)

avantor_df = avantor_df[avantor_df['sequence_name'].isin(sequences_to_drop)==False]

# drop single runs.

avantor_df = avantor_df[avantor_df.program_type == 'sequence']

avantor_df.groupby('sequence_name').size()

In [None]:
len(avantor_df['id'].unique())

So according to my defined criteria, there are 78 unique wines sampled and appropriate for further analysis. Let's show the timeline again.

In [None]:
f2 = library_timeline(avantor_df)
f2.show()

## Description of Wines Thus Far.

Now that I have a better idea of the timeline, I need to start looking at what my samples actually are. Three questions:

1. Group samples by variety.
2. Number of sample repeats.
3. Of sample repeats, how many repeats, over what lengths of time?

To do this, I will need to join the metadata table with tracker table, but I will also need to join the tracker table with a cellartracker metadata table. Lots of work.