
This notebook is now defunct as the samples and problems it refers to have been rectified within the core database. Thus, it is preserved purely as a reference for the code.

Premise: on pivoting the whole dataset, the resulting table is twice as long as it should be. Need to find out what the sample is, and why this is happening.

In [None]:
%load_ext autoreload
%autoreload 2

from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.db_methods import get_data, pivot_wine_data
import pytest
import pandas as pd
import duckdb as db
import logging
import seaborn as sns

logger = logging.getLogger(__name__)
pd.options.display.width = None
pd.options.display.show_dimensions = True
pd.options.display.max_colwidth = 50
pd.options.display.max_rows = 20
pd.options.display.max_columns = 15
pd.options.display.colheader_justify = "left"

In [None]:
con = db.connect(definitions.DB_PATH)

get_data.get_wine_data(con, wavelength=("254",))

df = pivot_wine_data.pivot_wine_data(con)

In [None]:
(
    df.stack(0)
    .reset_index()
    .set_index(["detection", "samplecode", "wine", "i"])
    .unstack(["detection", "samplecode", "wine"])
    .loc[:, "mins"]
    .isna()
    .sum()
    .to_frame("na_count")
    .query("na_count<7000 & na_count>1")
)

The offending entry is samplecode 98, 2020 barone ricasoli chianti classico rocca di montegrossi.

Lets investigate it further.

In [None]:
get_data.get_wine_data(con, samplecode=("98",), wavelength=("254",))
df_98 = pivot_wine_data.pivot_wine_data(con).pipe(
    lambda df: df.set_axis(df.columns.droplevel("samplecode"), axis=1)
)

In [None]:
(
    df_98
    .pipe(lambda df: df if display(df.describe()) is None else df)
    .pipe(lambda df: df if df.loc[:, "mins"].plot.line() else df)
    .pipe(lambda df: df if df.plot.line(x="mins", y="value") else df)
    .pipe(
        lambda df: df if display(df.loc[:, "mins"].value_counts().sort_index()) else df
    )
)

So it legitimately appears that there is somehow twice as many observation points on sample 98 as there is any other wine.. We can clarify this by calculating the frequency.

Frequency: $f=1/(t2-t1)$

In [None]:
import numpy as np
f_df = (df
 .stack('samplecode')
    .assign(f=lambda df: 1 / df.groupby('samplecode')['mins'].diff(1))
    .assign(f=lambda df: df['f'].round(2))
   .pipe(lambda df: df
                     if print(
                        df
                        .loc[:,'f']
                        .value_counts()
                     )
                     is None else df
                     )
)


It appears that the standard calculated frequency for that interval is 150, which would make sense if the wine that has double the observations has double that value. Next we will observe the rolling average frequency for a window of 2:

In [None]:
import matplotlib.pyplot as plt

rf_df = (
    f_df    
    .reset_index()
    .assign(
        rollingfu=lambda df: df.groupby("samplecode")["f"]
        .rolling(2)
        .mean()
        .reset_index(level=0, drop=True)
    )
    .pipe(lambda df: df if df
    .groupby(['samplecode','wine'])['rollingfu']
    .mean()
    .plot.hist() else df
    )
);

As we can see, a frequency of 150 Hz is well and above the norm. Judging from this result I would say that something went wrong with the recordings of these two samples. Assuming that there is some small amount of error, we can see that they are the only two wines that have a significant moving average frequency higher than 150.

to simplify things, I will just remove them from the dataset.

As a final check, i have noted that sample 98 was part of a sequence that included:

- 97 d'arenberg the wild pixie
- 98 barone ricasoli chianti classico rocca di montegrossi
- 99 domaine des lises crozes-hermitage

Comparing the three could illuminate the cause of the frequency discrepency

In [None]:
(
    rf_df
        .loc[lambda df: df['samplecode'].str.contains("97|98|99", regex=True)]
        .groupby(['samplecode','wine'])['rollingfu']
        .mean()
        .plot.bar()
)

Nope, evidently its JUST that 1.

Final check will be to observe the spectrum-chromatogram table to see if something went wrong in the pivot.

In [None]:
con.sql("""--sql
        CREATE OR REPLACE TEMPORARY TABLE temp
        AS
        SELECT
            detection, samplecode, CONCAT(st.vintage, st.name) as wine, cs.mins, cs.value, cs.id
        FROM
        c_sample_tracker st
        INNER JOIN
        c_chemstation_metadata chm
        ON
        chm.join_samplecode = st.samplecode
        INNER JOIN
        chromatogram_spectra cs
        ON
        chm.id = cs.id
        WHERE
        st.samplecode='97'
        AND cs.wavelength=254
        ;
        """)
con.sql(
        """--sql
            CREATE OR REPLACE TEMPORARY TABLE ptemp AS
            SELECT *
            FROM (
                    PIVOT (
                        SELECT
                            rowcount,
                            wine,
                            samplecode,
                            id,
                            mins,
                            value,
                            detection,
                        FROM (
                                SELECT
                                    wine,
                                    id,
                                    detection,
                                    samplecode,
                                    mins,
                                    value,
                                    ROW_NUMBER() OVER (
                                        PARTITION BY samplecode
                                        ORDER BY mins
                                    ) AS rowcount
                                FROM temp
                            )
                    ) ON samplecode
                    USING
                        FIRST(detection) as detection,
                        FIRST(wine) as wine,
                        FIRST(value) as value,
                        FIRST(mins) as mins,
                        FIRST(id) as id
                )
                ORDER BY rowcount

            """
    )
(
    con.sql("SELECT * FROM ptemp").df()
    .pipe(
        lambda df: df.set_axis(
            pd.MultiIndex.from_tuples(
                [tuple(c.split("_")) for c in df.columns],
                names=["samplecode", "vars"],
            ),
            axis=1,
        )
        .rename_axis('i')
        .droplevel(0, axis=1)
    )
    .pipe(
        lambda df: 
            sns.lineplot(df, x='mins',y='value')
    )
)

In [None]:
get_data.get_wine_data(con, samplecode=('97','98','99'), wavelength=('254',))
seq_df = pivot_wine_data.pivot_wine_data(con)

In [None]:
(
    seq_df
    .pipe(lambda df: df if display(df) is None else df)
    .stack(0)
    .pipe(lambda df: df if display(df.unstack('samplecode').shape) is None else df)
    .pipe(
        lambda df: df if sns.lineplot(df, x='mins',y='value',hue='wine') else df
    )
    .pipe(lambda df: df if display(df
                                   .groupby('wine')['mins']
                                   .agg(['min','max'])
                                   ) is None else df)
)

Well that sheds no more light, but it does indicate that 98 is 10 minutes longer than the adjacent samples.

In [None]:
con.sql("describe c_chemstation_metadata")

In [None]:
con.sql("SELECT ch_samplecode, acq_method FROM c_chemstation_metadata where ch_samplecode IN ('96','97','98','99')")

There we have it. That sample was recorded on an incorrect method, and ergo the data is different from its fellows. I will remove it from the dataset and mark it as such in sampletracker.

In [None]:
con.sql("SELECT chm.ch_samplecode, chm.acq_method, chm.id, chm.acq_date, chm.desc FROM c_chemstation_metadata chm where ch_samplecode IN ('72')").df()

In [None]:
con.sql("SELECT * FROM (SELECT ch_samplecode, COUNT(ch_samplecode) as count FROM c_chemstation_metadata GROUP BY ch_samplecode) WHERE count>1")

Regarding 72, This first indicates the importance of relying on the id as the ultimate primary key. Secondly, it looks as though two samples have been incorrectly labelled as 72 during the initial ETL. Lets have a look at 72 in the raw table

In [None]:
con.sql("describe chemstation_metadata")

In [None]:
con.sql("SELECT rchm.notebook, rchm.date, rchm.method, rchm.desc, rchm.id FROM chemstation_metadata rchm where rchm.notebook LIKE '%debor%'")

There we go. 6d8a370a-9f40-460d-acba-99fd4c287ad8	is the id of the HALO recording, and should be excluded from the dataset. Easy. Just drop both.