# Analysis of Signals by Sub-Groups

## Dev Log

2024-06-18 15:34:40 - Outline. Ater a bit of playing I have come up with an approximate outline for the first chapter. The theme will be about describing the datasets along finer and finer subgroupings within each detection method, similar to the metadata analysis. First divide the sample by detection method, then describe the following: number of peaks above 5% rel height, max amplitude, number of peaks by time quartile (divide time by 4), average auc per quartile. Once this is done observe the distributions of these measurements across color and varietal. This will be done for each detection method, then we will compare the distributions between detection methods. Then we'll have to summarize the results, justify them, and conclude. Thus the procedure can be broken down into the following:

1. preprocessing
2. measurement acquisition
3. statistical analysis
4. discussion
5. conclusion

Development of the pipeline will be completed using the 30 sample dataset developed earlier, and once its done we'll include the total dataset.

Outlier handling will be required, when calculating distribution statistics we will need to be specific. The actual outlier analysis will be moved to the appendix in the final publication.


TODO:
- [x] sample retrieval
- [ ] preprocessing
  - [ ] noise removal
  - [ ] sharpening and smoothing
  - [ ] measurement acquisition
    - [ ] produce the following tables
      - [ ] number of peaks above 5% rel height
      - [ ] maximum amplitude
      - [ ] peaks by time quartile
      - [ ] auc average per quartile
- [ ] statistical analysis
  - [ ] raw
    - [ ] by color
    - [ ] by varietal
  - [ ] cuprac
    - [ ] by color
    - [ ] by varietal
  - [ ] distribution comparison between detection methods
- [ ] discussion
- [ ] conclusion

In [None]:
# | echo: false
# imports
%reload_ext autoreload
%autoreload 2

import duckdb as db
from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.chapter_one import polars_extension, get_samples, baselines
import polars as pl
import plotnine as p9

## Sample Retrieval

In [None]:
with db.connect(definitions.DB_PATH) as con:
    sampling = get_samples.get_samples(
        con=con, detection="raw", n_samples=30, distinct_wine=True
    )

In [None]:
(
    p9.ggplot(sampling, p9.aes(x="mins", y="absorbance", color="sample_num"))
    + p9.geom_line()
    + p9.ggtitle("Random Sampling n = 30 of 'Raw' Samples at 256nm")
)

As we can see in the above figure, the AUC.. Next we will calculate the distribution.

## Preprocessing

### Baselines

In [None]:
results, infodict = baselines.calculate_baselines(
    df=sampling, grp_col="sample_num", y_col="absorbance"
)

baseline_results["results"]

In [None]:
# plot the samples


def plot_correction_overlay(df: pl.DataFrame, grp: str) -> None:
    """
    produce a plot of the original signal, baseline and corrected signal for each group in 'grp'
    """

    p = (
        p9.ggplot(
            df,
        )
        + p9.geom_line(p9.aes(x="mins", y="absorbance"), color="blue", alpha=0.5)
        + p9.geom_line(p9.aes(x="mins", y="baseline"), color="red")
        + p9.geom_line(p9.aes(x="mins", y="corrected"), color="black")
        + p9.facet_wrap(facets=grp)
        + p9.theme(figure_size=(16, 8))
        + p9.ggtitle("Base Signal, Fitted Baseline and Corrected Signal by Sample")
    )

    display(p)


plot_correction_overlay(
    df=db.sql("select * from sampled where mins < 30 order by sample_num, mins")
    .pl()
    .pipe(polars_extension.to_enum, "sample_num"),
    grp="sample_num",
)

## Statistical Analysis

### Raw

#### By Color

#### By Varietal

### CUPRAC

#### By Color

#### By Varietal

### Comparison of Distributions Between Detection Methods

## Discussion


## Conclusion