---
title: Wine Library Statistics Breakdown By Detection Method
toc: true
toc-depth: 5
echo: false
documentclass: report
page-layout: full
format:
  pdf:
    documentclass: article
    papersize: A4
geometry:
  - showframe
fig-width: 10
fig-dpi: 300
---

#| output: false
#| echo: false
TODO:
    - [ ] organise graphics into grouped bar plots where appropriate.
    - [ ] check sorting of all tables.
    - [ ] descriptive paragraphs of each catgegory distribution
    - [ ] proof read descriptive paragraphs
    - [ ] debug cross-referencing
    - [ ] final check of arrangements, appropriate pagebreaks where poss. 

\pagebreak

In [None]:
%reload_ext autoreload
%autoreload 2
from wine_analysis_hplc_uv.library_eda.lib_eda import lib_eda
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_excel("../df.xlsx", dtype=object)
df = df.rename({"vintage_ct": "vintage"}, axis=1)
df["wine"] = df["vintage"] + " " + df["name_ct"]
df = df[~df["wine"].isna()]

In [None]:
def two_grouped_col_df(df):
    # preference for left col to be longer than right col. 
    left_length = len(df) // 2 + 1
    right_length = len(df) // 2 + 1

    df1 = df.iloc[:left_length]
    df2 = df.iloc[right_length:].reset_index(drop=True)
    concat_df = pd.concat([df1, df2], axis=1)
    return concat_df

In [None]:
def display_summary_tbl(df):
    out_df = df.copy()
    out_df["prop"] = out_df["prop"].apply(np.round, 3)
    out_df = out_df.rename({"prop": "prop (%)"}, axis=1)
    return out_df

In [None]:
cuprac_df = df[df["detection"] == "cuprac"]
uv_df = df[df["detection"] == "raw"]

cuprac_summary_df = lib_eda.summary_table(cuprac_df)
raw_summary_df = lib_eda.summary_table(uv_df)

## Breakdown By Detection Type

The gathered dataset consists of spectrum signals detected directly and with CUPRAC-derivatized samples, as describef by Table 1:

In deciding the uesfulness of each dataset, it is interesting to observe: the total individual wines (count_individs), the total number of wines with repetitions (count_reps), the total number of wines without repetitions (count_unrep) and the total number of repeats (size_repeats), as given by Tables 2 and 3.

As it currently stands, the CUPRAC dataset consists of 66 total samples and 59 individual wines, with many more singluarly-sampled wines (52) compared to wines with repetitions (7).

In [None]:
# | tbl-cap: CUPRAC and Raw UV subpopulation counts including number of individual samples, members that have repetitions, members that have no repetition, and total number of repetitions
concat_df = pd.concat([raw_summary_df, cuprac_summary_df])
concat_df = concat_df.reset_index(names=["detection", "statistic"])
concat_df.pivot(index="statistic", columns="detection", values="n")

In [None]:
# | fig-cap: Grouped bar plot of containing counts of individual samples, repeated samples, unrepeated samples, size of repeat subpopulation and total subpopulation size

x = 0.4
figsizetuple = (10, 5)
fontsize_ = 12 * x * 1.5
figsize_ = tuple([param * x for param in figsizetuple])

pop_descrip_fig, pop_descrip_ax = plt.subplots(1, figsize=figsize_)
barplot = sns.barplot(
    x="statistic", y="n", hue="detection", data=concat_df, ax=pop_descrip_ax, width=0.5
)
pop_descrip_ax.xaxis.label.set_fontsize(fontsize_)
pop_descrip_ax.yaxis.label.set_fontsize(fontsize_)
pop_descrip_ax.tick_params(axis="x", labelsize=fontsize_, labelrotation=45)
pop_descrip_ax.tick_params(axis="y", labelsize=fontsize_)
legend = pop_descrip_ax.legend(fontsize=fontsize_)

\pagebreak

## Variety


TODO:
- [x] graphics
  - [x] combine cuprac and uv "more than one" bar plots
  - [ ] 
- [ ] descriptive paragraph:
  - [x] describe total number of varieties in each dataset
  - [x] identify most represented varieties
- [ ] tables
  - [ ] make two column table function with parameters df, left length, right length
  - [ ] apply created function to all tables longer than 1/3 a page

In [None]:
# | output: false
cuprac_variety_df = (
    lib_eda.num_unique_wines_by_detect_by_variety(cuprac_df)
    .drop(["cumsum", "prop", "cumsum_prop"], axis=1)
    .drop("total")
    .reset_index()
)
uv_variety_df = lib_eda.num_unique_wines_by_detect_by_variety(uv_df).drop(
    ["cumsum", "prop", "cumsum_prop"], axis=1
)
uv_variety_df = uv_variety_df.drop("total")
uv_variety_df = uv_variety_df.reset_index("varietal")

The CUPRAC dataset consists of 59 unique varietals with varying levels of representation. The most represented is Pinot Noir (7) followed by Chardonnay (6), Shiraz (6), Red Blends (3), and Nebbiolo, (3). 6 wines are represented twice, while 22 only have one representative sample.

The Raw UV dataset consists of 34 unique varietals also with varying levels of representation. The most represented varietal was Shiraz (9), followed by Chardonnay (7), Pinot Noir (7), Riesling (5), followed by 6 wines represented 3 times, 5 represented twice, and 18 represented once.

In [None]:
# | output: false
# concat the df

pd.options.display.max_rows = 100

concat_variety_df = pd.concat([uv_variety_df, cuprac_variety_df]).fillna(0)
pivot_concat_variety_df = concat_variety_df.pivot(index='varietal',values='count',columns='detection')

sort_pivot_concat_variety_df = (
    pivot_concat_variety_df
    .assign(total=lambda x : x.sum(axis=1)).sort_values('total', ascending=False)
    .drop('total', axis = 1)
)

In [None]:
#| tbl-cap: Counts of unique wines by varietal sorted by total count across both detection methods
# TODO:
# - [ ] add categorical sorting via varietal color as secondary to total column sort
two_col_concat_variety_df = two_grouped_col_df(sort_pivot_concat_variety_df.reset_index().fillna(0))
two_col_concat_variety_df.fillna("").style.format(precision=0).hide()

In [None]:
# filter out any varietals that are only present on one detection type
varietal_both_detect_df = (
    concat_variety_df.groupby('varietal').filter(lambda x : len(x)>1)
    .pivot(index='varietal',values='count',columns='detection')
    .assign(total=lambda x : x.sum(axis =1))
    .sort_values('total', ascending=False)
)
varietal_both_detect_df

In [None]:
# TODO: reverse the pivot so you can grouped var plot
reverse_pivot=(varietal_both_detect_df
               .drop('total', axis=1)
               .melt(
                     value_vars=['cuprac','raw']
                     )
)
reverse_pivot

In [None]:
# | fig-cap: Grouped bar plot of counts of individual wines by varietal by detection method where the varietal was detected with both CUPRAC and raw UV at least once.
varietal_both_detect_fig, varietal_both_detect_ax = plt.subplots(1)
varietal_both_barplot = sns.barplot(
    data=varietal_both_detect_df.melt(id_vars='varieta'),
    x="count",
    y="varietal",
    hue="detection",
    orient="h",
)

\pagebreak

## Type


In [None]:
cuprac_type_df = lib_eda.num_unique_wines_by_detect_by_type(cuprac_df)
cuprac_type_df["detect"] = "cuprac"

uv_type_df = lib_eda.num_unique_wines_by_detect_by_type(uv_df)
uv_type_df["detect"] = "raw"

type_concat_df = pd.concat([cuprac_type_df, uv_type_df])
type_concat_df = type_concat_df.reset_index()

In [None]:
# | output: false
cuprac_type_df.sort_values("count", ascending=False)

The following wine types are present within the dataset: 'white - sparkling', 'rosé - sparkling', 'white', 'orange', 'rosé', 'red', and 'white - sweet/dessert'. These definitions were taken from [cellartracker](https://www.cellartracker.com/) from which sample metadata was directly sourced.

The CUPRAC dataset contains the following wine types (in order of category size): 35 'red', 15 'white', 4 'rosé', 2 'white - sparkling', 2 'orange', and 1 'rosé - sparkling'. The UV dataset contains 46 'red', 19 'white', 3 'white - sparkling', 3 'rosé', 2 'orange', and one 'white - sweet/dessert' See @type_concat_tbl and @type_group_barplot.

In [None]:
# | tbl-cap: Comparison of counts of samples of wine type by detection method
# | tbl-label: type_concat_tbl
sorted_type_concat_df = groupby_sort_df(
    type_concat_df, group_col="type", x_col="count", y_col="detect"
)
sorted_type_concat_df.pivot(columns="detect", values="count").fillna(0).style.format(
    precision=0
)

In [None]:
# | fig-cap: Grouped bar plot of counts of samples categorized by wine type for raw and CUPRAC detections.
# | fig-label: type_group_barplot
type_fig, type_ax = plt.subplots(1)
type_barplot = sns.barplot(
    data=type_concat_df, ax=type_ax, orient="h", x="count", y="type", hue="detect"
)


\pagebreak


## Country


In [None]:
# | output: false
cuprac_country_df = lib_eda.num_unique_wines_by_detect_by_country(cuprac_df)
uv_country_df = lib_eda.num_unique_wines_by_detect_by_country(uv_df)
concat_country_df = pd.concat([cuprac_country_df, uv_country_df]).reset_index(drop=True)
concat_country_df = concat_country_df.set_index("country")
concat_country_df["totals"] = concat_country_df.groupby("country")["count"].sum()
concat_country_df.sort_values("totals", ascending=False)

Overall the dataset contains samples from Australia (71), Italy (32), France (19), Argentina (5), New Zealand (3), USA (2), and Spain (1).

Within the CUPRAC dataset there are 29 samples from Australia, 18 from Italy, 7 from France, 3 from New Zealand, 1 from Argentina and 1 from USA.

On the other hand, the raw UV dataset possesses 42 wines from Australia, 14 wines from Italy, 12 wines from France, 4 wines from Argentina, 1 wine from Spain and 1 wine from USA.

In [None]:
# pivot the table for display purposes

pivot_concat_country_df = (
    concat_country_df.reset_index()
    .pivot(index="country", columns="detection", values="count")
    .fillna(0)
)
pivot_concat_country_df.style.format(precision=0)

In [None]:
# | fig-cap: Grouped bar plot of counts of samples categorized by wine country of origin for raw and CUPRAC detections.
# | fig-label: country_group_barplot
plot_concat_country_df = concat_country_df.reset_index()
sns.barplot(
    data=plot_concat_country_df, orient="h", y="country", x="count", hue="detection"
)