In [1]:
%load_ext watermark
import pandas as pd
import numpy as np
from typing import Type, Optional, Callable
from typing import List, Dict, Union, Tuple
from myst_nb import glue

# from review_methods_tests import collect_vitals, find_missing, find_missing_loc_dates
# from review_methods_tests import make_a_summary

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.colors
from matplotlib.colors import LinearSegmentedColormap, ListedColormap

import setvariables as conf_
import reportclass as r_class

# Cumulative reports

Cumulative reports display the test statistic of sample results between elements of a geographic or administrative region. Cumulative reports are visualized with heat maps. The starting point for cumulative reports is a valid `ReportClass` object. The granularity of the results are at a minimum the municipal level. The lowest recognized administrative unit.

```{note}
The GPT assistant is being trained to accept the following methods and commands as key word arguments. So that different clients have access to a standardized output for tabular data.
```

## A top level description

A short and detailed summary of the report can be created by synthesising three tables from the `ReportClass`.

```python
header = a_report.a_short_description
components = a_report.the_number_of_attributes_in_a_feature('feature_type')
comp_summary = a_report.summarize_feature_labels(feature='feature_type')

In [2]:
# starting data, can be MySQL or NoSQL calls
# the three methods accept Callables, as long
# as the out put is pd.DataFrame
c_l = r_class.language_maps()
surveys = r_class.collect_survey_data_for_report()
codes, beaches, land_cover, land_use, streets, river_intersect_lakes = r_class.collect_env_data_for_report()

survey_data = surveys.copy() # .merge(beaches['canton'], left_on='slug', right_index=True, validate='many_to_one')

# temporal and geographic boundaries
# user defined input
boundaries = dict(canton='Bern', language='fr', start_date="2015-11-01", end_date="2021-12-31")

# the report_data method takes the boundaries and returns returns the top level of the report
# the language and two data frames from the same date range. w_df includes only the surveys
# that meet the criteria in boundaries, w_di includes all the data from the date range.
top_label, language, w_df, w_di = r_class.report_data(boundaries, survey_data.copy(), beaches, codes)

# the language map is included with the module
a_report = r_class.ReportClass(w_df,boundaries, top_label, 'fr', c_l)



In [3]:
# default arguments that define the most common objects
# this assumes that the columns quantity and fail rate exist
mc_criteria_one = {
        
        'column': 'quantity',
        'val': 5
    }

mc_criteria_two = {
        'column': 'fail rate',
        'val': 0.6
    }


In [4]:
most_common, w_mc = a_report.most_common

header = a_report.a_short_description
glue('header', header.style.set_table_styles(), display=False)

components = a_report.the_number_of_attributes_in_a_feature('feature_type')
glue('components', components.style.set_table_styles(), display=False)

comp_summary = a_report.summarize_feature_labels(feature='feature_type')
glue('comp_summary', comp_summary.style.set_table_styles().format(**conf_.format_kwargs), display=False)

### Synthesising table data

Calling `a_report.a_short_description`, `a_report.the_number_of_attributes_in_a_feature` and `a_report.summarize_feature_labels` provides the data for a top level description and a comparison of the sample total pcs/m of different attributes or features in the data.

::::{grid}
:::{grid-item-card}
:columns: 1 1 5 5
`a_report.short_description`
^^^
{glue}`header`



:::
:::{grid-item-card}
:columns: 1 1 7 7
`a_report.the_number_of_attributes_in_a_feature`
^^^
{glue}`components`

:::

:::{grid-item-card}
:columns: 1 1 5 6
`a_report.summarize_feature_labels`
^^^
{glue}`comp_summary`

:::

:::{grid-item-card}
:columns: 1 1 5 5
Text generated from summary tables
^^^
There were `13'759` objects identified in the period between `2015-11-01` and `2021-12-31` in the `canton` of `Bern`. In total, `196` samples were recorded, `99` on `lakes`, `96` at `rivers` and `1` at `parks`.  The lake samples were recorded from `14` `cities` and `14` for `rivers`. The `alpes` only had `one` representative. The `one` sample from the `alpes` is very close to what would be expected from a `lake` sample. The `rivers` have the `lowest` pcs/m.
:::
::::


```{note} Defining summary texts

Summary texts can be composed in many different ways. There is a role for LLMs at this part of the report. At the very least helping find other locations that fall within the same values. Providing context to either the geographic or adminstrative boundaries.

The person establishing the report can manage the content creation from the LLM or there can be a shared prompt for all reports.

```

### Features

The features are defined by the `setvariables.geo_h` variable. This is a list of the different boundaries present within the data. For example, a survey location can belong to a river, lake or park. The administrative boundaries are set by hierarchy in the `geo_h` variable, the default is set to canton, city.

Therefore, for any subset of data there will be different features present. The attribute `report_class.available_features` lists the different boundaries that exist in the selected data. 

In [5]:
# a summary of the different features and boundaries in a report
a_report.available_features

### Feature names

The feature names are the labels for the features that are in the report data. Subreports can be generated for each feautre. The prior example is bielersee. In IQAASL the selection was `{'feature_type':'l'}` for most of the report. The selection `{'feature_type':'r'}` was the last section of each survey area report and `{'feature_type':'p'}` had its own chapter (The Alps and Jura).

```python
my_labels = a_report.feature_labels()
```
The name and type of each feature available in the report can be accessed by calling `a_report.feature_labels()`. The `keys()` method will list the available feature types, `.values()` will list the feature names and the city names for all locations with samples.

In [6]:
# the lakes 
my_labels = a_report.feature_labels()
print(my_labels['l']['feature_name'])

In [7]:
# in the same way the name of the parks and the cities in those parks can be indentified
print(my_labels['p']['feature_name'])

In [8]:
# the same for rivers
print(my_labels['r']['feature_name'])

## Comparing results

Once the the objects of interest are identified (criteria) they can be compared accross the diferent feature_types and labels.

```python
t = a_cumulative_report(w_df[w_df.code.isin(most_common.index)], feature_name='feature_type', object_column='code')
translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)
``` 
For example the most common objects are found at different densitiies depending on the feature type.

In [9]:
t= r_class.a_cumulative_report(w_df[w_df.code.isin(most_common.index)], feature_name='feature_type', object_column='code')
r_class.translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)

### Alternate object groups `groupname`

If the column has other labeled values for object identification it can be used to aggregate results for each sample id. Here we consider `groupname`, there is more than one object in a group. They represent use cases.

```python
t = a_cumulative_report(w_df, feature_name='feature_type', object_column='groupname')
translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)
``` 
For example the different use cases are found at different densitiies depending on the feature type.

In [10]:
t = r_class.a_cumulative_report(w_df, feature_name='feature_type', object_column='groupname')
r_class.translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)

### By Survey area or `parent_boundary`

There are two parent boundaries in Bern, the _Alpes and Jura_ and the _Aare_ river basin.

In [11]:
t = r_class.a_cumulative_report(w_df[w_df.code.isin(most_common.index)], feature_name='parent_boundary', object_column='code')

r_class.translated_and_style_for_display(t,a_report.lang_maps[a_report.language], a_report.language, gradient=True)

### By `feature_name`:

There are many different features (rivers, lakes and parks) in Bern.

In [12]:
t = r_class.a_cumulative_report(w_df[w_df.code.isin(most_common.index)], feature_name='feature_name', object_column='code')
r_class.translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)

### By `city`:

In [13]:
t = r_class.a_cumulative_report(w_df[w_df.code.isin(most_common.index)], feature_name='city', object_column='code')
r_class.translated_and_style_for_display(t, a_report.lang_maps[a_report.language], a_report.language, gradient=True)

```{note}
The available features are column names of the survey data. They represent the different geopraphic or administrative boundaries in the selected report data. 

* `parent_boundary` is a geographic boundary such as a river basin or a category such as mountains
* `feature_type` designates whether the location is at a river, lake or park
* `feature_name` is the name of the river, lake or park
```