In [1]:
%load_ext watermark

import pandas as pd
import numpy as np
import datetime as dt 
from statsmodels.distributions.empirical_distribution import ECDF
from scipy.stats import beta
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown as md
from myst_nb import glue

dfCodes = pd.read_csv("resources/codes_with_group_names_2015.csv")
dfBeaches = pd.read_csv("resources/beaches_with_land_use_rates.csv")
dfBeaches = dfBeaches[dfBeaches.slug != "clean-up-event-test"]
dfBeaches.set_index("slug", inplace=True)

all_data = pd.read_csv("resources/u_all_data.csv")
all_data["date"] = pd.to_datetime(all_data["date"], format="%Y-%m-%d")

# import regional labels. labels are used
# to identify the regional priors
lac_leman_regions = pd.read_csv("resources/lac_leman_regions.csv")

# map to code decriptions
dfCodes.set_index("code", inplace=True)
dfCodes.loc["Gcaps", ["material", "description", "groupname"]] = ["Plastic", "Plastic bottle lids", "food and drink"]
code_d = dfCodes["description"]

# map to material descriptions
mat_d = dfCodes["material"]

In [2]:
# this defines the css rules for the note-book table displays
header_row = {'selector': 'th:nth-child(1)', 'props': f'background-color: #FFF; text-align:right'}
even_rows = {"selector": 'tr:nth-child(even)', 'props': f'background-color: rgba(139, 69, 19, 0.08);'}
odd_rows = {'selector': 'tr:nth-child(odd)', 'props': 'background: #FFF;'}
table_font = {'selector': 'tr', 'props': 'font-size: 10px;'}
table_data = {'selector': 'td', 'props': 'padding: 12px;'}
table_caption = {'selector': 'caption', 'props': 'font-size: 14px; font-style: italic; caption-side: bottom; text-align: left; margin-top: 10px'}
table_css_styles = [even_rows, odd_rows, table_font, header_row, table_caption]


table_large_data = {'selector': 'tr', 'props': 'font-size: 14px; padding: 12px;'}
table_large_font = [even_rows, odd_rows, table_large_data, header_row, table_caption]

# Estimating values

__Note to colleagues:__

This concerns the data from the federal report and a subset of data gathered in 2022. Since the puplication of IQAASL in December 2021 there has been addtional data collection events on Lac Léman:

1. SWE team
2. Association for the Sauvegarde du Léman [ASL](https://asleman.org/?lang=en) completed 100 beach litter surveys in 2022

__Previous associated work, general guidance, big picture:__

1. Common sense guidance:
   1. The data should be considered as a reasonable estimate of the minimum amount of trash on the ground at the time of the survey.
   2. The shoreline is the point where objects abandoned on land enter the water and where objects that are in the water get deposited on the beach.
   3. It is necessary to consider the data as a whole. There are many sources of variance. We have treated litter density between sampling groups and the covariance of litter density with topographical features.
      1. There are differences between the sampling groups.
      2. There are differences between sampling locations
      3. There are differences in detect-ability and appearance for items of the same code that are due to the effects of decomposition.
      4. Surveyors are volunteers and have different levels of experience or physical constraints that limit what will actually be collected and counted.
     
```{admonition} If we are using the data to estimate the environmental condition then the response should answer the question:

__What and how much are the volunteers likely to find?__
      
_This is the most honest answer that can be derived from the data. How well the counts perform over time is part of what we are discussing here._
```

2. Application:
   1. __Environmental assessment:__ Where this fits in the environmental assessment process is not clear. For example, when the health of the lake is considered are the findings for litter-data considered alongside bathing water quality? The EU adopted the principle of precaution when the suggested threshold of beach litter was set to 20 pieces of trash for 100 meters [Beach litter thresholds](https://mcc.jrc.ec.europa.eu/main/dev.py?N=41&O=454).
   2. __Stakeholder impact:__ Despite the differences in the surveying groups [Section one - differences between groups](diff-groups) we see that the group with the most samples has the greatest range in the IQR. However, the median value is still close to the others indicating that we have not yet found a maximum but there is some consensus between the groups where the median might be for the lake.
   3. __Parameters of interest:__ Considerable effort has been put to exploring the covariance between litter objects and topographical features. There are many positive covariates among objects and topographical features [Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html). Suggesting that certain features may be used as parameter values when constructing models. We test this hypothesis when constructing predictions for municipal locations.

3. Sampling strategy: 
   1. Consistent with 1 and given the context in which the samples were collected [Summary test and training data](data-context) and because of the work in 2.3 we can see the benefits in sampling many different locations. We have uncertainty about where the 'median' is based on the spread of the sampling statistic but we are not concerned about the geographic spread.
   2. The experience with the students demonstrates the importance of small actions. On their own they do not say much about the lake. However, in relation to and combined with the observations with the other groups we have a better idea of what we might find in general and some specific information about Saint Sulpice [Section one - differences between groups](diff-groups).


__Six year sampling period__

The timing of the newest samples, seven years after the first samples were recorded, could be interpreted as the begining of a new six year sampling period that started in January 2022. The Joint Research Center (JRC) at the EU suggest a six year sampling period with preferably ~ 40 surveys in that time, for each beach that is being monitored [Beach litter thresholds](https://mcc.jrc.ec.europa.eu/main/dev.py?N=41&O=454). If these conditions are met a baseline value can be established for the location in question. The baseline value, using this method is the median value of the surveys for the time period.

There are over 250 samples from 38 different locations on the lake in the initial six year period. There are no locations that have 40 surveys, therefore the method described previously would not be appropriate for any single location, but it is more than enough for the lake. 

## Research questions

__For the lake and Saint Sulpice:__

1. Given the data from 2022, Is there an increase, decrease or no change in the expected survey results given the consolidated results from 2015 - 2021?
2. Given the median value for the objects of interest in 2021, what is the chance that a survey in 2022 will exceed this value?
3. How do the results from 2022 change the expected survey results going forward?

## Practical applications

Investments have been made to either prevent or remove litter from the public space. The investments are made with the intention of reducing litter in the environment. The answers to the research questions should help evaluate the return on investment (ROI) from previous projects and provide insights for projects going forward. 

1. __Did the investment result in a net decline in litter?__
2. __What objects were particularly effected?__
3. __How does the municpality compare to the rest of the lake?__
4. __Where are areas that are in need of the most investment?__

## Constraints

The assessment method must produce information that directly answers the research question and can be put to practical application immediatley. The data produced should reduce the effort required to produce more specific models.  

1. There must be a method to check results integrated into the process.
   1. There must be another method that given the same data produces approximately the same results
2. The basic calculation should be as simple as possible.
   1. By this we mean the definition of the basic calculation should result from a text-book or similar.
   2. The prefered level is Maturité Federal or level one calculus
   3. The basic calculation should be executable on a spread sheet
3. The method must be scalable
   1. There should be a path to backend server operations
   2. Output formatting should take ML operations into consideration
4. Discarding or disregarding data is highly discouraged.

## Definitions
1. __threshold:__ The pieces of trash per meter of interest. A float value between 0 and 9.99. This represents between 0 and 999 pieces of trash for every 100 meters. Survey values of individual objects rarely exceeded this range.
2. __object-code:__ Connects the survey data to information about the category of the object counted. This contains information like material type or intended use. Groups of _object-codes_ can be used to define sources or orgins.
3. __frequency:__ The frequency of exceeding a threshold is the number of times that a threshold was exceeded (k) divided by the number of samples taken (n) or k/n.
4. __bounding-hex:__ A hexagon inscribed in a circle of r=1500 m with the survey location at the center
5. __dry-land:__ The portion of a bounding hex that is not covered by water.
6. __land-cover:__ The topographical features within a bounding hex that are common to most survey locations. Land-cover features can occupy op to 100% of available dry land. A bounding hex contains at least one land-cover feature.
7. __land-use:__ The topographical features within a bounding hex that are superimposed over the land cover. Land-use features occupy between 0 - 10% of the available dry-land. A bounding hex may or may not contain a land-use feature.
8. __event:__ The action of picking up a certain number of pieces of trash, indentifying them and counting them
9. __probability:__ The conditional probability $\theta$ that the number of _events_ will exceed a threshold for a given _object-code_ under the defined conditions of the _bounding-hex_ 

(assumptions)=
## Assumptions

1. Locations that have similar environmental conditions will yield similar survey results
2. There is an exchange of material (trash) between the beach and body of water
3. Following from two, the material recovered at the beach is a result of the assumed exchange
4. The type of activities adjacent to the survey location are an indicator of the trash that will be found there
5. Following from four and three, the local environmental conditions are an indicator of the local contribution to the mix of objects at the beach
6. Surveys are not 100% accurate
   1. Some objects will be misidentified
   2. Not all objects will be found
   3. There will be inaccuracies in object counts or data entry 
7. Following one through 6: __the survey results are a reasonable estimate of the minimum number of objects that were present at the time the survey was completed__

## Test data, training data and objects of interest

```{warning}
This report does not inculde surveys from the French side of the lake. This is becasue the orginal six years was done using the set of topographical data from Swiss geo admin. To include French locations in this model regional authoritities/experts need to define the appropriate map layers or data sources from France and then a correspondance needs to be established between the two sources.
```
* __Training data:__ All the survey records on or before May 31, 2021
* __Test data:__ All the survey records after after May 31, 2021
* __objects of interest:__ The object(s) for which further information is requested. Identified by the object-code.

The test and training data is the set of all data collected in Switzerland using the protocol defined in _the guide_ ([Guidance on Monitoring Marine Litter in European Seas](https://publications.jrc.ec.europa.eu/repository/handle/JRC83985)). More specifically: the test and training data are the results of object (trash) counts from individual survey locations over a delimitted length or surface area that is bordered on one side by a lake. _The guide_ suggests a standard length of 100 m of shoreline, this was encouraged but not considered a criteria for exclusion. The minimum recorded survey length of the training data is 5 m, in the test data it is 18 m. The width is measured from the water line to the high-water mark or the physical limits of the beach itself. For the purposes of this study the only minimum length or width for a survey to be valid is that which is imposed by the data itself.

### The training data

The training data was collected by a variety of organizations over a six year period. In the first sampling campaign (MCBP: 2015-2016) the data from Lake Geneva is primarily from the south part of the lake and collected by two people. Residents of the area would know the region as the _Haut Lac_ with most of the samples coming from the _Riviera_ (agglomermation of Vevey, La Tour-de-Peilz and Montreux). In the second sampling campaign (SLR: 2017-2018) the samples were collected by volunteers from the WWF ([WWF](https://www.wwf.ch/de)). The range extended from the _Haut Lac_ to Gland, including survey locations in Lausanne. 

The last survey campaign (IQAASL: 2020 - 2021) collected samples from each major region of the lake monthly at fixed locations, other locations were added spontaneously. When the results of SLR were compared to IQAASL a decrease in the number of objects associated with food and tobacco use was considered probable. However, it was unclear if that decline was due to the pandemic restrictions of 2020, [Conclusion SLR v/s IQAASL](https://hammerdirt-analyst.github.io/IQAASL-End-0f-Sampling-2021/slr-iqaasl.html#fazit). 

### The test data

The test data is a combination of the data collected by the ASL and the SWE team. Plastock is a project run by the ASL between January and December 2022. They conducted 95 beach liter surveys, from 25 different locations ([plastock](https://hammerdirt-analyst.github.io/plastock/)). The data was analyzed in partnership with the project manager from the ASL to determine suitability for this study. The protocol for plastock was based off of the national survey protocol ([IQAASL](https://hammerdirt-analyst.github.io/IQAASL-End-0f-Sampling-2021/land_use_correlation.html)), the collection and identification was completed by volunteers.

The survey dimensions in 2022 (test data) were on average longer 69 m v/s 48 m and wider 430 m² v/s 209 m² than the training data. There are a total of 245 samples in the training data, this is all the data collected in the first six year sampling period. There are 99 samples in the test data, 95 samples from the ASL and 4 samples from SWE. The test and training data are described by seven columns: *loc_date (location and date)*, _location_, _date_, _day of year (doy)_, _project (testing or training)_, _code (object code)_, _pcs/m (pieces per meter)_. 

(objects-of-interest)=
### The objects of interest

From the 2021 report there are 230 object-codes that can be attributed to each one of the 384 surveys. Some objects were found and counted only once, such as paint brushes (G166) others were found in 87% of all samples, cigarette ends (G27). The 15 most abundant objects from Lake Geneva indentified in IQAASL account for 75% of all the objects counted that year [Lake Geneva IQAASL](https://hammerdirt-analyst.github.io/IQAASL-End-0f-Sampling-2021/lac-leman.html), table one. There are some exceptions that must be eliminated, and explained:

1. Nurdles or injection molding pellets were not counted prior to 2020
2. Plastock was focussed on plastic objects

The codes of interest are selected from the 15 most abundant objects from the federal report of 2021 AND specific objects that were counted less often but are relativeley easy to identify. Furthermore, we only consider the objects that were also identified in the testing data. 

A surveyor is likely to encounter common objects in the various states of fragmentation or decomposition. Objects that are easy to identify or to describe have a better chance a being placed under the correct object code. For example a cigarette end is immediately recognizable. Fragmented or otherwise degraded objests are challenging, determining wether or not a plastic bottle cap comes from a beverage or a chemical container can be difficult when all the labeling is removed or eroded. From the original data the following object-codes were aggregated into groups:

1. Gfoam: Fragmented expanded polystyrene, object-codes: G81, G82, G83
2. Gfrags: Fragmented plastics, object-codes: G78, G79, G80, G75, G76, G77
3. Gcaps: Plastic bottle lids and lid rings, object-codes: G21, G23, G24

Note that aggregating object codes into groups is a common strategy. When evaluating litter densities in the marine environment _Single Use Plastics_ (SUP's) is a common group that contains the objects like plastic bottles or disposable food containers another common group is _fishing gear_ [Beach litter thresholds](https://mcc.jrc.ec.europa.eu/main/dev.py?N=41&O=454). There are 16 objects of interest for this initial study (including the three aggregated groups). These objects represent different use-cases and sources. It is these use cases we will evaluate.

1. __Personal hygiene (Ph)__, spatial source: diffuse, toilets, water treatment facilities
   1. G95: cotton swabs
2. __Personal consumption (Pc)__, spatial source: local to survey location, abandoned within 1 500 m of the survey
   1. G30: Snack wrappers
   2. Gcaps: drink bottles, caps and lid rings
   3. G10: To go containers
   4. G25: Tobacco related, not cigartetts
   5. G27: cigarette ends
   6. G35: Straws and stirrers
   7. G31: Lollypop sticks
   8. G32: Toys, party favors
   9. G33: Lids for to go drinks
3. __Industrial/professional (Ip)__, spatial source: diffuse and local, transported to survey location or professional activities within 1 500 meters of survey location 
   1. G67: Plastic sheeting
   2. G89: Construction plastics
   3. Gfoam: Fragmented expanded polystyrene
4. __Unknown (Unk)__, spatial source: diffuse and local, transported to survey location or professional activities within 1 500 meters of survey location 
   1. Gfrags: Fragmented plastics
5. __Recreation/sports (Rc)__, spatial source: diffuse, transported to survey location, it is illegal to discharge firearms on the lakeshore
   1. G70: Shotgun shells

In [3]:
cois = cities_of_interest = ['Saint-Sulpice (VD)', 'Saint Gingolph', 'Genéve', 'Cully', 'Vevey']
toi = trash_of_interest = ['Gfrags', 'G30', 'G27', 'Gfoam', 'G95', 'G144', 'G98','Gcaps', 'G67', 'G35', 'G89', 'G31', 'G32', 'G33', 'G25', 'G70', 'G10']
some_quants = [.03, .25, .48, .5, .52, .75, .97]
end_training_date = "2021-05-31"
begin_training_date = "2015-11-15"

use_groups =  {
    'Personal hygiene':['G95', 'G100'],
    'Personal consumption':[
    'G30', 'Gcaps', 'G10', 'G25', 'G27', 'G35', 'G31', 'G32', 'G33'],
    'Industrial/professional': ['G67', 'G89', 'Gfoam'],
    'Unknown':['Gfrags'],
    'Recreation/sports': ['G70']
}

use_groups_i =  {
    'G95':'Personal hygiene',
    'G100':'Personal hygiene', 
    'G30':'Personal consumption',
    'Gcaps':'Personal consumption',
    'G10':'Personal consumption',
    'G25':'Personal consumption',
    'G27':'Personal consumption',
    'G35':'Personal consumption',
    'G31':'Personal consumption',
    'G32':'Personal consumption',
    'G33':'Personal consumption',
    'G144':'Personal hygiene',
    'G98': 'Personal hygiene',
    'G67':'Industrial/professional',
    'G89':'Industrial/professional',
    'Gfoam':'Industrial/professional',
    'Gfrags':'Unknown',
    'G70':'Recreation/sports'
}

abbrev_use_g = {'Unknown':'Unk','Personal consumption':'Pc', 'Personal hygiene': 'Ph',    'Recreation/sports': 'Rc', 'Industrial/professional':'Ip'}

lake = 'lac-leman'

not_these = ['amphion', 'anthy', 'excenevex', 'lugrin', 'meillerie', 'saint-disdille', 'tougues']

In [4]:
def prior_distributions(prior_data: pd.DataFrame = None, start: str = None, end: str = None,
                        xrange: np.array = None, uninformed_prior: np.array = None):
    data_args = {
        'start':start,
        'end':end,
        'data': prior_data,
    }
    prior_pcs = period_pieces(*data_args.values())         

    # get n and k for the prior data
    prior_k, prior_notk, prior_k_n_minus_k = period_k_and_n(prior_pcs, xrange)
   
    # make the likelihood parameters
    lhx = list(zip(prior_k, prior_notk))

    # make the prior distribution
    p_ui, prior_bmean = make_expected(lhx, uninformed_prior, xrange)

    # the uninformed beta approximation of the prior data
    prior_beta = [period_beta(x) for x in prior_k_n_minus_k]
    p_beta= [x.mean() for x in prior_beta]

    results=pd.DataFrame({"x":xrange, "p":p_ui})
    results["pn"] = results.p/results.p.sum()
    
    return np.array(p_ui), np.array(p_beta), prior_k_n_minus_k, results, prior_pcs

def posterior_distribution(lh_data: pd.DataFrame = None, start: str = None, end: str = None, code: str = None,
                           informed_prior: np.array = None, un_informed: np.array = None):
                               
    
    data_args = {
        'start': start,
        'end':end,
        'data': lh_data,   
        }

    period_all = period_pieces(*data_args.values())
    
    pall_k, pall_notk, pall_k_n_minus_k = period_k_and_n(period_all, xrange)
    
    lh_and_informed = np.array(pall_k_n_minus_k) + np.array(informed_prior)
    lhx = list(zip(pall_k, pall_notk))        
    
    probi, probi_beta = make_expected(pall_k_n_minus_k, np.array(informed_prior), xrange)
    grid_prox, grid_prox_beta = make_expected(pall_k_n_minus_k, un_informed, xrange)
    
    # beta distribution 
    pall_beta = [period_beta((x[0]+1, x[1]+1)) for x in pall_k_n_minus_k]
    pall_bmean = [x.mean() for x in pall_beta]
    return np.array(probi), np.array(grid_prox), pall_bmean, period_all
                               
def training_testing_compare(lh_pcs, pcs, post_quants, prior_quants):
    
    total_training = len(pcs) + len(lh_pcs)
    prior_weight = len(pcs)/total_training
    lh_weight = len(lh_pcs)/total_training

    number_of_samples = {"Training": len(pcs), "Testing": len(lh_pcs)}
    weights = {"Training":prior_weight, "Testing": lh_weight}
    observed_median = {"Training":np.median(pcs), "Testing": np.median(lh_pcs)}
    observed_average = {"Training":np.mean(pcs), "Testing": np.mean(lh_pcs)}
    observed_25 = {"Training": prior_quants[1], "Testing":post_quants[1]}
    observed_75 = {"Training": prior_quants[5], "Testing":post_quants[5]}
    index = ["weight all samples", "Number of samples", "Median", "Average", "25th percentile", "75th percentile"]
    components = [weights, number_of_samples, observed_median, observed_average, observed_25, observed_75]
    unks_sum_table = pd.DataFrame(components, index=index).style.format(precision=2).set_table_styles(table_large_font)
    styled = unks_sum_table.format(formatter="{:.0f}", subset=pd.IndexSlice[['Number of samples'], :])
    
    return styled

def predicted_summary(lh_pcs, pcs, prior_quants, median_2024):
    

    predicted = ((lh_pcs <= prior_quants[5])&(lh_pcs >= prior_quants[1])).sum()/len(lh_pcs)
    predicted_94 = ((lh_pcs <= prior_quants[-1])&(lh_pcs >= prior_quants[0])).sum()/len(lh_pcs)
    past_present_future = {
        "Median 2021": np.median(pcs), 
        "Median 2022": np.median(lh_pcs), 
        "Expected sampling median 2024":median_2024,
        "% 2022 in 50% IQR  predicted": predicted,
        "% 2022 in 94% IQR  predicted": predicted_94,
    }
        
    
    ppf = pd.DataFrame(past_present_future, index=["pcs/m"]).T

    return ppf


def make_results_df(prior_df, lh_c, source=None, source_norm=None):
    prior_df[source] = lh_c
    prior_df[source_norm] = prior_df[source]/prior_df[source].sum()

    return prior_df

(data-context)=
### Summary test and training data

Another way to look at this collection of observations is that each sampling group collected the data for reasons that were specific to that group, the protocol provided a framework for ensuring consistency and a pathway to interpreting the results. However, this does not mean that each group interpreted the protocol in the same manner, nor does it mean that all objects collected were counted. By limiting analysis to specific object-codes, those that appear most often and/or those that are easily identified, uncertainty is reduced by leveraging frequency of occurence and domain experience.

In [5]:
asl_beaches = pd.read_csv("resources/u_asl_beaches.csv")
plastock = pd.read_csv("resources/u_pstk.csv")
plastock["date"] = pd.to_datetime(plastock["date"])

pstock = plastock.groupby(["loc_date", "location", "city", "date", "doy", "Project", "code"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})
pstockx = pstock.groupby(["loc_date","city" ,"date", "doy", "Project"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})

# partion the data on 
date_mask =  (all_data["date"] <= "2021-05-31")
prior_d = all_data[date_mask].copy()
prior_d = prior_d[prior_d.water_name_slug == lake]
prior_d.rename(columns={"pcs_m":"pcs/m"}, inplace=True)
prior_d["Project"] = "Training"
prior_dt = prior_d[["loc_date", "location", "city" ,"date", "doy", "Project", "code", "pcs/m", "quantity"]].groupby(["loc_date", "location", "city", "date", "doy", "Project", "code"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})
prior_dx = prior_dt[prior_dt.code.isin(pstock.code.unique())].groupby(["loc_date", "city", "date", "doy", "Project"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})

combined = pd.concat([prior_dt, pstock])
combinedx = pd.concat([prior_dx, pstockx])

sup_after = pd.read_csv("resources/sup_after.csv")
sup_after = sup_after[sup_after.code.isin(combined.code.unique())]
sup_afterx = sup_after[sup_after.code.isin(toi)].groupby(["loc_date", "date", "city", "Project"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})

# add those values to the plastock data and samples prior to end training date
cbdi = pd.concat([combined, sup_after[["loc_date", "location", "city", "date", "doy", "Project", "code", "pcs/m"]]], ignore_index=True, axis=0)
cbdx = pd.concat([combinedx, sup_afterx], ignore_index=True)

cbd = cbdi[cbdi.code.isin(toi)]

column_names_groups = {v:k for k,v in abbrev_use_g.items()}
code_groups = list(column_names_groups.keys())

In [6]:
summ_data = cbd.copy()
summ_data = summ_data[~summ_data.location.isin(not_these)]
summ_data["use group"] = summ_data.code.map(lambda x: use_groups_i[x])


summ_data["ug"] = summ_data["use group"].apply(lambda x: abbrev_use_g[x])
summ_data[summ_data["use group"] == 'Personal consumption'].code.unique()
summ_data["date"] = pd.to_datetime(summ_data["date"], format="%Y-%m-%d")

sd_x = summ_data.groupby(["loc_date", "date", "city", "Project", "doy"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})


trg = summ_data[summ_data.Project == "Training"].copy()
tst = summ_data[summ_data.Project == "Testing"].copy()
trg_c, tst_c = trg.city.nunique(), tst.city.nunique()
trg_lc, tst_lc = trg.location.nunique(), tst.location.nunique()
trg_q, tst_q = trg.quantity.sum(), tst.quantity.sum()

data_magnitude = [
    {"Training":trg_c, "Testing":tst_c},
    {"Training":trg_lc, "Testing":tst_lc},
    {"Training":trg_q, "Testing":tst_q}
    
]

cities_set = list(set([*trg.city.unique(), *tst.city.unique()]))
n_ind_cities = len(cities_set)

caption = f'The number of different locations and cities for the data. Note that there are {n_ind_cities} different municipalitites in all.'

data_summ_q = pd.DataFrame(data_magnitude, index=["Number of cities", "Number of locations", "Total objects"]).astype('int')
data_summ_q = data_summ_q.style.format(formatter="{:,}").set_table_styles(table_large_font).set_caption(caption)
styled = data_summ_q.format(formatter="{:,}", subset=pd.IndexSlice[['Total objects'], :])
glue("data-summ-q", styled, display=False)

In [7]:
# all the data by date
the_99th_percentile = np.quantile(sd_x['pcs/m'].values, .99)
px = 1/plt.rcParams['figure.dpi']  # pixel in inches
fig, ax = plt.subplots(figsize=(600*px,500*px))

sns.scatterplot(data=sd_x, x='date', y='pcs/m', hue='Project', hue_order=["Training", "Testing"],ax=ax)

ax.set_ylim(-1, the_99th_percentile)
ax.set_xlabel("")
glue("testing_training_chrono", fig, display=False)
plt.close()

In [8]:
# all the data day of year
fig, ax = plt.subplots(figsize=(600*px, 500*px))

sns.scatterplot(data=sd_x, x='doy', y='pcs/m', hue='Project',  hue_order=["Training", "Testing"],ax=ax)
ax.set_ylim(-1, the_99th_percentile)
ax.set_xlabel("Day of the year")
glue('testing_training_doy', fig, display=False)
plt.close()

In [9]:
testing_vals= sd_x[sd_x.Project == "Testing"]['pcs/m'].values
training_vals = sd_x[sd_x.Project == "Training"]['pcs/m'].values


train_quantiles = np.quantile(training_vals, some_quants)
test_quantiles = np.quantile(testing_vals, some_quants)

training_testing_summary = training_testing_compare(testing_vals, training_vals, test_quantiles, train_quantiles)
caption = "The observed values from the training and testing data. Remark that the testing data is only 22% of all the data. This is because we are only in the first year of a six year sampling period"
sum_table = training_testing_summary.set_caption(caption)
sum_table.format(formatter="{:.0f}", subset=pd.IndexSlice[['Number of samples'], :])
glue("data-summary", sum_table, display=False)

In [10]:
fig, ax = plt.subplots(figsize=(600*px, 500*px))

sns.ecdfplot(data=sd_x, x='pcs/m', hue='Project',  hue_order=["Training", "Testing"],ax=ax)
ax.set_xlim(-1, the_99th_percentile)
ax.set_ylabel("Cumulative probability")
glue('testing_training_cumulative', fig, display=False)
plt.close()

|Figure 1, Table 1 | Table 2, Figure 3|
|:-----------------------:|:---------------------:|
|{glue:}`testing_training_chrono` |{glue}`data-summary`|
|{glue:}`data-summ-q`|{glue}`testing_training_doy`|

## Methods

The research questions and practical applications are inquiring about expected results at the municipal level. There are records for 25 municipalities on Lake Geneva, some only have one sample in the entire sampling period. The negative binomial distribution was used to model expected survey results at the river bassin and national level [Estimating baselines IQAASL](https://www.plagespropres.ch/baselines.html#calculating-baselines). Here we are abandoning the assumption that the data has a particular shape and solving the binomial portion of the negative binomial distribution at each interval on the set of numbers from 0 - 10, every 0.01.

We use conditional probability because of the assumptions of our model. In this sense we are following trends from the conservationists and wildlife biologists. Both fields have a rich history of treating observations from the field that originate from citizen science projects and/or difficult field sampling conditions. Beach litter data collection is one such program. ([summarizing bird counts](https://www.fs.usda.gov/psw/publications/documents/psw_gtr191/psw_gtr191_0762-0770_sauer.pdf), [estimating age of survival](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.14077), [estimating tick abundance](https://hal.science/hal-02637100/)). The magnitude of the exchange between the water source and the beach is yet another variable that is for the most part unknown, except that which can be interpreted from the survey data ([Identifying Marine Sources of Beached Plastics](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2021GL097214)). In summary, there are many sources of variance, only one of which is the sampling error.

The applied method would best be classified as Empirical Bayes, in the sense that the prior is derived from the data ([Bayesian Filtering and Smoothing](https://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf), [Empirical Bayes methods in classical and Bayesian inference](https://link.springer.com/article/10.1007/s40300-014-0044-1)). However, we share the concerns of Davidson-Pillon ([Bayesian methods for hackers](https://dataorigami.net/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#contents)) about double counting and eliminate it as part of the formulation of the prior. This is possible because of the number of different locations and regions that were sampled durring the sampling period. The basis for this method was originally explored in the Swiss federal report and then again with much more rigor in [Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html).

```{note}
Model types or analytical labels are only important if the context is understood. Empirical Bayes means we are building a probabilistic model that uses existing data to establish those probabilities.
```

### Grid Approximation

Grid approximations are made from a series of point estimates calculated using Bayes theorem. Condtional probability and Bayes theorem makes it possible to measure the magnitude of an unknown parameter as long as the conditions can be quantified. There is no assumption about the underlying relationships between variables except that a realtionship exists. The relationship is defined in the notation: _the probability of a given b_ or $p(a|b)$. ([Statistical Machine Learning](https://www.stat.cmu.edu/~larry/=sml/Bayes.pdf) [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)) ([conditional proabability](https://en.wikipedia.org/wiki/Conditional_probability)).

The proposed model only demands whether or not a threshold has been exceeded. This is a binary variable. Therfore each step in the grid can be modeled using the binomial distribution ([Think Bayes 2](https://allendowney.github.io/ThinkBayes2/chap02.html), [Bayes Rules! An Introduction to Applied Bayesian Modeling](https://www.bayesrulesbook.com/chapter-6.html)). The prior data can be introduced and the integral can be solved analytically by using the prior conjugate of the binomial([Bayesian methods for hackers](https://dataorigami.net/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#contents),[Prior Probabilities](https://bayes.wustl.edu/etj/articles/prior.pdf)). The grid we are covering is relatively small, 1000 points, but it does represent real values in pieces of trash per meter between 0 and 10. This accounts for 99% of the data in the previous section.

```{note}
A grid approximation is not the same as a hierarchical model. The grid approximation is away to explore the suitability of the method with the data, in relation to the research question without the overhead of developing a full model. Because of the compromise a grid approximation is less accurate than a hierarchical model.
```

### Conditional probability

Conditional probability is the measure of the probability of an event occuring given that another event has occurred ([Wikepedia](https://en.wikipedia.org/wiki/Conditional_probability)). For this study the event under consideration is whether or not a threshold was exceeded. The probability of that event is noted $p(\theta)$, the probability of $\theta$ given a condition or set of condtions is $p(\theta | condition(s))$. 

__The general case__

Using the formal definition of conditional probability let a = $\theta$ and b = _event data_, the probability of an event given a condtion is:

$$ 
\begin{align}
p(a | b) =&\ \frac{p(a \cap b)}{p(b)},\ 0.0 \lt a \lt 1,\ 0.0 \lt b \lt 1 \tag{1}\\[10pt]
and\\
p(b | a) =&\ \frac{p(b \cap a)}{p(a)},\ p(a \cap b) =\ p(b \cap a) \tag{2}\\[10pt]
therefore\\
p(b | a) =&\ \frac{p(b \cap a)}{p(a)}\ =\ \frac{p(a \cap b)}{p(a)} \tag{3}\\[10pt]
p(a)p(b|a) =&\ p(a \cap b) \tag{4}\\[10pt]
finally\\
p(a | b) =&\ \frac{p(a)p(b|a)}{p(b)} \tag{5}\\[10pt]
\text{total probability = evidence} =&\ p(b) =\ p(b|a)p(a) + p(b|a^{\prime})p(a^{\prime}) \tag{6}\\[11pt]
posterior =&  \frac{likelihood*prior}{evidence}\tag{7}\\[10pt]
\end{align}
$$

This is Bayes theorem, it tells us that the probability of event _a_ is conditioned on event _b_. If the sample space can be defined by $a$ and $a^{\prime}$ then the total probability is the sum of the (likelihood * the prior) and (the complement to the prior*the complement to the likelihood). This means that we consider only two possible results: $y \ge x$ or $y \lt x$, where x is a threshold value and y is a survey result in pieces per meter (pcs/m). To use Bayes theorem we need to assign values to the likelihood and prior and cary out the math in (7).

$$
\begin{align}
\text{prior} =& \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} * \theta^{k} (1-\theta)^{n_{pr} - k}, \ \ \text{beta distribution} \tag{8} \\[12pt]
\text{likelihood} =& {n\choose s}*\theta^{s}(1-\theta)^{n-s},\ \ \text{binomial distribution} \tag{9} \\[12pt]
\text{prior*likelihood} =& C*\theta^{s+a}*(1-\theta)^{n+b-k-a},\ \ C=\frac{\Gamma(a + b + n)}{\Gamma(s+a)\Gamma(n - s + b)} \tag{10} \\[12pt]
\text{evidence} =& \int_{i=0}^1 \theta*\theta^{s+a}*(1-\theta)^{n+b-k-a} d\theta\ = \frac{\Gamma(s+a)\Gamma(n - s + b)}{\Gamma(a + b + n)} = 1 \tag{11} \\[12pt] 
\end{align}
$$

```{important}

The beta distribution (8) is the prior conjugate to the binomial distribution (9) ([Statistical Machine Learning](https://www.stat.cmu.edu/~larry/=sml/Bayes.pdf), [Conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior)). This means that the posterior distribution can be solved analytically. The evidence (11) which is an integral can be solved by recognizing that (11) is another way to write the beta distribution which integrates to one.

Therefore we can write the solution of Bayes theorem using the beta binomial conjugate model as:

$$
\begin{align}
p(a|b) \sim & \ Beta(s+a, n+b-s) \tag{12} \\[12pt]
\end{align}
$$

The mean (average) of the beta distribution above is $\frac{a + s}{a + b + n}$, which is the value that is calculated for each interval in the grid approximation.

```

### The priors

The priors used in this model are _subjective_, this is by definition because they come from the data ([Bayesian Filtering and Smoothing](https://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf)). The subjecitve bias is caused by the use of data from similar experiments that were carried out under different conditions at similar locations. However, we omit the location(s) of interest from the prior data. Instead we rely on other locations that have similar a land-use configuration as the locations of interest. 

This use of the prior fits well with the initial assumptions of the model and the previous work using Spearmans $\rho$ to indentify covariates between objects and topographical features ([Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html)). This increases the amount of available data for any single location. A distance component is used to capture the locations closest to the locations of interest. Regional denominations work well also. In this model the three regions of the lake are all considered subsets of prior data. ([Empirical Bayes methods in classical and Bayesian inference](https://link.springer.com/article/10.1007/s40300-014-0044-1))

#### The informed prior

An informed prior is a collection of results from different locations that have smilar magnitudes of specific land-use attributes. An informed prior does not contain survey results from the components of the likelihood. It is located at least within the same river bassin and most often on the same body of water as the likelihood component. This can be made even more explicit:

1. informed prior: The probability of x described by a subset of the data that was observed at a date on or before the maximum date of the likelihood data being evaluated.
2. The initial evaluation should be a covariance test such as Spearmans $\rho$ or other
   1. The subset of data is related to the observed data either geographically or by a measurable attribute
      1. example of geographic relationship: same lake or river basin
      2. example of measurable attribute: similar land use configuration
         1. samples within 5% of each other on the magnitude scale for buildings
         2. the samples with the lowest percentile ranking for infrastructure

It may be benficial to note that this prior satisfies the condition of testability. Even though we do not know exactly what the prior distribution will be at any moment the conditions imposed give us an expected result based on experiences accrued elsewhere under similar conditions. In that sense we remain consistent with our assumptions and testable in the sense that the priors are quantifiable ([Prior Probabilities](https://bayes.wustl.edu/etj/articles/prior.pdf)).

The set of available data is small and choices or statements about the value of the $prior$ have considerable weight in reference to the posterior distribution. Having multiple possible values for the $prior$ is consistent with Bayesian analysis.([Gelman prior distribution](http://www.stat.columbia.edu/~gelman/research/published/p039-_o.pdf)). 

#### The uninformed prior

The uninformed prior is the initial prior we used, whch captures the expectations of most beach litter surveyors: _you can find anything and if you do it long enough you will_. The uninformed prior is the distribution such that every value of x on the described interval has an equal chance of occuring: 0.5 or 1/2 or even 50%. We use this prior when their is no source of prior data. For example, for Lake Geneva there is no other source of comparable data in the river bassin.

These values are exchanged in the form of the coeficients $k, n-k$ in (8). The value that is in the grid is the mean of the binomial distribution of the probability of exceeding the threshold at that point. The grid has 1000 points. Having multiple possible values for the $prior$ is analagous with looking at the problem from different angles or changing some baseline assumptions [Our assumptions](assumptions) . 

#### The measured land-use attributes

```{important}
There are slight changes in the way the land-use variables are handled and described with respect to [Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html). Notably the infrastruture and recreation variables are scaled separately from the land-cover variables.

The land-use attributes are described here in detail. The descriptions are issue from the map themselves, these are easy to integrate into other geo maps from the swiss admin system. Here we are letting go of some of the control and limitting our varaibles to a set of choices. Those choices were derived by experts, they are certainly better at classifying land use than us.
```
The informed priors in this study are assembled by considering the survey results from locations that have similar environmental conditions. The connection between measurable land-use attributes and survey results was illustrated in the swiss national survey ([IQAASL](https://hammerdirt-analyst.github.io/IQAASL-End-0f-Sampling-2021/land_use_correlation.html)) and explored in depth, including early versions of the proposed model in [Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html). The source of the land-use data is [swissTLMRegio](https://www.swisstopo.admin.ch/de/geodata/landscape/tlmregio.html), the details of extracting the data and defining the boundary conditions can be found here [New map data](https://hammerdirt-analyst.github.io/landuse/hex-3000-m.html).

__land-cover__

These measured land-use attributes are the labeled polygons from the map layer _Landcover_ defined here ([swissTLMRegio product information](https://www.swisstopo.admin.ch/fr/geodata/landscape/tlm3d.html#dokumente)), they are extracted using vector overlay techniques in QGIS ([QGIS](https://qgis.org/en/site/)). The overlay is a hexagon-grid, each hex is 3000m, circumcscribed by a circle r=1500m. The survey location is located at the center of the hex. The magnitude of the land-use variable is the portion of the total dry surface area for any particular land-use attribute. Areas of the hex that are not defined with a land-use attribute in this map layer are labeled _undefined_ and processed like any other land-use attribute. The land-cover variables of interest are:

1. Buildings: built up, urbanized
2. Woods: not a park, harvesting of trees may be active
3. Vineyards: does not include any other type of agriculture
4. Orchards: not vineyards
5. Undefined: areas of the map with no predefined label

__Land-use__

Land-use variables are the labled polygons from the _Freizeitareal_ and  _Nutzungsareal_ map layers, defined in ([swissTLMRegio product information](https://www.swisstopo.admin.ch/fr/geodata/landscape/tlm3d.html#dokumente)). Both layers represent areas used for specific activities. Freizeitareal identifies areas used for recreational purposes and Nutzungsareal represents areas such as hospitals, cemeteries, historical sites or incineration plants. As a ratio of the available dry-land in a hex, these features are relatively small (less than 10%) of the total dry-land. For identified features within a bounding hex the magnitude in meters² of these variables is scaled between 0 and 1, thus the scaled value represents the size of the feature in relation to all other measured values for that feature from all other hexagons.

1. Recreation: parks, sports fields, attractions
2. Infrastructure: Schools, Hospitals, cemeteries, powerplants

__Streets and roads__

Streets and roads are the labled polylines from the _TLM Strasse_ map layer defined in ([swissTLMRegio product information](https://www.swisstopo.admin.ch/fr/geodata/landscape/tlm3d.html#dokumente)). All polyines from the map layer within a bounding hex are merged (disolved in QGIS commands) and the combined length of the polylines, in meters, is the magnitude of the variable for the bounding hex.

__Covariance of land-use variables__

The training data and the testing data come from the same lake. The locations surveyed in 2022 have different coefficients than 2021. Note how different the covaraince with cities is to woods and undefined land.

In [11]:
# plastock_land_use_cover = pd.read_csv("resources/u_land_cover_use_ps.csv")
# land_cover_use = pd.read_csv("resources/u_land_cover_use_iq.csv")
# land_cover_use.rename(columns={"area":"magnitude"}, inplace=True)


# bind = pd.IntervalIndex.from_tuples([(-1, 0), (0.000001, .049999), (.05, .099999), (.1, .149999),(.15, .199999), (.2, .249999), (.25, .299999), (.3, .349999),
#         (.35, .399999), (.4, .449999), (.45, .499999),(.5, .549999), (.55, .599999), (.6, .649999), (.65, .699999), (.7, .749999), (.75, .799999),
#         (.8, .849999), (.85, .899999), (.9, .949999), (.95, 1)])

# land_use_agg = pd.read_csv('resources/u_land_use_cover_streets_rivers.csv')

# lu_scaled = land_use_agg.pivot(columns="use", values="scaled", index="slug").fillna(0)
# lu_magnitude = land_use_agg.pivot(columns="use", values="magnitude", index="slug").fillna(0)

# lucopy = land_use_agg.pivot(columns="use", values="binned", index="slug").fillna(0)
# adlu = all_data.merge(lu_scaled, left_on="location", right_index=True, validate="many_to_one", how="outer")

# columns = [x for x in lu_magnitude.columns if x not in ["Geroell", "Stausee", "See"]]

In [12]:
def sampler_from_multinomial(normed, xrange, nsamples):
    
    choose = np.random.default_rng()
    nunique = np.unique(normed)
    norm_nunique = nunique/np.sum(nunique)
    found = choose.multinomial(1, pvals=norm_nunique, size=nsamples)
    ft = found.sum(axis=0)
    samples = []
    for i, asum in enumerate(ft):
        if asum == 0:
            samples += [0]
        else:
            choices = np.where(normed == nunique[i])
            samps = choose.choice(choices[0], size=asum)
            samples.extend(xrange[samps])

    return samples, nunique, norm_nunique, ft

def period_pieces(start, end, data):
    # the results in pieces per meter for one code from a subset of data
    date_mask = (data["date"] >= start) & (data["date"] <= end)
    period_one = data[date_mask]
    pone_pcs = period_one.pcs_m.values

    return pone_pcs

def period_k_and_n(data, xrange, add_one=False):

    pone_k = [(data >= x).sum() for x in xrange]
    pone_notk = [(data < x).sum() for x in xrange]

    if add_one:
        # if the use is for beta dist. This is the same
        # as mulitplying the likelihood * uninform prior (0.5) or beta(1,1)
        pone_k_n_minus_k = [(x+1, len(data) - x+1) for x in pone_k]
    else:
        pone_k_n_minus_k = [(x, len(data) - x) for x in pone_k]
        
    

    return np.array(pone_k), np.array(pone_notk), np.array(pone_k_n_minus_k)

def period_beta(k):
    
         
    return beta(*k)
        

def current_possible_prior_locations(landuse, locations, attribute):    

    # indentify the magnitude(s) of the attribute of interest from the
    # locations in the current data there may be more than one, in this 
    # example we use all the possible magnitudes for the attribute
    # locations = data[data.city == city].location.unique()

    # magnitudes for the attribute from all the locations in the municipality
    moa = magnitude_of_attribute = landuse.loc[locations][attribute].unique().astype('int')

    # identify locations that have the same attribute by magnitude of attribute
    possible_locations = landuse[landuse[attribute].isin(moa)].index

    # remove the locations that are in the likelihood function
    prior_locations = [x for x in possible_locations if x not in locations]

    return locations, possible_locations, prior_locations


def make_expected(lh_tuple, prior_tuple, xrange):
    res = []
    betas=[]
    # print(lh_tuple, prior_tuple)
    for i in np.arange(len(xrange)):
        alpha = prior_tuple[i][0]
        betai = prior_tuple[i][1]
        success = lh_tuple[i][0]
        n = lh_tuple[i][1] + lh_tuple[i][0] 
        numerator = alpha + success
        denominator = alpha + betai + n
        if numerator == 0:
            numerator = 1
        abeta = beta(numerator, (betai + lh_tuple[i][1] + lh_tuple[i][0])).mean()
        betas.append(abeta)
        # print(alpha, betai, success, numerator, n, denominator)
        if numerator >= denominator:
            numerator = denominator-1
            
        expected = numerator/denominator
        res.append(expected)
    return np.array(res), np.array(betas) 

In [13]:
comb_lu_agg = pd.read_csv("resources/u_comb_lu_cover_street_rivers.csv")

lu_scaled = comb_lu_agg.pivot(columns="use", values="scaled", index="slug").fillna(0)

lu_magnitude = comb_lu_agg.pivot(columns="use", values="magnitude", index="slug").fillna(0)

lu_binned = comb_lu_agg.pivot(columns="use", values="binned", index="slug").fillna(0)

# not_these = ['amphion', 'anthy', 'excenevex', 'lugrin', 'meillerie', 'saint-disdille', 'tougues']
merge_locations = cbd.location.unique()
cbdu = cbd[~cbd.location.isin(not_these)].merge(lu_scaled[lu_scaled.index.isin(merge_locations )], left_on="location", right_index=True, validate="many_to_one", how="outer")

cbdu["use group"] = cbdu.code.map(lambda x: use_groups_i[x])

cbdu["ug"] = cbdu["use group"].apply(lambda x: abbrev_use_g[x])
cbdu[cbdu["use group"] == 'Personal consumption'].code.unique()
cbdu["date"] = pd.to_datetime(cbdu["date"], format="%Y-%m-%d")

# cbdu["city"] = cbdu.location.apply(lambda x: full_city_map[x])
cbdu.rename(columns={"pcs/m":"pcs_m"}, inplace=True)

In [14]:
tst_locs = cbdu[(cbdu.Project == 'Testing')].location.unique()
tr_locs = cbdu[(cbdu.Project == 'Training')].location.unique()

attribute_columns = [x for x in lu_scaled.columns if x not in ["Geroell", "Stausee", "See", "Sumpf", "Stadtzentr", "Fels"]]
english_column_names = {
    "Obstanlage":"Orchards",
    "Reben": "Vineyards",
    "Siedl": "Buildings",
    "Strasse": "Streets",
    "Wald": "Woods",
    "infrastructure":"Infrastructure",
    "recreation":"Recreation",
    "undefined":"Undefined"
}

trc = lu_scaled.loc[tr_locs][attribute_columns]
tst = lu_scaled.loc[tst_locs][attribute_columns]

trc.rename(columns=english_column_names, inplace=True)
tst.rename(columns=english_column_names, inplace=True)

corr_tst = tst.corr()
corr_trc = trc.corr()

mask_tr = np.triu(np.ones_like(corr_trc, dtype=bool))
mask_ts = np.triu(np.ones_like(corr_tst, dtype=bool))
fig, ax = plt.subplots()

sns.heatmap(corr_trc, mask=mask_tr, cmap="YlOrBr", ax=ax)


ax.set_ylabel("")
ax.set_xlabel("")
ax.set_title("Training data", loc="left")
plt.tight_layout()

glue("corr_training", fig, display=False)
plt.close()

fig, ax = plt.subplots()

sns.heatmap(corr_tst, mask=mask_ts, cmap="YlOrBr", ax=ax)

ax.set_ylabel("")
ax.set_xlabel("")
ax.set_title("Testing data", loc="left")
plt.tight_layout()

glue("corr_testing", fig, display=False)
plt.close()

|   Figure 3             |     Figure 4                  |
|:----------------------:|:-----------------------------:|
|{glue:}`corr_training`| {glue:}`corr_testing`    | 
|Correlation of land-use variables from the prior data. | Correlation of land-use variables from the likelihood data|

## Results Lake Geneva

<!-- > 1. Given the data from 2022, Is there an increase, decrease or no change in the expected survey results given the consolidated results from 2015 - 2021?

The __combined daily total__ is the sum of the objects of interest per sample. In this case we are concerned with the objects listed in the [Objects of interest](objects-of-interest) section. The average number of objects counted and indentified per meter is expected to decline going into 2024. However, the reduction is minimal (Figure 5) and not equally spread between all object groups or survey locations. Which means that these changes will not be readily observable and also the sampling distribution for 2024 will be very close to 2021 (Table 4, Figure 6). -->

```{admonition} Given the data from 2022, Is there an increase, decrease or no change in the expected survey results given the consolidated results from 2015 - 2021?

The average number of objects counted and indentified per meter is expected to decline going into 2024. However, the reduction is minimal (Figure 5) and not equally spread between all object groups or survey locations. Which means that these changes will not be readily observable and also the sampling distribution for 2024 will be very close to 2021 (Table 4, Figure 6). The __combined daily total__ is the sum of the objects of interest per sample. In this case we are concerned with the objects listed in the [Objects of interest](objects-of-interest) section. 
```

In [15]:
# not_these = ['amphion', 'anthy', 'excenevex', 'lugrin', 'meillerie', 'saint-disdille', 'tougues']
g_resa = cbdu[~cbdu.location.isin(not_these)].copy()
g_res = g_resa.groupby(['loc_date', 'date','location', 'city', 'Project', 'ug'], as_index=False).agg({'pcs_m':'sum', 'quantity':'sum'})
g_res.rename(columns={"ug":"code"}, inplace=True)
index_range = (0.0, 10)

xrange =  np.arange(*index_range, step=.01)

# define the uninformed prior as either a float
# or coefficients for the beta dist
uninformed_prior = np.array([0.5 for x in xrange])
uninformed_tuple = np.array([(1,1) for x in xrange])

# the data is aggregated on loc_date for both sets
train_dt = g_resa[g_resa.Project == "Training"].groupby('loc_date', as_index=False).agg({"pcs_m":"sum", "quantity":"sum"})
test_dt = g_resa[g_resa.Project == "Testing"].groupby('loc_date', as_index=False).agg({"pcs_m":"sum", "quantity":"sum"})

# get the values or arrays of interest
train_dt_vals = train_dt.pcs_m.values
test_dt_vals = test_dt.pcs_m.values

# get k, and n-minus k
t_dt, t_dt_notk, t_dt_n_minus_k = period_k_and_n(train_dt_vals, xrange)

# make the likelihood parameters
lhx = list(zip(t_dt, t_dt_notk))

# make the prior distribution
p_ui, p_beta = make_expected(lhx, uninformed_tuple, xrange)

# the uninformed beta approximation of the prior data
prior_beta = [period_beta(x) for x in t_dt_n_minus_k]
prior_bmean = [x.mean() for x in prior_beta]

test_dt, test_dt_notk, test_dt_n_minus_k = period_k_and_n(test_dt_vals, xrange)

# make the likelihood parameters
lhx = list(zip(test_dt, test_dt_notk))

# make the prior distribution
test_ui, t_beta = make_expected(lhx, uninformed_tuple, xrange)

# the uninformed beta approximation of the prior data
test_beta = [period_beta(x) for x in test_dt_n_minus_k]
test_bmean = [x.mean() for x in test_beta]

# posterior
aprior = test_dt_n_minus_k + t_dt_n_minus_k
outlook_2024, _ = make_expected(test_dt_n_minus_k , t_dt_n_minus_k, xrange)
results=pd.DataFrame({"x":xrange, "p":p_ui})
results["pn"] = results.p/results.p.sum()

results["Informed"] = outlook_2024
results["Ip_n"] = results.Informed/results.Informed.sum()
results["Uninformed post"] = test_ui

# the quantiles from the observed data
prior_quants = np.quantile(train_dt_vals, some_quants)
post_quants = np.quantile(test_dt_vals, some_quants)

Ip_n = outlook_2024/np.sum(outlook_2024)

# samples from posterior 
choose = np.random.default_rng()
sim_2024 = choose.choice(xrange, 1000, p=Ip_n)
median_2024 = np.median(sim_2024)

# observed in relation to predicted
ppf = predicted_summary(test_dt_vals, train_dt_vals, prior_quants, median_2024)
caption="Previous and expected median values of the combined daily totals of the Unkown objects group"

ppf_d = ppf.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)
glue("comb-2024-meds", ppf_d, display=False)

# comparing training to testing
unks_sum_table = training_testing_compare(test_dt_vals, train_dt_vals, post_quants, prior_quants)
caption = "The observed values from the training and testing data."
sum_table = unks_sum_table.set_caption(caption)
glue("comb-summary", sum_table, display=False)

In [16]:
fig, ax = plt.subplots()

# points
median_prior = np.median(train_dt_vals)
median_lh = np.median(test_dt_vals)
median_post = results[results["Informed"].between(.49, .51)]["x"].values[0]
quants_2024 = np.quantile(sim_2024, some_quants)

# posterior
ax.plot(xrange, outlook_2024, c="magenta", linewidth=4,linestyle=':', zorder=20, label='Outlook 2024')
ax.plot([median_post], [.5],  c="black", markersize=6, marker="x", zorder=27, label="Expected median 2024")
ax.plot(xrange, test_ui, c="darkslategrey",  linestyle=':', linewidth=3, zorder=11, alpha=0.5,  label='2022')
ax.plot([median_lh], [.5], c="blue", markersize=5, marker="o", zorder=25, label="2022 median")
ax.plot(xrange, p_ui, c="cornflowerblue", linestyle=':',  linewidth=4, zorder=11, alpha=0.5, label="2021")
ax.plot([median_prior], [.5], c="red", markersize=5, marker="o", zorder=25, label="2021 median")

# 50% IQR
ax.axvspan(xmin=prior_quants[1], xmax=prior_quants[5], ymin=0.25, ymax=0.75, facecolor='cornflowerblue',  edgecolor='cornflowerblue', zorder=13, alpha=0.2, label="IQR - 2021")
ax.axvspan(xmin=post_quants[1], xmax=post_quants[5], ymin=0.25, ymax=0.75, facecolor='darkslategrey', edgecolor='darkslategrey', linestyle="-.", zorder=13, alpha=0.2, label="IQR - 2022")
ax.axvspan(xmin=quants_2024[1], xmax=quants_2024[5], ymin=0.25, ymax=0.75, facecolor='black',  edgecolor='black',  zorder=13, alpha=0.2, label="IQR - 2024")

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 10)
ax.set_ylabel('probability')
h, l = ax.get_legend_handles_labels()
ax.legend(h, l, bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('comb-outlook-2024', fig, display=False)
plt.close()

### Previous and expected survey totals

In [17]:
fig, ax = plt.subplots()

unin_2022 = results["Uninformed post"].values

sns.ecdfplot([*train_dt_vals, *test_dt_vals], label="Observed 2015 - 2022",  c="black", stat="proportion", ax=ax, zorder=10)
sns.ecdfplot(train_dt_vals, ax=ax, label="Observed 2015 - 2021",color="cornflowerblue", stat="proportion", zorder=10)
sns.ecdfplot(test_dt_vals, ax=ax, label="Observed 2022", color="darkslategrey", stat="proportion", zorder=10)

ax.plot(xrange, (1-outlook_2024), linestyle=':', c="magenta", linewidth=3,label="Expected 2024", zorder=11)

sns.histplot(train_dt_vals, ax=ax, label="Observed 2021", color="cornflowerblue", alpha=0.5, zorder=0, stat="probability")
sns.histplot(sim_2024, ax=ax, label="Expected 2024", color="black", zorder=2, alpha=0.5, edgecolor="magenta", linewidth=1.2, stat="probability")
sns.histplot(test_dt_vals, ax=ax, label="Observed 2022", color="darkslategrey", alpha=0.5, stat="probability", zorder=0)

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 10)
ax.set_ylabel('probability')

ax.legend(bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('comb-predicted_samples', fig, display=False)
plt.close()

|     Figure 5, Table 3    |     Table 4, Figure 6        | 
|:------------------------:|:----------------------------:|
|{glue:}`comb-outlook-2024` | {glue:}`comb-2024-meds`|
|{glue:}`comb-summary` | {glue:}`comb-predicted_samples`|



In [18]:
# select the code, city and attribute

code_index = 0
city_index = 0
attribute_index = 2

this_code =  code_groups[code_index]
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"

# define the prior, likelihood data and likelihood locations
prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

regions = lac_leman_regions[~lac_leman_regions.slug.isin(not_these)].copy()
lh_regions = regions[regions.slug.isin(lh_locations)].alabel.unique()
regional_locations = regions[regions.alabel.isin(lh_regions)].slug.unique()
land_use_data_of_interest = lu_binned.loc[regional_locations]

locations, possible_locations, prior_locations = current_possible_prior_locations(land_use_data_of_interest, regional_locations, this_attribute)

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(regional_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}

    

# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
choose = np.random.default_rng()
sim_2024 = choose.choice(xrange, len(lh_pcs), p=post_df["Ip_n"].values)
median_2024 = post_df.loc[post_df["Informed post"].between(.49, .51)].index.values.mean()/100

summaries = []
dfs = []
# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))
caption="Previous and expected median values of the combined daily totals of the Unkown objects group"

ppf_d = ppf.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)
glue("unk-2024-meds", ppf_d, display=False)

# comparing training to testing
unks_sum_table = training_testing_compare(lh_pcs, pcs, post_quants, prior_quants)
caption = "The observed values from the training and testing data."
sum_table = unks_sum_table.set_caption(caption)
glue("unk-summary", sum_table, display=False)

In [19]:
# bayes rules in python
# apie = [0, 0.2, 0.4, 0.6, 0.8, 1]
# alength = np.linspace(start=0, stop=1, num=6)
# prior_tuples = [(2,2) for _ in alength]
# prior = [beta.pdf(alength[i], *x).mean() for i, x in enumerate(prior_tuples)]
# likelihood = [binom.pmf(9, 10, x) for x in alength]
# print(np.round(prior, 2))
# print(np.round(likelihood, 2)

In [20]:
fig, ax = plt.subplots()

unin_2022 = post_df["Uninformed post"].values

sns.ecdfplot([*pcs, *lh_pcs], label="Observed 2015 - 2022",  c="black", stat="proportion", ax=ax, zorder=10)
sns.ecdfplot(pcs, ax=ax, label="Observed 2015 - 2021",color="cornflowerblue", stat="proportion", zorder=10)
sns.ecdfplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", stat="proportion", zorder=10)

ax.plot(xrange, (1-informed), linestyle=':', c="magenta", linewidth=3,label="Expected 2024", zorder=11)

sns.histplot(pcs, ax=ax, label="Observed 2021", color="cornflowerblue", alpha=0.5, zorder=0, stat="probability")
sns.histplot(sim_2024, ax=ax, label="Expected 2024", color="black", zorder=2, alpha=0.5, edgecolor="magenta", linewidth=1.2, stat="probability")
sns.histplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", alpha=0.5, stat="probability", zorder=0)

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')
# h, l = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('gfrags-predicted_samples', fig, display=False)
plt.close()

In [21]:
fig, ax = plt.subplots()

# points
median_prior = np.median(pcs)
median_lh = np.median(lh_pcs)
median_post = post_df[post_df["Informed post"].between(.49, .51)]["x"].values[0]
quants_2024 = np.quantile(sim_2024, some_quants)

# posterior
ax.plot(xrange, informed, c="magenta", linewidth=3,linestyle=':', zorder=20, label='Outlook 2024')
ax.plot([median_post], [.5],  c="black", markersize=6, marker="x", zorder=27, label="Expected median 2024")
ax.plot(xrange, uninformed, c="darkslategrey",  linestyle=':', linewidth=3, zorder=11, alpha=0.5,  label='2022')
ax.plot([median_lh], [.5], c="blue", markersize=5, marker="o", zorder=25, label="2022 median")
ax.plot(xrange, grid_prior,  c="cornflowerblue", linestyle=':',  linewidth=4, zorder=11, alpha=0.5, label="2021")
ax.plot([median_prior], [.5], c="red", markersize=5, marker="o", zorder=25, label="2021 median")

# 50% IQR
ax.axvspan(xmin=prior_quants[1], xmax=prior_quants[5], ymin=0.25, ymax=0.75, facecolor='cornflowerblue', zorder=13, alpha=0.2, label="IQR - 2021")
ax.axvspan(xmin=post_quants[1], xmax=post_quants[5], ymin=0.25, ymax=0.75, facecolor='darkslategrey', zorder=13, alpha=0.2, label="IQR - 2022")
ax.axvspan(xmin=quants_2024[1], xmax=quants_2024[5], ymin=0.25, ymax=0.75, facecolor='black', zorder=13, alpha=0.2, label="IQR - 2024")

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')
h, l = ax.get_legend_handles_labels()
ax.legend(h[:-2], l[:-2], bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('gfrags-outlook-2024', fig, display=False)
plt.close()

In [22]:
code_index = 1
city_index = 0
attribute_index = 2

this_code =  code_groups[code_index]
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"
# some_quants = [.03, .25, .48, .5, .52, .75, .97]

# define the prior, likelihood data and likelihood locations
prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

regions = lac_leman_regions[~lac_leman_regions.slug.isin(not_these)].copy()
lh_regions = regions[regions.slug.isin(lh_locations)].alabel.unique()
regional_locations = regions[regions.alabel.isin(lh_regions)].slug.unique()
land_use_data_of_interest = lu_binned.loc[regional_locations]

locations, possible_locations, prior_locations = current_possible_prior_locations(land_use_data_of_interest, regional_locations, this_attribute)

# grid approximation of the prior
prior_args = {
    'prior_data':prior_data[prior_data.location.isin(regional_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}

grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
choose = np.random.default_rng()
sim_2024 = choose.choice(xrange, 1000, p=post_df["Ip_n"].values)
median_2024 = post_df.loc[post_df["Informed post"].between(.49, .51)].index.values.mean()/100

# observed in relation to predicted
ppfi = predicted_summary(lh_pcs, pcs, prior_quants, median_2024 )
caption="Previous and expected median values of indentified Personal consumption group"

ppf_di = ppfi.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)
glue("pc-2024-meds", ppf_di, display=False)

# comparing training to testing
pc_sum_table = training_testing_compare(lh_pcs, pcs, post_quants, prior_quants)
caption = "The observed values from the training and testing data."
pc_table = pc_sum_table.set_caption(caption)
glue("pc-summary", pc_table, display=False)

In [23]:
fig, ax = plt.subplots()

unin_2022 = post_df["Uninformed post"].values

# hists = pd.DataFrame({"Observed 2021": pcs, "Expected 2024": sim_2024, "Observed 2022": lh_pcs}, index=xrange)

sns.ecdfplot([*pcs, *lh_pcs], label="Observed 2015 - 2022",  c="black", stat="proportion", ax=ax, zorder=10)
sns.ecdfplot(pcs, ax=ax, label="Observed 2015 - 2021",color="cornflowerblue", stat="proportion", zorder=10)
sns.ecdfplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", stat="proportion", zorder=10)

ax.plot(xrange, (1-informed), linestyle=':', c="magenta", linewidth=3,label="Expected 2024", zorder=11)

sns.histplot(pcs, ax=ax, label="Observed 2021", color="cornflowerblue", alpha=0.5, zorder=0, stat="probability")
sns.histplot(sim_2024, ax=ax, label="Expected 2024", color="black", zorder=2, alpha=0.5, edgecolor="magenta", linewidth=1, stat="probability")
sns.histplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", alpha=0.5, stat="probability", zorder=0)

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')
# h, l = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('pc-predicted_samples', fig, display=False)
plt.close()

In [24]:
fig, ax = plt.subplots()

# points
median_prior = np.median(pcs)
median_lh = np.median(lh_pcs)
median_post = post_df[post_df["Informed post"].between(.49, .51)]["x"].values[0]
quants_2024 = np.quantile(sim_2024, some_quants)

# posterior
ax.plot(xrange, informed, c="magenta", linewidth=3,linestyle=':', zorder=20, label='Outlook 2024')
ax.plot([median_post], [.5],  c="black", markersize=6, marker="x", zorder=27, label="Expected median 2024")
ax.plot(xrange, uninformed, c="darkslategrey",  linestyle=':', linewidth=3, zorder=11, alpha=0.5,  label='2022')
ax.plot([median_lh], [.5], c="blue", markersize=5, marker="o", zorder=25, label="2022 median")
ax.plot(xrange, grid_prior, c="cornflowerblue", linestyle=':',  linewidth=4, zorder=11, alpha=0.5, label="2021")
ax.plot([median_prior], [.5], c="red", markersize=5, marker="o", zorder=25, label="2021 median")

# 50% IQR
ax.axvspan(xmin=prior_quants[1], xmax=prior_quants[5], ymin=0.25, ymax=0.75, facecolor='cornflowerblue', zorder=13, alpha=0.2, label="IQR - 2021")
ax.axvspan(xmin=post_quants[1], xmax=post_quants[5], ymin=0.25, ymax=0.75, facecolor='darkslategrey', zorder=13, alpha=0.2, label="IQR - 2022")
ax.axvspan(xmin=quants_2024[1], xmax=quants_2024[5], ymin=0.25, ymax=0.75, facecolor='black', zorder=13, alpha=0.2, label="IQR - 2024")
ax.plot(xrange, beta_prior, c="lightsteelblue",  linewidth=3, zorder=10, alpha=0.5, label="Outlook-beta 2022")
ax.plot(xrange, beta_p, c="lightsteelblue",  linewidth=3, zorder=10, alpha=0.5, label="Outlook-beta 2022")

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')
h, l = ax.get_legend_handles_labels()
ax.legend(h[:-2], l[:-2], bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('pc-outlook-2024', fig, display=False)
plt.close()

<!-- |                        |                               |
|------------------------|-------------------------------|
|{glue:}`personal-consumption-2024`| {glue:}`pc-2024-meds`| 
 -->

In [25]:
# select the code, city and attribute

code_index = 1
city_index = 0
attribute_index = 2


this_code =  code_groups[code_index]

attribute_index = 2
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"

prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

lh_vals = lh_data["pcs_m"].values
post_quants = np.quantile(lh_vals, some_quants)
prior_vals = g_res[g_res.code == this_code]["pcs_m"].values
prior_quants = np.quantile(prior_vals, [.03, .25, .48, .5, .52, .75, .97])

prior_args = {
    'prior_data':prior_data,
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
   
}

grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'informed_prior': prior_k_n
}


informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)
post_df = make_results_df(prior_df, informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

results_df = pd.DataFrame({"pcs/m":xrange, "2024":informed})
results_df["2024 pmf"] = results_df["2024"]/results_df["2024"].sum()
median_2024 = results_df.loc[results_df["2024"].between(.49, .51), "pcs/m"].mean()

ppf = predicted_summary(lh_pcs, pcs, prior_quants, median_2024)
summaries.append(ppf)
dfs.append(results_df)
ph_past_present_future = {"Median 2021": np.median(prior_data.pcs_m.values), "Median 2022": np.median(lh_data.pcs_m.values), "Expected median 2024":median_2024}

In [26]:
code_index = 2

this_code = code_groups[code_index]
attribute_index = 2
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"

prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

lh_vals = lh_data["pcs_m"].values
post_quants = np.quantile(lh_vals, some_quants)
prior_vals = g_res[g_res.code == this_code]["pcs_m"].values
prior_quants = np.quantile(prior_vals, [.03, .25, .48, .5, .52, .75, .97])

prior_args = {
    'prior_data':prior_data,
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,

}

grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'informed_prior': prior_k_n
}


informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)
post_df = make_results_df(prior_df, informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

results_df = pd.DataFrame({"pcs/m":xrange, "2024":informed})
results_df["2024 pmf"] = results_df["2024"]/results_df["2024"].sum()
median_2024 = results_df.loc[results_df["2024"].between(.49, .51), "pcs/m"].mean()

ppf = predicted_summary(lh_pcs, pcs, prior_quants, median_2024)
summaries.append(ppf)
dfs.append(results_df)
ip_past_present_future = {"Median 2021": np.median(prior_data.pcs_m.values), "Median 2022": np.median(lh_data.pcs_m.values), "Expected median 2024":median_2024}

In [27]:
code_index = 3

this_code = code_groups[code_index]
attribute_index = 2
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"

prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

lh_vals = lh_data["pcs_m"].values
post_quants = np.quantile(lh_vals, some_quants)
prior_vals = g_res[g_res.code == this_code]["pcs_m"].values
prior_quants = np.quantile(prior_vals, [.03, .25, .48, .5, .52, .75, .97])

prior_args = {
    'prior_data':prior_data,
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,    
}

grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'informed_prior': prior_k_n
}


informed, uninformed, beta_p,lh_pcs = posterior_distribution(**posterior_args)
post_df = make_results_df(prior_df, informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

results_df = pd.DataFrame({"pcs/m":xrange, "2024":informed})
results_df["2024 pmf"] = results_df["2024"]/results_df["2024"].sum()
median_2024 = results_df.loc[results_df["2024"].between(.45, .55), "pcs/m"].mean()

ppf = predicted_summary(lh_pcs, pcs, prior_quants, median_2024)
summaries.append(ppf)
dfs.append(results_df)

In [28]:
code_index = 4

this_code = code_groups[code_index]
attribute_index = 2
this_attribute = attribute_columns[attribute_index]
start, end = "2015-11-15", "2021-05-31"

prior_data = g_res[(g_res.Project == 'Training')&(g_res.code == this_code)]
lh_data = g_res[(g_res.Project == 'Testing')&(g_res.code == this_code)]
lh_locations = lh_data.location.unique()

lh_vals = lh_data["pcs_m"].values
post_quants = np.quantile(lh_vals, some_quants)
prior_vals = g_res[g_res.code == this_code]["pcs_m"].values
prior_quants = np.quantile(prior_vals, some_quants)

prior_args = {
    'prior_data':prior_data,
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,    
}

grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': "2022-01-01",
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'informed_prior': prior_k_n
}


informed, uninformed, beta_p,lh_pcs = posterior_distribution(**posterior_args)
post_df = make_results_df(prior_df, informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

results_df = pd.DataFrame({"pcs/m":xrange, "2024":informed})
results_df["2024 pmf"] = results_df["2024"]/results_df["2024"].sum()
median_2024 = results_df.loc[results_df["2024"].between(.45, .55), "pcs/m"].mean()

ppf = predicted_summary(lh_pcs, pcs, prior_quants, median_2024)
summaries.append(ppf)
dfs.append(results_df)

sum_dict = {x:summaries[i]["pcs/m"].values for i,x  in enumerate(code_groups)}
previsions = pd.DataFrame(sum_dict, index=summaries[0].index.values)

In [29]:
previous_medians = previsions.loc["Median 2021"].round(2).values
chance_of_exceeding = [x[x["pcs/m"] == previous_medians[i]]["2024"].mean() for i,x in enumerate(dfs)]
response_question_2_lake = pd.DataFrame([chance_of_exceeding], columns=code_groups, index=["P"])
response_question_2_lake.rename(columns=column_names_groups, inplace=True)
rq2=response_question_2_lake.T

caption = "The probability that a survey in 2024 will exceed the median (50%) from 2021"

styled = rq2.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)
glue('reply-question-2', styled, display=False)

### Previous and expected survey results of objects of interest

```{admonition} Given the median value for the objects of interest in 2021, what is the chance that a survey in 2022 will exceed this value?

|Table 5|
|:------------------------|
|{glue:}`reply-question-2`|
|There were more fragmented plastics identified per/meter in 2022 than 2021 (Figure 7, Table 6). Most municipalities will either experience a slight increase or nothing at all. Locations that have historically high counts of these objects will see the greatest increases. This was hypothesised in the 2021 report, the results from the testing data lend support to the assessement. From table 7 and figure 8 we can see how similar the expected values for 2024 are to 2021.|
```

#### Expected results fragmented plastics

|       Figure 7, Table 6    | Table 7, Figure 8         |
|:------------------------:|:----------------------------:|
|{glue:}`gfrags-outlook-2024`| {glue:}`unk-2024-meds`    | 
|{glue:}`unk-summary`| {glue:}`gfrags-predicted_samples` |


#### Expected results personal consumption

__Example: Personal consumption.__ The personal consumption objects are those objects that are likely to be used at the beach by visitors, this includes food and tobaco items. There were less personal consumption objects identified per/meter in 2022 than 2021 (Figure 9, Table 8). This effect should be the most noticeable in communities that have active prevention programs (Table 9, Figure 10).

|      Figure 9, Table 8 |   Table 9, Figure 10     |
|:----------------------:|:------------------------:|
|{glue:}`pc-outlook-2024`| {glue:}`pc-2024-meds`    | 
|{glue:}`pc-summary`| {glue:}`pc-predicted_samples` |


#### Summary of expected survey results Lake Geneva

In [30]:
caption = "Table 10: Previous and expected results and the percent of observed samples from 2022 that were included in either the 50% IQR or the 94% IQR of the predicted values."

rq3 = previsions.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

glue("lake-rq3", rq3, display=False)

```{admonition}  How do the results from 2022 change the expected survey results going forward??
With the exception of fragmented plastics we expect reported beach litter densities to be lower in 2024. The greatest improvement will be concerning the objects of personal consumption, there is evidence of a decline that started in 2018 [Summary comparison 2018 - 2021](https://hammerdirt-analyst.github.io/IQAASL-End-0f-Sampling-2021/slr-iqaasl.html). These objects are often the main focus of anti litter campaigns. The expected low values for the Industrial group may be a case of mistaken identity. We base this on the fact that in 2021 fragmented plastics were 18% of the total, in 2022 that number rose to 50%. This is usually because inexperienced surveyors tend to not differentiate between fragmented items, thus some of the objects would be classified as Industrial or professional get placed under the more general category.

There are several item types that can be readily identified as Industrial or Profesional:
1. Conduit: PVC or ABS fragments
2. Plastic concrete form stops
3. Plastic saftey barrier fragments
4. Pallette angles
5. Industrial sheeting, heavy guage plastic for covering
6. Pheromone baits

The expected decline in Personal hygiene products is encouraging. However, there is no other reason to support the data. For example we are unaware if the sales of plastic cotton swabs has declined or if there is a proposed ban on these products like in France. We also note that there were no reported _plastic tampon applicators_ indentified in 2022.  This also may be case of lack of experience, given that _plastic tampon applicators_ were found in 40% of the samples in 2021 [Finding one object](https://hammerdirt-analyst.github.io/finding-one-object/chance_of_an_encounter.html#results-and-discussion).

|Table 10|
|:---------:|
|{glue:}`lake-rq3`|
|__Legend:__ Unk = Unknown group, PC = personal consumption, Ph = personal hygiene, Rc = recreation, Ip = industrial professional|
```

## Results Saint Sulpice

In [31]:
city =  'Saint-Sulpice (VD)'
start, end = "2015-11-15", "2021-05-31"
g_resa = cbdu[~cbdu.location.isin(not_these)].copy()
g_resa = g_resa.groupby(['loc_date', 'date','location', 'city', 'Project', 'ug'], as_index=False).agg({'pcs_m':'sum', 'quantity':'sum'})
g_resa.rename(columns={"ug":"code"}, inplace=True)
g_resadt = g_resa.groupby(['loc_date', 'date','location', 'city', 'Project'], as_index=False).agg({'pcs_m':'sum', 'quantity':'sum'})

# define the prior, likelihood data and likelihood locations
prior_data = g_resadt[(g_resadt.Project == 'Training')&(g_resadt.city != city)]
lh_data = g_resadt[(g_resadt.city == city)].copy()
lh_locations = lh_data.location.unique()

regions = lac_leman_regions[~lac_leman_regions.slug.isin(not_these)].copy()
lh_regions = regions[regions.slug.isin(lh_locations)].alabel.unique()
regional_locations = regions[regions.alabel.isin(lh_regions)].slug.unique()
land_use_data_of_interest = lu_binned.loc[regional_locations]

locations, possible_locations, prior_locations = current_possible_prior_locations(land_use_data_of_interest, lh_locations, this_attribute)

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

prior_df["Informed"] = informed
prior_df["Ip_n"] = prior_df.Informed/prior_df.Informed.sum()
prior_df["Uninformed post"] = test_ui

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# samples from posterior 
choose = np.random.default_rng()
# samples from posterior 
sim_2024, nunique, norm_nunique, ft = sampler_from_multinomial(prior_df["Ip_n"].values, xrange, 200)
x_median_2024 = prior_df.loc[prior_df["Informed"].between(.45, .55), "x"].mean()

# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
caption="Previous and expected median values of the combined daily totals of the Unkown objects group"

ppf_d = ppf.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)
glue("ssp-2024-meds", ppf_d, display=False)

# comparing training to testing
unks_sum_table = training_testing_compare(lh_pcs, pcs, post_quants, prior_quants)
caption = "The observed values from the training and testing data."
sum_table = unks_sum_table.set_caption(caption)
glue("ssp-summary", sum_table, display=False)

In [32]:
fig, ax = plt.subplots()

# points
median_prior = np.median(pcs)
median_lh = np.median(lh_pcs)
quants_2024 = np.quantile(sim_2024, some_quants)

# posterior
ax.plot(xrange, informed, c="magenta", linewidth=3,linestyle=':', zorder=20, label='Outlook 2024')
ax.plot([x_median_2024], [.5],  c="black", markersize=6, marker="x", zorder=27, label="Expected median 2024")
ax.plot(xrange, uninformed, c="darkslategrey",  linestyle=':', linewidth=3, zorder=11, alpha=0.5,  label='2022')
ax.plot([median_lh], [.5], c="blue", markersize=5, marker="o", zorder=25, label="2022 median")
ax.plot(xrange, grid_prior, c="cornflowerblue", linestyle=':',  linewidth=4, zorder=11, alpha=0.5, label="2021")
ax.plot([median_prior], [.5], c="red", markersize=5, marker="o", zorder=25, label="2021 median")

# 50% IQR
ax.axvspan(xmin=prior_quants[1], xmax=prior_quants[5], ymin=0.25, ymax=0.75, facecolor='cornflowerblue', zorder=13, alpha=0.2, label="IQR - 2021")
ax.axvspan(xmin=post_quants[1], xmax=post_quants[5], ymin=0.25, ymax=0.75, facecolor='darkslategrey', zorder=13, alpha=0.2, label="IQR - 2022")
ax.axvspan(xmin=quants_2024[1], xmax=quants_2024[5], ymin=0.25, ymax=0.75, facecolor='black', zorder=15, alpha=0.2, label="IQR - 2024")

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 10)
ax.set_ylim(-.01, 1)
ax.set_ylabel('probability')
h, l = ax.get_legend_handles_labels()
ax.legend(h, l, bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('ssp-outlook-2024', fig, display=False)
plt.close()

In [33]:
fig, ax = plt.subplots()

unin_2022 = results["Uninformed post"].values

# hists = pd.DataFrame({"Observed 2021": pcs, "Expected 2024": sim_2024, "Observed 2022": lh_pcs}, index=xrange)

sns.ecdfplot([*lh_pcs, *pcs], label="Observed 2015 - 2022",  c="black", stat="proportion", ax=ax, zorder=10)
sns.ecdfplot(pcs, ax=ax, label="Observed 2015 - 2021",color="cornflowerblue", stat="proportion", zorder=10)
sns.ecdfplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", stat="proportion", zorder=10)
# sns.ecdfplot(sim_2024, ax=ax, label="Observed 2022", color="magenta", stat="proportion", zorder=11)

# ax.plot(xrange, grid_prior, linestyle=':', c="blue", linewidth=4,label="Grid prior", zorder=11)
ax.plot(xrange, (1-informed), linestyle=':', c="magenta", linewidth=3,label="Expected 2024", zorder=11)
# ax.plot(xrange, unin_2022, linestyle=':', c="purple", linewidth=4, label="Grid uninformed 2022", zorder=11)
# ax.hlines(y=0.5, xmin=0, xmax=2, color="r", linestyle="-.", zorder=15)

# sns.histplot(lh_pcs, ax=ax, label="observed 2022", stat="probability")
sns.histplot(pcs, ax=ax, label="Observed 2021", color="cornflowerblue", alpha=0.5, zorder=0, stat="probability")
sns.histplot(sim_2024, ax=ax, label="Expected 2024", color="black", zorder=2, alpha=0.5, edgecolor="magenta", linewidth=1, stat="probability")
sns.histplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", alpha=0.5, stat="probability", zorder=0)


ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 10)
ax.set_ylabel('probability')
# h, l = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('ssp-predicted_samples', fig, display=False)
plt.close()

### Previous and expected survey totals

```{admonition} Given the data from 2022, Is there an increase, decrease or no change in the expected survey results given the consolidated results from 2015 - 2021?

The average number of objects counted and indentified per meter at Saint Sulpice is expected to decline going into 2024 (figure 11, table 12). The difference between 2021 and 2022 is minimal (table 11). If we consider that the estimated median is a point estimate of the probable distribution for 2024 (table 12) then there will be no noticable change in 2024 for the residents of Saint Sulpice (figure 12).  
```

|     Figure 11, Table 11  |    Table 12, Figure 12       | 
|:------------------------:|:----------------------------:|
|{glue:}`ssp-outlook-2024` | {glue:}`ssp-2024-meds`|
|{glue:}`ssp-summary` | {glue:}`ssp-predicted_samples`|



In [34]:
code_index = 0
city_index = 0
attribute_index = 2

this_code =  code_groups[code_index]
this_attribute = attribute_columns[attribute_index]
this_city = cois[city_index]

# define the prior, likelihood data and likelihood locations
prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project == 'Training')]
lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]
lh_locations = lh_data.location.unique()

regions = lac_leman_regions[~lac_leman_regions.slug.isin(not_these)].copy()
lh_regions = regions[regions.slug.isin(lh_locations)].alabel.unique()
regional_locations = regions[regions.alabel.isin(lh_regions)].slug.unique()
land_use_data_of_interest = lu_binned.loc[regional_locations]

locations, possible_locations, prior_locations = current_possible_prior_locations(land_use_data_of_interest, lh_locations, this_attribute)

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs) + len(lh_pcs))
x_median_2024 = post_df.loc[post_df["Informed post"].between(.48, .52), "x"].mean()

summaries = []
dfs = []
# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))
caption="Previous and expected median values of the combined daily totals of the Unkown objects group"

ppf_d = ppf.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)
glue("ssp-unk-2024-meds", ppf_d, display=False)

# comparing training to testing
unks_sum_table = training_testing_compare(lh_pcs, pcs, post_quants, prior_quants)
caption = "The observed values from the training and testing data."
sum_table = unks_sum_table.set_caption(caption)
glue("ssp-unk-summary", sum_table, display=False)

In [35]:
fig, ax = plt.subplots()

unin_2022 = post_df["Uninformed post"].values

sns.ecdfplot([*pcs, *lh_pcs], label="Observed 2015 - 2022",  c="black", stat="proportion", ax=ax, zorder=10)
sns.ecdfplot(pcs, ax=ax, label="Observed 2015 - 2021",color="cornflowerblue", stat="proportion", zorder=10)
sns.ecdfplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", stat="proportion", zorder=10)

ax.plot(xrange, (1-informed), linestyle=':', c="magenta", linewidth=3,label="Expected 2024", zorder=11)

sns.histplot(pcs, ax=ax, label="Observed 2021", color="cornflowerblue", alpha=0.5, zorder=0, stat="probability")
sns.histplot(sim_2024[0], ax=ax, label="Expected 2024", color="black", zorder=2, alpha=0.5, edgecolor="magenta", linewidth=1.2, stat="probability")
sns.histplot(lh_pcs, ax=ax, label="Observed 2022", color="darkslategrey", alpha=0.5, stat="probability", zorder=0)

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')

ax.legend(bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('ssp-gfrags-predicted_samples', fig, display=False)
plt.close()

In [36]:
fig, ax = plt.subplots()

# points
median_prior = np.median(pcs)
median_lh = np.median(lh_pcs)
# median_post = post_df[post_df["Informed post"].between(.49, .51)]["x"].mean()
quants_2024 = np.quantile(sim_2024[0], some_quants)

# posterior
ax.plot(xrange, informed, c="magenta", linewidth=3,linestyle=':', zorder=20, label='Outlook 2024')
ax.plot([x_median_2024], [.5],  c="black", markersize=6, marker="x", zorder=27, label="Expected median 2024")
ax.plot(xrange, uninformed, c="darkslategrey",  linestyle=':', linewidth=3, zorder=11, alpha=0.5,  label='2022')
ax.plot([median_lh], [.5], c="blue", markersize=5, marker="o", zorder=25, label="2022 median")
ax.plot(xrange, grid_prior, c="cornflowerblue", linestyle=':',  linewidth=4, zorder=11, alpha=0.5, label="2021")
ax.plot([median_prior], [.5], c="red", markersize=5, marker="o", zorder=25, label="2021 median")

# 50% IQR
ax.axvspan(xmin=prior_quants[1], xmax=prior_quants[5], ymin=0.25, ymax=0.75, facecolor='cornflowerblue', zorder=13, alpha=0.2, label="IQR - 2021")
ax.axvspan(xmin=post_quants[1], xmax=post_quants[5], ymin=0.25, ymax=0.75, facecolor='darkslategrey', zorder=13, alpha=0.2, label="IQR - 2022")
ax.axvspan(xmin=quants_2024[1], xmax=quants_2024[5], ymin=0.25, ymax=0.75, facecolor='black', zorder=13, alpha=0.2, label="IQR - 2024")
ax.plot(xrange, beta_prior, c="lightsteelblue",  linewidth=3, zorder=10, alpha=0.5, label="Outlook-beta 2022")
ax.plot(xrange, beta_p, c="lightsteelblue",  linewidth=3, zorder=10, alpha=0.5, label="Outlook-beta 2022")

ax.set_xlabel('pcs/m')
ax.set_xlim(-.1, 3)
ax.set_ylabel('probability')
h, l = ax.get_legend_handles_labels()
ax.legend(h[:-2], l[:-2], bbox_to_anchor=(0,1.05), loc="lower left", ncol=2 )
glue('ssp-gfrags-outlook-2024', fig, display=False)
plt.close()

In [37]:
code_index = 1

this_code =  code_groups[code_index]

prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project == 'Training')]
lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs) + len(lh_pcs))
x_median_2024 = post_df.loc[post_df["Informed post"].between(.48, .52), "x"].mean()

# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))

In [38]:
code_index = 2

this_code =  code_groups[code_index]

prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project == 'Training')]
lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs) + len(lh_pcs))
x_median_2024 = post_df.loc[post_df["Informed post"].between(.48, .52), "x"].mean()

# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))

In [39]:
code_index = 3

this_code =  code_groups[code_index]

# define the prior, likelihood data and likelihood locations
prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project == 'Training')]
lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs) + len(lh_pcs))
uniques = post_df["Informed post"].unique()
the_middle = int(np.floor(len(uniques)/2))
x_median_2024 = post_df.loc[post_df["Informed post"] == uniques[the_middle], "x"].mean()

# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))

In [40]:
code_index = 4

this_code =  code_groups[code_index]

# define the prior, likelihood data and likelihood locations
prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project == 'Training')]
lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]

prior_args = {
    'prior_data':prior_data[prior_data.location.isin(prior_locations)],
    'start': start,
    'end': end,
    'xrange':xrange,
    'uninformed_prior': uninformed_tuple,
}
# grid approximation of the prior
grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)

posterior_args = {
    'lh_data':lh_data,
    'start': start,
    'end': "2022-12-31",
    'un_informed': uninformed_tuple,
    'code':this_code,
    'informed_prior': prior_k_n
}

# grid approximation of posterior
informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)

# the quantiles from the observed data
prior_quants = np.quantile(pcs, some_quants)
post_quants = np.quantile(lh_pcs, some_quants)

# data frame with normalized results
post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")

# samples from posterior 
sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs))
x_median_2024 = post_df.loc[post_df["Informed post"].between(.48, .52), "x"].mean()

# observed in relation to predicted
ppf = predicted_summary(lh_pcs, pcs, prior_quants, x_median_2024)
summaries.append(ppf)
dfs.append(post_df.rename(columns={"Informed post":"2024", "x":"pcs/m"}))

# rc_past_present_future = {"Median 2021": np.median(prior_data.pcs_m.values), "Median 2022": np.median(lh_data.pcs_m.values), "Expected median 2024":median_2024}
sum_dict = {x:summaries[i]["pcs/m"].values for i,x  in enumerate(code_groups)}
previsions = pd.DataFrame(sum_dict, index=summaries[0].index.values)

In [41]:
# dfs[0].rename(columns={"Informed post":"2024", "x":"pcs/m"}, inplace=True)
previous_medians = previsions.loc["Median 2021"].round(2).values
# print(previous_medians)
chance_of_exceeding = [x[x["pcs/m"] == previous_medians[i]]["2024"].mean() for i,x in enumerate(dfs)]
# print(chance_of_exceeding)
response_question_2_lake = pd.DataFrame([chance_of_exceeding], columns=code_groups, index=["P"])
response_question_2_lake.rename(columns=column_names_groups, inplace=True)
rq2=response_question_2_lake.T

caption = "The probability that a survey in 2024 will exceed the median (50%) from 2021"

styled = rq2.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)
glue('ssp-reply-question-2', styled, display=False)

### Previous and expected survey results of objects of interest

```{admonition} Given the median value for the objects of interest in 2021, what is the chance that a survey in 2022 will exceed this value?

|Table 13|
|:------------------------|
|{glue:}`ssp-reply-question-2`|
|Contrary to the trend of the lake, we expect an increase in the amount of beach litter density at Saint Sulpice. This is valid for all object groups.|
```

#### Expected results fragmented plastics

|   Figure 12, Table 14  |   Table 15, Figure 13         |
|:----------------------:|:-------------------------------:|
|{glue:}`ssp-gfrags-outlook-2024`| {glue:}`ssp-unk-2024-meds`    | 
|{glue:}`ssp-unk-summary`| {glue:}`ssp-gfrags-predicted_samples` |

### Summary of expected survey results Saint Sulpice

In [42]:
caption = "Previous and expected results and the percent of observed samples from 2022 that were included in either the 50% IQR or the 94% IQR of the predicted values."

rq3 = previsions.style.format(precision=2).set_caption(caption).set_table_styles(table_large_font)

glue("ssp-lake-rq3", rq3, display=False)

```{admonition}  How do the results from 2022 change the expected survey results going forward??
Beach litter density at Saint Sulpice is expected to increase. Given the data from 2022 we expect the survey results to increase from the 2021 levels and the density at Saint Sulpice is expected to be greater than the expected value for the lake in general.


| Table 16  |
|:-----------:|
|{glue:}`ssp-lake-rq3`|

__Legend:__ Unk = Unknown group, PC = personal consumption, Ph = personal hygiene, Rc = recreation, Ip = industrial professional
```

## Discussion

Potential points of discussion

1. The lake samples show a decline but the results at Saint Sulpice show an increase. Could this be because the samples at the lake level came from one sampling group and at Saint Sulpice from another?
2. Fragmented plastics are 40% of the objects in 2022 and but hisotrically it is around 20% for the lake. Why is their such a difference?
3. Is their other evidence to support the expected decline of personal consumption products found on the beach?
4. Were there changes to the water treatment facilities (or processes) that would support an anticipated decline in personal hygiene products in general but an increase at Saint Sulpice? For example there were no tampon applicators identified in 2022 but yet they were identified in 40% of samples from 2015-2021.
5. What does the presence of shotgun shells on the beach at Saint Sulpice say about hydrological transport mechanisms in lake?


## Conclusions

The grid approximations function as we expected. The flat areas in the predicted curves demonstrate the need for a more comprehensive model. However, the observed samples in 2022 are within the 94% HDI of the predictions. In most cases we can account for 80% - 100% of the observed with the prediction. 

### Next steps

### Participating organizations

### Financial disclosure

This is unfunded

In [43]:
today = dt.datetime.now().date().strftime("%d/%m/%Y")
where = "Biel, CH"

my_block = f"""

This script updated {today} in {where}

> \u2764\ufe0f what you do everyday

*analyst at hammerdirt*
"""

md(my_block)



This script updated 28/08/2023 in Biel, CH

> ❤️ what you do everyday

*analyst at hammerdirt*


In [44]:
%watermark --iversions -b -r

Git repo: https://github.com/hammerdirt-analyst/patelmanuscript.git

Git branch: newsummary

pandas    : 2.0.2
seaborn   : 0.12.2
matplotlib: 3.7.1
numpy     : 1.24.3

