In [1]:
%load_ext watermark

import pandas as pd
import numpy as np
import datetime as dt 
from statsmodels.distributions.empirical_distribution import ECDF
from scipy.stats import beta
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown as md
from myst_nb import glue

from bisect import bisect
from bisect import bisect_left, bisect_right

dfCodes = pd.read_csv("resources/data/u_codes.csv")
dfBeaches = pd.read_csv("resources/data/u_beaches.csv")
dfBeaches.set_index("slug", inplace=True)

all_data = pd.read_csv("resources/data/u_all_data.csv")
all_data = all_data[all_data.river_bassin != 'les-alpes'].copy()
all_data["date"] = pd.to_datetime(all_data["date"], format="%Y-%m-%d")

# import regional labels. labels are used
# to identify the regional priors
lac_leman_regions = pd.read_csv("resources/data/lac_leman_regions.csv")

# map to code decriptions
dfCodes.set_index("code", inplace=True)
dfCodes.loc["Gcaps", ["material", "description", "groupname"]] = ["Plastic", "Plastic bottle lids", "food and drink"]

In [2]:
# this defines the css rules for the note-book table displays
header_row = {'selector': 'th:nth-child(1)', 'props': f'background-color: #FFF; text-align:right'}
even_rows = {"selector": 'tr:nth-child(even)', 'props': f'background-color: rgba(139, 69, 19, 0.08);'}
odd_rows = {'selector': 'tr:nth-child(odd)', 'props': 'background: #FFF;'}
table_font = {'selector': 'tr', 'props': 'font-size: 10px;'}
table_data = {'selector': 'td', 'props': 'padding: 12px;'}
table_caption = {'selector': 'caption', 'props': 'font-size: 14px; font-style: italic; caption-side: bottom; text-align: left; margin-top: 10px'}
table_css_styles = [even_rows, odd_rows, table_font, header_row, table_caption]


table_large_data = {'selector': 'tr', 'props': 'font-size: 14px; padding: 12px;'}
table_large_font = [even_rows, odd_rows, table_large_data, header_row, table_caption]

# Testing 2023 predictions

## The solid waste experience

This is the seventh year that the Solid Waste Team from the EPFL collect beach litter samples. In the maritime environment people have been measuring beach litter for decades. There is a standard protocol ([Guidance on Monitoring Marine Litter in European Seas](https://publications.jrc.ec.europa.eu/repository/handle/JRC83985)) for the EU area and threshold values for good environmental standing ([Beach litter thresholds](https://mcc.jrc.ec.europa.eu/main/dev.py?N=41&O=454)). 

In Switzerland we started monitoring shoreline trash in 2015, it was not obvious to most observers (except for Prof Ludwig) why this might be of interest. However, by 2016 the EU realized that monitoring trash flows in rivers and lakes ([monitoring trash in rivers](https://mcc.jrc.ec.europa.eu/documents/201703034325.pdf)) might be a good way to monitor flows into the oceans. All the while conservationists and biologists have raised concerns about the presence of plastics and diminshing biodiversity. The threshold established by the EU is based on the principle of precaution: _the health effects are unknown, it is prudent to reduce contact with plastics when possible_ ([Beach litter thresholds](https://mcc.jrc.ec.europa.eu/main/dev.py?N=41&O=454)).

### Observations and interpretations

A beach litter survey is a detailed observation of the quantity and type of objects that were found at the beach. This observation is further defined by the time and place it occured. The location of the beach litter survey can be described numerically using a topographical map and some common overlay techniques in QGIS.

The information gathered from the map are part of the _conditions_ that describe a survey location in particular. When beach litter surveys are considered in terms of their shared attributes we can use very simple techniques to find correlations between the conditions and the amount of trash found. For example, we can use Spearmans ranked correlation coefficient to quickly identify topographical attributes where specfic objects tend to accumulate. We wrote an article about it:  ([Near or far](https://hammerdirt-analyst.github.io/landuse/titlepage.html)).

### A unique problem and a unique solution

Trash in the environment is a unique problem. In general we know how an object becomes litter: either on purpose or on accident, people create the conditions that increase the chance that an end of lifecycle object will evade the waste recovery system. Resources are employed to change the behavior of people and therefore improve the chance that an end of lifcycle object will be approriately discarded, _(need reference)_. 

There are public services that are dedicated to collecting inappropriately discarded items. Beach litter surveys are the observed result of the difference between the effect of the systems in place to reduce litter and the amount of litter produced. Indifferent of how that litter was produced or the measures in place to prevent it. Therefore this environmental assessment is reliant on individual observations. We can look to orntithologists and botanists for examples on how to interpret this data.

```{admonition} Asessing the environment:

__What and how much are the volunteers likely to find?__
      
_This is the most honest answer that can be derived from the data._
```

There are 336 observations from 66 locations that describe the conditions under which 73,000 items were found on the 145km shore-line of Lake Geneva. Although this is only a small portion of the lake shore, this is still a good amount of samples in a six year period. It would be difficult to find a comparable stretch of coastline anywhere in the world that has that many samples in seven years. We can use that data to form our opinion of what we might find on October 5th.

```{admonition} Asessing the environment:

We can not tell you how much there is. __Only how much you are likely to find.__
      
What the difference is between the two statements is a philosophical discussion. In reality it may be hard to make such a distinction.
```


### Reducing dimensionality: find the most common

There are 228 different categories of objects. We are interested in what we might find and how likely we are to find it. Therefore we limit the search to items that were previously identified in at least 50% of the surveys AND/OR objects that are distinctive (easy to identify). This accounts for 74% of all objects previously recorded.

In [3]:
def prior_distributions(prior_data: pd.DataFrame = None, start: str = None, end: str = None,
                        xrange: np.array = None, uninformed_prior: np.array = None):
    data_args = {
        'start':start,
        'end':end,
        'data': prior_data,
    }
    prior_pcs = period_pieces(*data_args.values())         

    # get n and k for the prior data
    prior_k, prior_notk, prior_k_n_minus_k = period_k_and_n(prior_pcs, xrange)
   
    # make the likelihood parameters
    lhx = list(zip(prior_k, prior_notk))

    # make the prior distribution
    p_ui, prior_bmean = make_expected(lhx, uninformed_prior, xrange)

    # the uninformed beta approximation of the prior data
    prior_beta = [period_beta(x) for x in prior_k_n_minus_k]
    p_beta= [x.mean() for x in prior_beta]

    results=pd.DataFrame({"x":xrange, "p":p_ui})
    results["pn"] = results.p/results.p.sum()
    
    return np.array(p_ui), np.array(p_beta), prior_k_n_minus_k, results, prior_pcs

def posterior_distribution(lh_data: pd.DataFrame = None, start: str = None, end: str = None,
                           informed_prior: np.array = None, un_informed: np.array = None):
                               
    
    data_args = {
        'start': start,
        'end':end,
        'data': lh_data,   
        }

    period_all = period_pieces(*data_args.values())
    
    pall_k, pall_notk, pall_k_n_minus_k = period_k_and_n(period_all, xrange)
    
    lh_and_informed = np.array(pall_k_n_minus_k) + np.array(informed_prior)
    lhx = list(zip(pall_k, pall_notk))        
    
    probi, probi_beta = make_expected(pall_k_n_minus_k, np.array(informed_prior), xrange)
    grid_prox, grid_prox_beta = make_expected(pall_k_n_minus_k, un_informed, xrange)
    
    # beta distribution 
    pall_beta = [period_beta((x[0]+1, x[1]+1)) for x in pall_k_n_minus_k]
    pall_bmean = [x.mean() for x in pall_beta]
    return np.array(probi), np.array(grid_prox), pall_bmean, period_all
                               
def training_testing_compare(lh_pcs, pcs, post_quants, prior_quants):
    
    total_training = len(pcs) + len(lh_pcs)
    prior_weight = len(pcs)/total_training
    lh_weight = len(lh_pcs)/total_training

    number_of_samples = {"before may 2021": len(pcs), "after may 2021": len(lh_pcs)}
    weights = {"before may 2021":prior_weight, "after may 2021": lh_weight}
    observed_median = {"before may 2021":np.median(pcs), "after may 2021": np.median(lh_pcs)}
    observed_average = {"before may 2021":np.mean(pcs), "after may 2021": np.mean(lh_pcs)}
    observed_25 = {"before may 2021": prior_quants[1], "after may 2021":post_quants[1]}
    observed_75 = {"before may 2021": prior_quants[5], "after may 2021":post_quants[5]}
    index = ["weight all samples", "Number of samples", "Median", "Average", "25th percentile", "75th percentile"]
    components = [weights, number_of_samples, observed_median, observed_average, observed_25, observed_75]
    unks_sum_table = pd.DataFrame(components, index=index).style.format(precision=2).set_table_styles(table_large_font)
    styled = unks_sum_table.format(formatter="{:.0f}", subset=pd.IndexSlice[['Number of samples'], :])
    
    return styled

def predicted_summary(lh_pcs, pcs, prior_quants, median_2024):
    

    predicted = ((lh_pcs <= prior_quants[5])&(lh_pcs >= prior_quants[1])).sum()/len(lh_pcs)
    predicted_94 = ((lh_pcs <= prior_quants[-1])&(lh_pcs >= prior_quants[0])).sum()/len(lh_pcs)
    past_present_future = {
        "Median 2021": np.median(pcs), 
        "Median 2022": np.median(lh_pcs), 
        "Expected sampling median 2024":median_2024,
        "% 2022 in 50% IQR  predicted": predicted,
        "% 2022 in 94% IQR  predicted": predicted_94,
    }
        
    
    ppf = pd.DataFrame(past_present_future, index=["pcs/m"]).T

    return ppf


def make_results_df(prior_df, lh_c, source=None, source_norm=None):
    
    prior_df[source] = lh_c
    prior_df[source_norm] = prior_df[source]/prior_df[source].sum()

    return prior_df

def data_profile(all_data):
    date_min = all_data["date"].min()
    date_max = all_data["date"].max()

    if "location" in all_data.columns:
        nlocations = all_data.location.nunique()
    else:
        nlocations = all_data.slug.nunique()
    ncodes = all_data.code.nunique()
    ncities = all_data.city.nunique()
    quantity = all_data.quantity.sum()
    nsamples = all_data.loc_date.nunique()

    a_profile = dict(
        start = date_min,
        end = date_max,
        nlocations = nlocations,
        ncodes = ncodes,
        ncities = ncities,
        quantity = quantity,
        nsamples = nsamples
    )

    return a_profile

def a_fail_rate(x, total_number_of_samples):
    return x["fail"].sum()/total_number_of_samples


def the_most_abundant(x):
    t = x.groupby("code").quantity.sum().copy()
    t.sort_values("quantity", ascending=False, inplace=True)
    return t  

In [4]:
use_groups =  {
    'Personal hygiene':['G95', 'G96'],
    'Personal consumption':['G30', 'Gcaps', 'G27'],
    'Industrial/professional': ['G67', 'G89', 'G112'],
    'Unknown':['Gfrags', 'Gfoam'],
    'Recreation/sports': ['G70', 'G32'],
    
}

use_groups_i =  {
    'G95':'Personal hygiene',
    'G96': 'Personal hygiene',
    'G30':'Personal consumption',
    'Gcaps':'Personal consumption',
    'G27':'Personal consumption',
    'G67':'Industrial/professional',
    'G89':'Industrial/professional',
    'G112': 'Industrial/professional',
    'Gfoam':'Unknown',
    'Gfrags':'Unknown',
    'G70':'Recreation/sports',
    'G32':'Recreation/sports',
}

abbrev_use_g = {'Unknown':'Unk','Personal consumption':'Pc', 'Personal hygiene': 'Ph',    'Recreation/sports': 'Rc', 'Industrial/professional':'Ip'}
toi = list(use_groups_i.keys())
cbdi = pd.read_csv("resources/data/u_pstk_iqaasl_all.csv")
not_these = ['amphion', 'anthy', 'excenevex', 'lugrin', 'meillerie', 'saint-disdille', 'tougues']
cbd = cbdi[~cbdi.slug.isin(not_these)]
ssp = cbd[(cbd.city == 'Saint-Sulpice (VD)')].copy()
ssp['quantity'] = 0

ssp.to_csv('resources/data/swt_all.csv', index=False)

In [5]:

cbd = cbd[cbd.code.isin(toi)].copy()
cbd["fail"] = cbd.quantity > 0


cbd.loc[cbd.Project == "Testing", "Project"] = "after may 2021"
cbd.loc[cbd.Project == "Training", "Project"] = "before may 2021"

column_names_groups = {v:k for k,v in abbrev_use_g.items()}
code_groups = list(column_names_groups.keys())

cois = cities_of_interest = ['Saint-Sulpice (VD)', 'Saint Gingolph', 'Genéve', 'Cully', 'Vevey']

some_quants = [.03, .25, .48, .5, .52, .75, .97]
end_training_date = "2021-05-31"
begin_training_date = "2015-11-15"
codes_of_interest = cbd.groupby(["code"], as_index=False).agg({"quantity":"sum", "pcs/m":"mean", "fail": "sum"})
codes_of_interest["fail rate"] = (codes_of_interest.fail/cbd.loc_date.nunique()).round(2)
code_d = dfCodes["description"]
codes_of_interest["object"] = codes_of_interest.code.apply(lambda x: code_d.loc[x])
codes_of_interest = codes_of_interest[["code", "object", "pcs/m", "quantity", "fail rate"]]
codes_of_interest.set_index(["code", "object"], inplace=True, drop=True)
codes_of_interest["quantity"] = codes_of_interest.quantity.astype("int")
codes_of_interest["% of total"] = (codes_of_interest.quantity/cbdi.quantity.sum()).round(2)
codes_of_interest.index.name = None
caption = "Table 1: The objects of interest. The average pcs/m per sample for each object. The fail rate is the % of all samples that the object appeared in." 
codes_of_interest.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0_level_0,Unnamed: 1_level_0,pcs/m,quantity,fail rate,% of total
code,object,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
G112,Industrial pellets (nurdles),0.16,2686,0.22,0.02
G27,Cigarette filters,1.12,16458,0.85,0.15
G30,"Food wrappers; candy, snacks",0.54,6767,0.86,0.06
G32,Toys and party favors,0.05,606,0.48,0.01
G67,Industrial sheeting,0.3,3356,0.57,0.03
G70,Shotgun cartridges,0.08,1030,0.48,0.01
G89,Plastic construction waste,0.14,1970,0.51,0.02
G95,Cotton bud/swab sticks,0.39,4777,0.74,0.04
G96,Sanitary pads /panty liners/tampons and applicators,0.04,373,0.29,0.0
Gcaps,Plastic bottle lids,0.31,3953,0.84,0.04


### Assessing the environment

{Download}`Download the form </resources/figures/survey_estimates.pdf>`

The goal for todays excercise in 2023 is to determine how well our previous experiences inform us about the present. This is a simple process. There are four steps:

1. Start with your current understanding of the problem, consult the data here and form an opinion of how many of each item in the previous section you might find in 100 meters of shoreline. For example your might think 200 cigarette ends per 100m is a likely amount.
2. Use the provided form and note your estimate for each item in red ink. Put your name on the form and the name of the beach.
3. At the end of the litter survey note what you found for each item.

After the survey we will compare what we found to what we though we might find and the predicted amount using the model that was explained in the previous section. 

### Semester project

The semester project (if you choose to do it) is about documenting the process of updating the models and accessing data. It could be a narrated screencast. Something that next years class will consult. For those who are interested in data-science or application development we would be using python, R, Git and Annaconda.

Specifically we would be adding survey results from this years experience:
1. The results for Gfoams
2. The reults for Plage de Pélican

However, if you have done a data-science course or if you have some experience with application development you might find this an easy project that will allow you to demonstrate those skills and some creativity. Those that know how to use Git and Annaconda will find this fairly easy.

(data-context)=
## Summary of previous results

__Lake Geneva sample totals__

The total pcs/m for all surveys is given in figure 1 and figure 2. Samples after May 2021 are considered separately, this is a new six year sampling period for the lake.  The distribution of the sample totals is given in table 2.

```{figure} resources/figures/lac-leman_city_labels.jpeg
---
name: lac_leman_cities
---
Previous survey results from Lake Geneva
```

In [6]:
summ_data = cbd.copy()

summ_data["use group"] = summ_data.code.map(lambda x: use_groups_i[x])

summ_data["ug"] = summ_data["use group"].apply(lambda x: abbrev_use_g[x])
summ_data[summ_data["use group"] == 'Personal consumption'].code.unique()
summ_data["date"] = pd.to_datetime(summ_data["date"], format="%Y-%m-%d")

sd_x = summ_data.groupby(["loc_date", "date", "city", "Project", "doy"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})
sd_x_sp = sd_x[sd_x.city == 'Saint-Sulpice (VD)'].groupby(["loc_date", "date", "city", "Project", "doy"], as_index=False).agg({"pcs/m": 'sum', 'quantity':'sum'})

trg = summ_data[summ_data.Project == "before may 2021"].copy()
tst = summ_data[summ_data.Project == "after may 2021"].copy()
trg_c, tst_c = trg.city.nunique(), tst.city.nunique()
trg_lc, tst_lc = trg.slug.nunique(), tst.slug.nunique()
trg_q, tst_q = trg.quantity.sum(), tst.quantity.sum()

data_magnitude = [
    {"before may 2021":trg_c, "after may 2021":tst_c},
    {"before may 2021":trg_lc, "after may 2021":tst_lc},
    {"before may 2021":trg_q, "after may 2021":tst_q}
    
]

cities_set = list(set([*trg.city.unique(), *tst.city.unique()]))
n_ind_cities = len(cities_set)

caption = f'The number of different locations and cities for the data. Note that there are {n_ind_cities} different municipalitites in all.'

data_summ_q = pd.DataFrame(data_magnitude, index=["Number of cities", "Number of locations", "Total objects"]).astype('int')
data_summ_q = data_summ_q.style.format(formatter="{:,}").set_table_styles(table_large_font).set_caption(caption)
styled = data_summ_q.format(formatter="{:,}", subset=pd.IndexSlice[['Total objects'], :])
glue("data-summ-q3", styled, display=False)

In [7]:
# all the data by date
the_99th_percentile = np.quantile(sd_x['pcs/m'].values, .99)
px = 1/plt.rcParams['figure.dpi']  # pixel in inches
fig, ax = plt.subplots(figsize=(600*px,500*px))

sns.scatterplot(data=sd_x, x='date', y='pcs/m',ax=ax, color="dodgerblue", alpha=0.6,label="lac léman")
sns.scatterplot(data=sd_x_sp, x='date', y='pcs/m', color="magenta", label="solid-waste-team", ax=ax)

ax.set_ylim(-1, the_99th_percentile)
ax.legend(loc="upper left")
ax.set_xlabel("")
glue("testing_training_chrono_2", fig, display=False)
plt.close()

In [8]:
# all the data day of year
fig, ax = plt.subplots(figsize=(600*px, 500*px))

sns.scatterplot(data=sd_x, x='doy', y='pcs/m', ax=ax, color="dodgerblue", alpha=0.6,label="lac léman")
sns.scatterplot(data=sd_x_sp, x='doy', y='pcs/m', color="magenta", label="solid-waste-team", ax=ax)
ax.set_ylim(-1, the_99th_percentile)
ax.set_xlabel("Day of the year")
glue('testing_training_doy_2', fig, display=False)
plt.close()

In [9]:
testing_vals= sd_x[sd_x.Project == "after may 2021"]['pcs/m'].values
training_vals = sd_x[sd_x.Project == "before may 2021"]['pcs/m'].values


train_quantiles = np.quantile(training_vals, some_quants)
test_quantiles = np.quantile(testing_vals, some_quants)

training_testing_summary = training_testing_compare(testing_vals, training_vals, test_quantiles, train_quantiles)
caption = "The observed values from the training and testing data. Remark that the testing data is only 22% of all the data. This is because we are only in the first year of a six year sampling period"
sum_table = training_testing_summary.set_caption(caption)
sum_table.format(formatter="{:.0f}", subset=pd.IndexSlice[['Number of samples'], :])
glue("data-summary_2", sum_table, display=False)

|Figure 1, Table 2 | Table 3, Figure 2|
|:-----------------------:|:---------------------:|
|{glue:}`testing_training_chrono_2` |{glue}`data-summary_2`|
|{glue:}`data-summ-q3`|{glue}`testing_training_doy_2`|

In [10]:
def sampler_from_multinomial(normed, xrange, nsamples):
    
    choose = np.random.default_rng()
    nunique = np.unique(normed)
    norm_nunique = nunique/np.sum(nunique)
    found = choose.multinomial(1, pvals=norm_nunique, size=nsamples)
    ft = found.sum(axis=0)
    samples = []
    for i, asum in enumerate(ft):
        if asum == 0:
            samples += [0]
        else:
            choices = np.where(normed == nunique[i])
            samps = choose.choice(choices[0], size=asum)
            samples.extend(xrange[samps])

    return samples, nunique, norm_nunique, ft

def period_pieces(start, end, data):
    # the results in pieces per meter for one code from a subset of data
    date_mask = (data["date"] >= start) & (data["date"] <= end)
    period_one = data[date_mask]
    pone_pcs = period_one.pcs_m.values

    return pone_pcs

def period_k_and_n(data, xrange, add_one=False):

    pone_k = [(data >= x).sum() for x in xrange]
    pone_notk = [(data < x).sum() for x in xrange]

    if add_one:
        # if the use is for beta dist. This is the same
        # as mulitplying the likelihood * uninform prior (0.5) or beta(1,1)
        pone_k_n_minus_k = [(x+1, len(data) - x+1) for x in pone_k]
    else:
        pone_k_n_minus_k = [(x, len(data) - x) for x in pone_k]
        
    

    return np.array(pone_k), np.array(pone_notk), np.array(pone_k_n_minus_k)

def period_beta(k):
    
         
    return beta(*k)
        

def current_possible_prior_locations(landuse, locations, attribute):    

    # indentify the magnitude(s) of the attribute of interest from the
    # locations in the current data there may be more than one, in this 
    # example we use all the possible magnitudes for the attribute
    # locations = data[data.city == city].location.unique()

    # magnitudes for the attribute from all the locations in the municipality
    moa = magnitude_of_attribute = landuse.loc[locations][attribute].unique().astype('int')

    # identify locations that have the same attribute by magnitude of attribute
    possible_locations = landuse[landuse[attribute].isin(moa)].index

    # remove the locations that are in the likelihood function
    prior_locations = [x for x in possible_locations if x not in locations]

    return locations, possible_locations, prior_locations


def make_expected(lh_tuple, prior_tuple, xrange):
    res = []
    betas=[]
    # print(lh_tuple, prior_tuple)
    for i in np.arange(len(xrange)):
        alpha = prior_tuple[i][0]
        betai = prior_tuple[i][1]
        success = lh_tuple[i][0]
        n = lh_tuple[i][1] + lh_tuple[i][0] 
        numerator = alpha + success
        denominator = alpha + betai + n
        if numerator == 0:
            numerator = 1
        abeta = beta(numerator, (betai + lh_tuple[i][1] + lh_tuple[i][0])).mean()
        betas.append(abeta)
        # print(alpha, betai, success, numerator, n, denominator)
        if numerator >= denominator:
            numerator = denominator-1
            
        expected = numerator/denominator
        res.append(expected)
    return np.array(res), np.array(betas) 

In [11]:
an_xrange = np.arange(0, 11)


In [12]:
comb_lu_agg = pd.read_csv("resources/data/u_comb_lu_cover_street_rivers.csv")

lu_scaled = comb_lu_agg.pivot(columns="use", values="scaled", index="slug").fillna(0)

lu_magnitude = comb_lu_agg.pivot(columns="use", values="magnitude", index="slug").fillna(0)

lu_binned = comb_lu_agg.pivot(columns="use", values="binned", index="slug").fillna(0)

# not_these = ['amphion', 'anthy', 'excenevex', 'lugrin', 'meillerie', 'saint-disdille', 'tougues']
merge_locations = cbd.slug.unique()
cbdu = cbd[~cbd.slug.isin(not_these)].merge(lu_scaled[lu_scaled.index.isin(merge_locations )], left_on="slug", right_index=True, validate="many_to_one", how="outer")

cbdu["use group"] = cbdu.code.map(lambda x: use_groups_i[x])

cbdu["ug"] = cbdu["use group"].apply(lambda x: abbrev_use_g[x])
cbdu[cbdu["use group"] == 'Personal consumption'].code.unique()
cbdu["date"] = pd.to_datetime(cbdu["date"], format="%Y-%m-%d")

attribute_columns = [x for x in lu_scaled.columns if x not in ["Geroell", "Stausee", "See", "Sumpf", "Stadtzentr", "Fels"]]
work_columns = [x for x in cbdu.columns if x not in ["Geroell", "Stausee", "See", "Sumpf", "Stadtzentr", "Fels"]]
cbdu = cbdu[work_columns].copy()
cbdu.rename(columns={"pcs/m":"pcs_m"}, inplace=True)


## Expected survey results Saint Sulpice

### Predicted values using empirical Bayes method

The method proposed in chapter two produced the following expected survey results for October 5, 2023 at Saint Sulpice:

<!-- |     Figure 11, Table 11  |    Table 12, Figure 12       | 
|:------------------------:|:----------------------------:|
|{glue:}`ssp-outlook-2024` | {glue:}`ssp-2024-meds`|
|{glue:}`ssp-summary` | {glue:}`ssp-predicted_samples`| -->



In [13]:
city =  'Saint-Sulpice (VD)'
start, end = "2015-11-15", "2021-05-31"
index_range = (0.0, 10)
xrange =  np.arange(*index_range, step=.01)
uninformed_tuple = np.array([(1,1) for x in xrange])


g_resa = cbdu.copy()
g_resa = g_resa.groupby(['loc_date', 'date','slug', 'city', 'Project', 'code'], as_index=False).agg({'pcs_m':'sum', 'quantity':'sum'})
g_resadt = g_resa.groupby(['loc_date', 'date','slug', 'city', 'Project'], as_index=False).agg({'pcs_m':'sum', 'quantity':'sum'})

# define the prior, likelihood data and likelihood locations
posterior_df = pd.DataFrame(index=xrange)
predictions = {}

for code in toi:
    
    # code_index = 1
    city_index = 0
    attribute_index = 2
    
    this_code =  code
    this_attribute = attribute_columns[attribute_index]
    this_city = cois[city_index]
    
    prior_data = g_resa[(g_resa.code == this_code)&(g_resa.city != city)&(g_resa.Project =="before may 2021")]
    lh_data = g_resa[(g_resa.code == this_code)&(g_resa.city == city)]

    # here the locations from Saint Sulpice are indentified
    lh_locations = lh_data.slug.unique()

    # remove any location with no land-use data
    regions = lac_leman_regions[~lac_leman_regions.slug.isin(not_these)].copy()
    # identify the region of interest
    lh_regions = regions[regions.slug.isin(lh_locations)].alabel.unique()
    # retireve any other survey locations in the region
    regional_locations = regions[regions.alabel.isin(lh_regions)].slug.unique()
    # retrieve the land use values of all locations in the region
    land_use_data_of_interest = lu_binned.loc[regional_locations]

    # the locations from Saint Sulpice as well as regional locations are passed
    # to the current_possible_prior_locations method. The land use values are
    # compared and the locations with similar land use values are identified.
    locations, possible_locations, prior_locations = current_possible_prior_locations(land_use_data_of_interest, lh_locations, this_attribute)
    
    prior_args = {
        'prior_data':prior_data[prior_data.slug.isin(prior_locations)],
        'start': start,
        'end': end,
        'xrange':xrange,
        'uninformed_prior': uninformed_tuple,
    }
    # grid approximation of the prior
    grid_prior, beta_prior, prior_k_n, prior_df, pcs = prior_distributions(**prior_args)
    
    posterior_args = {
        'lh_data':lh_data,
        'start': start,
        'end': "2022-12-31",
        'un_informed': uninformed_tuple,
        'informed_prior': prior_k_n
    }
    
    # grid approximation of posterior
    informed, uninformed, beta_p, lh_pcs = posterior_distribution(**posterior_args)
    
    # the quantiles from the observed data
    prior_quants = np.quantile(pcs, some_quants)
    post_quants = np.quantile(lh_pcs, some_quants)
    
    # data frame with normalized results
    post_df = make_results_df(prior_df.copy(), informed, source="Informed post", source_norm="Ip_n")
    post_df = make_results_df(post_df, uninformed, source="Uninformed post", source_norm="Un_n")
    
    # samples from posterior 
    sim_2024 = sampler_from_multinomial(post_df["Ip_n"].values, xrange, len(pcs) + len(lh_pcs))
    sim_quants = np.quantile(sim_2024[0], some_quants)
    
    
    predictions.update({this_code:sim_quants})
    posterior_df[this_code]=informed

index = ['{:.0%}'.format(x) for x in some_quants]
pred_quants = pd.DataFrame(predictions, index=index)

In [14]:
objects = ["G112", "G27", "G30", "G32", "G67", "G70", "G95", "G96", "Gcaps", "Gfrags"]
pred_quants = pred_quants[objects]
caption = "Table 4: The 94% probability interval of the objects of interest for Saint Sulpice. The median value is used for the predictions"
pred_quants.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0,G112,G27,G30,G32,G67,G70,G95,G96,Gcaps,Gfrags
3%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.04,0.01,0.03,0.0,0.0,0.0,0.04,0.0
48%,0.04,0.36,0.18,0.05,0.15,0.0,0.32,0.04,0.14,0.57
50%,0.06,0.42,0.27,0.05,0.17,0.0,0.33,0.05,0.14,0.59
52%,0.07,0.45,0.3,0.05,0.18,0.0,0.34,0.07,0.14,0.62
75%,0.62,0.91,0.56,0.06,0.31,0.04,0.43,0.12,0.21,0.86
97%,1.09,1.43,1.92,5.65,0.87,0.14,1.03,0.44,0.37,4.5


Recall that the previous results from Saint Sulpice are not used to make the predictions. Only locations in the same region with similar land-use characteristics are used.

### Estimates from participants

After a classroom discusion and review of the previous years results (but not the predicted results) the participants made an estimate of how many they expect to find of each item of interest.

In [15]:
length_p = 49.3

estimated_p =[
    [.16, 1.12, .54, .05, .30, .08, .39, .04, .31, 1.34],
    [6, 3, .6, .1, .4, .03, .8, 1, 2, 1.34],
    [.4, 1.5, .3, .1, 1.1, .01, .5, .2, .4, 2]    
]

def make_rows(estimated, objects):
    rows = []
    for row in estimated:
        row = {objects[i]: x for i,x in enumerate(row)}
        rows.append(row)
    return rows

found_p = [4, 51, 7, 2, 0, 0, 12, 1, 12, 266]
found_pm = [x/length_p for x in found_p]

pierrette_rows = make_rows(estimated_p, objects)
pierrette = pd.DataFrame(pierrette_rows)

In [16]:
found_pel = [4, 51, 7, 2, 0, 0, 12, 1, 12, 266]
found_pelm = [x/length_p for x in found_p]

In [17]:
estimated_td = [
    [.16, 1.12, .54, .05, .30, .08, .39, .04, .31, 1.34],
    [.15, .57,.24,.08,.05,.05,.16,.1,.14, .22],
    [.05, .8, .2, .02, .15, .04, .3, .01, .25, 1.2],
    [.1,.35,.15,.03, .05, .01, .06, .1, .15, .5],
    [.07, .15, .05, .02, .01, 0, .04, .01, .08, .12],
    [.15, .5, .3, .03, .01, 0, .05, .05, .2, .2]
]

length_td=16.5

found_td = [11, 60, 9, 3, 0, 0,13, 1, 3, 179]
found_tdm = np.array([x/length_td for x in found_td])

tiger_duck_rows = make_rows(estimated_td, objects)
tiger_duck = pd.DataFrame(tiger_duck_rows)

found = pd.DataFrame([found_tdm, found_pm], columns=objects)
fmelted = pd.melt(found, value_vars=found.columns)
fmelted["source"] = "found"

combined = pd.concat([tiger_duck, pierrette])
caption = "Table 5: The estimated amount in pcs/meter for each object that the participants expected to find."
combined.reset_index(inplace=True, drop=True)
combined.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0,G112,G27,G30,G32,G67,G70,G95,G96,Gcaps,Gfrags
0,0.16,1.12,0.54,0.05,0.3,0.08,0.39,0.04,0.31,1.34
1,0.15,0.57,0.24,0.08,0.05,0.05,0.16,0.1,0.14,0.22
2,0.05,0.8,0.2,0.02,0.15,0.04,0.3,0.01,0.25,1.2
3,0.1,0.35,0.15,0.03,0.05,0.01,0.06,0.1,0.15,0.5
4,0.07,0.15,0.05,0.02,0.01,0.0,0.04,0.01,0.08,0.12
5,0.15,0.5,0.3,0.03,0.01,0.0,0.05,0.05,0.2,0.2
6,0.16,1.12,0.54,0.05,0.3,0.08,0.39,0.04,0.31,1.34
7,6.0,3.0,0.6,0.1,0.4,0.03,0.8,1.0,2.0,1.34
8,0.4,1.5,0.3,0.1,1.1,0.01,0.5,0.2,0.4,2.0


## Survey results October 5, 2023 Saint Sulpice

After the particpants completed the forms, surveys were conducted at three beaches within the city limits of Saint Sulpice. Only the forms for two beaches were returned.

In [18]:
caption="Table 6: The survey results of the objects of interest on October 5, 2023 in pieces per meter"
found_display = found.copy()
found_display.loc[0, "beach"] = "tiger-duck-beach"
found_display.loc[1, "beach"] = "parc-des-pierrettes"
found_display.set_index("beach", inplace=True, drop=True)
found_display.index.name = None
found_display.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0,G112,G27,G30,G32,G67,G70,G95,G96,Gcaps,Gfrags
tiger-duck-beach,0.67,3.64,0.55,0.18,0.0,0.0,0.79,0.06,0.18,10.85
parc-des-pierrettes,0.08,1.03,0.14,0.04,0.0,0.0,0.24,0.02,0.24,5.4


## Survey results October 5, 2023 Saint Sulpice

After the particpants completed the forms, surveys were conducted at three beaches within the city limits of Saint Sulpice. Only the forms for two beaches were returned.

In [19]:
caption="Table 6: The survey results of the objects of interest on October 5, 2023 in pieces per meter"
found_display = found.copy()
found_display.loc[0, "beach"] = "tiger-duck-beach"
found_display.loc[1, "beach"] = "parc-des-pierrettes"
found_display.set_index("beach", inplace=True, drop=True)
found_display.index.name = None
found_display.style.format(precision=2).set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0,G112,G27,G30,G32,G67,G70,G95,G96,Gcaps,Gfrags
tiger-duck-beach,0.67,3.64,0.55,0.18,0.0,0.0,0.79,0.06,0.18,10.85
parc-des-pierrettes,0.08,1.03,0.14,0.04,0.0,0.0,0.24,0.02,0.24,5.4


## Results: Estimated, found and predicted

It appears that both the participants and the model underestimated the amount of plastic fragments. Recall that the participants were given the cumulative results for these objects, table 1.

In [20]:

combined.reset_index(inplace=True, drop=True)
comb_long = pd.melt(combined, value_vars=combined.columns)
comb_long["source"] = "estimated"

predicted_median = list(zip(objects,pred_quants.loc["50%"]))
pmelted = pd.DataFrame(predicted_median, columns=["variable", "value"])
pmelted["source"] = "predicted"

predict_estimate_found = pd.concat([fmelted, pmelted, comb_long])

fig, ax = plt.subplots()

data_one = predict_estimate_found[predict_estimate_found.source == "estimated"]
data_two = predict_estimate_found[predict_estimate_found.source == "found"]
data_three = predict_estimate_found[predict_estimate_found.source == "predicted"]
sns.scatterplot(data=data_one, x="variable", y="value", color="magenta", zorder=0,ax=ax, label="estimated")
sns.scatterplot(data=data_two, x="variable", y="value", color="dodgerblue", marker="X", s=60, zorder=2,ax=ax, label="found")
sns.scatterplot(data=data_three, x="variable", y="value", color="black", zorder=3,ax=ax,label="predicted")
ax.set_xlabel("")
ax.set_ylabel("pcs/m")
ax.legend()
glue('estimated-found-predicted-2023', fig, display=False)
plt.close()

|Figure 3|
|:------------:|
|{glue:}`estimated-found-predicted-2023`|
|Figure 3: _The Estimated, observed and predicted results for the objects of interest, Saint Sulpice October 5, 2023_|
 

The predicted values from the model were all closer than the predicted values by the participants, table 7. In total 17/20 oserved results fell within the 96% probability interval predicted by the model, 7/20 fell within the 50% probability interval, Annex: table 8. Of the objects not within the 96% interval there is:
1. Fragmented plastics, (both surveys)
2. Cigarette ends
3. Toys and party favors

From figure 4 we can see how close the predictions and the estimates are.

In [21]:
def calculate_the_difference_estimated_found(combined, found_tdm, objects):
    diffs = []
    for i, o in enumerate(objects):
        diff = combined[objects[i]] - found_tdm[i]
        # diffpr = combined[objects[i]] - found_pm[i]
        diff_sq = (diff**2)**.5
        diffs.append(diff_sq)
    return diffs
difference_estimated_found_td = calculate_the_difference_estimated_found(tiger_duck, found_tdm, objects)
difference_estimated_found_pr = calculate_the_difference_estimated_found(pierrette, found_pm, objects)
td_diff = pd.DataFrame(difference_estimated_found_td).T
pier_diff = pd.DataFrame(difference_estimated_found_pr).T
combined_diffs = pd.concat([td_diff, pier_diff])
c_diff_estimated = pd.melt(combined_diffs, value_vars=combined_diffs.columns)
c_diff_estimated["source"] = "difference² of estimated"

In [22]:
predicted_median = pred_quants.loc["50%"].values
diff_1 = []
diff_2 = []
for i, o in enumerate(objects):    
    diffp = predicted_median[i] - found_tdm[i]
    diffpr = predicted_median[i] - found_pm[i]
    diff_sq = (np.array([diffp, diffpr])**2)**.5
    diff_1.append(diff_sq[0])
    diff_2.append(diff_sq[1])


pred_diffs = pd.DataFrame([diff_1, diff_2], columns=objects)
p_diff_predicted = pd.melt(pred_diffs, value_vars=pred_diffs.columns)
p_diff_predicted["source"] = "difference² of predicted"

In [23]:
results = pd.concat([c_diff_estimated, p_diff_predicted])

fig, ax = plt.subplots()

ax.plot(objects, [*[0]*len(objects)], label="zero")
sns.scatterplot(data=results, x="variable", y="value", ax=ax, hue="source", palette=["magenta", "black"])
ax.set_xlabel("")
ax.set_ylabel("pcs/m")
ax.legend()
glue('diference-estimated-found-predicted-2023', fig, display=False)
plt.close()

|Figure 4|
|:------------:|
|{glue:}`diference-estimated-found-predicted-2023`|
|Figure 4: _The root of the squared difference between observed estimated and predicted, Saint Sulpice October 5, 2023_| 

In [24]:
r = results.groupby(["source", "variable"], as_index=False).value.mean()
r.rename(columns={"value":"average"}, inplace=True)
rp = r.pivot(index="variable", columns="source")
rp.index.name = None
caption = "Table 7: The average difference between what was found and the estimates of the participants and what was predicted using the empirical Bayes method"
rp.style.set_table_styles(table_large_font).format(precision=2).set_caption(caption)

Unnamed: 0_level_0,average,average
source,difference² of estimated,difference² of predicted
G112,1.07,0.32
G27,2.32,1.92
G30,0.31,0.2
G32,0.11,0.07
G67,0.26,0.17
G70,0.03,0.0
G95,0.52,0.27
G96,0.15,0.02
Gcaps,0.26,0.07
Gfrags,8.11,7.53


In [25]:
def are_the_observed_values_within_the_hdi(found, objects, predicted, index=0):
    place = []
    for i, o in enumerate(objects):
        f = found.loc[index, o]
        p = predicted[o].values
        ip = bisect_left(p, f)
        if ip >= 0 and ip <= 6:
            # is within 94% HDI
            within_96 = True
        else:
            within_96 = False                  
        if ip >= 1 and ip <= 5:
            # is within 50% HDI           
            within_50 = True
        else:          
            within_50 = False    
            
        place.append([o, within_96, within_50])
    return place

simple_results_td = are_the_observed_values_within_the_hdi(found, objects, pred_quants)
td_predicted_results = pd.DataFrame(simple_results_td, columns=["object","within 96% HDI", "within 50% HDI"])

simple_results_p = are_the_observed_values_within_the_hdi(found, objects, pred_quants, index=1)
p_predicted_results = pd.DataFrame(simple_results_p, columns=["object","within 96% HDI", "within 50% HDI"])

all_predicted_results = pd.concat([td_predicted_results, p_predicted_results])

## Discussion

There were three surveys completed, two are reflected in the report. Neither the the estimated amounts from the particpants or the survey results were returned for the third survey. On location the participants were shown examples of the objects of interest. The limits of the survey area were defined and the survey was conducted in small groups. The objects found on the beach were separated and counted on location. The identification or the differentiation of fragmented plastics and foams remains difficult for new participants. This is in one part due to the constraints of time and on the other to the lack of experience. Many times what initially appears to be an unidentifiable piece of plastic can actually be placed in a more precise category with reasonable certainty. The new paritcipants do not have the time to consider other possibilities or simply are unaware of the original use of the item in questions.

Many participants used the previous aggregated survey results to estimate the expected values. This reasonable strategy produced estimates that were very close to the predicted survey results. This suggests that previous survey results can serve as an indicator for expected results as long as the objects have been identified correctly and consistently in the past. Yet, from table 7 it is clear that predictions are more accurate when a formal method is used. 

## Conclusions

From this experience we conclude that the expected values in table 4 do represent probable beach-litter densities in the region. 

### Next steps

Make hierarchical model

## Annex

### The accuracy of the predictions in relation to what was found

In [26]:
caption = "Table 8: Whether the observed value fell within the predicted interval"
all_predicted_results.style.set_table_styles(table_large_font).set_caption(caption)

Unnamed: 0,object,within 96% HDI,within 50% HDI
0,G112,True,False
1,G27,False,False
2,G30,True,True
3,G32,True,False
4,G67,True,False
5,G70,True,False
6,G95,True,False
7,G96,True,True
8,Gcaps,True,True
9,Gfrags,False,False


In [27]:
alr = all_predicted_results[["within 96% HDI", "within 50% HDI"]].sum()
alr/len(all_predicted_results)

within 96% HDI    0.85
within 50% HDI    0.40
dtype: float64

In [28]:
alr

within 96% HDI    17
within 50% HDI     8
dtype: int64

In [29]:
today = dt.datetime.now().date().strftime("%d/%m/%Y")
where = "Biel, CH"

my_block = f"""

This script updated {today} in {where}

\u2764\ufe0f __what you do everyday:__ *analyst at hammerdirt*
"""

md(my_block)



This script updated 22/08/2024 in Biel, CH

❤️ __what you do everyday:__ *analyst at hammerdirt*


In [30]:
%watermark --iversions -b -r

Git repo: https://github.com/hammerdirt-analyst/solid-waste-team.git

Git branch: main

numpy     : 1.26.4
seaborn   : 0.13.2
matplotlib: 3.8.4
pandas    : 2.2.2

