In [1]:
%load_ext watermark
import pandas as pd
import numpy as np

# Testing data models

The methods used in the version of the federal report were tested, but their was not a specific set of validation criteria beforehand. Test were done as the work progressed. This wasted alot of time

here we test the land use and survey data models.

1. is the land use data complete for each survey location?
2. does the survey data aggregate correctly to sample level?
   * what happens to objects with a quantity of zero?
   * aggregating to cantonal, municipal or survey area
     * are all locations included?
     * are lakes and rivers distinguished?
3. Does the aggregated data for iqaasl match the federal report?

### Gfoams, Gfrags, Gcaps

These are aggregate groups. It is difficult to infer how well a participant differentiates between size or use of the following codes.

1. Gfrags: G79, G78, G75
2. Gfoams: G81, G82, G76
3. Gcaps: G21, G22, G23, G24

These aggregate groups are used when comparing values between sampling campaigns.

### Sampling campaigns

The dates of the sampling campaigns are expanded to include the surveys that happened between large organized campaigns. The start and end dates are defined below.

__Attention!!__ The codes used for each survey campaign are different. Different groups organized and conducted surveys using the MLW protocol. The data was then sent to us.

__MCBP:__ November 2015 - November 2016. The initial sampling campaign. Fragmented plastics (Gfrags/G79/G78/G76) were not sorted by size. All unidentified hard plastic items were classified in this manner.

* start_date = 2015-11-15
* end_date = 2017-03-31

__SLR:__ April 2017 - May 2018. Sampling campaign by the WWF. Objects less than 2.5 cm were not counted.

* start_date = 2017-04-01
* end_date = 2020-03-31

__IQAASL:__ April 2020 - May 2021. Sampling campaign mandated by the Swiss confederation. Additional codes were added for regional objects.

* start_date = 2020-04-01
* end_date = 2021-05-31

__Plastock (not added yet):__ January 2022 - December 2022. Sampling campaign from the Association pour la Sauvegarde du Léman. Not all objects were counted, They only identified a limited number of objects.

### Feature type

The feature type is a label that applies to general conditions of use for the location and other locations in the region

* r: rivers: surveys on river banks
* l: lake: surveys on the lake shore
* p: parcs: surveys in recreational areas

### Parent boundary

Designates the larger geographic region of the survey location. For lakes and rivers it is the name of the catchment area or river basin. For parcs it is the the type of park ie.. les Alpes. Recall that each feature has a name, for example Alpes Lépontines is the the name of a feature in the geographic region of _Les Alpes_.

### Language

The code descriptions are available in three languages

* en: english
* fr: french
* de: german

In [2]:
def collect_vitals(data):
    total = data.quantity.sum()
    median = data.pcs_m.median()
    samples = data.loc_date.nunique()
    ncodes = data.code.nunique()
    nlocations = data.slug.nunique()
    nbodies = data.feature_name.nunique()
    ncities = data.city.nunique()
    min_date = data["date"].min()
    max_date = data["date"].max()
    
    return total, median, samples, ncodes, nlocations, nbodies, ncities, min_date, max_date

def find_missing(more_than, less_than):
    return np.setdiff1d(more_than, less_than)
def find_missing_loc_dates(done, dtwo):
    locs_one = done.loc_date.unique()
    locs_two = dtwo.loc_date.unique()
    return find_missing(locs_one, locs_two)

def use_gfrags_gfoams_gcaps(data, codes,columns=["Gfoams", "Gfrags", "Gcaps"]):
    for col in columns:
        change = codes.loc[codes.parent_code == col].index
        data.loc[data.code.isin(change), "code"] = col
        
    return data

def make_a_summary(vitals, add_summary_name=False):

    a_summary = f"""
    Number of objects: {vitals[0]}
    
    Median pieces/meter: {vitals[1]}
    
    Number of samples: {vitals[2]}
    
    Number of unique codes: {vitals[3]}
    
    Number of sample locations: {vitals[4]}
    
    Number of features: {vitals[5]}
    
    Number of cities: {vitals[6]}
    
    Start date: {vitals[7]}
    
    End date: {vitals[8]}
    
    """

    if add_summary_name:
        a_summary = f"""
        Summary name = {add_summary_name}

        {a_summary}
        """
        
    return a_summary
def combine_survey_files(list_of_files):

    files = []
    for afile in list_of_files:
        files.append(pd.read_csv(afile))
    return pd.concat(files)

def indexed_feature_data(file, index: str = "code"):
    df = pd.read_csv(file)
    df.set_index(index, drop=True, inplace=True)
    return df

# period dates
period_dates = {
    "mcbp":["2015-11-15", "2017-03-31"],
    "slr": ["2017-04-01", "2020-03-31"],
    "iqaasl": ["2020-04-01", "2021-05-31"],
    "2022": ["2021-06-01", "2022-12-01"]
}
code_cols = ['material', 'description', 'source', 'parent_code', 'single_use', 'groupname']

group_by_columns = [
    'loc_date', 
    'date', 
    'feature_name', 
    'slug',     
    'parent_boundary',
    'length',
    'groupname',
    'city',
    'code', 
]
agg_this = {
    "quantity":"sum",
    "pcs_m": "sum"
}

survey_data = [
    "data/end_process/after_may_2021.csv",
    "data/end_process/iqaasl.csv",
    "data/end_process/mcbp.csv",
    "data/end_process/slr.csv",
]

code_data =  "data/end_process/codes.csv"
beach_data = "data/end_process/beaches.csv"
land_cover_data = "data/end_process/land_cover.csv"
land_use_data = "data/end_process/land_use.csv"
street_data = "data/end_process/streets.csv"
intersection_attributes = "data/end_process/river_intersect_lakes.csv"
surveys = combine_survey_files(survey_data)
codes = indexed_feature_data(code_data, index="code")
beaches = indexed_feature_data(beach_data, index="slug")
land_cover = pd.read_csv(land_cover_data)
land_use = pd.read_csv(land_use_data)
streets = pd.read_csv(street_data)
river_intersect_lakes = pd.read_csv(intersection_attributes)

In [3]:
# assign the new code values to the results
g_frag_foam = use_gfrags_gfoams_gcaps(surveys, codes)

# separate the values greater than zero and the new code values
gthan_zero = g_frag_foam[(g_frag_foam.quantity > 0) | (g_frag_foam.code.isin(["Gfrags", "Gfoams", "Gcaps"]))].copy()

# separate the values = to zero and the codes that are not being changed
t_t0 = g_frag_foam[(g_frag_foam.quantity == 0) & (~g_frag_foam.code.isin(["Gfrags", "Gfoams", "Gcaps"]))].copy()

# group the codes that have a value greater
t_ti = gthan_zero.groupby(group_by_columns, as_index=False).agg(agg_this)

t_th = pd.concat([t_t0, t_ti])

In [4]:
vitals_t = collect_vitals(t_th)
print(make_a_summary(vitals_t))


    Number of objects: 196842
    
    Median pieces/meter: 0.0
    
    Number of samples: 1352
    
    Number of unique codes: 227
    
    Number of sample locations: 245
    
    Number of features: 59
    
    Number of cities: 142
    
    Start date: 2015-11-23
    
    End date: 2022-10-06
    
    


In [5]:
vitals_o = collect_vitals(surveys)
print(make_a_summary(vitals_o))


    Number of objects: 196842
    
    Median pieces/meter: 0.0
    
    Number of samples: 1352
    
    Number of unique codes: 227
    
    Number of sample locations: 245
    
    Number of features: 59
    
    Number of cities: 142
    
    Start date: 2015-11-23
    
    End date: 2022-10-06
    
    


In [6]:
%watermark -a hammerdirt-analyst -co --iversions

Author: hammerdirt-analyst

conda environment: cantonal_report

pandas: 2.0.3
numpy : 1.25.2

