In [2]:
%load_ext watermark

import pandas as pd
import numpy as np

# Review of previous data (in progress)

## Notes

After the end of the IQAASL project the data was stored separately and other surveys were conducted. Those results were stored in separate places. The codes used for the different projects were not all the same.

Here we standardize, indentify locations or records that cannot be verified and define the set of codes used since 2015. The results will be used to define the data model for the next itteration.

This work considers the results from experiences with the Solid-Waste-Team, Wagenigen Research and Univeristy, the previous collaboration with OFEV and current projects with the ASL.

### Eliminated survey locations

The following locations were either duplicated or the length of the shoreline could not be verified.

```python

not_these = [
    'sihlsee_einsiedeln_schilligerllacherl',
    'schiffenensee_duedingen_hirschij',
    'lac-leman-hammerdirt',
    'thur_schoenenberg_schaera',
    'katzenbach_zuerich_sanesim',
    'inn_pradella_kohlt',
    'emme_luterbach_huggenbergerk',
    'lotschebach_bern_scheurerk',
    'mammern-swisslitterreport',
    'berlingen-swisslitterreport'    
]
```

### Eliminated codes

The codes G909, G910, G911 and G912 were eliminated. The recorded value was placed under the parent code:

1. G909, G910 => G74
2. G911 => G81
3. G912 => G82

### Gfoams, Gfrags, Gcaps

These are aggregate groups. It is difficult to infer how well a participant differentiates between size or use of the following codes.

1. Gfrags: G79, G78, G75
2. Gfoams: G81, G82, G76
3. Gcaps: G21, G22, G23, G24

These aggregate groups are used when comparing values between sampling campaigns.

### Sampling campaigns

The dates of the sampling campaigns are expanded to include the surveys that happened between large organized campaigns. The start and end dates are defined below.

__Attention!!__ The codes used for each survey campaign are different. Different groups organized and conducted surveys using the MLW protocol. The data was then sent to us.

__MCBP:__ November 2015 - November 2016. The initial sampling campaign. Fragmented plastics (Gfrags/G79/G78/G76) were not sorted by size. All unidentified hard plastic items were classified in this manner.

* start_date = 2015-11-15
* end_date = 2017-03-31

__SLR:__ April 2017 - May 2018. Sampling campaign by the WWF. Objects less than 2.5 cm were not counted.

* start_date = 2017-04-01
* end_date = 2020-03-31

__IQAASL:__ April 2020 - May 2021. Sampling campaign mandated by the Swiss confederation. Additional codes were added for regional objects.

* start_date = 2020-04-01
* end_date = 2021-05-31

__Plastock (not added yet):__ January 2022 - December 2022. Sampling campaign from the Association pour la Sauvegarde du Léman. Not all objects were counted, They only identified a limited number of objects.

### Feature type

The feature type is a label that applies to general conditions of use for the location and other locations in the region

* r: rivers: surveys on river banks
* l: lake: surveys on the lake shore
* p: parcs: surveys in recreational areas

### Parent boundary

Designates the larger geographic region of the survey location. For lakes and rivers it is the name of the catchment area or river basin. For parcs it is the the type of park ie.. les Alpes. Recall that each feature has a name, for example Alpes Lépontines is the the name of a feature in the geographic region of _Les Alpes_.

### Language

The code descriptions are available in three languages

* en: english
* fr: french
* de: german

### quantity

In [3]:
def collect_vitals(data):
    total = data.quantity.sum()
    median = data.pcs_m.median()
    samples = data.loc_date.nunique()
    ncodes = data.code.nunique()
    nlocations = data.slug.nunique()
    nbodies = data.feature_name.nunique()
    ncities = data.city.nunique()
    min_date = data["date"].min()
    max_date = data["date"].max()
    
    return total, median, samples, ncodes, nlocations, nbodies, ncities, min_date, max_date

def find_missing(more_than, less_than):
    return np.setdiff1d(more_than, less_than)
def find_missing_loc_dates(done, dtwo):
    locs_one = done.loc_date.unique()
    locs_two = dtwo.loc_date.unique()
    return find_missing(locs_one, locs_two)

def aggregate_gcaps_gfoams_gfrags(data, codes,columns=["Gfoams", "Gfrags", "Gcaps"]):
    for col in columns:
        change = codes.loc[codes.parent_code == col].index
        data.loc[data.code.isin(change), "code"] = col
        
    return data

def make_a_summary(vitals, add_summary_name=False):

    a_summary = f"""
    Number of objects: {vitals[0]}
    
    Median pieces/meter: {vitals[1]}
    
    Number of samples: {vitals[2]}
    
    Number of unique codes: {vitals[3]}
    
    Number of sample locations: {vitals[4]}
    
    Number of features: {vitals[5]}
    
    Number of cities: {vitals[6]}
    
    Start date: {vitals[7]}
    
    End date: {vitals[8]}
    
    """

    if add_summary_name:
        a_summary = f"""
        Summary name = {add_summary_name}

        {a_summary}
        """
        
    return a_summary
code_cols = ['material', 'description', 'source', 'parent_code', 'single_use', 'groupname']
a_cols = [
    'loc_date',
    'date',
    'water_name_slug',
    'location',
    'code',
    'pcs_m',
    'quantity',
    'river_bassin',
    'length', 
    'groupname',
    'city'
]

c_cols = [
    'loc_date',
    'location',
    'date',
    'water_name_slug',     
    'river_bassin',
    'length', 
    'groupname',
    'city',
    'code',
]

group_by_columns = [
    'loc_date', 
    'date', 
    'feature_name', 
    'slug',     
    'parent_boundary',
    'length',
    'groupname',
    'city',
    'code', 
]
agg_this = {
    "quantity":"sum",
    "pcs_m": "sum"
}

not_these = [
    'sihlsee_einsiedeln_schilligerllacherl',
    'schiffenensee_duedingen_hirschij',
    'lac-leman-hammerdirt',
    'thur_schoenenberg_schaera',
    'katzenbach_zuerich_sanesim',
    'inn_pradella_kohlt',
    'emme_luterbach_huggenbergerk',
    'lotschebach_bern_scheurerk',
    'mammern-swisslitterreport',
    'berlingen-swisslitterreport'   
]

dfCodes = pd.read_csv("data/end_process/codes.csv")
dfCodes.set_index("code", drop=True, inplace=True)


data_source = [
    "data/end_process/after_may_2021.csv",
    "data/end_process/iqaasl.csv",
    "data/end_process/mcbp.csv",
    "data/end_process/slr.csv",
]

def combine_survey_files(list_of_files):

    files = []
    for afile in list_of_files:
        files.append(pd.read_csv(afile))
    return pd.concat(files)

test_this = combine_survey_files(data_source)

## Survey data

In [4]:
new_column_names = {
    "water_name_slug":"feature_name",
    "river_bassin":"parent_boundary",
    "location":"slug"
}

test_this.rename(columns=new_column_names, inplace=True)
test_this["length"] = test_this["length"].astype("int")
test_this.head()

Unnamed: 0,loc_date,date,feature_name,slug,code,pcs_m,quantity,parent_boundary,length,groupname,city
0,"('tiger-duck-beach', '2021-10-07')",2021-10-07,lac-leman,tiger-duck-beach,G1,0.0,0,rhone,30,food and drink,Saint-Sulpice (VD)
1,"('tiger-duck-beach', '2021-10-07')",2021-10-07,lac-leman,tiger-duck-beach,G10,0.07,2,rhone,30,food and drink,Saint-Sulpice (VD)
2,"('tiger-duck-beach', '2021-10-07')",2021-10-07,lac-leman,tiger-duck-beach,G100,0.3,9,rhone,30,waste water,Saint-Sulpice (VD)
3,"('tiger-duck-beach', '2021-10-07')",2021-10-07,lac-leman,tiger-duck-beach,G101,0.0,0,rhone,30,personal items,Saint-Sulpice (VD)
4,"('tiger-duck-beach', '2021-10-07')",2021-10-07,lac-leman,tiger-duck-beach,G102,0.0,0,rhone,30,personal items,Saint-Sulpice (VD)


In [5]:
# aggregate to object totals per day per sample
# groupby eliminates the objects with a quantity of zero
# split the data into two parts

t_t0 = test_this[test_this.quantity == 0].copy()
t_t = test_this[test_this.quantity > 0].copy()
t_t = aggregate_gcaps_gfoams_gfrags(t_t, dfCodes)


t_ti = t_t.groupby(group_by_columns, as_index=False).agg(agg_this)

t_th = pd.concat([t_t0, t_ti])

In [6]:
test_this.info()

<class 'pandas.core.frame.DataFrame'>
Index: 318478 entries, 0 to 200785
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   loc_date         318478 non-null  object 
 1   date             318478 non-null  object 
 2   feature_name     318478 non-null  object 
 3   slug             318478 non-null  object 
 4   code             318478 non-null  object 
 5   pcs_m            318478 non-null  float64
 6   quantity         318478 non-null  int64  
 7   parent_boundary  318478 non-null  object 
 8   length           318478 non-null  int64  
 9   groupname        318478 non-null  object 
 10  city             318478 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 29.2+ MB


### Summary all data

In [7]:
vitals_all = collect_vitals(test_this)
print(make_a_summary(vitals_all))


    Number of objects: 196842
    
    Median pieces/meter: 0.0
    
    Number of samples: 1352
    
    Number of unique codes: 239
    
    Number of sample locations: 245
    
    Number of features: 59
    
    Number of cities: 142
    
    Start date: 2015-11-23
    
    End date: 2022-10-06
    
    


### Summary aggregate

In [8]:
vitals_aggregate = collect_vitals(t_th)
print(make_a_summary(vitals_aggregate))


    Number of objects: 196842
    
    Median pieces/meter: 0.0
    
    Number of samples: 1352
    
    Number of unique codes: 239
    
    Number of sample locations: 245
    
    Number of features: 59
    
    Number of cities: 142
    
    Start date: 2015-11-23
    
    End date: 2022-10-06
    
    


### Summary MCBP

start_date = 2015-11-15

end_date = 2017-03-31

In [9]:
mcbp = test_this[(test_this["date"] > "2015-11-15")&(test_this["date"] <= "2017-03-31")].copy()
mcbp.to_csv("data/end_process/mcbp.csv", index=False)
vitals_mcbp = collect_vitals(mcbp)
print(make_a_summary(vitals_mcbp, add_summary_name="MCBP"))


        Summary name = MCBP

        
    Number of objects: 35837
    
    Median pieces/meter: 0.0
    
    Number of samples: 94
    
    Number of unique codes: 235
    
    Number of sample locations: 21
    
    Number of features: 1
    
    Number of cities: 9
    
    Start date: 2015-11-23
    
    End date: 2017-03-20
    
    
        


### Summary SLR

start_date = 2017-04-01

end_date = 2018-05-31

In [10]:
slr = test_this[(test_this["date"] >= "2017-04-01")&(test_this["date"] <= "2020-03-31")].copy()
slr.to_csv("data/end_process/slr.csv", index=False)
vitals_slr = collect_vitals(slr)
print(make_a_summary(vitals_slr, add_summary_name="SLR"))


        Summary name = SLR

        
    Number of objects: 96851
    
    Median pieces/meter: 0.0
    
    Number of samples: 853
    
    Number of unique codes: 235
    
    Number of sample locations: 114
    
    Number of features: 38
    
    Number of cities: 79
    
    Start date: 2017-04-02
    
    End date: 2020-03-20
    
    
        


### summary IQAASL

start_date = 2020-04-01

end_date = 2021-05-31


In [11]:
iqaasl = test_this[(test_this["date"] >= "2020-04-01")&(test_this["date"] <= "2021-05-31")].copy()
iqaasl.to_csv("data/end_process/iqaasl.csv", index=False)

the_rest = test_this[(test_this["date"] > "2021-05-31")]
the_rest.to_csv("data/end_process/after_may_2021.csv", index=False)
vitals_iqaasl = collect_vitals(iqaasl)
print(make_a_summary(vitals_iqaasl, add_summary_name="IQAASL"))


        Summary name = IQAASL

        
    Number of objects: 54773
    
    Median pieces/meter: 0.0
    
    Number of samples: 387
    
    Number of unique codes: 238
    
    Number of sample locations: 149
    
    Number of features: 34
    
    Number of cities: 83
    
    Start date: 2020-04-09
    
    End date: 2021-05-29
    
    
        


### Summary parent-boundary

In [12]:
parent_boundary = "aare"

aare = test_this[test_this.parent_boundary == "aare"].copy()
vitals_aare = collect_vitals(aare)
print(make_a_summary(vitals_aare, add_summary_name="Aare"))


        Summary name = Aare

        
    Number of objects: 33446
    
    Median pieces/meter: 0.0
    
    Number of samples: 363
    
    Number of unique codes: 235
    
    Number of sample locations: 70
    
    Number of features: 15
    
    Number of cities: 48
    
    Start date: 2017-04-02
    
    End date: 2021-04-23
    
    
        


## Location data

In [13]:
beaches = pd.read_csv("data/end_process/beaches.csv")
beaches.head()

Unnamed: 0,slug,latitude,longitude,country,feature_type,display_feature_name,city_slug,feature_name,city,parent_boundary
0,aabach,47.220989,8.940365,CH,l,Zürichsee,schmerikon,zurichsee,Schmerikon,linth
1,aare-limmatspitz,47.50106,8.237371,CH,r,Aare,gebenstorf,aare,Gebenstorf,aare
2,aare-port,47.11617,7.26955,CH,r,Nidau-Büren-Kanal,port,aarenidau-buren-kanal,Port,aare
3,aare-solothurn-lido-strand,47.196949,7.521643,CH,r,Aare,solothurn,aare,Solothurn,aare
4,aare_bern_caveltin,46.923579,7.473319,CH,r,Aare,muri-bei-bern,aare,Muri bei Bern,aare


### Summary by feature type

In [14]:
by_feature = t_th.merge(beaches[["slug", "feature_type"]], how="outer", on="slug", validate="many_to_one")
by_feature.groupby("feature_type").agg({"quantity":"sum", "pcs_m":"mean"})

Unnamed: 0_level_0,quantity,pcs_m
feature_type,Unnamed: 1_level_1,Unnamed: 2_level_1
l,146980,0.023955
p,7776,0.120091
r,42086,0.006742


In [15]:
r_summary = by_feature[by_feature.feature_type == "r"].copy()
vitals_r = collect_vitals(r_summary)
print(make_a_summary(vitals_r, add_summary_name="Rivers"))


        Summary name = Rivers

        
    Number of objects: 42086
    
    Median pieces/meter: 0.0
    
    Number of samples: 575
    
    Number of unique codes: 238
    
    Number of sample locations: 99
    
    Number of features: 36
    
    Number of cities: 69
    
    Start date: 2017-04-02
    
    End date: 2021-05-06
    
    
        


In [16]:
l_summary = by_feature[by_feature.feature_type == "l"].copy()
vitals_l = collect_vitals(l_summary)
print(make_a_summary(vitals_l, add_summary_name="lake"))


        Summary name = lake

        
    Number of objects: 146980
    
    Median pieces/meter: 0.0
    
    Number of samples: 757
    
    Number of unique codes: 239
    
    Number of sample locations: 126
    
    Number of features: 16
    
    Number of cities: 67
    
    Start date: 2015-11-23
    
    End date: 2022-10-06
    
    
        


In [17]:
p_summary = by_feature[by_feature.feature_type == "p"].copy()
vitals_p = collect_vitals(p_summary)
print(make_a_summary(vitals_p, add_summary_name="parcs"))


        Summary name = parcs

        
    Number of objects: 7776
    
    Median pieces/meter: 0.0
    
    Number of samples: 20
    
    Number of unique codes: 231
    
    Number of sample locations: 20
    
    Number of features: 7
    
    Number of cities: 18
    
    Start date: 2021-04-24
    
    End date: 2021-08-28
    
    
        


In [18]:
t = np.array(vitals_l[:-2]) + np.array(vitals_r[:-2])
list(t)

[189066.0, 0.0, 1332.0, 477.0, 225.0, 52.0, 136.0]

## Codes

In [19]:
dfCodes.head()

Unnamed: 0_level_0,material,en,source,parent_code,single_use,groupname,fr,de
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
G708,Metal,Batons de ski,Usagers,G199,False,recreation,Batons de ski,Skistöcke
G712,Cloth,Gants de ski,Usagers,G135,False,recreation,Gants de ski,Skihandschuhe
G902,Cloth,"Mask medical, cloth",Personal hygiene,G145,False,personal items,"Masque médical, tissu réutilisable","Medizinische Masken, Stoff"
G917,Glass,Terracotta balls,Utility items,G210,False,unclassified,Boules de terre cuite,Blähton
G921,Glass,Ceramic tile and pieces,Construction,G204,False,infrastructure,Carreaux et pièces de céramique,Keramikfliesen und Bruchstücke


In [20]:
# language

dfCodes.loc["G79", "fr"]

'Plastiques fragmentés x > 25mm'

In [21]:
dfCodes.loc["G79", "de"]

'Objekte aus Kunststoff 2,5 - 50 cm'

In [22]:
dfCodes.rename(columns={"description":"en"}, inplace=True)

In [23]:
dfCodes.loc["G79", "en"]

'Plastic pieces 2.5cm - 50cm'

In [24]:
%watermark -a hammerdirt-analyst -co --iversions

Author: hammerdirt-analyst

conda environment: cantonal_report

numpy : 1.25.2
pandas: 2.0.3

