In [None]:
In this notebook, I will clean and transform data from SSB (statistisk sentralbyrå) on household waste.
The data spans from 2015 to 2024 for all municipalities in Norway -
Due to our survey data Norsk medborgerpanel, where I extract the independent variables, records only the county respondents reside in, 
I decide to use counties instead of municipalities as the entity in the pandel data. 
The measurement unit is tonnes. 

The household waste dataset records total collected waste, the sum of residual waste (restavfall) and separated waste (utsort avfall).
Separated waste is then broken down into 16 waste streams including paper, glass, plastics, metals, electronics, food and wet organics, tree, garden waste, hazardous waste, etc.
Each stream is further divided into waste treatment methods: recycling, incieration, landfilling, biogas production, composting, and others. 
It's worth noting that not all waste streams are suited for recycling, such as food and wet organics, tree, and garden waste -
thus to use a combined recycling rate of all waste streams as a variable for how good each county is at recycling would be inappropriate, as it highly depends on the composition of collected waste.
Another point to consider is that almost all paper, glass, plastics, metals that are sorted out are recycled, making the recycling rates close to 100%. 
This is of course not true in reality, because a percentage of these is not sorted and goes into residual waste, which is primarily incinerated in Norway. 

...



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
#import xlsx data file 
df = pd.read_excel("ssb_avfall.xlsx", sheet_name="data")
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,EAK The whole country,0 In total,in total,In total,2288135.5,2276919.6,2254846.6,2240096.0,2274906.3,2418459,2335812,2123818,2079634,2122764
1,,,recycling,In total,527411.3,535704.4,547123.2,545616.8,561718.4,631840,634584,586187,570586,588369
2,,,incineration,In total,1322783.6,1322482.2,1282927.9,1245144.3,1233566.3,1278844,1225980,1095301,1050310,1050466
3,,,landfilling,In total,67891.0,64578.9,71617.5,88539.3,99448.7,126055,85100,72972,71263,74477
4,,,biogas production,In total,93538.3,104580.9,108028.7,120089.5,160785.8,157460,163792,157247,170210,183711


In [27]:
#add colnames to the first 4 columns dataframe
df.columns = ['region', 'waste_type', 'treatment_method', 'collection_method'] + list(df.columns[4:])

#collection method contains pickup and drop-off, this is not relevant for this research, so I will drop this column.
df = df.drop(columns=['collection_method'])

df.head(20)

Unnamed: 0,region,waste_type,treatment_method,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,EAK The whole country,0 In total,in total,2288135.5,2276919.6,2254846.6,2240096,2274906.3,2418459,2335812,2123818,2079634,2122764
1,,,recycling,527411.3,535704.4,547123.2,545616.8,561718.4,631840,634584,586187,570586,588369
2,,,incineration,1322783.6,1322482.2,1282927.9,1245144.3,1233566.3,1278844,1225980,1095301,1050310,1050466
3,,,landfilling,67891,64578.9,71617.5,88539.3,99448.7,126055,85100,72972,71263,74477
4,,,biogas production,93538.3,104580.9,108028.7,120089.5,160785.8,157460,163792,157247,170210,183711
5,,,composting,245745,227495.4,224428.4,225738.6,203892.7,204786,211766,184764,200882,211233
6,,,other,30766.3,22077.8,20720.8,14967.4,15494.4,19474,14590,27347,16383,14508
7,,1 Residual waste,in total,985822.5,977103.8,943061.2,926351.3,901263.1,958339,918937,876707,850827,841004
8,,,recycling,.,.,.,.,.,.,.,.,.,.
9,,,incineration,958731.3,956287.9,916666.9,904845.2,879968.6,920672,898810,849309,828342,818122


In [29]:
#The NaN in region and waste_type are actually not missing values but blank cells that mean "same as above". I will fill them with the value above.
df['region'] = df['region'].ffill()
df['waste_type'] = df['waste_type'].ffill()

df.head(20)

Unnamed: 0,region,waste_type,treatment_method,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,EAK The whole country,0 In total,in total,2288135.5,2276919.6,2254846.6,2240096,2274906.3,2418459,2335812,2123818,2079634,2122764
1,EAK The whole country,0 In total,recycling,527411.3,535704.4,547123.2,545616.8,561718.4,631840,634584,586187,570586,588369
2,EAK The whole country,0 In total,incineration,1322783.6,1322482.2,1282927.9,1245144.3,1233566.3,1278844,1225980,1095301,1050310,1050466
3,EAK The whole country,0 In total,landfilling,67891,64578.9,71617.5,88539.3,99448.7,126055,85100,72972,71263,74477
4,EAK The whole country,0 In total,biogas production,93538.3,104580.9,108028.7,120089.5,160785.8,157460,163792,157247,170210,183711
5,EAK The whole country,0 In total,composting,245745,227495.4,224428.4,225738.6,203892.7,204786,211766,184764,200882,211233
6,EAK The whole country,0 In total,other,30766.3,22077.8,20720.8,14967.4,15494.4,19474,14590,27347,16383,14508
7,EAK The whole country,1 Residual waste,in total,985822.5,977103.8,943061.2,926351.3,901263.1,958339,918937,876707,850827,841004
8,EAK The whole country,1 Residual waste,recycling,.,.,.,.,.,.,.,.,.,.
9,EAK The whole country,1 Residual waste,incineration,958731.3,956287.9,916666.9,904845.2,879968.6,920672,898810,849309,828342,818122


In [None]:
# the dot in cells under the years columns are a bit tricky
# some dots mean value 0, and some dots mean the value is not applicable due to county reform.
# check unique values of region
df['region'].unique()

array(['EAK The whole country', 'EAKUO The whole country except Oslo',
       'EKA31 Østfold', 'EKA32 Akershus', 'EKA30 Viken (2020-2023)',
       'EKA01 Østfold (-2019)', 'EKA02 Akershus (-2019)', 'EKA03 Oslo',
       'EKA34 Innlandet', 'EKA04 Hedmark (-2019)',
       'EKA05 Oppland (-2019)', 'EKA33 Buskerud',
       'EKA06 Buskerud (-2019)', 'EKA39 Vestfold', 'EKA40 Telemark',
       'EKA38 Vestfold og Telemark (2020-2023)', 'EKA07 Vestfold (-2019)',
       'EKA08 Telemark (-2019)', 'EKA42 Agder',
       'EKA09 Aust-Agder (-2019)', 'EKA10 Vest-Agder (-2019)',
       'EKA11 Rogaland', 'EKA46 Vestland', 'EKA12 Hordaland (-2019)',
       'EKA14 Sogn og Fjordane (-2019)', 'EKA15 Møre og Romsdal',
       'EKA50 Trøndelag - Trööndelage', 'EKA16 Sør-Trøndelag (-2017)',
       'EKA17 Nord-Trøndelag (-2017)', 'EKA18 Nordland - Nordlánnda',
       'EKA55 Troms - Romsa - Tromssa',
       'EKA56 Finnmark - Finnmárku - Finmarkku',
       'EKA54 Troms og Finnmark Romsa ja Finnmárku (2020-2023)',
 

In [None]:
#ok, I obviously need to combine some counties but I need to handle the dots first in order to use fillna or combine_first

# select only the relevant waste types and treatment_method

# to start, I want to look at packaging waste, which is the sum of plastic, paper, glass, metal packaging waste.
# I also want to look at food waste and residual waste. For these two types, I am interested in the total amount collected.
# I will keep total waste as well just for reference.  