# Annual Energy Savings from Recycled Materials in Singapore

## Project Goals
The goal of this project is to analyze the total garbage collection and recycling rate in Singapore, and to determine the amount of energy saved from recycling.

In this analysis, we will answer questions such as:
1. How much energy was saved per year? In which year was this amount the highest? The lowest? 
2. What is the trend for recycled energy savings in Singapore from 2003 to 2022?
3. What is the greatest source of recycled energy savings in 2022 and how has this changed over time?

For more information about how recycling can save energy, please refer here: https://greentumble.com/how-does-recycling-save-energy

## Data
- Recycled energy data for 2003 to 2016 a csv file is taken from the reference for this project, [kingabzpro](https://github.com/kingabzpro/Annual-Recycled-Energy-Saved-in-Singapore/tree/main/Data)
- Recycled energy data for 2017 to 2021 is taken from the Waste and Recycling Statistics [document](https://www.nea.gov.sg/docs/default-source/default-document-library/waste-and-recycling-statistics-2017-to-2021.pdf) on the NEA website. The data has been extracted to an Excel file.
- Recycled energy data for 2022 is taken from the [Waste Statistics and Overall Recycling NEA webpage](https://www.nea.gov.sg/our-services/waste-management/waste-statistics-and-overall-recycling)

**Data Dictionary**

|Variable|Description|
|-----|-----|

## Table of Contents
1. Data Acquisition
2. Data Cleaning and Pre-processing
3. Data Exploration and Visualization
4. Conclusions

***

## 1. Data Acquisition

#### Import Libraries

In [345]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import requests
from bs4 import BeautifulSoup

import sqlite3

#### 2003-2016: Import Data from `.csv`

In [346]:
# 2003 - 2016
df_03to16 = pd.read_csv('data/waste-and-recycling-statistics-2003-to-2016.csv')

In [347]:
df_03to16.head()

Unnamed: 0,waste_type,waste_disposed_of_tonne,total_waste_recycled_tonne,total_waste_generated_tonne,recycling_rate,year
0,Food,679900,111100.0,791000,0.14,2016
1,Paper/Cardboard,576000,607100.0,1183100,0.51,2016
2,Plastics,762700,59500.0,822200,0.07,2016
3,C&D,9700,1585700.0,1595400,0.99,2016
4,Horticultural waste,111500,209000.0,320500,0.65,2016


#### 2017-2021: Import Data from `.xlsx`

In [348]:
# 2017-2021
sheets = ['2017', '2018', '2019', '2020', '2021']

df_17to21_list = []
for sheet in sheets:
    df = pd.read_excel('data/waste-and-recycling-statistics-2017-to-2021.xlsx', sheet_name=sheet)
    df = df.rename(columns=df.iloc[0]).loc[1:]
    df['year'] = sheet
    df_17to21_list.append(df)
    
df_17to21 = pd.concat(df_17to21_list, axis=0)

In [349]:
df_17to21.head()

Unnamed: 0,Waste Type,Total Generated\n('000 tonnes),Total Recycled\n('000 tonnes),Recycling Rate,Total Disposed\n('000 tonnes),year
1,C&D,1609,1600,99%,9,2017
2,Ferrous metal,1379,1371,99%,8,2017
3,Paper/Cardboard,1145,569,50%,576,2017
4,Plastics,815,52,6%,763,2017
5,Food,810,133,16%,677,2017


#### 2022: Scrape Data with BeautifulSoup

In [6]:
#2022
url = 'https://www.nea.gov.sg/our-services/waste-management/waste-statistics-and-overall-recycling'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find('table')
    data = [(cell.text for cell in row.find_all('td')) for row in table.find_all('tr')]

df_22 = pd.DataFrame(data)

ConnectionError: HTTPSConnectionPool(host='www.nea.gov.sg', port=443): Max retries exceeded with url: /our-services/waste-management/waste-statistics-and-overall-recycling (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000029A1B30F670>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [None]:
df_22

***

## 2. Data Cleaning and Pre-processing

### Cleaning `df_03to16`

In [350]:
df_03to16.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   waste_type                   225 non-null    object 
 1   waste_disposed_of_tonne      225 non-null    int64  
 2   total_waste_recycled_tonne   225 non-null    float64
 3   total_waste_generated_tonne  225 non-null    int64  
 4   recycling_rate               225 non-null    float64
 5   year                         225 non-null    int64  
dtypes: float64(2), int64(3), object(1)
memory usage: 10.7+ KB


In [351]:
# change data types waste_disposed_of_tonne,total_waste_generated_tonne to float
dtype= {'waste_disposed_of_tonne': 'float64', 
        'total_waste_generated_tonne': 'float64'}

df_03to16 = df_03to16.astype(dtype)

In [352]:
# check missing values
df_03to16.isna().sum()

waste_type                     0
waste_disposed_of_tonne        0
total_waste_recycled_tonne     0
total_waste_generated_tonne    0
recycling_rate                 0
year                           0
dtype: int64

In [353]:
# reoder columns
df_03to16 = df_03to16[['waste_type',
                       'total_waste_generated_tonne',
                       'total_waste_recycled_tonne',
                       'recycling_rate',
                       'waste_disposed_of_tonne',
                       'year']]

In [354]:
# check update
df_03to16.reset_index(drop=True).head(1)

Unnamed: 0,waste_type,total_waste_generated_tonne,total_waste_recycled_tonne,recycling_rate,waste_disposed_of_tonne,year
0,Food,791000.0,111100.0,0.14,679900.0,2016


### Cleaning `df_17to22`

In [355]:
df_17to21.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75 entries, 1 to 15
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Waste Type                     75 non-null     object
 1   Total Generated
('000 tonnes)  75 non-null     object
 2   Total Recycled
('000 tonnes)   75 non-null     object
 3   Recycling Rate                 75 non-null     object
 4   Total Disposed
('000 tonnes)   75 non-null     object
 5   year                           75 non-null     object
dtypes: object(6)
memory usage: 4.1+ KB


In [356]:
# recursively rename columns
col_list = df_03to16.columns.tolist()
for idx,col in enumerate(col_list):
    df_17to21 = df_17to21.rename(columns={df_17to21.columns[idx]:col})

In [357]:
df_17to21.head(1)

Unnamed: 0,waste_type,total_waste_generated_tonne,total_waste_recycled_tonne,recycling_rate,waste_disposed_of_tonne,year
1,C&D,1609,1600,99%,9,2017


In [358]:
# remove special characters from columns (comma, %)
cols = ['total_waste_generated_tonne','total_waste_recycled_tonne','recycling_rate','waste_disposed_of_tonne']
df_17to21[cols] = df_17to21[cols].replace(r'[^\w\s]', '', regex=True)

In [359]:
# update data types
dtype = {'total_waste_generated_tonne':'float64', 'total_waste_recycled_tonne':'float64', 'waste_disposed_of_tonne':'float64',
        'recycling_rate':'float64'}
df_17to21 = df_17to21.astype(dtype)

In [360]:
cols = ['total_waste_generated_tonne','total_waste_recycled_tonne','waste_disposed_of_tonne']
df_17to21[cols] = df_17to21[cols] * 1000
df_17to21['recycling_rate'] = df_17to21['recycling_rate'] / 100

In [361]:
df_17to21.reset_index(drop=True).head(1)

Unnamed: 0,waste_type,total_waste_generated_tonne,total_waste_recycled_tonne,recycling_rate,waste_disposed_of_tonne,year
0,C&D,1609000.0,1600000.0,0.99,9000.0,2017


### Cleaning `df_22`

### Doing it altogether

In [362]:
df0 = pd.concat([df_03to16,df_17to21],ignore_index=True).reset_index(drop=True)

In [363]:
df0['waste_type'] = df0['waste_type'].str.replace(r'[^A-Za-z0-9\s]+','') \
                                     .apply(lambda x: ' '.join((' '.join(re.findall('[a-zA-Z][^A-Z]*', x))).split())) \
                                     .str.lower()

  df0['waste_type'] = df0['waste_type'].str.replace(r'[^A-Za-z0-9\s]+','') \


In [451]:
wnl = WordNetLemmatizer()
stop = stopwords.words('english')

df0['token'] = df0['waste_type'].apply(word_tokenize) \
                                .apply(lambda row: [str(wnl.lemmatize(word,pos='n')) for word in row if word not in stop]) 

In [462]:
mats = df0['token'].value_counts().index.tolist()
mats

[['textile', 'leather'],
 ['paper', 'cardboard'],
 ['scrap', 'tyre'],
 ['used', 'slag'],
 ['nonferrous', 'metal'],
 ['ferrous', 'metal'],
 ['plastic'],
 ['glass'],
 ['horticultural', 'waste'],
 ['total'],
 ['others', 'stone', 'ceramic', 'rubber', 'etc'],
 ['construction', 'debris'],
 ['food', 'waste'],
 ['sludge'],
 ['wood', 'timber'],
 ['food'],
 ['wood'],
 ['ash', 'sludge'],
 ['c'],
 ['overall'],
 ['horticultural'],
 ['others', 'stone', 'ceramic', 'etc'],
 ['construction', 'demolition', 'c'],
 ['others']]

In [481]:
df1_list = []
for mat in mats:
    temp_df = df0[df0['token'].map(tuple) == tuple(mat)]
    
    temp_df['word'] = temp_df["token"].apply(lambda x: list(set(mat).intersection(x)))
    #df["query_match"] = df["word"].apply(lambda x: 'True' if x else 'False')
    df1_list.append(temp_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp_df['word'] = temp_df["token"].apply(lambda x: list(set(mat).intersection(x)))


In [491]:
df1_list[22]

Unnamed: 0,waste_type,total_waste_generated_tonne,total_waste_recycled_tonne,recycling_rate,waste_disposed_of_tonne,year,token,word,query_match
273,construction demolition c d,825000.0,822000.0,0.99,3000.0,2020,"[construction, demolition, c]","[construction, demolition, c]",
288,construction demolition c d,1013000.0,1011000.0,0.99,2000.0,2021,"[construction, demolition, c]","[construction, demolition, c]",


In [None]:
# 1 column to check the math of recycling rate?

***

## 3. Data Exploration and Visualization

In [None]:
# create new database
conn=sqlite3.connect('mydb.db')

In [None]:
# use pandas `.to_sql` to create a table 'recycling' from dataframe df
df.to_sql(name='recycling', con=conn, if_exists='replace', index=False)
conn.comit()

In [None]:
# connect to database
%load_ext sql
%sql sqlite:///mydb.db

In [None]:
# start querying!

In [None]:
# recycling rate of individual waste types per year
%%sql

In [None]:
# total energy saved per year
%%sql

***

## 4. Conclusions

***