SDG Labelling Data Preparation
==============================

Cleaning of data scraped from [Partnerships for the SDGs](https://sustainabledevelopment.un.org/partnership/browse/) and [RELX Group SDG Resource Centre](https://sdgresources.relx.com/articles).

In [1]:
%load_ext line_profiler
%load_ext autoreload
%autoreload 2

In [160]:
import os
import re
import ast
import json
import string

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from collections import defaultdict, Counter
from datetime import datetime, date

from analysis.src.data.readnwrite import get_data_dir
from analysis.src.data.data_utilities import flatten, eval_column, grouper

pd.options.display.max_columns = 99

In [5]:
%matplotlib inline

# Paths
# Get the top path
data_path = get_data_dir()

# Create the path for external data
ext_data = os.path.join(data_path, 'external')
# Raw data
raw_data = os.path.join(data_path, 'raw')
# And external data
proc_data = os.path.join(data_path, 'processed')
# And interim data
inter_data = os.path.join(data_path, 'interim')
# And figures
fig_path = os.path.join(data_path, 'figures')

# Get date for saving files
today = datetime.utcnow()

today_str = "_".join([str(x) for x in [today.year,today.month,today.day]])

## 1. Load Data

We have a raw dataset from each site that was scraped to load.

In [15]:
partnernship_df = pd.read_csv(os.path.join(raw_data, 'sdg_partnership_projects_scraped.csv'))
relx_df = pd.read_json(os.path.join(raw_data, 'sdg_relx_articles.json'))

In [11]:
partnernship_df.head(2)

Unnamed: 0,content,goals,project_number,project_url,timeframe,title
0,\n\n\nThe European PVC industry commits to add...,"['1', '3', '4', '5', '6', '7', '8', '9', '12',...",91,http://www.vinylplus.eu/,Time-frame: 2011-06-22 - 2020-12-31,VinylPlus
1,"\n\n\nMore than one billion people worldwide, ...",['7'],93,,Time-frame: 2012 - 2015-12-31,Min-E Access: Minimum Electricity Access


In [12]:
relx_df.head(2)

Unnamed: 0,article_url,authors,citation,content,publisher,sdg_goals,tags,title
0,https://www.sciencedirect.com/science/article/...,"Gerald G. Singh, Andrés M. Cisneros-Montemayor...","Marine Policy Volume 93, July 2018, Pages 223-231",Achieving the United Nations’ 17 Sustainable D...,Elsevier,[14],"[Oceans & Seas, Small Island Developing States]",A rapid assessment of co-benefits and trade-of...
1,https://www.sciencedirect.com/science/article/...,"Anthony Y.Ku, Johnathan Loudis, Steven J. Duclos",Sustainable Materials and Technologies Volume ...,As the technologies we use as a society have a...,Elsevier,[9],"[Chemicals and waste, Industry, Supply chain, ...",The impact of technological innovation on crit...


## 2. Cleaning

### 2.1 Partnership Data

#### Goals

In [151]:
goals_partner = eval_column(partnernship_df, 'goals')

In [152]:
goals_binary_partner = []
for gp in goals_partner:
    goals_binary = np.zeros(17).astype('int8')
    for i in gp:
        goals_binary[int(i) - 1] = 1
    goals_binary_partner.append(goals_binary)

ohe_goals_partner = pd.DataFrame(goals_binary_partner)
ohe_goals_partner.columns = ['goal_{}'.format(i + 1) for i in range(17)]

In [153]:
ohe.head(1)

Unnamed: 0,goal_1,goal_2,goal_3,goal_4,goal_5,goal_6,goal_7,goal_8,goal_9,goal_10,goal_11,goal_12,goal_13,goal_14,goal_15,goal_16,goal_17
0,1,0,1,1,1,1,1,1,1,0,0,1,1,0,0,0,1


#### Content

In [35]:
content_partnership = list(partnernship_df['content'].values)

In [50]:
content_partnership[0]

'\n\n\nThe European PVC industry commits to address five key challenges:- Work towards the more efficient management of PVC throughout its life cycle.- Help to ensure that persistent organic compounds do not accumulate in nature and that other emissions are reduced. - Review the use of PVC additives and move towards more sustainable additives systems.- Help to minimize climate impacts through reducing energy and raw material use, potentially endeavoring to switch to renewable sources and promoting sustainable innovation.- Continue to build sustainability awareness across the value chain and external stakeholders.\n \n\n \n\n\n\n\n\n\nIn line with the Agenda 21 chapter 30 Strengthening the role of business and industry, the VinylPlus Voluntary Commitment has been developed bottom up with an open process of stakeholder dialogue. Five key sustainable development challenges have been identified for PVC, based on The Natural Step System Conditions for a Sustainable Society. VinylPlus consid


Looks like the main text cleaning is removing new lines etc.

In [55]:
content_partnership = [cp.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ') for cp in content_partnership]
content_partnership = [re.sub(' +', ' ', cp).strip() for cp in content_partnership]

### Date

In [66]:
tf = partnernship_df['timeframe'].values[0]

In [138]:
# This doesn't work...
# I stopped trying when I found the date '1/2///2/0/1/9'...

def parse_timeframe(tf):
    start, end = tf.split(' - ')[:]
    start = start.split(': ')[1]
    
    start = start.replace(',', ', ')
    end = end.replace(',', ', ')
    
    if '/' in start:
        start = start.split('/')
        if len(start[-1]) == 2:
            start[-1] = '20' + start[-1]
        start = '/'.join(end)
    if '/' in end:
        print(start, end)
        end = end.split('/')
        if len(end[-1]) == 2:
            end[-1] = '20' + end[-1]
        end = '/'.join(end)
        print(start, end)
    if (end == 'ongoing') | (end == '-'):
        end = date(year=2030, month=1, day=1)
        start = pd.to_datetime(start).date()
    else:
        start = pd.to_datetime(start).date() 
        end = pd.to_datetime(end).date()
    return start, end

### 2.2 RELX Data

#### Goals

In [142]:
goals_relx = relx_df['sdg_goals'].values

In [149]:
goals_binary_relx = []
for gp in goals_relx:
    goals_binary = np.zeros(17).astype('int8')
    for i in gp:
        goals_binary[int(i) - 1] = 1
    goals_binary_relx.append(goals_binary)

ohe_goals_relx = pd.DataFrame(goals_binary_relx)
ohe_goals_relx.columns = ['goal_{}'.format(i + 1) for i in range(17)]

In [150]:
ohe_goals_relx.head(1)

Unnamed: 0,goal_1,goal_2,goal_3,goal_4,goal_5,goal_6,goal_7,goal_8,goal_9,goal_10,goal_11,goal_12,goal_13,goal_14,goal_15,goal_16,goal_17
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


#### Content

In [172]:
content_relx = relx_df['content'].values
content_relx[0]

"Achieving the United Nations’ 17\xa0Sustainable Development\xa0Goals (SDGs) results in many ecological, social, and economic consequences that are inter-related. Understanding relationships between sustainability goals and determining their interactions can help prioritize effective and efficient policy options. This paper presents a framework that integrates existing knowledge from literature and expert opinions to rapidly assess the relationships between one SDG goal and another. Specifically, given the important role of the oceans in the world's social-ecological systems, this study focuses on how SDG 14 (Life Below Water), and the targets within that goal, contributes to other SDG goals. This framework differentiates relationships based on compatibility (co-benefit, trade-off, neutral), the optional nature of achieving one goal in attaining another, and whether these relationships are context dependent. The results from applying this framework indicate that oceans SDG targets are 

Looks like there's a fair amount of special characters here. Let's get rid of 'em.

In [173]:
content_relx = [re.sub(r'[^\x00-\x7f]',r' ', cr) for cr in content_relx]
content_relx = [re.sub(' +', ' ', cr).strip() for cr in content_relx]

## 3. Joining and Exporting

In [199]:
partner_clean_df = pd.DataFrame({'content': content_partnership,
                                 'source': 'un_sdg_partnerships'})
relx_clean_df = pd.DataFrame({'content': content_relx,
                              'source': 'relx'})

partner_clean_df = partner_clean_df.join(ohe_goals_partner)
relx_clean_df = relx_clean_df.join(ohe_goals_relx)

In [204]:
clean_df = pd.concat([partner_clean_df, relx_clean_df])

In [226]:
print("Number of projects:", len(clean_df))

Number of projects: 2228


In [227]:
print("Number of projects for each goal:")
for c in clean_df.columns:
    if 'goal_' in c:
        print('{:7} {:>5}'.format(c, sum(clean_df[c])))

Number of projects for each goal:
goal_1    192
goal_2    178
goal_3    250
goal_4    514
goal_5    345
goal_6    142
goal_7    356
goal_8    530
goal_9    132
goal_10   104
goal_11   182
goal_12   176
goal_13   275
goal_14   186
goal_15   138
goal_16   133
goal_17   353


In [229]:
clean_df.to_csv(os.path.join(inter_data, 'sdg_projects_and_goals.csv'), index=False)