# Scraping UN PRI Transparency Reports

The goal of this book is to get the Transparency Reports of ~1,800 companies from https://www.unpri.org/signatories/transparency-reports-2019/4506.article?adredir=1&adredir=1, which are kept in several different pages for each company, into a single dataframe. There is no API so we must web scrape.

In [2]:
from bs4 import BeautifulSoup
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import re
import time
import pandas as pd
import numpy as np
from collections import Counter 

In [3]:
pd.set_option('display.max_columns', 200)

Entities and associated urls copied and pasted from html of https://www.unpri.org/signatories/transparency-reports-2019/4506.article?adredir=1&adredir=1

In [66]:
entities = pd.read_csv('./datasets/PRI_urls.csv')

UN PRI doesn't like one IP address visiting too many sites within a short period. It blocks scrapes somewhere between 300 - 600 entities, so I've split the 1,800+ entities into six groups of 300. 

The entity id that UN PRI uses is used as the url page strucuture so keeping a list of the entity name and its 'id' is helpful for webscraping.

In [68]:
entityid_list = entities['id'].tolist()
entityid_shard1 = entityid_list[0:300]
entityid_shard2 = entityid_list[300:600]
entityid_shard3 = entityid_list[600:900]
entityid_shard4 = entityid_list[900:1200]
entityid_shard5 = entityid_list[1200:1500]
entityid_shard6 = entityid_list[1500:-1]

In [69]:
entityname_list = entities['name'].tolist()
entityname_shard1 = entityname_list[0:300]
entityname_shard2 = entityname_list[300:600]
entityname_shard3 = entityname_list[600:900]
entityname_shard4 = entityname_list[900:1200]
entityname_shard5 = entityname_list[1200:1500]
entityname_shard6 = entityname_list[1500:-1]

In [70]:
entitydict_shard1 = dict(zip(entityname_shard1, entityid_shard1))
entitydict_shard2 = dict(zip(entityname_shard2, entityid_shard2))
entitydict_shard3 = dict(zip(entityname_shard3, entityid_shard3))
entitydict_shard4 = dict(zip(entityname_shard4, entityid_shard4))
entitydict_shard5 = dict(zip(entityname_shard5, entityid_shard5))
entitydict_shard6 = dict(zip(entityname_shard6, entityid_shard6))

We're writing to a dictionary because we'll get non-normal responses for each entity based on this survey's structure. We will make a list of dictionaries to bypass the IP address defense.

In [71]:
list_of_dicts = []

for i in range(1,7):
    list_of_dicts.append(eval("entitydict_shard" + str(i)))

I've decided to scrape seven report pages for each company. These reports are signified by a report suffix.

In [72]:
page_suffixes = ['/79894dbc337a40828d895f9402aa63de/html/2/?lang=en&a=1',
                 '/d0cc681dfa4d45dca3d70f04bc27d284/html/2/?lang=en&a=1',
                 '/bf735de92be04caa8c32fcbc25cbdd2c/html/2/?lang=en&a=1',
                 '/b8be094467a0406ead601634b02a60c6/html/2/?lang=en&a=1',
                 '/57749b1a39a14fe6942aabb90698b3c1/html/2/?lang=en&a=1',
                 '/8f2ede8902574ce5afc919af9e05c4e0/html/2/?lang=en&a=1',
                 '/b2a82182cc14473b90b72f6bb504fae0/html/2/?lang=en&a=1']

In [73]:
page_prefix = 'https://reporting.unpri.org/surveys/PRI-reporting-framework-2019/'

Web scraping function attempts to collect questions and responses, with a few features: 

Retry and Adapter are intra-loop defense mechanisms against UN PRI blocking my IP address. This dynamically backs off for a specific amount of time if there's an error and tries to reconnect 3 times thereafter.

Questions are containted in 'question-blocks' which have various response outputs: checked box, checked radio, text, url, and others. By converting any response type that's "unchecked" to zeroes, we simplify the responses to affirmative and negative.

Because the response form is dynamic based on the respondent's responses to other questions, the output must be saved to a dictionary, which doesn't require a fixed size.

In [74]:
def question_grabber(url):
    question_list = []
    response_list = []
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    html = session.get(url)
    html.text
    soup = BeautifulSoup(html.text, 'html.parser')
    blocks = soup.findAll(class_='question-block')
    for i in blocks:
        if 'unchecked' in str(i):
            response_list.append(0)
        elif ''
        elif 'response text_TC' in str(i):
            response_list.append(1)
        elif 'response url' in str(i):
            response_list.append(1)
        else:
            response_list.append(1)
    for title in blocks:
        question_list.append(title.text.strip())
    return dict(zip(question_list, response_list))

I used three levels of loops. The first is the 6 "shards", then the 1,800+ companies, and then their six pages. In between shards, the loop sleeps for 60 seconds to fend off the UN PRI defesne. In the company loop, all responses are saved to a dictionary with the format of question ('title') to response. Then the entity dictionary containing entity names as keys will be updated with new values of this embedded new dictionary of question to response. 

#### CAUTION: This loop takes several hours to run (ran it overnight)!

It must scrape over 10,000 pages!

In [76]:
for i in range(len(list_of_dicts)):
    for j in range(len(eval('entityid_shard' + str(i+1)))):
        big_responses = {}
        session = requests.Session()
        for k in range(len(page_suffixes)):
            page_contents = question_grabber(page_prefix + eval('entityid_shard' + str(i+1))[j] + page_suffixes[k])
            big_responses.update(page_contents)
            value_to_update = big_responses
        key_to_update = list(eval('entitydict_shard' + str(i+1)).keys())[j]
        entitydict_shard1[key_to_update] = value_to_update
    time.sleep(60)

In [15]:
df = pd.DataFrame.from_dict(entitydict_shard1, orient='index')

In [14]:
df.shape

### Basic Feature Selection

#### Firm-Specific Questions (large quantity of N/As)

The dynamic nature of the questionnaire mean that we have a number of columns that are not relevant for many entities. So, if the number of na's for a given column amounted to 60% or more of the length of the column, then the entire column was dropped. 

In [102]:
thin_df = df.dropna(thresh=len(df) * .6, axis=1)

In [103]:
thin_df.shape

(1602, 142)

In [104]:
thin_df.to_csv(r'C:\Users\michael.amenta\DAT_MA\datasets\survey_test2.csv')

In [116]:
pd.set_option('display.max_columns', 200)

In [120]:
thin_df.describe().to_clipboard()

In Excel, I renamed all columns with funny characters removed with Find & Replace.

In [132]:
thin_df.columns = ["We address ESG incorporation.","We do not do ESG incorporation.","Organisational Overview","Strategy and Governance","Closing module","None","Yes","Policy setting out your overall approach","Formalised guidelines on environmental factors","Formalised guidelines on social factors","Formalised guidelines on corporate governance factors","Fiduciary (or equivalent) duties","Asset class-specific RI guidelines","Sector specific RI guidelines","Screening / exclusions policy","Other, specify (1)","Other, specify(2)","Applicable policies cover all AUM","Applicable policies cover a majority of AUM","Applicable policies cover a minority of AUM","Your organisation’s definition of ESG and/or responsible investment and it’s relation to investments","Your investment objectives that take ESG factors/real economy influence into account","Time horizon of your investment","Governance structure of organisational ESG responsibilities","ESG incorporation approaches","Active ownership approaches","Reporting","Climate change","Understanding and incorporating client / beneficiary sustainability preferences","Other RI considerations, specify (1)","Other RI considerations, specify (2)","No","I confirm I have read and understood the Accountability tab for SG 01","URL","Attachment (will be made public)","We do not publicly disclose our investment policy documents","Attachment","We do not publicly disclose any investment policy components","Board members or trustees","Oversight/accountability for responsible investment","Implementation of responsible investment","No oversight/accountability or implementation responsibility for responsible investment","Internal Roles (triggers other options)","Chief Executive Officer (CEO), Chief Investment Officer (CIO), Chief Operating Officer (COO), Investment Committee","Other Chief-level staff or head of department, specify","Portfolio managers","Investment analysts","Dedicated responsible investment staff","Investor relations","Other role, specify (1)","Other role, specify (2)","External managers or service providers","I confirm I have read and understood the Accountability tab for SG 07","Principles for Responsible Investment","Basic Moderate Advanced","Asian Corporate Governance Association","Australian Council of Superannuation Investors","AFIC – La Commission ESG","BVCA – Responsible Investment Advisory Board","CDP Climate Change","CDP Forests","CDP Water","CFA Institute Centre for Financial Market Integrity","Code for Responsible Investment in SA (CRISA)","Code for Responsible Finance in the 21st Century","Council of Institutional Investors (CII)","Eumedion","Extractive Industries Transparency Initiative (EITI)","ESG Research Australia","Invest Europe Responsible Investment Roundtable","Global Investors Governance Network (GIGN)","Global Impact Investing Network (GIIN)","Global Real Estate Sustainability Benchmark (GRESB)","Green Bond Principles","Institutional Investors Group on Climate Change (IIGCC)","Interfaith Center on Corporate Responsibility (ICCR)","International Corporate Governance Network (ICGN)","Investor Group on Climate Change, Australia/New Zealand (IGCC)","International Integrated Reporting Council (IIRC)","Investor Network on Climate Risk (INCR)/CERES","Local Authority Pension Fund Forum","Principles for Sustainable Insurance","Regional or National Social Investment Forums (e.g. UKSIF, Eurosif, ASRIA, RIAA), specify","Responsible Finance Principles in Inclusive Finance","Shareholder Association for Research and Education (Share)","United Nations Environmental Program Finance Initiative (UNEP FI)","United Nations Global Compact","Other collaborative organisation/initiative, specify","Provided or supported education or training programmes (this includes peer to peer RI support) Your education or training may be for clients, investment managers, actuaries, broker/dealers, investment consultants, legal advisers etc.)","Quarterly or more frequently","Biannually","Annually","Less frequently than annually","Ad hoc","Other","Provided financial support for academic or industry research on responsible investment","Provided input and/or collaborated with academia on RI related work","Encouraged better transparency and disclosure of responsible investment practices across the investment industry","Spoke publicly at events and conferences to promote responsible investment","Wrote and published in-house research papers on responsible investment","Encouraged the adoption of the PRI","Responded to RI related consultations by non-governmental organisations (OECD, FSB etc.)","Wrote and published articles on responsible investment in the media","A member of PRI advisory committees/ working groups, specify","On the Board of, or officially advising, other RI organisations (e.g. local SIFs)","Other, specify","We do not disclose to either clients/beneficiaries or the public.","We disclose to clients/beneficiaries only.","We disclose to the public","Quarterly or more frequently Biannually Annually Less frequently than annually Ad-hoc/when requested","Third party assurance over selected responses from this year’s PRI Transparency Report","Third party assurance over data points from other sources that have subsequently been used in your PRI responses this year","Third party assurance or audit of the correct implementation of RI processes (that have been reported to the PRI this year)","Internal audit of the correct implementation of RI processes and/or accuracy of RI data (that have been reported to the PRI this year)","Internal verification of responses before submission to the PRI (e.g. by the CEO or the board)","Whole PRI Transparency Report has been internally verified","Selected data has been internally verified","None of the above","Whole PRI Transparency Report was assured last year","Selected data was assured in last year’s PRI Transparency Report","We did not assure last year's PRI Transparency report","None of the above, we were in our preparation year and did not report last year.","We adhere to an RI certification or labelling scheme","We carry out independent/third party assurance over a whole public report (such as a sustainability report) extracts of which are included in this year’s PRI Transparency Report","ESG audit of holdings","Whole PRI Transparency Report will be assured","Selected data will be assured","We do not plan to assure this year's PRI Transparency report","CEO or other Chief-Level staff","The Board","Investment Committee","Compliance Function","RI/ESG Team","Investment Teams","Legal Department","Other (specify)","We engage with companies on ESG factors via our staff, collaborations or service providers.","We do not engage directly and do not require external managers to engage with companies on ESG factors.","We cast our (proxy) votes directly or via dedicated voting providers","We do not cast our (proxy) votes directly and do not require external managers to vote on our behalf","Engagement policy","(Proxy) voting policy"]

#### Low Variance

I analyzed the columns in Excel to find ones that had 95% or more of 0's or 1's and dropped these columns in the following snippet.

In [128]:
thinner_df = thin_df.drop(["Investment Committee","Compliance Function","RI/ESG Team","Investment Teams","Legal Department","Other (specify)","We engage with companies on ESG factors via our staff, collaborations or service providers.","We do not engage directly and do not require external managers to engage with companies on ESG factors.","We cast our (proxy) votes directly or via dedicated voting providers","We do not cast our (proxy) votes directly and do not require external managers to vote on our behalf","Engagement policy","(Proxy) voting policy","We address ESG incorporation.","We do not do ESG incorporation.","Organisational Overview","Strategy and Governance","Closing module","None","Yes"],axis=1)

In [129]:
thinner_df.shape

(1602, 123)

In [130]:
thinner_df.to_csv(r'C:\Users\michael.amenta\DAT_MA\datasets\survey_test3.csv')

My data is now saved to a csv where it can be re-imported in the next step.