# 1.1_text_collection

The following notebook goes through the process of text collection. The high level sequence is as follows:

1. Obtain a list of companies to scrape reports for
2. Obtain a bank of urls for each report using a webscraper
3. Extract raw text from the pdfs and place in a dataframe 

### Package Imports

In [49]:
# general imports
import pandas as pd
import numpy as np
import capstone_utility_functions as cuf
import seaborn as sns
import matplotlib.pyplot as plt

# webscraping imports
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests

# misc imports
import joblib
import re
import io 
from PyPDF2 import PdfReader

The following urls are the 3 separate landing pages for NYSE, NASDAQ and LSE, which are to be used in the following section. This will ulimtately allow us to iterate through the 3 landing pages and grab the company names and tickers for each page. This then gives a list of companies which we know exist on ResponsibilityReports.com, which can be used in the following section to extract pdf urls for.

### ESG Report Collection

In [50]:
#NYSE url
'https://www.responsibilityreports.com/Companies?exch=1'
#NASDAQ url
'https://www.responsibilityreports.com/Companies?exch=2'
#LSE url
'https://www.responsibilityreports.com/Companies?exch=9'

exchange_urls = ['https://www.responsibilityreports.com/Companies?exch=1',
                 'https://www.responsibilityreports.com/Companies?exch=2',
                 'https://www.responsibilityreports.com/Companies?exch=9']

The below cell is the webscraper. Note it can take a long time to run - advised to load pickles (see below) instead!

In [51]:
# #initialise lists and dictionaries
# missing_ticker = []
# report_bank = {}
# tick_bank = {}
# #initialise counter to see how far through for loop we are
# count = 0
# #iterate through list of relevant urls for each exchange.
# for exchange in exchange_urls:
#     #Obtain list of company names for exchange X
#     company_names = cuf.get_exchange_companylist(exchange)
#     #obtain list of url-friendly company names
#     company_names_url_compatible = cuf.get_exchange_company_list_url_compatible(company_names)
#     #Iterate through list of company names one at a time to ultimately get ESG report url
#     for name,url_name in zip(company_names,company_names_url_compatible) :
#         try:
#             #obtain html from first landing page of specific company 
#             soup1 = cuf.html_parse('https://www.responsibilityreports.com/Companies?search='+ url_name)
#             a_href=soup1.find_all('span', class_ = 'companyName')
#             links = soup1.find_all('a', href=True)
#         except Exception as e:
#             print(e)
#             print(name)
#         #form url for second landing page where reports are found for a specific country 
#         for link in links:
#             url_ext= link['href']
#             #obtain all labels for href tags to find the url extension
#             label = link.text
#             #finding the label which contains the company name 
#             if name in label:
#                 search_2_extension = url_ext
#                 # building the url for the page which contains all reports 
#                 search_2_url = 'https://www.responsibilityreports.com' + search_2_extension
#                 break
#             else:
#                 search_2_url = None

#         #obtain html from second landing page
#         try:
#             soup2 = cuf.html_parse(search_2_url)
#         except Exception as e:
#             print(f'No link for {name}')
#             print(e)
#             continue
        
#         #initialise list where  middle part of url (eg NASDAQ_AAPL_) will be stored, to ultimately append to final pdf url.
#         middle = []

#         for a in soup2.find_all('a', href=True):
#             str = a['href']
#             #regex to find the exchange symbol and company ticker from within the html soup eg NASDAQ_AAPL_
#             exchange_pattern = r'\b[A-Z]{0,10}_[A-Z]{0,10}_'
#             #regex to find the year of the latest report from the html soup 
#             latest_year_pattern = r'\d{4}'
#             #search the html for the relevant exchange and ticker
#             found_exchange = re.findall(exchange_pattern,str)
#             #find the year of the latest annual report from the html
#             found_year = re.findall(latest_year_pattern,str)
#             if len(found_exchange)>0:
#                 middle.append(found_exchange)
#                 break
#         #obtaining respective ticker from url
#         pattern = r'Company\/(.)'
#         #used in line 73 to add a unique tag to report_url, based on the first letter of the ticker. 
#         extract = re.findall(pattern,search_2_url)
#         try:
#             #checking if the most recent report is 2021 or not. If not the latest, 
#             #then the base url changes to account for the Archive.
#             if int(found_year[0]) < 2021:
#                 report_url_base = 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/'
#                 report_url_final = report_url_base + middle[0][0] + '2021.pdf'
#             else:
#                 report_url_base = 'https://www.responsibilityreports.com/HostedData/ResponsibilityReportArchive/'
#                 report_url_final = report_url_base + extract[0].lower() +'/'+ middle[0][0] + '2021.pdf'

#         except:
#             continue
#         #adding ticker to dictionary
#         tick_pattern = r'_(.+)_'
#         tick = re.findall(tick_pattern,middle[0][0])
#         tick_bank[name] = tick
#         #appending report url and company name to report_bank dictionary
#         report_bank[name] = report_url_final
#         count += 1
#         print(count)

In [52]:
#pickling the report_bank dictionary to ensure that the above scraper does not need to be run again (30min+)
#report_bank_dict = joblib.dump(report_bank, '../data/report_bank_dict.pkl')

In [53]:
tick_bank_list = joblib.load('../data/tick_bank_dict.pkl')
len(tick_bank_list)

1255

Successfully retrieved urls for 1255 pdfs. The company name and report link are stored in the report_bank dictionary below.

In [54]:
report_bank = joblib.load('../data/report_bank_dict.pkl')

In [55]:
report_bank

{'3M Corporation': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReportArchive/3/NYSE_MMM_2021.pdf',
 'ABB Ltd': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReportArchive/a/NYSE_ABB_2021.pdf',
 'Abbott Laboratories': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_ABT_2021.pdf',
 'Abbvie Inc': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_ABBV_2021.pdf',
 'ABM Industries, Inc.': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_ABM_2021.pdf',
 'Acadia Realty Trust': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_AKR_2021.pdf',
 'Accenture plc': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_ACN_2021.pdf',
 'ACCO Brands': 'https://www.responsibilityreports.com/HostedData/ResponsibilityReports/PDF/NYSE_ACCO_2021.pdf',
 'Acuity Brands, Inc.': 'https://www.responsibilityreports.com/Hosted

In [56]:
#make a dataframe with reports and urls

report_bank_df_1255 = pd.DataFrame(columns=['name'], data=report_bank.keys())

In [57]:
report_bank_df_1255['ticker'] = tick_bank_list

In [58]:
report_bank_df_1255['url'] = report_bank.values()

In [59]:
report_bank_df_1255

Unnamed: 0,name,ticker,url
0,3M Corporation,MMM,https://www.responsibilityreports.com/HostedDa...
1,ABB Ltd,ABB,https://www.responsibilityreports.com/HostedDa...
2,Abbott Laboratories,ABT,https://www.responsibilityreports.com/HostedDa...
3,Abbvie Inc,ABBV,https://www.responsibilityreports.com/HostedDa...
4,"ABM Industries, Inc.",ABM,https://www.responsibilityreports.com/HostedDa...
...,...,...,...
1250,Whitbread plc,WTB,https://www.responsibilityreports.com/HostedDa...
1251,Wm Morrison Supermarkets plc,MRW,https://www.responsibilityreports.com/HostedDa...
1252,WPP PLC,WPP,https://www.responsibilityreports.com/HostedDa...
1253,WSP Group plc,WSH,https://www.responsibilityreports.com/HostedDa...


In [60]:
df_report_bank = joblib.dump(report_bank_df_1255,'../data/df_report_bank.pkl')

We now have a dataframe which contains 1255 companies and their associated 2021 ESG report links. 

### Raw Text Extraction

The raw text needs to be extracted from the pdfs so that it can be used for the upcoming project. Two different packages were trialled for this - `pyPDF2` and `pdftotext`. The latter was preferred as it was much better at retaining a similar paragraph structure and order to the original pdf. 

The following code is commented due to a lengthy run time. Advised to use the pickle below!

In [61]:
# df = joblib.load('../data/df_report_bank.pkl')

In [62]:
# df_li = []
# data = []
# dict = {}
# tick_ind = 0
# c = 0

# #iterate through the urls for each report
# for i in df['url']:
#    try:
#       url = i
#       file_list = []
#       tick = df.iloc[tick_ind,1]
#       #sector = df.iloc[tick_ind,0]

#       #convert pdf to raw txt
#       raw_text = cuf.pdf_to_raw_text(url)

#       #split text into paragraphs
#       raw_text = raw_text.split('\n\n')

#       # replacing any new line characters within a string with a space
#       raw_text = [x.replace('\n',' ') for x in raw_text]

#       # remove any strings which have less than 10 words (headers etc and thus just noise)
#       raw_text = [x for x in raw_text if len(x.split())>10]

        #iterate through the paragraphs and add the ticker, sector and paragraph to a list 'data'.
#       for para in raw_text:
#          res = (tick,sector,para)
#          data.append(res)

#       print(c)
#       c+=1
#       tick_ind +=1
#    except Exception as e:
#       print(e)
#       print(f'Error with {i}')
#       tick_ind +=1
#       c+=1

In [63]:
# joblib.dump(data,'../data/raw_text_data_1200.pkl')

In [64]:
data = joblib.load('../data/raw_text_data_1200.pkl')

The data in the list now needs to be converted to a dataframe. 

In [65]:
col1 = []
col2 = []
for tup in data:
    col1.append(tup[0])
    col2.append(tup[2])

col1_df = pd.DataFrame(col1)
col1_df.rename(columns = {0:'ticker'},inplace=True)

col2_df = pd.DataFrame(col2)
col2_df.rename(columns = {0:'raw_paras'},inplace=True)

df_raw = col1_df
df_raw = pd.concat([df_raw,col2_df],axis=1)

df_raw.nunique()

ticker         1056
raw_paras    344074
dtype: int64

In [66]:
df_raw

Unnamed: 0,ticker,raw_paras
0,MMM,Growing our business by enabling action and im...
1,MMM,Contents Our leadership Message from our CEO ...
2,MMM,"Environmental, health, and safety management ..."
3,MMM,Mike Roman The last year has brought extraordi...
4,MMM,The pandemic has reinforced the importance of ...
...,...,...
371195,FIVE,• The planet: Our approach and 2021 highlights...
371196,FIVE,"Goal 15. Protect, restore and promote sustaina..."
371197,FIVE,"• 15.2 By 2020, promote the implementation of ..."
371198,FIVE,• The planet: Reduced climate impact and energ...


We are now in a position where the 1200 reports have been split by paragraph, with each row in the above dataframe a raw paragraph. 

### Cleaning Text

The raw paragraphs now need to be cleaned. The process shown below is the refined process. This has been achived through an interative process to understand what stages of cleaning needed to be included. Ultimately, the cleaning process is condensed into a simple function `my_cleaner()` in the capstone_utility_functions.py. The main stages of the cleaning process are summarised below:

* removes \n characters within lines
* splits apart camel case
* removes any numbers
* lowercase words
* remove any punctuation
* remove any unicode
* split string into individual tokens
* remove stopwords
* remove Parts of Speech which are not nouns
* lemmatize tokens
* stem tokens
* remove any tokens less than a certain length
* remove any tokens which are in a predefined list of unwanted words

Note that the default for all options is set to true, apart from stemming, lemmatization and part of speech removal.

An example test string is shown below to quickly demonstrate what the function does.


In [67]:
test = 'thisIS a 12Messy 3435 test and .ad.* need to see \n how the cleaning PROCESS works and splitsCamelCase ifNeeds'
cuf.my_cleaner(test,3, lemmatization=True,nouns=True)

['test', 'need', 'process', 'split', 'case', 'need']

We now need to apply this to the dataframe containing raw paragraphs. Note the below block is commented out as it takes a while to run. It is recommended you load the pickle below. 

In [68]:
# df_paras = df_raw.copy()
# # list of noisy words identified through the iterative process
# noisy = ['clients','west','percent','letter','sustainability','report','appendix','management','team','social','corporate','responsibility','business','pages','introduction']

# # feed the raw paragraphs into the custom cleaner. Note that tokens less than 3 letters long are treated as noise and removed.
# df_paras['clean_paras'] = [cuf.my_cleaner(x,3,noisy_words = noisy, newLineRemove = True,lemmatization=True,nouns = True,) for x in df_paras.raw_paras]
# df_paras

In [69]:
# joblib.dump(df_paras, '../data/clean_paras_df.pkl')

In [70]:
df_paras = joblib.load('../data/clean_paras_df.pkl')

In [71]:
df_paras

Unnamed: 0,ticker,raw_paras,clean_paras
0,MMM,Growing our business by enabling action and im...,"[action, impact, humanity, challenge]"
1,MMM,Contents Our leadership Message from our CEO ...,"[content, leadership, message, message]"
2,MMM,"Environmental, health, and safety management ...","[health, safety, material, energy, supplier]"
3,MMM,Mike Roman The last year has brought extraordi...,"[mike, roman, year, challenge, time, recommit,..."
4,MMM,The pandemic has reinforced the importance of ...,"[pandemic, importance, science, challenge]"
...,...,...,...
371195,FIVE,• The planet: Our approach and 2021 highlights...,"[planet, approach, highlight, packaging, impac..."
371196,FIVE,"Goal 15. Protect, restore and promote sustaina...","[goal, protect, restore, promote, forest, dese..."
371197,FIVE,"• 15.2 By 2020, promote the implementation of ...","[implementation, type, halt, deforestation, re..."
371198,FIVE,• The planet: Reduced climate impact and energ...,"[planet, impact, energy, efficiency, consumption]"


We now have reports from 1200 companies, split by paragraph, with each paragraph cleaned and tokenised. EDA is the next stage of the pipeline.