[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gov-analysis/ENData-Tech/blob/main/Applying_ESG_BERT_on_sustainability_reports.ipynb)

In [None]:
!pip install tika

Collecting tika
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-2.6.0-py3-none-any.whl size=32621 sha256=b4ac941406f79954e9dd240d9f85617b2c5b26098b2c1d11b6fbf35f9f040385
  Stored in directory: /root/.cache/pip/wheels/5f/71/c7/b757709531121b1700cffda5b6b0d4aad095fb507ec84316d0
Successfully built tika
Installing collected packages: tika
Successfully installed tika-2.6.0


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
from tika import parser
import re
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

model = AutoModelForSequenceClassification.from_pretrained("nbroad/ESG-BERT")

# Create the pipeline for text classification
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, truncation=True) # Added truncation


In [None]:
# Create a Class to parse PDF
class PDFParser:
    def __init__(self, file_path):
        self.file_path = file_path
        self.raw = parser.from_file(self.file_path)
        self.text = self.raw['content']

    def get_text(self):
        return self.text

    def get_text_clean(self):
        text = self.text
        text = re.sub(r'\n', ' ', text)
        text = re.sub(r'\s+', ' ', text)
        return text

    def get_text_clean_list(self):
        text = self.get_text_clean()
        text_list = text.split('.')
        return text_list

In [None]:
# Get report from responsibilityreports.com
mcdonalds_url = "https://www.responsibilityreports.com/Click/2534"
pp = PDFParser(mcdonalds_url)
sentences = pp.get_text_clean_list()

print(f"The McDonalds CSR report has {len(sentences):,d} sentences")


2024-06-22 04:16:40,705 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/2534 to /tmp/click-2534.
INFO:tika.tika:Retrieving https://www.responsibilityreports.com/Click/2534 to /tmp/click-2534.


The McDonalds CSR report has 2,438 sentences


In [None]:
sentences

[' 2022–2023 Our Purpose & Impact Report McDonald’s Corporation Impact Report 2022–2023 Our Purpose & Impact Report McDonald’s Corporation Our purpose is to feed and foster communities',
 ' As the leading global foodservice retailer, we believe it’s our responsibility to make a positive impact on the world',
 ' We’re driving that impact by living our purpose',
 ' The actions we continue to take today across our food, people, communities and our planet will help contribute to building a better business and a more trusted brand for generations to come',
 ' One of these actions is reporting on our environmental and social activities',
 ' McDonald’s Corporation Purpose & Impact Report 2022–2023 2Our Planet Food Quality & Sourcing Jobs, Inclusion & Empowerment Community Connection SASB Index Introduction What’s Inside Introduction McDonald’s is the global leading foodservice retailer, with more than 40,000 locations in over 100 countries helping feed millions of customers every day',
 ' Our

In [None]:
result = classifier(sentences)
df = pd.DataFrame(result)

In [None]:
df.groupby(['label']).mean().sort_values('score', ascending = False)

Unnamed: 0_level_0,score
label,Unnamed: 1_level_1
Waste_And_Hazardous_Materials_Management,0.891394
Labor_Practices,0.786746
Supply_Chain_Management,0.779445
Physical_Impacts_Of_Climate_Change,0.763709
Critical_Incident_Risk_Management,0.750916
Water_And_Wastewater_Management,0.749063
Product_Quality_And_Safety,0.74867
Employee_Engagement_Inclusion_And_Diversity,0.736018
Human_Rights_And_Community_Relations,0.704285
Product_Design_And_Lifecycle_Management,0.698856


In [None]:
# We can also convert the workflow above into a function and can easily compare the scores with other companies'
def run_classifier(url):
    pp = PDFParser(url)
    sentences = pp.get_text_clean_list()
    print(f"The CSR report has {len(sentences):,d} sentences")
    result = classifier(sentences)
    df = pd.DataFrame(result)
    return(df)

In [None]:
# Let's try to look at Amazon
amzn = run_classifier("https://www.responsibilityreports.com/Click/2015")

2024-06-22 04:32:06,634 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/2015 to /tmp/click-2015.
INFO:tika.tika:Retrieving https://www.responsibilityreports.com/Click/2015 to /tmp/click-2015.


The CSR report has 2,567 sentences


In [None]:
amzn.groupby(['label']).mean().sort_values('score', ascending = False)

Unnamed: 0_level_0,score
label,Unnamed: 1_level_1
Water_And_Wastewater_Management,0.944439
Waste_And_Hazardous_Materials_Management,0.822303
Supply_Chain_Management,0.82229
Energy_Management,0.790863
Physical_Impacts_Of_Climate_Change,0.787895
Business_Ethics,0.744366
Labor_Practices,0.738375
Employee_Engagement_Inclusion_And_Diversity,0.730735
Human_Rights_And_Community_Relations,0.726501
Ecological_Impacts,0.67996


In [None]:
# Let's look at another company from a different sector - Newmont Mining
nm = run_classifier("https://www.responsibilityreports.com/Click/1772")


2024-06-22 04:43:17,303 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/1772 to /tmp/click-1772.
INFO:tika.tika:Retrieving https://www.responsibilityreports.com/Click/1772 to /tmp/click-1772.


The CSR report has 12,897 sentences


In [None]:
nm.groupby(['label']).mean().sort_values('score', ascending = False)

Unnamed: 0_level_0,score
label,Unnamed: 1_level_1
Water_And_Wastewater_Management,0.915708
Air_Quality,0.816752
Employee_Health_And_Safety,0.780301
Physical_Impacts_Of_Climate_Change,0.775978
Human_Rights_And_Community_Relations,0.773394
Supply_Chain_Management,0.709297
Ecological_Impacts,0.699983
GHG_Emissions,0.693242
Labor_Practices,0.644533
Waste_And_Hazardous_Materials_Management,0.636957


In [None]:
# Let's look at Nvidia
nvidia = run_classifier("https://www.responsibilityreports.com/Click/1532")

2024-06-22 05:08:14,022 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/1532 to /tmp/click-1532.
INFO:tika.tika:Retrieving https://www.responsibilityreports.com/Click/1532 to /tmp/click-1532.


The CSR report has 1,621 sentences


In [None]:
nvidia.groupby(['label']).mean().sort_values('score', ascending = False)

Unnamed: 0_level_0,score
label,Unnamed: 1_level_1
Water_And_Wastewater_Management,0.951785
Waste_And_Hazardous_Materials_Management,0.82711
Critical_Incident_Risk_Management,0.825755
GHG_Emissions,0.773908
Physical_Impacts_Of_Climate_Change,0.772111
Labor_Practices,0.754507
Employee_Engagement_Inclusion_And_Diversity,0.725637
Energy_Management,0.724615
Air_Quality,0.717293
Business_Ethics,0.706091
