# Sustainability reports & NLP  


## Concepts
#### Corporate Social Responsibility Reports (CSR)
A corporate social responsibility (CSR) report is an internal and external facing document companies use to communicate CSR efforts around environmental, ethical, philanthropic, and economic impacts on the environment and community.    

#### Natural Language Processing (NLP)
Natural language processing (NLP) is a field of linguistics and machine learning that deals with natural (i.e., human) languages. The goal is to "understand" the unstructured text data and produce something new. Examples of NLP tasks are language translation, text summarization, and sentiment analysis.  


#### Zero-Shot Learning (ZSL)
Human languages are really complex, so it is impossible to train classifiers on every single phrase. Zero-shot learning (ZSL) models allow classification of text into categories unseen by the model during training. These methods work by combining the observed/seen and the non-observed/unseen categories through auxiliary information, which encodes properties of objects.    

Other common uses for zero-shot learning models are images and videos. And the uses keep growing, such as activity recognition from sensors.


In [9]:
# Imports
import re
import string
from collections import defaultdict
import pandas as pd
import tika
tika.initVM()
from tika import parser
import nltk
import torch
from transformers import pipeline  # Hugging Face

pd.set_option("display.max_colwidth", None)

## Parsing CSR PDFs
A non-trivial portion of classifying CSR reports is converting them to a computer-readable format. Companies publish their CSR reports as PDFs, which are notoriously hard to read. Our goal is to extract text as a list of sentences.  

We will be doing very simple parsing of a PDF report using the package tika to extract the text, regular expressions to filter and join the text, and NLTK to split the text into sentences.  

This is by no means the best way to do it, but it's relatively simple and gets the job done well enough for our purposes. Text cleaning is task-specific, so you need to consider what is sufficient for your problem. 

In [10]:
class parsePDF:
    def __init__(self, url):
        self.url = url
    
    def extract_contents(self):
        """ Extract a pdf's contents using tika. """
        pdf = parser.from_file(self.url)
        self.text = pdf["content"]
        return self.text
        
    
    def clean_text(self):
        """ Extract & clean sentences from raw text of pdf. """
        # Remove non ASCII characters
        printables = set(string.printable)
        self.text = "".join(filter(lambda x: x in printables, self.text))

        # Replace tabs with spaces
        self.text = re.sub(r"\t+", r" ", self.text)

        # Aggregate lines where the sentence wraps
        # Also, lines in CAPITALS is counted as a header
        fragments = []
        prev = ""
        for line in re.split(r"\n+", self.text):
            if line.isupper():
                prev = "."  # skip it
            elif line and (line.startswith(" ") or line[0].islower()
                  or not prev.endswith(".")):
                prev = f"{prev} {line}"  # make into one line
            else:
                fragments.append(prev)
                prev = line
        fragments.append(prev)

        # Clean the lines into sentences
        sentences = []
        for line in fragments:
            # Use regular expressions to clean text
            url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\."
                       r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*")
            line = re.sub(url_str, r" ", line)  # URLs
            line = re.sub(r"^\s?\d+(.*)$", r"\1", line)  # headers
            line = re.sub(r"\d{5,}", r" ", line)  # figures
            line = re.sub(r"\.+", ".", line)  # multiple periods
            
            line = line.strip()  # leading & trailing spaces
            line = re.sub(r"\s+", " ", line)  # multiple spaces
            line = re.sub(r"\s?([,:;\.])", r"\1", line)  # punctuation spaces
            line = re.sub(r"\s?-\s?", "-", line)  # split-line words

            # Use nltk to split the line into sentences
            for sentence in nltk.sent_tokenize(line):
                s = str(sentence).strip().lower()  # lower case
                # Exclude tables of contents and short sentences
                if "table of contents" not in s and len(s) > 5:
                    sentences.append(s)
        return sentences

##### Example: McDonald's
Here, we're pulling McDonalds' most recent CSR report from [responsibilityreports.com](https://www.responsibilityreports.com/Company/mcdonalds-corporation). We will extract and parse the text in order to move on to classifying it using zero shot learning.

In [11]:
mcdonalds_url = "https://www.responsibilityreports.com/Click/2534"
pp = parsePDF(mcdonalds_url)
pp.extract_contents()
sentences = pp.clean_text()

print(f"The McDonalds CSR report has {len(sentences):,d} sentences")

2023-08-22 10:55:28,095 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/2534 to C:\Users\USER\AppData\Local\Temp/click-2534.


The McDonalds CSR report has 288 sentences


## Zero-Shot Learning
Zero-shot learning models are extremely helpful when you want to classify text on very specific labels and don't have labeled data. Labeled data can be difficult, expensive, and tedious to acquire, so zero-shot learning provides a quick way to get a classification without specialized data and additional model training.   

We are going to define industry-specific ESG categories and ask our model to classify each sentence in our CSR report. We will get a "score" that shows how confident the model is that that label applies. A score of 1.0 means that that sentence is definitely about that topic. Conversely, a score of 0.0 means that the sentence definitely doesn't relate to that topic.  

The downside to zero-shot learning is that it is extremely slow compared to models trained on specific labels. It basically has to compute "what it means to be that label" then it has to check if your sentence "is that label."

In [12]:
class ZeroShotClassifier:

    def create_zsl_model(self, model_name):
        """ Create the zero-shot learning model. """
        self.model = pipeline("zero-shot-classification", model=model_name)
    
        
    def classify_text(self, text, categories):
        """
        Classify text(s) to the pre-defined categories using a
        zero-shot classification model and return the raw results.
        """
        # Classify text using the zero-shot transformers model
        hypothesis_template = "This text is about {}."
        result = self.model(text, categories, multi_label=True,
                            hypothesis_template=hypothesis_template)
        return result

    
    def text_labels(self, text, category_dict, cutoff=None):
        """
        Classify a text into the pre-defined categories. If cutoff
        is defined, return only those entries where the score > cutoff
        """
        # Run the model on our categories
        categories = list(category_dict.keys())
        result = (self.classify_text(text, categories))
        
        # Format as a pandas dataframe and add ESG label
        df = pd.DataFrame(result).explode(["labels", "scores"])
        df["ESG"] = df.labels.map(category_dict)
    
        # If a cutoff is provided, filter the dataframe
        if cutoff:
            df = df[df.scores.gt(cutoff)].copy()
        return df.reset_index(drop=True)

##### Pre-Define Labels
The labels chosen below are based on categories and topics used by ESG scoring companies.  
We define the plain-english version, which is what will be searched by the zero-shot learning model, as well as the general "ESG" label.  

Because of how zero-shot learning models work, inference time will increase linearly with the number of labels you define. Therefore, it is necessary to consider which labels you really want and how much time is acceptable for text classification.

In [13]:
# Define categories we want to classify
esg_categories = {
  "emissions": "E",
  "natural resources": "E",
  "pollution": "E",
  "diversity and inclusion": "S",
  "philanthropy": "S",
  "health and safety": "S",
  "training and education": "S",
  "transparancy": "G",
  "corporate compliance": "G",
  "board accountability": "G",
"community engagement": "S",  
  "data privacy": "G",  
  "product safety": "E",
  "renewable energy": "E",  
  "corporate ethics": "G",  
  "waste management": "E",
}

##### Getting Text Classification
Now, all we have to do is define the model and make predictions. The architecture of the model can be chosen from any text-classification model on [Hugging Face](https://huggingface.co/models).  


In [14]:
# Define and Create the zero-shot learning model
model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli" 
ZSC = ZeroShotClassifier()
ZSC.create_zsl_model(model_name)
    # Note: the warning is expected, so ignore it



In [15]:
# Classify all the sentences in the report
    # Note: this takes a while
classified = ZSC.text_labels(sentences, esg_categories)
classified.sample(n=20)  # display 20 random records


Unnamed: 0,sequence,labels,scores,ESG
697,"following our 2020 responsible sourcing goals being substantially achieved, we continue to work with our suppliers on what is outlined in those commitments, evaluating ongoing progress.",community engagement,0.000612,S
738,we have achieved a 7.8%7 reduction in supply chain ghgemissions intensity compared to 2015 figures.,corporate compliance,0.267969,G
2244,were aiming for a 90% reduction in virgin fossil fuel-based plastic used to make happy meal toys by the end of 2025.,emissions,0.548571,E
4486,footnotes 1.,corporate ethics,0.195198,G
67,"showing up for our communities ray kroc used to say, none of us is as good as all of us a phrase that serves as a constant reminder of mcdonalds impact on the world when we leverage thecollective strength of our system.",philanthropy,0.44926,S
4378,the classification of media and production companies and content creators as diverse-owned suppliers is determined by both self-certification and third party certification.,philanthropy,5.6e-05,S
1603,"set in 2018 and approved by the science based targets initiative (sbti), our current targets aim to reduce restaurant and office emissions by 36% by 2030 from a 2015 baseline, and supply chain emissions intensity by 31% over the same period.",board accountability,0.081077,G
2655,"we have seen overall compliance improve and, by the end of 2021, more than 4,600 facilities were actively participating in the program.",pollution,0.000235,E
3818,we/the company: mcdonalds corporation anditsmajority-owned subsidiaries worldwide.,waste management,0.027402,E
2332,"ourapproach, based on global best practices, is a critical part of mcdonalds sustainability journey and purpose to feed and foster community.",training and education,0.000173,S


In [16]:
# Look at an example of "E" classified sentences:
E_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("E")].copy()
E_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
272,helping protect our planet earning the trust of our people and customers by doing what we say weregoing to do has always been key to building a strong brand and a lasting legacy.,natural resources,0.996699,E
320,"thats why, in 2021, we set an ambition to achieve net zero emissions by 2050.",emissions,0.998425,E
321,"thats why, in 2021, we set an ambition to achieve net zero emissions by 2050.",pollution,0.819794,E
336,"were prioritizing action on the largest elements of our carbon footprint from restaurant energy use to packaging and waste, and the sourcing of key ingredients for our menu.",emissions,0.957858,E
337,"were prioritizing action on the largest elements of our carbon footprint from restaurant energy use to packaging and waste, and the sourcing of key ingredients for our menu.",waste management,0.922814,E
352,meaningful change also requires us to find alternative and sustainable solutions to help protect the worlds natural resources and the communities that rely on them.,natural resources,0.996081,E
385,"weare committed to partnering with our suppliers around the world to scale innovative practices, fromresponsible sourcing and regenerative agriculture, towidespread reuse andrecycling programs.",waste management,0.849208,E
433,"further details about mcdonalds strategy, goals and progress can be found at a message from our ceo chris kempczinski president & ceo, mcdonalds corporation purpose & impact progress summary introduction food quality & sourcing our planet jobs, inclusion & empowerment community connection 2 food quality & sourcing at mcdonalds, our purpose is to feed and foster communities.",natural resources,0.806335,E
641,beef antibiotic pilots have been conducted in 10 key beef sourcing markets.5 these markets represented over 80% of our global beef supply chain in 2021.,product safety,0.988897,E
672,"responsible sourcing page 7 in 2014, we set global goals forsustainable sourcing of ourpriority ingredients1 thosewhere we can have the greatest impact.",natural resources,0.967698,E
