# Sustainability reports & NLP  
Thursday, March 17, 2022

In this hackathon learning session, we will walk through an example of parsing a pdf sustainability report and classifying it using one-shot learning in Python.  

---

## Concepts
#### Corporate Social Responsibility Reports (CSR)
A corporate social responsibility (CSR) report is an internal and external facing document companies use to communicate CSR efforts around environmental, ethical, philanthropic, and economic impacts on the environment and community.    

While they are not required in any sense, over 90% of S&P 500 index companies publish them anually [[source]](https://www.ga-institute.com/index.php?id=9128). As there's not a standard reporting process, the quantity and quality of information disclosed is up to the PR department at each company. The reports can be anywhere from 30 to 200+ pages.  

CSR reports are available to the public on a company's website or on [www.responsibilityreports.com/Company](www.responsibilityreports.com).


#### Natural Language Processing (NLP)
Natural language processing (NLP) is a field of linguistics and machine learning that deals with natural (i.e., human) languages. The goal is to "understand" the unstructured text data and produce something new. Examples of NLP tasks are language translation, text summarization, and sentiment analysis.  


#### Zero-Shot Learning (ZSL)
Human languages are really complex, so it is impossible to train classifiers on every single phrase. Zero-shot learning (ZSL) models allow classification of text into categories unseen by the model during training. These methods work by combining the observed/seen and the non-observed/unseen categories through auxiliary information, which encodes properties of objects.    

Other common uses for zero-shot learning models are images and videos. And the uses keep growing, such as activity recognition from sensors.

We will use NLP and ZSL to analyze a CSR report in order to classify each sentence as one of several categories relating to ESG.  

---

In [1]:
# Imports
import re
import string
from collections import defaultdict
import pandas as pd
from tika import parser
import nltk
import torch
from transformers import pipeline  # Hugging Face

pd.set_option("display.max_colwidth", None)

## Parsing CSR PDFs
A non-trivial portion of classifying CSR reports is converting them to a computer-readable format. Companies publish their CSR reports as PDFs, which are notoriously hard to read. Our goal is to extract text as a list of sentences.  

We will be doing very simple parsing of a PDF report using the package tika to extract the text, regular expressions to filter and join the text, and NLTK to split the text into sentences.  

This is by no means the best way to do it, but it's relatively simple and gets the job done well enough for our purposes. Text cleaning is task-specific, so you need to consider what is sufficient for your problem. 

In [2]:
class parsePDF:
    def __init__(self, url):
        self.url = url
    
    def extract_contents(self):
        """ Extract a pdf's contents using tika. """
        pdf = parser.from_file(self.url)
        self.text = pdf["content"]
        return self.text
        
    
    def clean_text(self):
        """ Extract & clean sentences from raw text of pdf. """
        # Remove non ASCII characters
        printables = set(string.printable)
        self.text = "".join(filter(lambda x: x in printables, self.text))

        # Replace tabs with spaces
        self.text = re.sub(r"\t+", r" ", self.text)

        # Aggregate lines where the sentence wraps
        # Also, lines in CAPITALS is counted as a header
        fragments = []
        prev = ""
        for line in re.split(r"\n+", self.text):
            if line.isupper():
                prev = "."  # skip it
            elif line and (line.startswith(" ") or line[0].islower()
                  or not prev.endswith(".")):
                prev = f"{prev} {line}"  # make into one line
            else:
                fragments.append(prev)
                prev = line
        fragments.append(prev)

        # Clean the lines into sentences
        sentences = []
        for line in fragments:
            # Use regular expressions to clean text
            url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\."
                       r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*")
            line = re.sub(url_str, r" ", line)  # URLs
            line = re.sub(r"^\s?\d+(.*)$", r"\1", line)  # headers
            line = re.sub(r"\d{5,}", r" ", line)  # figures
            line = re.sub(r"\.+", ".", line)  # multiple periods
            
            line = line.strip()  # leading & trailing spaces
            line = re.sub(r"\s+", " ", line)  # multiple spaces
            line = re.sub(r"\s?([,:;\.])", r"\1", line)  # punctuation spaces
            line = re.sub(r"\s?-\s?", "-", line)  # split-line words

            # Use nltk to split the line into sentences
            for sentence in nltk.sent_tokenize(line):
                s = str(sentence).strip().lower()  # lower case
                # Exclude tables of contents and short sentences
                if "table of contents" not in s and len(s) > 5:
                    sentences.append(s)
        return sentences

##### Example: McDonald's
Here, we're pulling McDonalds' most recent CSR report from [responsibilityreports.com](https://www.responsibilityreports.com/Company/mcdonalds-corporation). We will extract and parse the text in order to move on to classifying it using zero shot learning.

In [3]:
mcdonalds_url = "https://www.responsibilityreports.com/Click/2534"
pp = parsePDF(mcdonalds_url)
pp.extract_contents()
sentences = pp.clean_text()

print(f"The McDonalds CSR report has {len(sentences):,d} sentences")

2022-03-16 15:46:16,450 [MainThread  ] [INFO ]  Retrieving https://www.responsibilityreports.com/Click/2534 to /var/folders/8k/zkkj1v6n7gbd6tf9pdj8cw1w0000gq/T/click-2534.


The McDonalds CSR report has 275 sentences


## Zero-Shot Learning
Zero-shot learning models are extremely helpful when you want to classify text on very specific labels and don't have labeled data. Labeled data can be difficult, expensive, and tedious to acquire, so zero-shot learning provides a quick way to get a classification without specialized data and additional model training.   

We are going to define industry-specific ESG categories and ask our model to classify each sentence in our CSR report. We will get a "score" that shows how confident the model is that that label applies. A score of 1.0 means that that sentence is definitely about that topic. Conversely, a score of 0.0 means that the sentence definitely doesn't relate to that topic.  

The downside to zero-shot learning is that it is extremely slow compared to models trained on specific labels. It basically has to compute "what it means to be that label" then it has to check if your sentence "is that label."

In [4]:
class ZeroShotClassifier:

    def create_zsl_model(self, model_name):
        """ Create the zero-shot learning model. """
        self.model = pipeline("zero-shot-classification", model=model_name)
    
        
    def classify_text(self, text, categories):
        """
        Classify text(s) to the pre-defined categories using a
        zero-shot classification model and return the raw results.
        """
        # Classify text using the zero-shot transformers model
        hypothesis_template = "This text is about {}."
        result = self.model(text, categories, multi_label=True,
                            hypothesis_template=hypothesis_template)
        return result

    
    def text_labels(self, text, category_dict, cutoff=None):
        """
        Classify a text into the pre-defined categories. If cutoff
        is defined, return only those entries where the score > cutoff
        """
        # Run the model on our categories
        categories = list(category_dict.keys())
        result = (self.classify_text(text, categories))
        
        # Format as a pandas dataframe and add ESG label
        df = pd.DataFrame(result).explode(["labels", "scores"])
        df["ESG"] = df.labels.map(category_dict)
    
        # If a cutoff is provided, filter the dataframe
        if cutoff:
            df = df[df.scores.gt(cutoff)].copy()
        return df.reset_index(drop=True)

##### Pre-Define Labels
The labels chosen below are based on categories and topics used by ESG scoring companies.  
We define the plain-english version, which is what will be searched by the zero-shot learning model, as well as the general "ESG" label.  

Because of how zero-shot learning models work, inference time will increase linearly with the number of labels you define. Therefore, it is necessary to consider which labels you really want and how much time is acceptable for text classification.

In [5]:
# Define categories we want to classify
esg_categories = {
  "emissions": "E",
  "natural resources": "E",
  "pollution": "E",
  "diversity and inclusion": "S",
  "philanthropy": "S",
  "health and safety": "S",
  "training and education": "S",
  "transparancy": "G",
  "corporate compliance": "G",
  "board accountability": "G"}

##### Getting Text Classification
Now, all we have to do is define the model and make predictions. The architecture of the model can be chosen from any text-classification model on [Hugging Face](https://huggingface.co/models).  

Here, we choose to use the extra large version of the DeBERTa model, as maintained by Microsoft. A larger model (generally) gives better performance but is much slower.

In [6]:
# Define and Create the zero-shot learning model
model_name = "microsoft/deberta-v2-xlarge-mnli" 
    # a smaller version: "microsoft/deberta-base-mnli"
ZSC = ZeroShotClassifier()
ZSC.create_zsl_model(model_name)
    # Note: the warning is expected, so ignore it

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
# Classify all the sentences in the report
    # Note: this takes a while
classified = ZSC.text_labels(sentences, esg_categories)
classified.sample(n=20)  # display 20 random records

Unnamed: 0,sequence,labels,scores,ESG
175,our customers want to see that the mcdonalds they visit locally matches how we act globally.,natural resources,0.018979,E
1714,"across europe, mcdonalds and its franchisees partnered with organizations and local food banks to donate surplus ingredients to families.",board accountability,0.003851,G
2670,"as one of the worlds largest restaurant companies, we have a responsibility to ensure long-term, sustainable value creation for shareholders while taking action on some of the worlds most pressing social and environmental challenges.",natural resources,0.295828,E
2646,mcdonalds continues to proactively make changes to restaurant operations and office settings based on the expert guidance of health authorities.,board accountability,0.003085,G
1902,"5.5m in france, 1.5 million was raised in-restaurant through customer donations in 2019, while restaurants mobilized to donate more than 4million to rmhc.",health and safety,0.119588,S
1837,"ronald mcdonald care mobile program provides medical, dental and healthcare resources to children and families in underserved communities around the world.",natural resources,0.000999,E
2601,"informed by rainn, the nations largest anti-sexual violence organization, the policy contains clear language on workplace conduct, manager responsibilities, employee resources and the investigation process.",transparancy,0.19148,G
933,we have more work to do but im confident we can continue to work with experts and learn from families to find areas where our system has the best opportunity to create positive and meaningful change.,health and safety,0.05889,S
1347,"were also collaborating with target, cargill and the nature conservancy to support a five-year $8.5 million project in nebraska, a key state for both beef and cattle feed production.",health and safety,0.001917,S
1494,"in europe, our renewable energy purchases in 2019 covered over 6,500 restaurants worth of electricity across 11 markets.",pollution,0.017052,E


In [8]:
# Look at an example of "E" classified sentences:
E_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("E")].copy()
E_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
0,"feeding and fostering communities feeding and fostering communities mcdonalds purpose & impact summary report there when people need us most in a difficult year, mcdonalds showed up for its communities accelerating circular solutions how we are reimagining packaging our food journey sourcing quality ingredients while helping people, animals andthe planet thrive whats inside 04 foodquality&sourcing 04 our food journey 05 helping coffee communities build resilience 06 offering choices that kidsand parents love: byalistair macrow 07 our planet 07 reimagining packaging 09 taking action onclimate change: q&a with francescadebiase 10 what if a restaurant could generate all its own power from renewable energy?",natural resources,0.98839,E
80,"we have continued our investment into sustainable packaging innovation, renewable energy and regenerative farming solutions to help drive action on climate change.",natural resources,0.973544,E
81,"we have continued our investment into sustainable packaging innovation, renewable energy and regenerative farming solutions to help drive action on climate change.",emissions,0.942017,E
100,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",natural resources,0.986246,E
101,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",emissions,0.943047,E
102,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",pollution,0.858435,E
220,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",natural resources,0.986246,E
221,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",emissions,0.943047,E
222,"our planet: we are partnering with our franchisees, suppliers and farmers to protect our planet by finding innovative ways to keep waste out of nature and drive climate action.",pollution,0.858435,E
292,"our values serve we put our customers and people first inclusion we open our doors toeveryone integrity we do the rightthing community we are goodneighbors family we get better together our values guide us to always put our customers and people first, and ensure we open our doors to everyone.",natural resources,0.884307,E
