# Sustainability reports & NLP  


## Concepts
#### Corporate Social Responsibility Reports (CSR)
A corporate social responsibility (CSR) report is an internal and external facing document companies use to communicate CSR efforts around environmental, ethical, philanthropic, and economic impacts on the environment and community.    

#### Natural Language Processing (NLP)
Natural language processing (NLP) is a field of linguistics and machine learning that deals with natural (i.e., human) languages. The goal is to "understand" the unstructured text data and produce something new. Examples of NLP tasks are language translation, text summarization, and sentiment analysis.  


#### Zero-Shot Learning (ZSL)
Human languages are really complex, so it is impossible to train classifiers on every single phrase. Zero-shot learning (ZSL) models allow classification of text into categories unseen by the model during training. These methods work by combining the observed/seen and the non-observed/unseen categories through auxiliary information, which encodes properties of objects.    

Other common uses for zero-shot learning models are images and videos. And the uses keep growing, such as activity recognition from sensors.


In [1]:
# Imports
import re
import string
from collections import defaultdict
import pandas as pd
import tika
import warnings
tika.initVM()
from tika import parser
import nltk
import torch
from transformers import pipeline  # Hugging Face

# Set pandas display options
pd.set_option("display.max_colwidth", None) 

# Ignore warnings
warnings.filterwarnings("ignore")


## Parsing CSR PDFs
A non-trivial portion of classifying CSR reports is converting them to a computer-readable format. Companies publish their CSR reports as PDFs, which are notoriously hard to read. Our goal is to extract text as a list of sentences.  

We will be doing very simple parsing of a PDF report using the package tika to extract the text, regular expressions to filter and join the text, and NLTK to split the text into sentences.  

This is by no means the best way to do it, but it's relatively simple and gets the job done well enough for our purposes. Text cleaning is task-specific, so you need to consider what is sufficient for your problem. 

In [2]:
class parsePDF:
    def __init__(self, url):
        self.url = url
    
    def extract_contents(self):
        """ Extract a pdf's contents using tika. """
        pdf = parser.from_file(self.url)
        self.text = pdf["content"]
        return self.text
        
    
    def clean_text(self):
        """ Extract & clean sentences from raw text of pdf. """
        # Remove non ASCII characters
        printables = set(string.printable)
        self.text = "".join(filter(lambda x: x in printables, self.text))

        # Replace tabs with spaces
        self.text = re.sub(r"\t+", r" ", self.text)

        # Aggregate lines where the sentence wraps
        # Also, lines in CAPITALS is counted as a header
        fragments = []
        prev = ""
        for line in re.split(r"\n+", self.text):
            if line.isupper():
                prev = "."  # skip it
            elif line and (line.startswith(" ") or line[0].islower()
                  or not prev.endswith(".")):
                prev = f"{prev} {line}"  # make into one line
            else:
                fragments.append(prev)
                prev = line
        fragments.append(prev)

        # Clean the lines into sentences
        sentences = []
        for line in fragments:
            # Use regular expressions to clean text
            url_str = (r"((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\."
                       r"([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*")
            line = re.sub(url_str, r" ", line)  # URLs
            line = re.sub(r"^\s?\d+(.*)$", r"\1", line)  # headers
            line = re.sub(r"\d{5,}", r" ", line)  # figures
            line = re.sub(r"\.+", ".", line)  # multiple periods
            
            line = line.strip()  # leading & trailing spaces
            line = re.sub(r"\s+", " ", line)  # multiple spaces
            line = re.sub(r"\s?([,:;\.])", r"\1", line)  # punctuation spaces
            line = re.sub(r"\s?-\s?", "-", line)  # split-line words

            # Use nltk to split the line into sentences
            for sentence in nltk.sent_tokenize(line):
                s = str(sentence).strip().lower()  # lower case
                # Exclude tables of contents and short sentences
                if "table of contents" not in s and len(s) > 5:
                    sentences.append(s)
        return sentences

##### Example: vermeg
Here, we're pulling vermeg' 2022 CSR report. We will extract and parse the text in order to move on to classifying it using zero shot learning.

In [3]:
vermeg_url = "https://www.vermeg.com/wp-content/uploads/2020/08/CSR-Report_VERMEG_2022.pdf"
pp = parsePDF(vermeg_url)
pp.extract_contents()
sentences = pp.clean_text()

print(f"The vermeg CSR report has {len(sentences):,d} sentences")

2024-05-10 21:49:34,611 [MainThread  ] [INFO ]  Retrieving https://www.vermeg.com/wp-content/uploads/2020/08/CSR-Report_VERMEG_2022.pdf to C:\Users\USER\AppData\Local\Temp/wp-content-uploads-2020-08-csr-report_vermeg_2022.pdf.
2024-05-10 21:49:45,241 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2024-05-10 21:49:50,281 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


The vermeg CSR report has 336 sentences


## Zero-Shot Learning
Zero-shot learning models are extremely helpful when you want to classify text on very specific labels and don't have labeled data. Labeled data can be difficult, expensive, and tedious to acquire, so zero-shot learning provides a quick way to get a classification without specialized data and additional model training.   

We are going to define industry-specific ESG categories and ask our model to classify each sentence in our CSR report. We will get a "score" that shows how confident the model is that that label applies. A score of 1.0 means that that sentence is definitely about that topic. Conversely, a score of 0.0 means that the sentence definitely doesn't relate to that topic.  

The downside to zero-shot learning is that it is extremely slow compared to models trained on specific labels. It basically has to compute "what it means to be that label" then it has to check if your sentence "is that label."

In [4]:
class ZeroShotClassifier:

    def create_zsl_model(self, model_name):
        """ Create the zero-shot learning model. """
        self.model = pipeline("zero-shot-classification", model=model_name)
    
        
    def classify_text(self, text, categories):
        """
        Classify text(s) to the pre-defined categories using a
        zero-shot classification model and return the raw results.
        """
        # Classify text using the zero-shot transformers model
        hypothesis_template = "This text is about {}."
        result = self.model(text, categories, multi_label=True,
                            hypothesis_template=hypothesis_template)
        return result

    
    def text_labels(self, text, category_dict, cutoff=None):
        """
        Classify a text into the pre-defined categories. If cutoff
        is defined, return only those entries where the score > cutoff
        """
        # Run the model on our categories
        categories = list(category_dict.keys())
        result = (self.classify_text(text, categories))
        
        # Format as a pandas dataframe and add ESG label
        df = pd.DataFrame(result).explode(["labels", "scores"])
        df["ESG"] = df.labels.map(category_dict)
    
        # If a cutoff is provided, filter the dataframe
        if cutoff:
            df = df[df.scores.gt(cutoff)].copy()
        return df.reset_index(drop=True)

##### Pre-Define Labels
The labels chosen below are based on categories and topics used by ESG scoring companies.  
We define the plain-english version, which is what will be searched by the zero-shot learning model, as well as the general "ESG" label.  

Because of how zero-shot learning models work, inference time will increase linearly with the number of labels you define. Therefore, it is necessary to consider which labels you really want and how much time is acceptable for text classification.

In [9]:
# Define categories we want to classify
esg_categories = {
    "emissions": "E",
    "pollution": "E",
    "diversity and inclusion": "S",
    "health and safety": "S",
    "training and education": "S",
    "transparancy": "G",
    "board accountability": "G",
    "data privacy": "G",  
    "corporate ethics": "G",  

}

##### Getting Text Classification
Now, all we have to do is define the model and make predictions. The architecture of the model can be chosen from any text-classification model on [Hugging Face](https://huggingface.co/models).  


In [6]:
# Define and Create the zero-shot learning model
model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli" 
ZSC = ZeroShotClassifier()
ZSC.create_zsl_model(model_name)
    # Note: the warning is expected, so ignore it

Downloading model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [10]:
# Classify all the sentences in the report
    # Note: this takes a while
classified = ZSC.text_labels(sentences, esg_categories)
classified.sample(n=20)  # display 20 random records


Unnamed: 0,sequence,labels,scores,ESG
389,"mousser jerbi, the groups coo and myriam sanhaji, the groups csro, have deliberately chosen to ask the staff for their views and ensure that they want to follow the path taken by the companys founder.",corporate ethics,0.7642,G
1165,"in addition, leaders are considering to update the materiality matrix by interviewing external stakeholders in the coming years.",pollution,0.002301,E
55,"also, we reviewed our target carbon neutrality objectives to align with the post covid world.",board accountability,0.306766,G
1417,"anonymous whistleblowing process implemented in the internal system with public access from vermegs website ( ) to allow any internal or external stakeholders raising alerts on suspected wrongdoing (bribery, fraud or other criminal activity, miscarriages of justice, health and safety risks, damage to the environment, breach of legal or professional obligations, discrimination, managerial practices, labor rights, etc.)",pollution,0.025648,E
1963,"new trendy sports such as paddle, cross fit, pilate, climbing, etc.",transparancy,0.009291,G
2803,suppliers and subcontractors who minimize the waste generated).,data privacy,0.001359,G
278,"urging vermeg stakeholders to read & sign its csr policies handbook and ethics policy describing its code of conduct to achieve its commitment to sustainable development, vermeg has implemented policies, procedures and controls in the organization demonstrating concrete proofs as: all day-to-day activities and efficient services of the group company are in line with the chart and principles of corporate social responsibility (csr) all stakeholders (investors, suppliers, business partners, etc.)",data privacy,6.9e-05,G
1658,"in 2022, global emissions increased to reach 2533 teq co2 (uncertainty around 19%) againt 1560 teq co2 in 2021, but 3190 teq co2 in 2019.",transparancy,0.271825,G
2930,"key indicators for sustainable impact improvement as a key player in the financial industry, our mission is to offer the best solutions available to advise and support individuals, businesses and institutions in the development of their projects and to ensure a positive long-term impact on the business, social and environmental world around us.",training and education,0.000382,S
2461,"indeed today, whatever the field in which the company operates, it has become vital to meet the expectations resulting from the international standards for sustainable development in its specific business area.",data privacy,0.007402,G


In [13]:
# Look at an example of "E" classified sentences:
E_sentences = classified[classified.scores.gt(0.8) & classified.ESG.eq("E")].copy()
E_sentences.head(10)

Unnamed: 0,sequence,labels,scores,ESG
54,"also, we reviewed our target carbon neutrality objectives to align with the post covid world.",emissions,0.954821,E
1449,"minimizing our environment impact although its environmental impact is naturally low due to the primarily intellectual nature of its services, vermeg continues to carry out an annual carbon assessment to measure its greenhouse gas (ghg) emissions over its financial year and constantly questions its working methods to reduce the environmental footprint of all its activities and limit waste.",emissions,0.989547,E
1494,provide shuttle buses to limit the emissions from staff home/work travel by mutualizing transport morning and night for tunis offices.,emissions,0.996736,E
1495,provide shuttle buses to limit the emissions from staff home/work travel by mutualizing transport morning and night for tunis offices.,pollution,0.952134,E
1512,reduce diesel cars till banishing them by replacing company cars exclusively with electric or hybrid vehicles if possible in the countries where vermeg operates.,emissions,0.978075,E
1513,reduce diesel cars till banishing them by replacing company cars exclusively with electric or hybrid vehicles if possible in the countries where vermeg operates.,pollution,0.913205,E
1629,"since 2016, the carbon assessment is carried out annually including all countries where vermeg has offices: belgium, france, luxembourg, tunisia, united kingdom, united states, singapore and hong kong.",emissions,0.993789,E
1638,"for this carbon footprint till 2020 financial year, the initial approach were limited to considering 5 sources (excluding home/work commuting).",emissions,0.976301,E
1656,"in 2022, global emissions increased to reach 2533 teq co2 (uncertainty around 19%) againt 1560 teq co2 in 2021, but 3190 teq co2 in 2019.",emissions,0.997338,E
1657,"in 2022, global emissions increased to reach 2533 teq co2 (uncertainty around 19%) againt 1560 teq co2 in 2021, but 3190 teq co2 in 2019.",pollution,0.939917,E
