# Efficient ESG evaluation by NLP

チェックリストに沿ったESG評価を自然言語処理で効率化するデモをです。

ESG評価を行う手法として、ESGに関する質問のリストを作成し適正な回答が得られた個数を数えることは有効です。Eであれば「気候変動リスクを監視する委員会があるか」「気候変動リスクと機会を特定するプロセスがあるか」といった質問が挙げられます。  
自然言語処理を用い、質問に対する回答箇所をモデルで抽出できることを示します。


デモの手順は以下の通りです。

1. Prepare: PDFファイル(統合報告書)からテキストを読み込みます。
2. Preprocess: テキスト解析しやすいよう整形します。
3. Retrieve: 質問に関連する箇所を抽出します。
4. Answer: 関連する箇所から、質問の回答を抽出します。

## 1. Extract

PDFファイルを読み込みます。はじめにディレクトリを準備します。

In [1]:
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import requests


def set_root():
    root = os.path.join(os.path.realpath("."), "../")
    if root not in sys.path:
        sys.path.append(root)
    return root

ROOT_DIR = Path(set_root())
DATA_DIR = ROOT_DIR / "data"

統合報告書からテキストを抽出します。デモでは2019年のトヨタの統合報告書を使用しています。

In [2]:
from evaluator.data.pdf_reader import PDFReader

In [3]:
file_path = DATA_DIR / f"raw/2019_001_annual_en.pdf"
reader = PDFReader()
df = reader.read_to_frame(file_path)

PDF読み込み結果を表示します。

* page: ページ番号
* order: ページ内のセクション番号(登場順にカウント)


In [4]:
df.head(5)

Unnamed: 0,page,order,content
0,0,0,Page 1
1,0,1,Annual Report \nAnnual Report 2019\nFiscal yea...
2,1,0,Page 1
3,1,1,Table of Contents
4,1,2,1 \nTable of Contents\n2 \nMessage from the Pr...


## 2. Preprocess

PDF読み込み結果は様々なノイズを含んでいるので、前処理を行います。


In [5]:
preprocessed = reader.preprocess_frame(df)
preprocessed.head(5)

Unnamed: 0,page,order,content
0,0,0,page 1
1,0,1,annual report annual report 2019fiscal year en...
2,1,1,table of contents
3,1,2,1 table of contents2 message from the presiden...
4,1,3,the annual report 2019 is intended to communic...


文を含んでいないセクションを除外します。

In [6]:
import re


has_sentence = re.compile("(•)?\s?[A-Za-z](\s)?(\.|;)")
preprocessed = preprocessed[preprocessed["content"].apply(lambda s: re.search(has_sentence, s) is not None)]

In [7]:
print(f"Rows are decreased from {len(df)} to {len(preprocessed)}")

Rows are decreased from 747 to 189


In [8]:
preprocessed.assign(length=preprocessed["content"].apply(lambda s: len(s)))
preprocessed.head(5)

Unnamed: 0,page,order,content
4,1,3,the annual report 2019 is intended to communic...
5,1,4,about the pdfthis file is an interactive pdf a...
9,1,8,icons found in each section link to related pa...
16,1,15,toyota’s reports and publications* toyota als...
23,2,2,reforming our company to become a “mobility co...


## 3. Retrieve

チェックリストの質問に関係しているセクションを抽出します。  
本パートでは単純に質問文に含まれているキーワードを含むセクションを抽出します。

In [9]:
# CDPのC2.1の質問
question = "Does your organization have a process for identifying, assessing, and responding to climate-related risks and opportunities ?"
question = question.lower()
language = "en"

In [10]:
from spacy.util import get_lang_class


class Parser():

    def __init__(self, lang):
        self.lang = lang
        self.parser = get_lang_class(self.lang)()
    
    def parse(self, text):
        return self.parser(text)

  from .autonotebook import tqdm as notebook_tqdm


評価項目の質問から、キーワードを抽出

In [11]:
parser = Parser(language)
question_words = [str(t) for t in parser.parse(question) if not t.is_stop and not re.match("\'|\.|\?|\/|\,|\-", t.text)]
question_words

['organization',
 'process',
 'identifying',
 'assessing',
 'responding',
 'climate',
 'related',
 'risks',
 'opportunities']

文書内の各セクションについて、キーワードが含まれる数を計算

In [12]:
def count_keyword_match(parser, keywords, text):
    tokens = parser.parse(text)
    count = 0
    _keywords = [k for k in keywords]
    for t in tokens:
        if str(t).lower() in _keywords:
            count += 1
    return count


counted = preprocessed.assign(
    keyword_match=preprocessed["content"].apply(
        lambda s: count_keyword_match(parser, question_words, s)))

In [13]:
matched = counted[counted["keyword_match"] > 0]
matched.sort_values(by=["keyword_match"], ascending=False).head(5)

Unnamed: 0,page,order,content,keyword_match
424,37,3,organization and structuretoyota has appointed...,9
392,34,4,making over the decades has been made possible...,4
418,36,5,initiatives related to persons with disabiliti...,3
277,25,7,sustainability meetingreceives reports and del...,2
136,12,13,"royalty-free licenses to 23,740 patents relate...",2


当然ながら、検索でかかるようなセクションは取れている。

## 4.Answer

### 4.1 Use Question Answering Model

Retrieveで絞り込んだ結果から、質問の回答箇所を抽出します。

回答箇所の抽出には、自然言語処理の質問回答の手法を使用します。Wikipediaをベースにした質問回答のデータセット([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)と呼ばれる)で事前に学習したモデルをお持ちいます。本来はESGに関する質問と回答のデータセットで転移学習すべきですが、今回は学習せずに用います。

実際人間がチェックした結果は以下の通りです。
![image](./images/answer.PNG)

In [14]:
from evaluator.models.question_answer import answer

回答箇所を抽出

In [15]:
question_context = matched["content"].apply(lambda s: (question.lower(), s)).tolist()
answers = answer("distilbert-base-uncased-distilled-squad", question_context)

Loading pretrained model...
Prepair the tokenizer...
Set the pipeline.
Answer start.


  tensor = as_tensor(value)
  p_mask = np.asarray(
100%|██████████| 48/48 [00:10<00:00,  4.47it/s]


In [16]:
pd.DataFrame(answers).head(5)

Unnamed: 0,score,start,end,answer
0,0.092548,66,72,toyota
1,0.134468,123,154,requires an internet connection
2,0.012953,173,251,developing people message from the cfo capital...
3,0.000197,402,459,joint venture related to the town development ...
4,0.000305,992,996,tnga


answerは抽出できているが、意味が通らないものが多い。  
学習をしないとやはりうまういかないのかもしれない。

### 4.2 Use Feature Representation

直接質問回答ではなく、評価の質問に近い文を抽出してみる(あればチェック、なければチェックしない)。  
先ほどのキーワードでの抽出と変えて、もう少し文の意味を考慮できる手法を使用する。具体的には、Googleの検索で最近採用された手法を使用する。

* [BERT](https://www.blog.google/products/search/search-language-understanding-bert/)

まずは、セクションを文に分割する。

In [17]:
sentences = []
for i, row in matched.iterrows():
    c = row["content"]
    for j, s in enumerate(c.replace("•", ".").replace(";", ".").split(".")):
        sentences.append({
            "page": row["page"],
            "section_order": row["order"],
            "sentence_order": j,
            "sentence": s,
            "length": len(s)
        })

sentences = pd.DataFrame(sentences)
sentences.head(5)

Unnamed: 0,page,section_order,sentence_order,sentence,length
0,1,3,0,the annual report 2019 is intended to communic...,209
1,1,3,1,more detailed information on toyota’s esg-rel...,112
2,1,3,2,(published december 2019),25
3,1,8,0,icons found in each section link to related pa...,120
4,1,8,1,* requires an internet connection,33


文をベクトル表現(BERT表現)に変換する。

In [18]:
from evaluator.models.encoder import encode

In [19]:
model_name = "bert-base-uncased"

In [20]:
embeddings = encode(model_name, sentences["sentence"].values.tolist())

Loading pretrained model...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Prepair the tokenizer...
Set the pipeline.
Inference start.


100%|██████████| 52/52 [00:53<00:00,  1.03s/it]


In [21]:
embeddings.shape

(520, 768)

評価項目の質問と、文書中の文とで、ベクトル表現が近いものを抽出する。

In [22]:
query = encode(model_name, "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement".lower())
query = np.reshape(query, (1, -1))

Loading pretrained model...


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Prepair the tokenizer...
Set the pipeline.
Inference start.


In [23]:
from sklearn.metrics.pairwise import cosine_similarity


distance = cosine_similarity(query, embeddings)
np.sort(-distance).flatten()[:10]

array([-0.78335362, -0.7653984 , -0.74546632, -0.74509025, -0.74274237,
       -0.72449028, -0.71190212, -0.70885764, -0.70613514, -0.69764027])

質問に近い文トップ10を表示

In [24]:
pd.set_option("display.max_colwidth", -1)
sentences.assign(distance=distance.flatten()).iloc[np.argsort(-distance).flatten()].head(10)

  pd.set_option("display.max_colwidth", -1)


Unnamed: 0,page,section_order,sentence_order,sentence,length,distance
271,25,1,7,the agree-ment set the long-term goal of limiting global warming to well below 2°c compared with pre-industrial levels and calls for reaching net zero anthropogenic emissions of co2 and other greenhouse gases during the second half of the 21st century,252,0.783354
125,12,2,2,the first is regulations on co2 emissions and fuel efficiency,63,0.765398
127,12,2,4,"under this framework, which is increas-ingly being adopted by countries worldwide, the required level of cuts in co2 emissions rises each year",143,0.745466
269,25,1,5,"to help achieve the paris agreement goal of keeping global warming below 2°c,* we are promoting initiatives under the toyota environmental challenge 2050",154,0.74509
517,47,1,17,and (xiv) the impact of natural calamities including the negative effect on toyota’s vehicle production and sales,114,0.742742
123,12,2,0,"regulations are being tightened, along with new government policies, to combat global warming",95,0.72449
146,13,1,4,"if these initiatives accelerate the development of electrified vehicles at other companies, we will have helped hasten the reduction of co2 emissions",151,0.711902
278,25,7,0,"sustainability meetingreceives reports and deliberates on important manage-ment issues related to enhancing competitiveness and addressing risks over the long term in light of internal and external changes, primarily in environmental, social, and governance areas",263,0.708858
449,37,3,7,"at the same time, the sustainability meeting reviews and reports on major current risk items in order to promote preventive action",132,0.706135
132,12,2,9,this government policy basically aims to increase the number of vehicles on the road with zero co2 emis-sions,110,0.69764


そこそこ関連ある文章は取れているように思える。"the agree-ment set the long-term"等を見ると、"Policy"がトヨタのものなのか世界的なものなのか判別つかない問題が考えられる。