## Check ESG evaluation automatically

ESG評価に使用しているチェックリストを、自動でチェックできるか検証する。

1. Upload: PDFファイル(統合報告書)のアップロード・読み取り
2. Preprocess: 読み込んだテキストを解析しやすいよう整形
3. Retrieval: 関連個所の抽出
4. Answering: チェックの実行

結論

* 評価項目に関連する箇所を抽出することは可能: 作業負荷軽減には有効な手ごたえ。
  * キーワード検索でもそこそこ役に立つ。
  * 近年の自然言語処理技術(テキストをベクトルに変換して検索などを行う技術)でも、関連すると思われる箇所が抽出できる。
* 「チェック」の自動化を行うには、まだ検討しなければならない点がある。
  * 条件設定がなど: 関連箇所がN件以上ならチェックする/しないetc.

Next Step

1. 抽出のみでどの程度負荷軽減が可能かの検証
2. 1を行うためのシステムづくり
3. そもそものデータ(文書)を収集するための体制・仕組みづくり


### 0. Preparation

プログラムの実行に必要なライブラリや設定の実行


In [1]:
import os
import sys
import numpy as np
import pandas as pd

In [2]:
def set_root():
    root = os.path.join(os.path.realpath("."), "../")
    if root not in sys.path:
        sys.path.append(root)
    return root

ROOT_DIR = set_root()
DATA_DIR = os.path.join(ROOT_DIR, "data")

### 1. Upload

(※実際システムになる時はPDFのUploadだが、現在はファイルを取ってくるのでDownload)
ああ


In [3]:
# トヨタの統合報告書
url = "https://global.toyota/pages/global_toyota/ir/library/annual/2019_001_annual_en.pdf"

In [4]:
from chariot.storage import Storage
from evaluator.data.pdf_reader import PDFReader

In [5]:
storage = Storage(DATA_DIR)

file_name = os.path.basename(url)
file_path = storage.download(url, f"raw/{file_name}")
reader = PDFReader()
df = reader.read_to_frame(file_path)

HBox(children=(FloatProgress(value=0.0, max=11276.41015625), HTML(value='')))




PDF読み込み結果の表示

* page: ページ番号
* order: ページ内のセクション番号(登場順に上からカウント)


In [6]:
df.head(5)

Unnamed: 0,page,order,content
0,0,0,Page 1
1,0,1,Annual Report \nAnnual Report 2019\nFiscal yea...
2,1,0,Page 1
3,1,1,Table of Contents
4,1,2,1 \nTable of Contents\n2 \nMessage from the Pr...


### 2. Preprocess

PDF読み込み結果は様々なノイズを含んでいるので、処理しやすいよう前処理を行う。


In [7]:
preprocessed = reader.preprocess_frame(df)
preprocessed.head(5)

Unnamed: 0,page,order,content
0,0,0,page 1
1,0,1,annual report annual report 2019fiscal year en...
2,1,1,table of contents
3,1,2,1 table of contents2 message from the presiden...
4,1,3,the annual report 2019 is intended to communic...


文を含んでいないセクションを除外

In [8]:
import re


has_sentence = re.compile("(•)?\s?[A-Za-z](\s)?(\.|;)")
preprocessed = preprocessed[preprocessed["content"].apply(lambda s: re.search(has_sentence, s) is not None)]

In [9]:
print(f"Rows are decreased from {len(df)} to {len(preprocessed)}")

Rows are decreased from 747 to 189


In [10]:
preprocessed.assign(length=preprocessed["content"].apply(lambda s: len(s)))
preprocessed.head(5)

Unnamed: 0,page,order,content
4,1,3,the annual report 2019 is intended to communic...
5,1,4,about the pdfthis file is an interactive pdf a...
9,1,8,icons found in each section link to related pa...
16,1,15,toyota’s reports and publications* toyota als...
23,2,2,reforming our company to become a “mobility co...


### 3. Retrieval

評価項目に関係しているセクションを抽出する。  
手法は様々あるが、単純に評価項目の質問に含まれているキーワードを含むセクションを抽出する。実際自分でやってみたところ、「CO2」などのキーワードでまず検索することが多かったので。


In [11]:
question = "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement"
question = question.lower()
language = "en"

In [12]:
from spacy.util import get_lang_class


class Parser():

    def __init__(self, lang):
        self.lang = lang
        self.parser = get_lang_class(self.lang)()
    
    def parse(self, text):
        return self.parser(text)

評価項目の質問から、キーワードを抽出

In [13]:
parser = Parser(language)
question_words = [t.lemma_ for t in parser.parse(question) if not t.is_stop and not re.match("\'|\.|\?|\/|\,", t.text)]
question_words

['climate',
 'change',
 'impact',
 'including',
 'co2',
 'ghg',
 'emissions',
 'policy',
 'commitment',
 'statement']

文書内の各セクションについて、キーワードが含まれる数を計算

In [14]:
def count_keyword_match(parser, keywords, text):
    tokens = parser.parse(text)
    count = 0
    _keywords = [k for k in keywords]
    for t in tokens:
        if t.lemma_ in _keywords:
            count += 1
    return count


counted = preprocessed.assign(
    keyword_match=preprocessed["content"].apply(
        lambda s: count_keyword_match(parser, question_words, s)))

In [15]:
matched = counted[counted["keyword_match"] > 0]
matched.sort_values(by=["keyword_match"], ascending=False).head(5)

Unnamed: 0,page,order,content,keyword_match
126,12,2,"regulations are being tightened, along with n...",11
384,33,24,global average co2 emissions from new vehicles...,7
125,12,1,speeding the popularization of electrified veh...,7
562,47,1,this report contains forward-looking statement...,6
365,33,1,one means of decarbonization regarded as holdi...,5


当然ながら、検索でかかるようなセクションは取れている。

### 4.Answering

#### 4.1 Use Question Answering Model

キーワード検索から絞り込んだ結果から、具体的な回答関連箇所を抽出する。抜粋に成功したらチェック?とできるか。  
(=>実際のチェック結果が現在手に入らないので、結果の確認はできない)

回答箇所の抽出には、自然言語処理の質問回答の手法を使用。Wikipediaをベースにした質問回答のデータセット(SQuADと呼ばれる)を学習させれば、一応[人間より精度は高くなる](https://rajpurkar.github.io/SQuAD-explorer/)。ただESGの質問回答データセットはないので、SQuADで学習したモデルをESGにそのまま適用してみる。

実際人間がチェックすると以下の箇所になる
![image](./images/answer.PNG)

In [16]:
from evaluator.models.question_answer import answer

In [17]:
# Climate Change impact including CO2 / GHG emissions. Policy or commitment statement
asking = "What policy or commitment does company have for climate change impact including CO2 / GHG emissions ?"

回答箇所抜粋

In [18]:
question_context = matched["content"].apply(lambda s: (asking.lower(), s)).tolist()
answers = answer("distilbert-base-uncased-distilled-squad", question_context)

Loading pretrained model...
Prepair the tokenizer...


  0%|                                                                                           | 0/53 [00:00<?, ?it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 200.13it/s][A

Set the pipeline.
Answer start.



  2%|█▌                                                                                 | 1/53 [00:00<00:21,  2.39it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.26it/s][A
  4%|███▏                                                                               | 2/53 [00:01<00:27,  1.87it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 100.01it/s][A
  6%|████▋                                                                              | 3/53 [00:01<00:24,  2.01it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.77it/s][A
  8%|██████▎                                                                            | 4/53 [00:02<00:23,  2.08it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 52.64it/s][A
  9%|███████▊              

Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 83.31it/s][A
 66%|██████████████████████████████████████████████████████▏                           | 35/53 [00:22<00:12,  1.40it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 250.09it/s][A
 68%|███████████████████████████████████████████████████████▋                          | 36/53 [00:22<00:10,  1.59it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.00it/s][A
 70%|█████████████████████████████████████████████████████████▏                        | 37/53 [00:23<00:09,  1.75it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.46it/s][A
 72%|██████████████████████████████████████████████████████████▊                       | 38/53 [00:24<00:09,  1.56it/s]
Converting examples to featu

In [19]:
pd.DataFrame(answers).head(5)

Unnamed: 0,score,start,end,answer
0,0.623756,275,286,"situa-tion,"
1,0.352997,1036,1061,"corporate culture reform,"
2,0.36926,208,239,cfo capital policytechnological
3,0.056642,234,303,continue to hone our competitiveness in the re...
4,0.039247,388,418,toyota new global architecture


answerは抽出できているが、意味が通らないものが多い。  
手法のせいなのか、学習データがあればうまく動くのかは、現時点ではわからない(質問と回答のペアを作る必要がある)。

#### 4.2 Use Feature Representation

直接質問回答ではなく、評価の質問に近い文を抽出してみる(あればチェック、なければチェックしない)。  
先ほどのキーワードでの抽出と変えて、もう少し文の意味を考慮できる手法を使用する。具体的には、Googleの検索で最近採用された手法を使用する。

* [BERT](https://www.blog.google/products/search/search-language-understanding-bert/)

まずは、セクションを文に分割する。

In [20]:
sentences = []
for i, row in matched.iterrows():
    c = row["content"]
    for j, s in enumerate(c.replace("•", ".").replace(";", ".").split(".")):
        sentences.append({
            "page": row["page"],
            "section_order": row["order"],
            "sentence_order": j,
            "sentence": s,
            "length": len(s)
        })

sentences = pd.DataFrame(sentences)
sentences.head(5)

Unnamed: 0,page,section_order,sentence_order,sentence,length
0,2,2,0,reforming our company to become a “mobility co...,143
1,2,2,1,in light of technolog-ical innovations in “ca...,118
2,2,2,2,"given this situa-tion, we must transform our ...",103
3,2,2,3,,0
4,4,2,0,unwavering commitment to the development of ou...,130


文をベクトル表現(BERT表現)に変換する。

In [21]:
from evaluator.features.encoder import encode

In [22]:
model_name = "bert-base-uncased"

In [23]:
embeddings = encode(model_name, sentences["sentence"].values.tolist())

Loading pretrained model...
Prepair the tokenizer...


  3%|██▋                                                                                | 2/63 [00:00<00:04, 12.50it/s]

Set the pipeline.
Inference start.


100%|██████████████████████████████████████████████████████████████████████████████████| 63/63 [00:43<00:00,  1.45it/s]


In [24]:
embeddings.shape

(625, 768)

評価項目の質問と、文書中の文とで、ベクトル表現が近いものを抽出する。

In [25]:
query = encode(model_name, "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement".lower())
query = np.reshape(query, (1, -1))

Loading pretrained model...
Prepair the tokenizer...
Set the pipeline.
Inference start.


In [26]:
from sklearn.metrics.pairwise import cosine_similarity


distance = cosine_similarity(query, embeddings)
np.sort(-distance).flatten()[:10]

array([-0.8077492 , -0.76283298, -0.74784697, -0.74546636, -0.70889445,
       -0.6956646 , -0.69559286, -0.69314511, -0.67987736, -0.66801694])

質問に近い文トップ10を表示

In [27]:
pd.set_option("display.max_colwidth", -1)
sentences.assign(distance=distance.flatten()).iloc[np.argsort(-distance).flatten()].head(10)

Unnamed: 0,page,section_order,sentence_order,sentence,length,distance
430,33,24,3,"the average co2 emissions (g-co2/km) of new vehicles in each year, based on fuel efficiency values (co2 emissions) certified by the respective national authorities2020 target: 22%",182,0.807749
267,25,1,7,the agree-ment set the long-term goal of limiting global warming to well below 2°c compared with pre-industrial levels and calls for reaching net zero anthropogenic emissions of co2 and other greenhouse gases during the second half of the 21st century,252,0.762833
427,33,24,0,"global average co2 emissions from new vehicles reduction rate versus 2010 (japan, u",83,0.747847
140,12,2,4,"under this framework, which is increas-ingly being adopted by countries worldwide, the required level of cuts in co2 emissions rises each year",143,0.745466
123,12,1,1,"as part of the toyota environmental challenge 2050, launched in 2015, we set for ourselves the new vehicle zero co2 emissions challenge, under which we aim to reduce by 90% toyota’s global average new vehicle co2 emissions during operation by 2050, compared with the 2010 level",278,0.708894
402,32,2,4,"in 2019, toyota became a signatory to the task force on climate-related financial disclosures (tcfd) recom-mendations",118,0.695665
143,12,2,7,"the second trend entails regulations for zero emis-sion vehicles (zevs), which have come into effect in some parts of the united states and canada, and regulations for new energy vehicles (nevs) in china",205,0.695593
409,33,1,4,our objective is to achieve zero co2 emissions at our plants all over the world by 2050,88,0.693145
124,12,1,2,"since launching the prius hybrid elec-tric vehicle (hev) in 1997, toyota has sold approxi-mately 14 million electrified vehicles around the world (as of july 2019), helping to cut co2 emissions by more than an estimated 113 million tons",237,0.679877
95,10,4,4,"in light of these circumstances, toyota is working with a wide range of partners to contribute to commu-nities by utilizing fuel cell technologies, which emit no co2 or other air pollutants",191,0.668017


そこそこ関連ある文章は取れているように思える。"the agree-ment set the long-term"等を見ると、"Policy"がトヨタのものなのか世界的なものなのか判別つかない問題が考えられる。