## Check ESG evaluation automatically

ESG評価を行うためのチェックリストを、自動でチェックできるか検証する。

1. Upload: PDFファイル(統合報告書)のアップロード・読み取り
2. Preprocess: 読み込んだテキストを解析しやすいよう整形
3. Retrieval: 関連個所の抽出
4. Answering: チェックの実行

### 0. Preparation

プログラムの実行に必要なライブラリや設定の実行


In [3]:
import os
import sys
import numpy as np
import pandas as pd

In [4]:
def set_root():
    root = os.path.join(os.path.realpath("."), "../")
    if root not in sys.path:
        sys.path.append(root)
    return root

ROOT_DIR = set_root()
DATA_DIR = os.path.join(ROOT_DIR, "data")

### 1. Upload

(※実際システムになる時はPDFのUploadだが、現在はファイルを取ってくるのでDownload)


In [5]:
# トヨタの統合報告書
url = "https://global.toyota/pages/global_toyota/ir/library/annual/2019_001_annual_en.pdf"

In [6]:
from chariot.storage import Storage
from evaluator.data.pdf_reader import PDFReader

In [8]:
storage = Storage(DATA_DIR)

file_name = os.path.basename(url)
file_path = storage.download(url, f"raw/{file_name}")
reader = PDFReader()
df = reader.read_to_frame(file_path)

HBox(children=(FloatProgress(value=0.0, max=5958.62109375), HTML(value='')))




PDF読み込み結果の表示

* page: ページ番号
* order: ページ内のセクション番号(登場順に上からカウント)


In [10]:
df.head(5)

Unnamed: 0,page,order,content
0,1,1,
1,1,2,Annual \nReport
2,1,3,Annual Report 2019\nFiscal year ended March 31...
3,1,4,2019
4,1,5,


In [11]:
reader.stop()

### 2. Preprocess

PDF読み込み結果は様々なノイズを含んでいるので、処理しやすいよう前処理を行う。


In [12]:
preprocessed = reader.preprocess_frame(df)
preprocessed.head(5)

Unnamed: 0,page,order,content
0,1,2,annual report
1,2,2,table of contents
2,2,3,1 table of contents 2 message from the preside...
3,2,4,5 recent initiatives 6 organization
4,2,5,7 making ever-better cars: continuing to hone...


文を含んでいないセクションを除外

In [29]:
import re


has_sentence = re.compile("[A-Za-z](\s)?\.")
preprocessed = preprocessed[preprocessed["content"].apply(lambda s: re.search(has_sentence, s) is not None)]

In [30]:
print(f"Rows are decreased from {len(df)} to {len(preprocessed)}")

Rows are decreased from 2195 to 193


In [41]:
preprocessed.assign(length=preprocessed["content"].apply(lambda s: len(s)))
preprocessed.head(5)

Unnamed: 0,page,order,content,length
14,2,17,porate value and the ways that it is contribut...,121
18,2,21,this file is an interactive pdf and can be nav...,91
21,2,25,of the report and to relevant web pages and pd...,70
22,2,26,* requires an internet connection.,34
28,2,34,* toyota also publishes information on busine...,156


### 3. Retrieval


質問に関係しているセクションを抽出する。


In [50]:
question = "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement"
question = question.lower()
language = "en"

In [56]:
from spacy.util import get_lang_class


class Parser():

    def __init__(self, lang):
        self.lang = lang
        self.parser = get_lang_class(self.lang)()
    
    def parse(self, text):
        return self.parser(text)

質問からキーワードを抽出

In [69]:
parser = Parser(language)
question_words = [t.lemma_ for t in parser.parse(question) if not t.is_stop and not re.match("\'|\.|\?|\/|\,", t.text)]
question_words

['climate',
 'change',
 'impact',
 'including',
 'co2',
 'ghg',
 'emissions',
 'policy',
 'commitment',
 'statement']

キーワードが含まれる数を計算

In [74]:
def count_keyword_match(parser, keywords, text):
    tokens = parser.parse(text)
    count = 0
    _keywords = [k for k in keywords]
    for t in tokens:
        if t.lemma_ in _keywords:
            count += 1
    return count


counted = preprocessed.assign(
    keyword_match=preprocessed["content"].apply(
        lambda s: count_keyword_match(parser, question_words, s)))

In [77]:
matched = counted[counted["keyword_match"] > 1]
matched.head(5)

Unnamed: 0,page,order,content,length,keyword_match
192,11,10,cut annual greenhouse gas emissions by 465 ton...,850,2
209,13,6,two major trends in automobile-related environ...,526,4
210,13,7,year. in order to improve corporate average co...,1599,7
240,15,27,to achieve even higher levels of battery durab...,1548,2
592,46,6,lifestyles are on the threshold of profound ch...,1196,3


### 4.Answering

#### 4.1 Use Question Answering Model

キーワード検索から絞り込んだ結果から、具体的な回答関連箇所を抽出する。

In [78]:
from evaluator.models.question_answer import answer

In [79]:
# Climate Change impact including CO2 / GHG emissions. Policy or commitment statement
asking = "What policy or commitment does company have for climate change impact including CO2 / GHG emissions ?"

回答箇所抜粋

In [84]:
question_context = matched["content"].apply(lambda s: (asking.lower(), s)).tolist()
answers = answer("distilbert-base-uncased-distilled-squad", question_context)

Loading pretrained model...
Prepair the tokenizer...


  0%|                                                                                            | 0/5 [00:00<?, ?it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 90.86it/s][A

Set the pipeline.
Answer start.



 20%|████████████████▊                                                                   | 1/5 [00:00<00:01,  2.04it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 90.89it/s][A
 40%|█████████████████████████████████▌                                                  | 2/5 [00:00<00:01,  2.12it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.56it/s][A
 60%|██████████████████████████████████████████████████▍                                 | 3/5 [00:01<00:00,  2.16it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 71.47it/s][A
 80%|███████████████████████████████████████████████████████████████████▏                | 4/5 [00:01<00:00,  2.11it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 52.64it/s][A
100%|██████████████████████

In [86]:
pd.DataFrame(answers)

Unnamed: 0,score,start,end,answer
0,0.220558,673,679,toyota
1,0.154751,156,173,fuel efficiency.
2,0.038653,988,998,government
3,3e-05,752,791,toyota has put into place a struc- ture
4,0.110417,1037,1052,estab- lishment


結果はいまいち(Wikipediaがベースなのであまり精度は高くないかも)。
ただ、学習データを追加すれば改善される可能性はある(もともとは、[人間より精度高いモデル](https://rajpurkar.github.io/SQuAD-explorer/))。

#### 4.2 Use Feature Representation

評価項目・文双方をベクトル表現にし、評価項目に近い文を抽出する。
ベクトル表現への変換には、Googleの検索で採用された手法を使用。

* [BERT](https://www.blog.google/products/search/search-language-understanding-bert/)

セクションを文に分割

In [90]:
sentences = []
for i, row in matched.iterrows():
    c = row["content"]
    for j, s in enumerate(c.split(".")):
        sentences.append({
            "page": row["page"],
            "section_order": row["order"],
            "sentence_order": j,
            "sentence": s,
            "length": len(s)
        })

sentences = pd.DataFrame(sentences)
sentences.head(5)

Unnamed: 0,page,section_order,sentence_order,sentence,length
0,11,10,0,cut annual greenhouse gas emissions by 465 ton...,139
1,11,10,1,72 tons,7
2,11,10,2,"under the project, the fuel cell system of t...",247
3,11,10,3,we plan to deploy 10 of these trucks to haul...,140
4,11,10,4,by adapting its fuel cell technologies to fr...,312


文のベクトル表現(BERT表現)を作成する

In [91]:
from evaluator.features.encoder import encode

In [92]:
model_name = "bert-base-uncased"

In [93]:
embeddings = encode(model_name, sentences["sentence"].values.tolist())

Loading pretrained model...
Prepair the tokenizer...


  0%|                                                                                            | 0/5 [00:00<?, ?it/s]

Set the pipeline.
Inference start.


100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.61it/s]


In [94]:
embeddings.shape

(42, 768)

評価項目のベクトル表現と近しい文書を抽出する。

In [95]:
query = encode(model_name, "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement".lower())
query = np.reshape(query, (1, -1))

Loading pretrained model...
Prepair the tokenizer...
Set the pipeline.
Inference start.


In [96]:
from sklearn.metrics.pairwise import cosine_similarity


distance = cosine_similarity(query, embeddings)
np.sort(-distance).flatten()[:10]

array([-0.80772909, -0.75797244, -0.74253654, -0.73154611, -0.67573654,
       -0.57177246, -0.55729908, -0.55063786, -0.53485747, -0.53485747])

評価項目に近い文トップ10を表示

In [97]:
pd.set_option("display.max_colwidth", -1)
sentences.assign(distance=distance.flatten()).iloc[np.argsort(-distance).flatten()].head(10)

Unnamed: 0,page,section_order,sentence_order,sentence,length,distance
7,13,6,1,the first is regulations on co2 emissions and fuel efficiency,65,0.807729
13,13,7,3,"the second trend entails regulations for zero emis- sion vehicles (zevs), which have come into effect in some parts of the united states and canada, and regulations for new energy vehicles (nevs) in china",209,0.757972
31,15,27,11,"in preparing for the spread of bevs and to win over customers to our bevs, we have a long list of initia- tives to follow through on, including developing vehi- cles, ensuring the stable supply of batteries, improving the durability of batteries, and preparing for the reuse of older batteries",299,0.742537
12,13,7,2,"with regard to co2 regulations in europe, for example, toyota led the industry in meeting 2017 regulatory values, and, although the current-generation prius satisfies 2025 regulatory values, it is challenging for suvs and other types of relatively heavy vehicles, even hybrid models, to clear this regulatory hurdle, necessitating the greater prolif- eration of phevs, bevs, and fcevs",392,0.731546
36,46,6,2,"from here on out, informa- tion links will connect all items and services that sup- port our daily lives",105,0.675737
19,13,7,9,"with a strong vision,",22,0.571772
28,15,27,8,"(catl), byd co",15,0.557299
20,15,27,0,to achieve even higher levels of battery durability in the bevs we plan to launch in 2020,90,0.550638
29,15,27,9,", ltd",5,0.534857
27,15,27,7,", ltd",5,0.534857


2番目の文は近い?気もする。
もう少し多くのデータ/評価項目で検証してみたい。