## Check ESG evaluation automatically

ESG評価に使用しているチェックリストを、自動でチェックできるか検証する。

1. Upload: PDFファイル(統合報告書)のアップロード・読み取り
2. Preprocess: 読み込んだテキストを解析しやすいよう整形
3. Retrieval: 関連個所の抽出
4. Answering: チェックの実行

結論

* 評価項目に関連する箇所を抽出することは可能。
  * 現時点では、キーワード検索のほうが優秀。
  * 近年の自然言語処理技術(テキストをベクトルに変換して検索などを行う技術)では、あまり良好な精度が得られていない。
* 「チェックする」は難しい。
  * まず、条件設定が難しい(関連箇所がN件以上ならチェックする/しない、など)。

キーワードでどの程度役に立つか、また学習データをある程度作成した場合自然言語処理技術が聞くかどうか、検証したい。

### 0. Preparation

プログラムの実行に必要なライブラリや設定の実行


In [1]:
import os
import sys
import numpy as np
import pandas as pd

In [2]:
def set_root():
    root = os.path.join(os.path.realpath("."), "../")
    if root not in sys.path:
        sys.path.append(root)
    return root

ROOT_DIR = set_root()
DATA_DIR = os.path.join(ROOT_DIR, "data")

### 1. Upload

(※実際システムになる時はPDFのUploadだが、現在はファイルを取ってくるのでDownload)
ああ


In [3]:
# トヨタの統合報告書
url = "https://global.toyota/pages/global_toyota/ir/library/annual/2019_001_annual_en.pdf"

In [4]:
from chariot.storage import Storage
from evaluator.data.pdf_reader import PDFReader

In [5]:
storage = Storage(DATA_DIR)

file_name = os.path.basename(url)
file_path = storage.download(url, f"raw/{file_name}")
reader = PDFReader()
df = reader.read_to_frame(file_path)

HBox(children=(FloatProgress(value=0.0, max=11276.41015625), HTML(value='')))




PDF読み込み結果の表示

* page: ページ番号
* order: ページ内のセクション番号(登場順に上からカウント)


In [6]:
df.head(5)

Unnamed: 0,page,order,content
0,0,0,Page 1
1,0,1,Annual Report \nAnnual Report 2019\nFiscal yea...
2,1,0,Page 1
3,1,1,Table of Contents
4,1,2,1 \nTable of Contents\n2 \nMessage from the Pr...


### 2. Preprocess

PDF読み込み結果は様々なノイズを含んでいるので、処理しやすいよう前処理を行う。


In [7]:
preprocessed = reader.preprocess_frame(df)
preprocessed.head(5)

Unnamed: 0,page,order,content
0,0,0,page 1
1,0,1,annual report annual report 2019fiscal year en...
2,1,0,page 1
3,1,1,table of contents
4,1,2,1 table of contents2 message from the presiden...


文を含んでいないセクションを除外

In [8]:
import re


has_sentence = re.compile("[A-Za-z](\s)?\.")
preprocessed = preprocessed[preprocessed["content"].apply(lambda s: re.search(has_sentence, s) is not None)]

In [9]:
print(f"Rows are decreased from {len(df)} to {len(preprocessed)}")

Rows are decreased from 747 to 189


In [10]:
preprocessed.assign(length=preprocessed["content"].apply(lambda s: len(s)))
preprocessed.head(5)

Unnamed: 0,page,order,content
5,1,3,the annual report 2019 is intended to communic...
6,1,4,about the pdfthis file is an interactive pdf a...
10,1,8,icons found in each section link to related pa...
17,1,15,toyota’s reports and publications* toyota als...
26,2,2,reforming our company to become a “mobility co...


### 3. Retrieval

評価項目に関係しているセクションを抽出する。  
手法は様々あるが、単純に評価項目の質問に含まれているキーワードを含むセクションを抽出する。実際自分でやってみたところ、「CO2」などのキーワードでまず検索することが多かったので。


In [11]:
question = "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement"
question = question.lower()
language = "en"

In [12]:
from spacy.util import get_lang_class


class Parser():

    def __init__(self, lang):
        self.lang = lang
        self.parser = get_lang_class(self.lang)()
    
    def parse(self, text):
        return self.parser(text)

評価項目の質問から、キーワードを抽出

In [13]:
parser = Parser(language)
question_words = [t.lemma_ for t in parser.parse(question) if not t.is_stop and not re.match("\'|\.|\?|\/|\,", t.text)]
question_words

['climate',
 'change',
 'impact',
 'including',
 'co2',
 'ghg',
 'emissions',
 'policy',
 'commitment',
 'statement']

文書内の各セクションについて、キーワードが含まれる数を計算

In [14]:
def count_keyword_match(parser, keywords, text):
    tokens = parser.parse(text)
    count = 0
    _keywords = [k for k in keywords]
    for t in tokens:
        if t.lemma_ in _keywords:
            count += 1
    return count


counted = preprocessed.assign(
    keyword_match=preprocessed["content"].apply(
        lambda s: count_keyword_match(parser, question_words, s)))

In [15]:
matched = counted[counted["keyword_match"] > 0]
matched.sort_values(by=["keyword_match"], ascending=False).head(5)

Unnamed: 0,page,order,content,keyword_match
157,12,2,"regulations are being tightened, along with n...",11
486,33,24,global average co2 emissions from new vehicles...,7
156,12,1,speeding the popularization of electrified veh...,7
741,47,1,this report contains forward-looking statement...,6
463,33,1,one means of decarbonization regarded as holdi...,5


当然ながら、検索でかかるようなセクションは取れている。

### 4.Answering

#### 4.1 Use Question Answering Model

キーワード検索から絞り込んだ結果から、具体的な回答関連箇所を抽出する。抜粋に成功したらチェック?とできるか。  
(=>実際のチェック結果が現在手に入らないので、結果の確認はできない)

回答箇所の抽出には、自然言語処理の質問回答の手法を使用。Wikipediaをベースにした質問回答のデータセット(SQuADと呼ばれる)を学習させれば、一応[人間より精度は高くなる](https://rajpurkar.github.io/SQuAD-explorer/)。ただESGの質問回答データセットはないので、SQuADで学習したモデルをESGにそのまま適用してみる。

実際人間がチェックすると以下の箇所になる
![image](./images/answer.PNG)

In [16]:
from evaluator.models.question_answer import answer

In [17]:
# Climate Change impact including CO2 / GHG emissions. Policy or commitment statement
asking = "What policy or commitment does company have for climate change impact including CO2 / GHG emissions ?"

回答箇所抜粋

In [18]:
question_context = matched["content"].apply(lambda s: (asking.lower(), s)).tolist()
answers = answer("distilbert-base-uncased-distilled-squad", question_context)

Loading pretrained model...
Prepair the tokenizer...


  0%|                                                                                           | 0/53 [00:00<?, ?it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.97it/s][A

Set the pipeline.
Answer start.



  2%|█▌                                                                                 | 1/53 [00:00<00:21,  2.42it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.46it/s][A
  4%|███▏                                                                               | 2/53 [00:01<00:27,  1.83it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.94it/s][A
  6%|████▋                                                                              | 3/53 [00:01<00:25,  1.93it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.89it/s][A
  8%|██████▎                                                                            | 4/53 [00:02<00:23,  2.04it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.83it/s][A
  9%|███████▊              

Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 111.12it/s][A
 66%|██████████████████████████████████████████████████████▏                           | 35/53 [00:24<00:12,  1.43it/s]
Converting examples to features: 100%|██████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.86it/s][A
 68%|███████████████████████████████████████████████████████▋                          | 36/53 [00:24<00:10,  1.64it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.00it/s][A
 70%|█████████████████████████████████████████████████████████▏                        | 37/53 [00:25<00:08,  1.79it/s]
Converting examples to features: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 35.68it/s][A
 72%|██████████████████████████████████████████████████████████▊                       | 38/53 [00:26<00:09,  1.52it/s]
Converting examples to featu

In [19]:
pd.DataFrame(answers).head(5)

Unnamed: 0,score,start,end,answer
0,0.623756,275,286,"situa-tion,"
1,0.352997,1036,1061,"corporate culture reform,"
2,0.36926,208,239,cfo capital policytechnological
3,0.056642,234,303,continue to hone our competitiveness in the re...
4,0.039247,388,418,toyota new global architecture


answerは抽出できているが、意味が通らないものが多い。  
手法のせいなのか、学習データがあればうまく動くのかは、現時点ではわからない(質問と回答のペアを作る必要がある)。

#### 4.2 Use Feature Representation

直接質問回答ではなく、評価の質問に近い文を抽出してみる(あればチェック、なければチェックしない)。  
先ほどのキーワードでの抽出と変えて、もう少し文の意味を考慮できる手法を使用する。具体的には、Googleの検索で最近採用された手法を使用する。

* [BERT](https://www.blog.google/products/search/search-language-understanding-bert/)

まずは、セクションを文に分割する。

In [20]:
sentences = []
for i, row in matched.iterrows():
    c = row["content"]
    for j, s in enumerate(c.split(".")):
        sentences.append({
            "page": row["page"],
            "section_order": row["order"],
            "sentence_order": j,
            "sentence": s,
            "length": len(s)
        })

sentences = pd.DataFrame(sentences)
sentences.head(5)

Unnamed: 0,page,section_order,sentence_order,sentence,length
0,2,2,0,reforming our company to become a “mobility co...,143
1,2,2,1,in light of technolog-ical innovations in “ca...,118
2,2,2,2,"given this situa-tion, we must transform our ...",103
3,2,2,3,,0
4,4,2,0,unwavering commitment to the development of ou...,130


文をベクトル表現(BERT表現)に変換する。

In [21]:
from evaluator.features.encoder import encode

In [22]:
model_name = "bert-base-uncased"

In [23]:
embeddings = encode(model_name, sentences["sentence"].values.tolist())

Loading pretrained model...
Prepair the tokenizer...


  3%|██▊                                                                                | 2/60 [00:00<00:04, 13.61it/s]

Set the pipeline.
Inference start.


100%|██████████████████████████████████████████████████████████████████████████████████| 60/60 [01:37<00:00,  1.62s/it]


In [24]:
embeddings.shape

(599, 768)

評価項目の質問と、文書中の文とで、ベクトル表現が近いものを抽出する。

In [25]:
query = encode(model_name, "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement".lower())
query = np.reshape(query, (1, -1))

Loading pretrained model...
Prepair the tokenizer...
Set the pipeline.
Inference start.


In [26]:
from sklearn.metrics.pairwise import cosine_similarity


distance = cosine_similarity(query, embeddings)
np.sort(-distance).flatten()[:10]

array([-0.79577988, -0.77521259, -0.76369322, -0.75728184, -0.75303959,
       -0.75146863, -0.74910982, -0.74798706, -0.74662738, -0.74573172])

質問に近い文トップ10を表示

In [27]:
pd.set_option("display.max_colwidth", -1)
sentences.assign(distance=distance.flatten()).iloc[np.argsort(-distance).flatten()].head(10)

Unnamed: 0,page,section_order,sentence_order,sentence,length,distance
419,33,24,2,", europe, china)(%)• the average co2 emissions (g-co2/km) of new vehicles in each year, based on fuel efficiency values (co2 emissions) certified by the respective national authorities2020 target: 22%",202,0.79578
49,8,9,5,"in addition to global changes, such as the shift from sedans to suvs, region-specific customer preferenc-es are in constant flux",129,0.775213
596,47,1,4,"dollar, the euro, the australian dollar, the russian ruble, the canadian dollar and the british pound, and interest rates fluctuations; (iii) changes in funding environment in financial markets and increased competition in the financial services industry; (iv) toyota’s ability to market and distribute effectively; (v) toyota’s ability to realize production efficiencies and to implement capital expenditures at the levels and times planned by management; (vi) changes in the laws, regulations and government policies in the markets in which toyota operates that affect toyota’s automotive operations, par-ticularly laws, regulations and government policies relating to vehicle safety including remedial measures such as recalls, trade, environmental protection, vehicle emissions and vehicle fuel economy, as well as changes in laws, regulations and government policies that affect toyota’s other operations, including the outcome of current and future litigation and other legal proceed-ings, government proceedings and investigations; (vii) political and economic instability in the markets in which toyota operates; (viii) toyota’s ability to timely develop and achieve market acceptance of new products that meet customer demand; (ix) any damage to toyota’s brand image; (x) toyota’s reliance on various suppliers for the provision of supplies; (xi) increases in prices of raw materials; (xii) toyota’s reliance on various digital and information technologies; (xiii) fuel shortages or interruptions in electricity, transportation systems, labor strikes, work stop-pages or other interruptions to, or difficulties in, the employment of labor in the major markets where toyota purchases materials, components and supplies for the production of its products or where its products are produced, distributed or sold; and (xiv) the impact of natural calamities including the negative effect on toyota’s vehicle production and sales",1935,0.763693
336,30,6,0,"board of directors and audit & supervisory board members (as of october 1, 2019)apr",83,0.757282
126,12,1,4,"5 million electrified vehi-cles, including at least 4",53,0.75304
389,30,6,53,2013 chairman of tmc (to present)akio toyodapositions and areas of responsibility: chief executive officer chief branding officerapr,134,0.751469
341,30,6,5,"2015 vice-minister of ministry of economy, trade and industryjul",65,0.74911
364,30,6,28,2017 managing executive officer of sumitomo mitsui banking corporation (to present)jun,87,0.747987
18,4,2,14,"december 2019akio toyodapresident, member of the board of directorstoyota motor corporation",91,0.746627
417,33,24,0,"global average co2 emissions from new vehicles reduction rate versus 2010 (japan, u",83,0.745732


つながりがよくわからない文章が多い。"global average co2 emissions from new vehicles reduction rate versus 2010"など、関係がありそうな箇所も取れてはいるが、同じスコア(distance)でも関係ある/ないで分散が大きい。