## Search by Sentence Representation

評価項目と統合報告書上の文をベクトル表現でマッチングさせることで、どの程度評価項目に有用な文が抽出できるかを検証する。

* Representation
  * [BERT](https://www.blog.google/products/search/search-language-understanding-bert/)


### Preparation

In [1]:
import os
import sys
import numpy as np
import pandas as pd

In [2]:
def set_root():
    root = os.path.join(os.path.realpath("."), "../")
    if root not in sys.path:
        sys.path.append(root)
    return root

ROOT_DIR = set_root()
DATA_DIR = os.path.join(ROOT_DIR, "data")

### Download Report

In [3]:
url = "https://global.toyota/pages/global_toyota/ir/library/annual/2019_001_annual_en.pdf"

In [4]:
from chariot.storage import Storage
from evaluator.data.pdf_reader import PDFReader

In [5]:
storage = Storage(DATA_DIR)

file_name = os.path.basename(url)
file_path = storage.download(url, f"raw/{file_name}")
reader = PDFReader()
df = reader.read_to_frame(file_path)

df.head(5)

HBox(children=(FloatProgress(value=0.0, max=5958.62109375), HTML(value='')))




Unnamed: 0,page,order,content
0,1,1,
1,1,2,Annual \nReport
2,1,3,Annual Report 2019\nFiscal year ended March 31...
3,1,4,2019
4,1,5,


In [6]:
df = reader.preprocess_frame(df)
df.head(5)

Unnamed: 0,page,order,content
0,1,2,annual report
1,2,2,table of contents
2,2,3,1 table of contents 2 message from the preside...
3,2,4,5 recent initiatives 6 organization
4,2,5,7 making ever-better cars: continuing to hone...


In [7]:
len(df)

625

In [8]:
reader.stop()

### Calculate Sentence Representation

In [9]:
from evaluator.features.encoder import encode

In [10]:
model_name = "bert-base-uncased"

In [11]:
embeddings = encode(model_name, df["content"].values.tolist())

Loading pretrained model...
Prepair the tokenizer...


  0%|                                                                                           | 0/63 [00:00<?, ?it/s]

Set the pipeline.
Inference start.


100%|██████████████████████████████████████████████████████████████████████████████████| 63/63 [01:16<00:00,  1.22s/it]


In [12]:
embeddings.shape

(625, 768)

### Query by evaluation item

In [13]:
query = encode(model_name, "Climate Change impact including CO2 / GHG emissions. Policy or commitment statement".lower())
query = np.reshape(query, (1, -1))

Loading pretrained model...
Prepair the tokenizer...
Set the pipeline.
Inference start.


In [23]:
from sklearn.metrics.pairwise import cosine_similarity


distance = cosine_similarity(query, embeddings)
np.sort(-distance).flatten()[:10]

array([-0.806189  , -0.79814149, -0.79535683, -0.77796453, -0.77390577,
       -0.77082201, -0.77050985, -0.7694012 , -0.76785613, -0.76517533])

In [27]:
pd.set_option("display.max_colwidth", -1)
df.assign(distance=distance.flatten()).iloc[np.argsort(-distance).flatten()].head(10)

Unnamed: 0,page,order,content,distance
520,38,45,"web risk management (sustainability data book 2019, p. 108)",0.806189
489,34,43,co2 reduction effect of 13.53 million hevs,0.798141
471,33,11,"• make annual global sales of more than 5.5 million electrified vehicles, including more than 1 million zero-emission vehicles (bevs and fcevs) reduce global average co2 emissions in g-co2/km from new vehicles by 35% or more compared to 2010 levels (may vary depending on market conditions)",0.795357
29,2,35,investors https://global.toyota/en/ir/,0.777965
572,43,44," kanban method adopted (1963)  nummi, a joint corporation with gm, established in the u.s. (1984)",0.773906
575,43,49, losses (fy 2009),0.770822
281,19,16,"(e-care, agent, etc.)authentication ota update of",0.77051
519,38,44,"form 20-f for the year ended march 31, 2019web",0.769401
582,44,18,"depreciation expenses (note 3) (billions of yen) 1,032.0 812.3 732.9 727.3 775.9 806.2 885.1 893.2 964.4 984.8",0.767856
620,48,23,acceptance of new products that meet customer demand; (ix) any damage to toyota’s brand,0.765175
