# 生体医療分野の固有表現認識におけるBERTとPubMedBERTの性能を比較する

注：このノートブックは日本語版のおまけです。GPUを使うことを推奨します。

## 準備

### インストール

In [1]:
!pip install -Uq spacy[transformers]==3.2.1

[K     |████████████████████████████████| 6.0 MB 4.9 MB/s 
[K     |████████████████████████████████| 451 kB 45.0 MB/s 
[K     |████████████████████████████████| 628 kB 49.4 MB/s 
[K     |████████████████████████████████| 181 kB 48.9 MB/s 
[K     |████████████████████████████████| 10.1 MB 36.0 MB/s 
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
[K     |████████████████████████████████| 51 kB 120 kB/s 
[K     |████████████████████████████████| 3.4 MB 35.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 41.2 MB/s 
[K     |████████████████████████████████| 895 kB 48.1 MB/s 
[K     |████████████████████████████████| 67 kB 5.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.4 MB/s 
[K     |████████████████████████████████| 596 kB 37.6 MB/s 
[?25h

### データセットのダウンロード

今回は、BC5CDRと呼ばれるデータセットを使って、生体医療分野の固有表現認識をします。認識する固有表現タイプはChemicalとDiseaseの2種類です。

https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data

In [3]:
!wget https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/train.tsv
!wget https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/devel.tsv
!wget https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/test.tsv

--2022-02-03 00:56:24--  https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1039940 (1016K) [text/plain]
Saving to: ‘train.tsv’


2022-02-03 00:56:25 (22.0 MB/s) - ‘train.tsv’ saved [1039940/1039940]

--2022-02-03 00:56:25--  https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/devel.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1031781 (1008K) [text/plain]
Saving to: ‘devel.tsv’


202

In [5]:
!head train.tsv

Selegiline	B-Chemical
-	O
induced	O
postural	B-Disease
hypotension	I-Disease
in	O
Parkinson	B-Disease
'	I-Disease
s	I-Disease
disease	I-Disease


### データセットの変換

ダウンロードしたデータセットを`spacy convert`コマンドを使ってspaCy形式に変換します。

In [6]:
!mkdir corpus
!python3 -m spacy convert "train.tsv" corpus -c ner -n 10
!python3 -m spacy convert "test.tsv" corpus -c ner -n 10
!python3 -m spacy convert "devel.tsv" corpus -c ner -n 10

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (456 documents): corpus/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (480 documents): corpus/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (459 documents): corpus/devel.spacy[0m


### 設定ファイルの作成

設定ファイルを作成します。`components.transformer.model`セクションの`name`に`bert-base-uncased`を指定しています。

https://spacy.io/usage/training

In [7]:
%%writefile base_config.cfg
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null

[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "bert-base-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Writing base_config.cfg


In [8]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## BERT
### モデルの学習

まずはBERTを使ってモデルを学習してみましょう。

In [6]:
!python -m spacy train config.cfg \
         --output=./model \
         --paths.train corpus/train.spacy \
         --paths.dev corpus/devel.spacy \
         --training.patience 1000 \
         --gpu-id 0 

[38;5;2m✔ Created output directory: model[0m
[38;5;4mℹ Saving to output directory: model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-02-02 22:17:06,034] [INFO] Set up nlp object from config
[2022-02-02 22:17:06,047] [INFO] Pipeline: ['transformer', 'ner']
[2022-02-02 22:17:06,052] [INFO] Created vocabulary
[2022-02-02 22:17:06,054] [INFO] Finished initializing nlp object
Downloading: 100% 28.0/28.0 [00:00<00:00, 26.1kB/s]
Downloading: 100% 570/570 [00:00<00:00, 528kB/s]
Downloading: 100% 226k/226k [00:00<00:00, 706kB/s]
Downloading: 100% 455k/455k [00:00<00:00, 1.14MB/s]
Downloading: 100% 420M/420M [00:10<00:00, 43.8MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.

### モデルの評価

In [9]:
!python -m spacy evaluate model/model-best corpus/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     -    
NER P   85.24
NER R   87.15
NER F   86.19
SPEED   3530 

[1m

               P       R       F
Disease    81.13   83.70   82.40
Chemical   88.67   89.99   89.33



## PubMedBERT

### モデルの学習

では次に、PubMedBERTを使って学習してみましょう。設定ファイルはそのままで、使うモデル名をオプションで指定します。

https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext

In [11]:
!python -m spacy train config.cfg \
        --output=./pubmed \
        --paths.train corpus/train.spacy \
        --paths.dev corpus/devel.spacy \
        --gpu-id 0 \
        --training.patience 1000 \
        --components.transformer.model.name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext

[38;5;4mℹ Saving to output directory: pubmed[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-02-03 01:13:00,639] [INFO] Set up nlp object from config
[2022-02-03 01:13:00,653] [INFO] Pipeline: ['transformer', 'ner']
[2022-02-03 01:13:00,658] [INFO] Created vocabulary
[2022-02-03 01:13:00,659] [INFO] Finished initializing nlp object
Downloading: 100% 28.0/28.0 [00:00<00:00, 34.6kB/s]
Downloading: 100% 385/385 [00:00<00:00, 401kB/s]
Downloading: 100% 221k/221k [00:00<00:00, 2.03MB/s]
Downloading: 100% 420M/420M [00:20<00:00, 21.4MB/s]
Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight

### モデルの評価

In [12]:
!python -m spacy evaluate pubmed/model-best corpus/test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     -    
NER P   89.91
NER R   90.43
NER F   90.17
SPEED   4176 

[1m

               P       R       F
Disease    85.06   87.12   86.07
Chemical   94.04   93.15   93.59

