# PubMedBERTを用いたテキスト分類

このノートブックでは、PubMedBertを用いて、テキスト分類をする方法を紹介します。PubMedBERTは大規模な医療系テキストを用いて学習されたモデルです。医療系のタスクにおいては、Wikipediaやニュース記事で学習したBERTよりも高性能となることが期待できます。

訳注: 元のノートブックではBioBERTを利用していましたが、PubMedBERTの論文を見る限り、こちらのほうが性能が良かったので置き換えています。

- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://arxiv.org/abs/2007.15779)


## 準備

### パッケージのインストール

In [1]:
!pip install -q pandas==1.1.5 tensorflow==2.6.0 transformers==4.10.2 scikit-learn==0.23.2

[K     |████████████████████████████████| 2.8 MB 4.3 MB/s 
[K     |████████████████████████████████| 6.8 MB 37.4 MB/s 
[K     |████████████████████████████████| 895 kB 50.1 MB/s 
[K     |████████████████████████████████| 636 kB 46.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 45.2 MB/s 
[K     |████████████████████████████████| 52 kB 1.4 MB/s 
[?25h

### インポート

In [2]:
import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification
tf.get_logger().setLevel('ERROR')

### データのダウンロードとアップロード

今回のデータセットは、Kaggleの以下のページからダウンロードする必要があります。登録してダウンロードし、`mtsamples.csv`をアップロードしましょう。

- [Medical Transcriptions](https://www.kaggle.com/tboyle10/medicaltranscriptions)

In [3]:
from google.colab import files

uploaded = files.upload()

Saving mtsamples.csv to mtsamples.csv


### データの読み込み

In [4]:
df = pd.read_csv("mtsamples.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


In [5]:
df["medical_specialty"].value_counts()

 Surgery                          1103
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        372
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  230
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Obstetrics / Gynecology           160
 Urology                           158
 Discharge Summary                 108
 ENT - Otolaryngology               98
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    62
 Psychiatry / Psychology            53
 Office Notes                       51
 Podiatry                           47
 Dermatology                        29
 Dentistry                          27
 Cosmetic / Plastic Surge

このデータセットを使って、`description`から`medical_speciality`を予測するモデルを学習しましょう。データセットは非常に不均衡なので、数の少ないクラスを除去するのも1つの手です。今回はデモなので、そのまま使います。

## 前処理

`LabelEncoder`を使って、ラベルの文字列を数字に変換します。

In [6]:
le = LabelEncoder()
df["medical_specialty"] = le.fit_transform(df["medical_specialty"])
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,0,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,2,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,2,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,3,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,3,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


データセットを学習用とテスト用に分割します。

In [7]:
x_train, x_test, y_train, y_test = train_test_split(
    list(df["description"].values),
    list(df["medical_specialty"].values),
    test_size=0.2,
    random_state=2021
)

x_valid, x_test, y_valid, y_test = train_test_split(
    x_test,
    y_test,
    test_size=0.5,
    random_state=2021
)

`BERTTokenizerFast.from_pretrained`メソッドでトークナイザーをインスタンス化します。

In [8]:
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

トークナイザーにテキストを与えて、エンコーディングしましょう。

In [9]:
train_encodings = tokenizer(x_train, truncation=True, padding=True)
val_encodings = tokenizer(x_valid, truncation=True, padding=True)
test_encodings = tokenizer(x_test, truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


`tf.data.Dataset`の`from_tensor_slices`メソッドにラベルとエンコーディングした入力を与えて、データセットを作成しましょう。

In [10]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_valid
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

In [11]:
train_dataset = train_dataset.shuffle(len(train_dataset)).batch(8)
val_dataset = val_dataset.shuffle(len(val_dataset)).batch(8)
test_dataset = test_dataset.shuffle(len(test_dataset)).batch(8)

## モデルの学習

`TFAutoModelForSequenceClassification`を使って分類用のモデルを作成したら、Kerasのfitメソッドを呼び出して学習しましょう。

In [12]:
model = TFAutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(le.classes_),
    from_pt=True
)

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        patience=2,
        restore_best_weights=True
    ),
]

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(train_dataset, validation_data=val_dataset, epochs=10, callbacks=callbacks)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.callbacks.History at 0x7f74c0445b50>

In [13]:
_, acc = model.evaluate(test_dataset)
acc



0.35199999809265137

正解率がかなり低いです。前処理やハイパーパラメータチューニングをすれば、もう少し良くなるでしょう。

続いて、WikipediaとBookCorpusで学習したBERTを使ってみましょう。

In [14]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

train_encodings = tokenizer(x_train, truncation=True, padding=True)
val_encodings = tokenizer(x_valid, truncation=True, padding=True)
test_encodings = tokenizer(x_test, truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_valid
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

train_dataset = train_dataset.shuffle(len(train_dataset)).batch(8)
val_dataset = val_dataset.shuffle(len(val_dataset)).batch(8)
test_dataset = test_dataset.shuffle(len(test_dataset)).batch(8)

model = TFAutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(le.classes_),
    from_pt=True
)

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        patience=2,
        restore_best_weights=True
    ),
]

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(train_dataset, validation_data=val_dataset, epochs=10, callbacks=callbacks)

_, acc = model.evaluate(test_dataset)
acc

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10


0.2619999945163727