# BERTを用いた意図検出

このノートブックでは、BERTを用いて意図検出をします。データセットとしてはATISを使います。なお、実行はGPU環境ですることを推奨します。

## 準備

### パッケージのインストール

In [1]:
!pip install -q tensorflow==2.6.0 transformers==4.10.2 scikit-learn==0.22.2.post1

[K     |████████████████████████████████| 2.8 MB 11.5 MB/s 
[K     |████████████████████████████████| 636 kB 34.4 MB/s 
[K     |████████████████████████████████| 52 kB 1.4 MB/s 
[K     |████████████████████████████████| 895 kB 40.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 35.3 MB/s 
[?25h

### インポート


In [2]:
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
from transformers import TFAutoModelForSequenceClassification
tf.get_logger().setLevel('ERROR')

### データのアップロード

まずはデータをアップロードします。ノートブックと同じ階層にDataフォルダがあり、その下にdata2フォルダがあります。以下の2つのファイルをアップロードしましょう。Colabでない場合は、読み込むときに正しいパスを指定してください。

- atis.train.w-intent.iob
- atis.test.w-intent.iob

In [3]:
from google.colab import files

uploaded = files.upload()

Saving atis.test.w-intent.iob to atis.test.w-intent.iob
Saving atis.train.w-intent.iob to atis.train.w-intent.iob


In [4]:
!head atis.train.w-intent.iob

BOS i want to fly from boston at 838 am and arrive in denver at 1110 in the morning EOS	 O O O O O O B-fromloc.city_name O B-depart_time.time I-depart_time.time O O O B-toloc.city_name O B-arrive_time.time O O B-arrive_time.period_of_day atis_flight
BOS what flights are available from pittsburgh to baltimore on thursday morning EOS	O O O O O O B-fromloc.city_name O B-toloc.city_name O B-depart_date.day_name B-depart_time.period_of_day atis_flight
BOS what is the arrival time in san francisco for the 755 am flight leaving washington EOS	O O O O B-flight_time I-flight_time O B-fromloc.city_name I-fromloc.city_name O O B-depart_time.time I-depart_time.time O O B-fromloc.city_name atis_flight_time
BOS cheapest airfare from tacoma to orlando EOS	O B-cost_relative O O B-fromloc.city_name O B-toloc.city_name atis_airfare
BOS round trip fares from pittsburgh to philadelphia under 1000 dollars EOS	O B-round_trip I-round_trip O O B-fromloc.city_name O B-toloc.city_name B-cost_relative B-fare_amo

In [5]:
!head atis.test.w-intent.iob

BOS O
i O
would O
like O
to O
find O
a O
flight O
from O
charlotte B-fromloc.city_name


## データの読み込み

In [6]:
train_data_path = "atis.train.w-intent.iob"
test_data_path = "atis.test.w-intent.iob"

### 学習データ

In [7]:
def load_train_data(filename, remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename, encoding="utf-8") as f:
        for line in f:
            words, labs = [i.split(' ') for i in line.strip().split('\t')]
            if remove_validation and "#" in labs[-1]:
                continue
            sents.append(words[1:-1])
            labels.append(labs[1:-1])
            intents.append(labs[-1])
    return sents, labels, intents

In [8]:
train_texts, _, train_labels = load_train_data(train_data_path)
print("Number of training sentences :", len(train_texts))
print("Number of unique intents :", len(set(train_labels)))

for i in zip(train_texts[:5], train_labels[:5]):
    print(i)

Number of training sentences : 4952
Number of unique intents : 17
(['i', 'want', 'to', 'fly', 'from', 'boston', 'at', '838', 'am', 'and', 'arrive', 'in', 'denver', 'at', '1110', 'in', 'the', 'morning'], 'atis_flight')
(['what', 'flights', 'are', 'available', 'from', 'pittsburgh', 'to', 'baltimore', 'on', 'thursday', 'morning'], 'atis_flight')
(['what', 'is', 'the', 'arrival', 'time', 'in', 'san', 'francisco', 'for', 'the', '755', 'am', 'flight', 'leaving', 'washington'], 'atis_flight_time')
(['cheapest', 'airfare', 'from', 'tacoma', 'to', 'orlando'], 'atis_airfare')
(['round', 'trip', 'fares', 'from', 'pittsburgh', 'to', 'philadelphia', 'under', '1000', 'dollars'], 'atis_airfare')


### テストデータ

In [9]:
def load_test_data(filename, remove_validation=True):
    sents, labels, intents = [], [], []
    with open(filename, encoding="utf-8") as f:
        words, tags = [], []
        for line in f:
            line = line.strip()
            if line:
                word, tag = line.split()
                words.append(word)
                tags.append(tag)
            else:
                if not (remove_validation and "#" in tags[-1]):
                    sents.append(words[1: -1])
                    labels.append(tags[1: -1])
                    intents.append(tags[-1])
                words, tags = [], []
    return sents, labels, intents

In [10]:
test_texts, _, test_labels  = load_test_data(test_data_path)
new_labels = set(test_labels) - set(train_labels)
# テストデータにだけ出現するラベルを除去
vals = []
for i in range(len(test_labels)):
    if test_labels[i] in new_labels:
        print(test_labels[i])
        vals.append(i)
for i in vals[::-1]:
    test_labels.pop(i)
    test_texts.pop(i)

print("Number of testing sentences :", len(test_texts))
print("Number of unique intents :", len(set(test_labels)))

for i in zip(test_texts[:5], test_labels[:5]):
    print(i)

atis_day_name
atis_day_name
Number of testing sentences : 876
Number of unique intents : 15
(['i', 'would', 'like', 'to', 'find', 'a', 'flight', 'from', 'charlotte', 'to', 'las', 'vegas', 'that', 'makes', 'a', 'stop', 'in', 'st.', 'louis'], 'atis_flight')
(['on', 'april', 'first', 'i', 'need', 'a', 'ticket', 'from', 'tacoma', 'to', 'san', 'jose', 'departing', 'before', '7', 'am'], 'atis_airfare')
(['on', 'april', 'first', 'i', 'need', 'a', 'flight', 'going', 'from', 'phoenix', 'to', 'san', 'diego'], 'atis_flight')
(['i', 'would', 'like', 'a', 'flight', 'traveling', 'one', 'way', 'from', 'phoenix', 'to', 'san', 'diego', 'on', 'april', 'first'], 'atis_flight')
(['i', 'would', 'like', 'a', 'flight', 'from', 'orlando', 'to', 'salt', 'lake', 'city', 'for', 'april', 'first', 'on', 'delta', 'airlines'], 'atis_flight')


In [11]:
le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)
test_labels[:10]

array([9, 2, 9, 9, 9, 9, 9, 9, 9, 9])

In [12]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## 前処理

テキストをモデルに与える前に、前処理をします。入力をトークン化（トークンを事前学習済みモデルの語彙の対応するIDに変換することを含む）し、モデルが期待するフォーマットにするとともに、パディングと切り詰めをしましょう。

これらの作業を行うために、`BERTTokenizerFast.from_pretrained`メソッドでトークナイザーをインスタンス化します。

In [13]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

トークナイザーにテキストを与えて、エンコーディングしましょう。

In [14]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, is_split_into_words=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, is_split_into_words=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, is_split_into_words=True)

`tf.data.Dataset`の`from_tensor_slices`メソッドにラベルとエンコーディングした入力を与えて、データセットを作成しましょう。

In [15]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

In [16]:
train_dataset = train_dataset.shuffle(len(train_dataset)).batch(8)
val_dataset = val_dataset.shuffle(len(val_dataset)).batch(8)
test_dataset = test_dataset.shuffle(len(test_dataset)).batch(8)

## モデルの学習

`TFAutoModelForSequenceClassification`を使って分類用のモデルを作成したら、Kerasの`fit`メソッドを呼び出して学習しましょう。

In [31]:
model = TFAutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(le.classes_)
)

filepath = "model/"
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=2),
    tf.keras.callbacks.ModelCheckpoint(
        filepath=filepath,
        save_best_only=True,
        save_weights_only=True
    ),
]

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(train_dataset, validation_data=val_dataset, epochs=10, callbacks=callbacks)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10


<keras.callbacks.History at 0x7fb24090e4d0>

## モデルの評価

In [32]:
model.load_weights(filepath)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fb22abb37d0>

In [33]:
_, acc = model.evaluate(test_dataset)



In [34]:
print(f'Test accuracy: {acc}')

Test accuracy: 0.9726027250289917
