<a href="https://colab.research.google.com/github/hiya906/my-machine-learning/blob/master/BERT_SST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **사용법**

1.   우측 상단 '로그인'
2.   좌측 상단 '실습 모드에서 열기'


※ 각각의 셀은 셀 좌측 상단 실행 버튼을 통해 실행할 수 있습니다.

※ 실행 중 '경고: 이 노트는 Google에서 작성하지 않았습니다.'라는 창이 뜰 경우, '실행 전에 모든 런타임 재설정'란에 체크 후 '무시하고 계속하기'를 하시면 됩니다.

# 1. 사전 준비

- Google drive 연동
- 파일 복사
- 필요한 Package 설치

In [0]:
BASE_DIR = '/content/drive/My Drive/AI/SDS_BERT'

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
!cp -r /content/drive/"My Drive"/AI/SDS_BERT/. ./

In [5]:
!pip install funcsigs

Collecting funcsigs
  Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
Installing collected packages: funcsigs
Successfully installed funcsigs-1.0.2


# 2. 모델 및 데이터 다운로드 & 전처리

- Pretrained BERT 다운로드
- Dataset 다운로드
- 전처리

In [4]:
!sh bert_pretrained_models/download_model.sh

--2019-07-23 07:38:03--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.141.128, 2607:f8b0:400c:c06::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.141.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘bert_pretrained_models/uncased_L-12_H-768_A-12.zip.1’


2019-07-23 07:38:05 (214 MB/s) - ‘bert_pretrained_models/uncased_L-12_H-768_A-12.zip.1’ saved [407727028/407727028]

Archive:  bert_pretrained_models/uncased_L-12_H-768_A-12.zip
replace bert_pretrained_models/uncased_L-12_H-768_A-12/bert_model.ckpt.meta? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: bert_pretrained_models/uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
replace bert_pretrained_models/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: bert_pretrained_

In [5]:
# 여기에서 선택 : COLA, SST, MRPC, MNLI, XNLI
!python data/download_glue_data.py --tasks=SST

Downloading and extracting SST...
	Completed!


In [6]:
# 여기에서 선택 : CoLA, SST, MRPC, MNLI, XNLI
!python prepare_data.py --task=SST

W0723 07:38:46.132593 139924813449088 deprecation_wrapper.py:119] From /content/texar/core/layers.py:629: The name tf.layers.Layer is deprecated. Please use tf.compat.v1.layers.Layer instead.

W0723 07:38:46.132965 139924813449088 deprecation_wrapper.py:119] From /content/texar/core/layers.py:682: The name tf.layers.MaxPooling1D is deprecated. Please use tf.compat.v1.layers.MaxPooling1D instead.

W0723 07:38:46.133079 139924813449088 deprecation_wrapper.py:119] From /content/texar/core/layers.py:683: The name tf.layers.AveragePooling1D is deprecated. Please use tf.compat.v1.layers.AveragePooling1D instead.

W0723 07:38:46.133286 139924813449088 deprecation_wrapper.py:119] From /content/texar/core/layers.py:1174: The name tf.layers.Conv1D is deprecated. Please use tf.compat.v1.layers.Conv1D instead.

W0723 07:38:46.133417 139924813449088 deprecation_wrapper.py:119] From /content/texar/core/layers.py:1175: The name tf.layers.Conv2D is deprecated. Please use tf.compat.v1.layers.Conv2D ins

# 3. 학습

## 3.1. 학습 설정

- 학습 설정 정의 및 수정


In [18]:
# 여기에서 선택 : CoLA, SST-2, MRPC, MNLI, XNLI
import config_func
config_func.modify_task_dir('./data/SST-2')

I0723 07:43:13.524230 140617375025024 config_func.py:24] config_data.py has been updated


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import math
import numpy as np
import importlib
import tensorflow as tf
import texar as tx

from utils import model_utils

# pylint: disable=invalid-name, too-many-locals, too-many-statements

flags = tf.flags

FLAGS = flags.FLAGS

FLAGS.remove_flag_values(FLAGS.flag_values_dict())

flags.DEFINE_string('f', '', 'kernel')
flags.DEFINE_bool("logtostderr", False, "Whether to run training.")
flags.DEFINE_string('stderrthreshold', 'fatal', 'log messages at this level, or more severe, to stderr in addition to the logfile.')
flags.DEFINE_boolean('showprefixforinfo', True,
                     'If False, do not prepend prefix to info messages '
                     'when it\'s logged to stderr, '
                     '--verbosity is set to INFO level, '
                     'and python logging is used.')
flags.DEFINE_boolean('alsologtostderr',
                     False,
                     'also log to stderr?', allow_override_cpp=True)

flags.DEFINE_string(
    "config_bert_pretrain", 'uncased_L-12_H-768_A-12',
    "The architecture of pre-trained BERT model to use.")
flags.DEFINE_string(
    "config_format_bert", "json",
    "The configuration format. Set to 'json' if the BERT config file is in "
    "the same format of the official BERT config file. Set to 'texar' if the "
    "BERT config file is in Texar format.")
flags.DEFINE_string(
    "config_downstream", "config_classifier",
    "Configuration of the downstream part of the model and optmization.")
flags.DEFINE_string(
    "config_data", "config_data",
    "The dataset config.")
flags.DEFINE_string(
    "output_dir", "output/",
    "The output directory where the model checkpoints will be written.")
flags.DEFINE_string(
    "checkpoint", None,
    "Path to a model checkpoint (including bert modules) to restore from.")
flags.DEFINE_bool("do_train", True, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool("do_test", False, "Whether to run test on the test set.")
flags.DEFINE_bool("distributed", False, "Whether to run in distributed mode.")

config_data = importlib.import_module('config_data')
config_downstream = importlib.import_module("config_classifier")

W0723 07:44:19.604691 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:629: The name tf.layers.Layer is deprecated. Please use tf.compat.v1.layers.Layer instead.

W0723 07:44:19.606336 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:682: The name tf.layers.MaxPooling1D is deprecated. Please use tf.compat.v1.layers.MaxPooling1D instead.

W0723 07:44:19.607319 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:683: The name tf.layers.AveragePooling1D is deprecated. Please use tf.compat.v1.layers.AveragePooling1D instead.

W0723 07:44:19.621073 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:1174: The name tf.layers.Conv1D is deprecated. Please use tf.compat.v1.layers.Conv1D instead.

W0723 07:44:19.623011 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:1175: The name tf.layers.Conv2D is deprecated. Please use tf.compat.v1.layers.Conv2D ins

## 3.2. 모델 불러오기

- BERT 설정 읽기

In [0]:
# LOAD MODEL
tf.logging.set_verbosity(tf.logging.INFO)

bert_pretrain_dir = ('bert_pretrained_models''/%s') % FLAGS.config_bert_pretrain
# Loads BERT model configuration
if FLAGS.config_format_bert == "json":
    bert_config = model_utils.transform_bert_to_texar_config(os.path.join(bert_pretrain_dir, 'bert_config.json'))
elif FLAGS.config_format_bert == 'texar':
    bert_config = importlib.import_module(('bert_config_lib.''config_model_%s') % FLAGS.config_bert_pretrain)
else:
    raise ValueError('Unknown config_format_bert.')

## 3.3. 데이터 불러오기

- 데이터 읽고 준비하기

In [3]:
# Loads data
num_classes = config_data.num_classes
num_train_data = config_data.num_train_data

train_dataset = tx.data.TFRecordData(hparams=config_data.train_hparam)
eval_dataset = tx.data.TFRecordData(hparams=config_data.eval_hparam)
test_dataset = tx.data.TFRecordData(hparams=config_data.test_hparam)

iterator = tx.data.FeedableDataIterator({
    'train': train_dataset, 'eval': eval_dataset, 'test': test_dataset})
batch = iterator.get_next()
input_ids = batch["input_ids"]
segment_ids = batch["segment_ids"]
batch_size = tf.shape(input_ids)[0]
input_length = tf.reduce_sum(1 - tf.to_int32(tf.equal(input_ids, 0)), axis=1)

W0723 07:44:24.011317 140056818276224 deprecation_wrapper.py:119] From /content/texar/data/data_decoders.py:603: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W0723 07:44:24.012494 140056818276224 deprecation_wrapper.py:119] From /content/texar/data/data_decoders.py:610: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0723 07:44:24.035303 140056818276224 deprecation.py:323] From /content/texar/data/data/data_base.py:161: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0723 07:44:24.036144 140056818276224 deprecation.py:323] From /content/texar/data/data/data_base.py:162: padded_batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.dat

## 3.4. BERT 완성하기

- BERT 코드 TEXAR 이용하여 구현하기

In [4]:
# Builds BERT
with tf.variable_scope('bert'):
    # Word embedding
    embedder = tx.modules.WordEmbedder(
        vocab_size=30522, hparams=bert_config.embed)
    word_embeds = embedder(input_ids)

    # Segment embedding for each type of tokens
    segment_embedder = tx.modules.WordEmbedder(
        vocab_size=bert_config.type_vocab_size,
        hparams=bert_config.segment_embed)
    segment_embeds = segment_embedder(segment_ids)

    # Position embedding
    position_embedder = tx.modules.PositionEmbedder(
        position_size=bert_config.position_size,
        hparams=bert_config.position_embed)
    seq_length = tf.ones([batch_size], tf.int32) * tf.shape(input_ids)[1]
    pos_embeds = position_embedder(sequence_length=seq_length)
    
    # ============= Aggregates embeddings =============
    
    input_embeds = word_embeds + segment_embeds + pos_embeds
    
    # =================================================

    # The BERT model (a TransformerEncoder)
    # input_embeds: (batch, seq_len, emb_dim)
    # output      : (batch, seq_len, emb_dim)
    encoder = tx.modules.TransformerEncoder(hparams=bert_config.encoder)
    output = encoder(input_embeds, input_length)

    # Builds layers for downstream classification, which is also
    # initialized with BERT pre-trained checkpoint.
    with tf.variable_scope("pooler"):
        # Uses the projection of the 1st-step hidden vector of BERT output
        # as the representation of the sentence

        # ============= Get sentence embedding (CLS) =============
        # CLS_hidden : First vector from contextual embedding
        CLS_hidden = tf.squeeze(output[:, 0:1, :], axis=1)
    
        # ========================================================
        
        # Output
        bert_sent_output = tf.layers.dense(CLS_hidden, config_downstream.hidden_dim, activation=tf.tanh)

W0723 07:44:26.081236 140056818276224 deprecation_wrapper.py:119] From /content/texar/module_base.py:72: The name tf.make_template is deprecated. Please use tf.compat.v1.make_template instead.

W0723 07:44:26.092065 140056818276224 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0723 07:44:26.279551 140056818276224 deprecation_wrapper.py:119] From /usr/lib/python3.6/pydoc.py:1595: The name tf.layers.Dense is deprecated. Please use tf.compat.v1.layers.Dense instead.

W0723 07:44:26.288738 140056818276224 deprecation_wrapper.py:119] From /content/texar/core/layers.py:598: The name tf.layers.Layer is deprecated. Please use tf.compat.v1.layers.Layer instead.

W0723 07:44:26.423269 14

## 3.5. Classification Layer 구현하기

- Contextural Embedding ('bert_sent_output') 을 활용하여 Classification을 위한 Layer를 추가

In [5]:
# Adds the final classification layer

# Hint: USE "tf.layers.dense(dense_input, num_classes)"
# output X W --> logits
logits = tf.layers.dense(bert_sent_output, num_classes)

# Argmax over last dimension
preds = tf.argmax(logits, axis=-1, output_type=tf.int32)

# Accuracy
accu = tx.evals.accuracy(batch['label_ids'], preds)

W0723 07:44:33.298686 140056818276224 deprecation.py:323] From /content/texar/evals/metrics.py:29: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.


## 3.4, 3.5 정답

In [13]:
# Builds BERT
with tf.variable_scope('bert'):
    # Word embedding
    embedder = tx.modules.WordEmbedder(
        vocab_size=30522, hparams=bert_config.embed)
    word_embeds = embedder(input_ids)

    # Segment embedding for each type of tokens
    segment_embedder = tx.modules.WordEmbedder(
        vocab_size=bert_config.type_vocab_size,
        hparams=bert_config.segment_embed)
    segment_embeds = segment_embedder(segment_ids)

    # Position embedding
    position_embedder = tx.modules.PositionEmbedder(
        position_size=bert_config.position_size,
        hparams=bert_config.position_embed)
    seq_length = tf.ones([batch_size], tf.int32) * tf.shape(input_ids)[1]
    pos_embeds = position_embedder(sequence_length=seq_length)

    # Aggregates embeddings
    input_embeds = word_embeds + segment_embeds + pos_embeds

    # The BERT model (a TransformerEncoder)
    # input_embeds: (batch, seq_len, emb_dim)
    # output      : (batch, seq_len, emb_dim)
    encoder = tx.modules.TransformerEncoder(hparams=bert_config.encoder)
    output = encoder(input_embeds, input_length)

    # Builds layers for downstream classification, which is also
    # initialized with BERT pre-trained checkpoint.
    with tf.variable_scope("pooler"):
        # Uses the projection of the 1st-step hidden vector of BERT output
        # as the representation of the sentence
        
        # CLS_hidden : First vector from contextual embedding
        CLS_hidden = tf.squeeze(output[:, 0:1, :], axis=1)
        
        # Output
        bert_sent_output = tf.layers.dense(CLS_hidden, config_downstream.hidden_dim, activation=tf.tanh)
        
# Adds the final classification layer

# output X W --> logits
# num_classes = 정답 클래스 개수
logits = tf.layers.dense(bert_sent_output, num_classes)

# Argmax over last dimension
preds = tf.argmax(logits, axis=-1, output_type=tf.int32)

accu = tx.evals.accuracy(batch['label_ids'], preds)

ValueError: ignored

## 3.6. Loss 및 Optimizer 설정

- Loss, Optimizer 정의
- Pretrained BERT 값을 불러와서 입히기

In [6]:
# Optimization
loss = tf.losses.sparse_softmax_cross_entropy(labels=batch["label_ids"], logits=logits)
global_step = tf.Variable(0, trainable=False)

# Learning rate
static_lr = config_downstream.lr['static_lr']
num_train_steps = int(num_train_data / config_data.train_batch_size * config_data.max_train_epoch)
num_warmup_steps = int(num_train_steps * config_data.warmup_proportion)
# lr is a Tensor
lr = model_utils.get_lr(global_step, num_train_steps, num_warmup_steps, static_lr)

# Optimize
opt = tx.core.get_optimizer(
    global_step=global_step,
    learning_rate=lr,
    hparams=config_downstream.opt
)

train_op = tf.contrib.layers.optimize_loss(
        loss=loss,
        global_step=global_step,
        learning_rate=None,
        optimizer=opt)

W0723 07:44:39.067096 140056818276224 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:121: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0723 07:44:39.085576 140056818276224 deprecation_wrapper.py:119] From /content/utils/model_utils.py:86: The name tf.train.polynomial_decay is deprecated. Please use tf.compat.v1.train.polynomial_decay instead.

W0723 07:44:39.093857 140056818276224 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/learning_rate_schedule.py:409: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
W0723 07:44:39.111694 140056818276224 deprecation_wrapper.py:119] 

In [7]:
# Loads pretrained BERT model parameters
init_checkpoint = os.path.join(bert_pretrain_dir, 'bert_model.ckpt')
model_utils.init_bert_checkpoint(init_checkpoint)

session_config = tf.ConfigProto()

sess = tf.Session(config=session_config)

sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(tf.tables_initializer())

saver = tf.train.Saver()
if FLAGS.checkpoint:
    saver.restore(sess, FLAGS.checkpoint)
    
iterator.initialize_dataset(sess)

W0723 07:44:53.948428 140056818276224 deprecation_wrapper.py:119] From /content/utils/model_utils.py:187: The name tf.train.init_from_checkpoint is deprecated. Please use tf.compat.v1.train.init_from_checkpoint instead.



## 3.7. 평가 함수 정의

In [0]:
def _eval_epoch(sess):
    """Evaluates on the dev set.
    """
    iterator.restart_dataset(sess, 'eval')

    cum_acc = 0.0
    cum_loss = 0.0
    nsamples = 0
    fetches = {
        'accu': accu,
        'loss': loss,
        'batch_size': batch_size,
    }
    while True:
        try:
            feed_dict = {
                iterator.handle: iterator.get_handle(sess, 'eval'),
                tx.context.global_mode(): tf.estimator.ModeKeys.EVAL,
            }
            rets = sess.run(fetches, feed_dict)

            cum_acc += rets['accu'] * rets['batch_size']
            cum_loss += rets['loss'] * rets['batch_size']
            nsamples += rets['batch_size']
        except tf.errors.OutOfRangeError:
            break
    tf.logging.info('eval accu: {}; loss: {}; nsamples: {}'.format(
        cum_acc / nsamples, cum_loss / nsamples, nsamples))

## 3.8. 학습 시작

In [9]:
for i in range(config_data.max_train_epoch):
    # Dataset 초기화
    iterator.restart_dataset(sess, 'train')
    fetches = {
            'train_op': train_op,
            'loss': loss,
            'batch_size': batch_size,
            'step': global_step
        }
    
    while True:
        try:
            feed_dict = {
                iterator.handle: iterator.get_handle(sess, 'train'),
                tx.global_mode(): tf.estimator.ModeKeys.TRAIN,
            }
            rets = sess.run(fetches, feed_dict)
            step = rets['step']

            dis_steps = config_data.display_steps
            if dis_steps > 0 and step % dis_steps == 0:
                tf.logging.info('step:%d; loss:%f' % (step, rets['loss']))

            eval_steps = config_data.eval_steps
            if eval_steps > 0 and step % eval_steps == 0:
                _eval_epoch(sess)

        except tf.errors.OutOfRangeError:
            break
saver.save(sess, FLAGS.output_dir + '/model.ckpt')

I0723 07:45:57.263368 140056818276224 <ipython-input-9-b2ea9810eb12>:22] step:50; loss:0.612986
I0723 07:46:39.917129 140056818276224 <ipython-input-9-b2ea9810eb12>:22] step:100; loss:0.328315
I0723 07:46:49.315532 140056818276224 <ipython-input-8-7129b0313312>:28] eval accu: 0.8658256880733946; loss: 0.32551266106033544; nsamples: 872


KeyboardInterrupt: ignored

## 3.9. Test Data 예측 결과 확인하기

In [0]:
# Test Data 읽고 예측

import csv
test_file = os.path.join(config_data.tfrecord_data_dir, 'test.tsv')
lines = []
with tf.gfile.Open(test_file, "r") as f:
    reader = csv.reader(f, delimiter="\t")
    for i, line in enumerate(reader):
        if i == 0:
            continue
        if len(line) > 3:
            lines.append(line[-2:])
        else:
            lines.append(line[-1])

iterator.restart_dataset(sess, 'test')

_all_preds = []
while True:
    try:
        feed_dict = {
            iterator.handle: iterator.get_handle(sess, 'test'),
            tx.context.global_mode(): tf.estimator.ModeKeys.PREDICT,
        }
        _preds = sess.run(preds, feed_dict=feed_dict)
        _all_preds.extend(_preds.tolist())
    except tf.errors.OutOfRangeError:
        break

In [12]:
# 몇 개의 Test 샘플을 출력할 것인가
NUM_PRINT_TEST = 40

for i in range(NUM_PRINT_TEST):
  if isinstance(lines[i], list):
      print('Sentence 1: ', lines[i][0])
      print('Sentence 2: ', lines[i][1])
      print('Prediction: ', _all_preds[i])
  else:
      print('Sentence  : ', lines[i])
      print('Prediction: ', _all_preds[i])
      
  print()

Sentence  :  uneasy mishmash of styles and genres .
Prediction:  0

Sentence  :  this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
Prediction:  0

Sentence  :  by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
Prediction:  1

Sentence  :  director rob marshall went out gunning to make a great one .
Prediction:  1

Sentence  :  lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .
Prediction:  1

Sentence  :  a well-made and often lovely depiction of the mysteries of friendship .
Prediction:  1

Sentence  :  none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor .
Prediction:  0

Sentence  :  although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk