# Playing with pytorch-pretrained-BERT for Japanese language

In this notebook, we show how to play with [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) for Japanese language. We use Kyoto pretrained BERT model. Kyoto pretrained BERT model does not use BPE (subwords) in tokenization. According to the guideline, you need to comment the following line in the file `pytorch-pretrained-BERT/tokenization.py`

    # text = self._tokenize_chinese_chars(text)

I changed the `tokenization.py` so that it can be used for Japanese and other languages as well. The modification is the use of the argument `kyoto_bert=[True|False]`. You just need to installed from the forked repo [https://github.com/minhpqn/pytorch-pretrained-BERT](https://github.com/minhpqn/pytorch-pretrained-BERT).

You also need to do tokenization with [Juman++](https://github.com/ku-nlp/jumanpp) before using BERT model.

You need to install pytorch-pretrained-BERT in order to run the notebook.

## Load pretrained BERT model for Japanese

We now load [pretrained BERT model](http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3%83%A2%E3%83%87%E3%83%AB) published by Kyoto University.

Note that, we have to set `do_lower_case=False`, we may lost voiced consonant marks (nigori) of Japanese.

In [1]:
import torch
from pytorch_pretrained_bert import BasicTokenizer, BertTokenizer, BertModel, BertForMaskedLM
from pyknp import Juman

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

import logging
logging.basicConfig(level=logging.INFO)

jumanpp = Juman()

def tokenize(text):
    result = jumanpp.analysis(text)
    tokens = []
    for mrph in result.mrph_list():
        tokens.append(mrph.midasi)
    return ' '.join(tokens)
        
path_to_pretrained_model = '/Users/minhpqn/workspace/Japanese_L-12_H-768_A-12_E-30_BPE'

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained(path_to_pretrained_model, kyoto_bert=True, do_lower_case=False)

# Tokenized input
text = "数学の最も普通の定義としては、「数および図形についての学問」というものがある。"
text = tokenize(text)
text = '[CLS] ' + text
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'

print(tokenized_text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor = tokens_tensor.to(device)

print(indexed_tokens)
print(tokens_tensor.size())
print(tokens_tensor)

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file /Users/minhpqn/workspace/Japanese_L-12_H-768_A-12_E-30_BPE/vocab.txt


cpu
['[CLS]', '数学', 'の', '最も', '普通の', '定義', 'と', 'して', 'は', '、', '「', '数', 'および', '図形', 'に', 'ついて', 'の', '学問', '」', 'と', 'いう', 'もの', 'が', 'ある', '。']
['[CLS]', '数学', 'の', '最も', '普通の', '定義', 'と', 'して', '[MASK]', '、', '「', '数', 'および', '図形', 'に', 'ついて', 'の', '学問', '」', 'と', 'いう', 'もの', 'が', 'ある', '。']
[2, 2938, 5, 476, 7078, 1315, 12, 19, 4, 6, 24, 145, 186, 17201, 8, 130, 5, 5476, 25, 12, 56, 60, 11, 38, 7]
torch.Size([1, 25])
tensor([[    2,  2938,     5,   476,  7078,  1315,    12,    19,     4,     6,
            24,   145,   186, 17201,     8,   130,     5,  5476,    25,    12,
            56,    60,    11,    38,     7]])


Let's see how to use ```BertModel``` to get hidden states.

In [2]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained(path_to_pretrained_model, cache_dir=None)
model.eval()
model.to(device)

INFO:pytorch_pretrained_bert.modeling:loading archive file /Users/minhpqn/workspace/Japanese_L-12_H-768_A-12_E-30_BPE
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 32006
}



BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32006, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=

Get hidden states of the input

In [3]:
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, hidden = model(tokens_tensor, output_all_encoded_layers=False)
assert len(encoded_layers) == 1
print(encoded_layers[0].size())
print(hidden.size())
print(encoded_layers[0][-1,:].size())
print(hidden)
print(encoded_layers[0][-1,:])


torch.Size([25, 768])
torch.Size([1, 768])
torch.Size([768])
tensor([[-2.6145e-01,  1.6846e-01, -9.9489e-01,  1.6224e-01,  6.5779e-02,
         -1.6521e-01, -1.0453e-01, -6.4537e-02, -8.9345e-01, -4.5911e-01,
         -2.0780e-01,  9.0893e-01,  7.4370e-02,  9.9050e-01,  3.2349e-01,
         -6.2441e-02,  7.7569e-02, -2.4102e-01, -9.9853e-01, -1.8675e-02,
          9.1260e-01, -2.4392e-01, -8.3436e-01,  1.3213e-01, -7.2599e-01,
          1.0340e-01, -6.5891e-02,  3.3977e-01,  1.7579e-02,  9.6078e-01,
         -3.5215e-01,  2.1684e-02, -1.7530e-01, -9.3593e-01,  2.3440e-01,
         -2.9097e-02, -9.0766e-01, -4.3840e-02, -1.9927e-01, -4.9560e-01,
         -1.8500e-02,  1.0180e-01,  1.4831e-01,  3.8736e-03,  6.5150e-01,
         -1.8352e-01,  5.3526e-02, -1.1046e-01,  8.6650e-01, -9.1555e-02,
          6.6434e-01,  1.7088e-01,  1.1503e-02, -9.9135e-01,  7.7298e-01,
          1.3503e-01,  8.0489e-02, -2.3020e-02, -8.9470e-02,  9.9903e-01,
          5.9107e-02,  4.3549e-02, -1.0930e-01,  4.

Now we try to get representation of multiple sentences.

In [4]:
from keras.preprocessing.sequence import pad_sequences

texts = ["私は２０歳で学生です。", "数学の最も普通の定義としては、「数および図形についての学問」というものがある。"]
texts = list(map(tokenize, texts))
tokenized_texts = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t), texts))

print(tokenized_texts)

indexed_tokens = list(map(tokenizer.convert_tokens_to_ids, tokenized_texts))
maxlen = max([len(s) for s in indexed_tokens])

padded_tokens = pad_sequences(indexed_tokens, maxlen=maxlen, truncating="post", padding="post", dtype="int")
print(padded_tokens)

tokens_tensor = torch.tensor(padded_tokens)
tokens_tensor = tokens_tensor.to(device)

print(tokens_tensor.size())
print(tokens_tensor)

Using TensorFlow backend.


[['[CLS]', '私', 'は', '２０', '歳', 'で', '学生', 'です', '。'], ['[CLS]', '数学', 'の', '最も', '普通の', '定義', 'と', 'して', 'は', '、', '「', '数', 'および', '図形', 'に', 'ついて', 'の', '学問', '」', 'と', 'いう', 'もの', 'が', 'ある', '。']]
[[    2  1038     9   183   205    13  1098  3338     7     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]
 [    2  2938     5   476  7078  1315    12    19     9     6    24   145
    186 17201     8   130     5  5476    25    12    56    60    11    38
      7]]
torch.Size([2, 25])
tensor([[    2,  1038,     9,   183,   205,    13,  1098,  3338,     7,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [    2,  2938,     5,   476,  7078,  1315,    12,    19,     9,     6,
            24,   145,   186, 17201,     8,   130,     5,  5476,    25,    12,
            56,    60,    11,    38,     7]])


In [5]:
with torch.no_grad():
    encoded_layers, hidden = model(tokens_tensor, output_all_encoded_layers=True)
print(encoded_layers[-2].size())
print(hidden.size())

torch.Size([2, 25, 768])
torch.Size([2, 768])


Let's see some outputs

In [8]:
from torch.autograd import Variable

def avg(src_h):
    batch_size = src_h.size(0)
    src_length = src_h.size(1)
    avg_weights = Variable(torch.ones(batch_size, 1, src_length)).cuda() if torch.cuda.is_available() and self.useGpu else Variable(torch.ones(batch_size, 1, src_length)) / src_length
    src_h_t = torch.bmm(avg_weights, src_h)
    src_h_t = src_h_t[:,0,:]
    return src_h_t

print( torch.mean(encoded_layers[-2], 1) )
print( avg(encoded_layers[-2]) )

tensor([[ 0.8578, -0.3807, -0.3330,  ..., -0.4755,  0.5379, -0.8880],
        [ 0.0526, -0.4467, -0.4227,  ...,  0.5322, -0.0773,  0.0536]])
tensor([[ 0.8578, -0.3807, -0.3330,  ..., -0.4755,  0.5379, -0.8880],
        [ 0.0526, -0.4467, -0.4227,  ...,  0.5322, -0.0773,  0.0536]])


Now, we use ```BertForMaskedLM``` to predict tokens.

In [4]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained(path_to_pretrained_model, cache_dir=None)
model.eval()
model.to(device)

INFO:pytorch_pretrained_bert.modeling:loading archive file /Users/minhpqn/workspace/Japanese_L-12_H-768_A-12_E-30_BPE
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 32006
}

INFO:pytorch_pretrained_bert.modeling:Weights from pretrained model not used in BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32006, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertInterm

In [5]:
with torch.no_grad():
    predictions = model(tokens_tensor)

predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

は


## Using BERT with SentencePiece for Japanese text

We now use BERT with SentencePiece for Japanese. The point is that you will use the pre-trained BERT model in which they used sentencepiece for subword tokenization. The pretrained BERT model with sentencepiece is available on [https://github.com/yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese)

You need to to install [sentencepiece](https://github.com/google/sentencepiece) in order to play with that pre-trained model.

Pre-trained BERT model in [https://github.com/yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese) can only be used with Tensorflow implementation of BERT. In order to use with Pytorch implementation, use need to convert the Tensorflow model in to the format which is compatible with Pytorch. I followed the instruction in [https://github.com/huggingface/pytorch-pretrained-BERT#Command-line-interface](https://github.com/huggingface/pytorch-pretrained-BERT#Command-line-interface) to convert the model into Pytorch saved model file.

In [28]:
import torch
from pytorch_pretrained_bert import BasicTokenizer, BertTokenizer, BertModel, BertForMaskedLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
        
path_to_pretrained_model = '/Users/minhpqn/nlp/data/japanese/bert/bert-wiki-ja'
# path_to_pretrained_model = '/Users/minhpqn/nlp/data/japanese/bert/multi_cased_L-12_H-768_A-12'

model = BertModel.from_pretrained(path_to_pretrained_model, cache_dir=None)
model.eval()

INFO:pytorch_pretrained_bert.modeling:loading archive file /Users/minhpqn/nlp/data/japanese/bert/bert-wiki-ja
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 32000
}



cpu


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=

Now we create a tokenizer using sentencepiece. We use the module `tokenization_sentencepiece.py` from [https://github.com/yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese).

In [29]:
import os
import tokenization_sentencepiece as tokenization

model_file = os.path.join(path_to_pretrained_model, 'wiki-ja.model')
vocab_file = os.path.join(path_to_pretrained_model, 'wiki-ja.vocab')

tokenizer = tokenization.FullTokenizer(model_file=model_file, vocab_file=vocab_file, do_lower_case=False)
text = "数学の最も普通の定義としては、「数および図形についての学問」というものがある。"
tokenized_text = tokenizer.tokenize(text)
tokenized_text.insert(0,'[CLS]')
print(tokenized_text)

masked_index = 8
tokenized_text[masked_index] = '[MASK]'
print(tokenized_text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor = tokens_tensor.to(device)

print(indexed_tokens)
print(tokens_tensor.size())

# Let's see how to use BertModel to get hidden states.
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, hidden = model(tokens_tensor, output_all_encoded_layers=True)
assert len(encoded_layers) == 12
print(encoded_layers[0].size())
print(hidden)

Loaded a trained SentencePiece model.
['[CLS]', '▁', '数学', 'の', '最も', '普通', 'の定義', 'としては', '、「', '数', 'および', '図形', 'についての', '学問', '」', 'という', 'ものがある', '。']
['[CLS]', '▁', '数学', 'の', '最も', '普通', 'の定義', 'としては', '[MASK]', '数', 'および', '図形', 'についての', '学問', '」', 'という', 'ものがある', '。']
[4, 9, 4560, 10, 1016, 2334, 9070, 512, 6, 181, 144, 24031, 3226, 6078, 21, 49, 5805, 8]
torch.Size([1, 18])
torch.Size([1, 18, 768])


We try to just output the last encoded layer.

In [32]:
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, output_all_encoded_layers=False)
print(encoded_layers[0].size())

torch.Size([18, 768])


Now, we use BertForMaskedLM to predict tokens.

In [24]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained(path_to_pretrained_model, cache_dir=None)
model.eval()
model.to(device)

INFO:pytorch_pretrained_bert.modeling:loading archive file /Users/minhpqn/nlp/data/japanese/bert/bert-wiki-ja
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 32000
}

INFO:pytorch_pretrained_bert.modeling:Weights from pretrained model not used in BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertInterm

In [25]:
with torch.no_grad():
    predictions = model(tokens_tensor)

predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

、「


## References

- BERT with SentencePiece for Japanese text. [https://github.com/yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese)
- [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)