# Using BERT for Japanese

In this notebook, we will try to test new BERT model for Japanese language in the `transformers` library.

Now let's try to use Japanese tokenization in BERT.

In [1]:
import torch
from transformers import BertJapaneseTokenizer, BertModel, BertTokenizer, MecabTokenizer

In [8]:
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese', word_tokenizer_type='mecab', do_lower_case=False)

In [16]:
tokenizer.build_inputs_with_special_tokens(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("こんにちは、世界。こんばんは、世界。")))

[2, 10350, 25746, 28450, 6, 324, 8, 10350, 14815, 28450, 6, 324, 8, 3]

In [17]:
tokenizer.convert_tokens_to_ids(tokenizer.tokenize("こんにちは、世界。こんばんは、世界。"))

[10350, 25746, 28450, 6, 324, 8, 10350, 14815, 28450, 6, 324, 8]

In [3]:
# tokenize input
text = "数学の最も普通の定義としては、「数および図形についての学問」というものがある。"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['数学', 'の', '最も', '普通', 'の', '定義', 'として', 'は', '、', '「', '数', 'および', '図形', 'について', 'の', '学問', '」', 'という', 'もの', 'が', 'ある', '。']


In [4]:
print(tokenizer.tokenize("他主要言語圏においても同様である。"))

['他', '主要', '言語', '圏', 'において', 'も', '同様', 'で', 'ある', '。']


In [5]:
text = "こんにちは、世界。\nこんばんは、世界。"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

['こん', '##にち', '##は', '、', '世界', '。', 'こん', '##ばん', '##は', '、', '世界', '。']


In [6]:
print(tokenizer.max_len_single_sentence)

510


In [5]:
text = tokenizer.encode("こんにちは、世界。\nこんばんは、世界。", add_special_tokens=False)
text

[10350, 25746, 28450, 6, 324, 8, 10350, 14815, 28450, 6, 324, 8]

In [6]:
text = tokenizer.encode("こんにちは、世界。\nこんばんは、世界。", add_special_tokens=True)
text

[2, 10350, 25746, 28450, 6, 324, 8, 10350, 14815, 28450, 6, 324, 8, 3]

In [7]:
text = "こんにちは、世界。"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
print(input_ids)

['こん', '##にち', '##は', '、', '世界', '。']
[10350, 25746, 28450, 6, 324, 8]


Now, we are going to tokenize the text with MeCab before tokenizing using BERT.

In [8]:
text = "こんにちは、世界。"
mecab_tokenizer = MecabTokenizer()
text_ = mecab_tokenizer.tokenize(text)
text_ = ' '.join(text_)
text_ = '[CLS] ' + text_ + ' [SEP]'
text_

'[CLS] こんにちは 、 世界 。 [SEP]'

In [9]:
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese', word_tokenizer_type='basic')
tokenized_text = tokenizer.tokenize(text_)
print(tokenized_text)

['[CLS]', 'こん', '##にち', '##は', '、', '世界', '。', '[SEP]']


Let's try to encode sentence with pre-trained Japanese BERT model.

In [None]:
model = BertModel.from_pretrained('bert-base-japanese')
model.eval()

Let's try to encode the sentence with pre-trained Japanese BERT model.

In [None]:
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
input_ids

In [None]:
tokens_tensor = torch.tensor([input_ids])
tokens_tensor.shape

Let's try to add some special tokens into the input text.

In [None]:
text = '[CLS] [<S>] こんにちは [<S>] 、 [<O>] 世界 [<O>] 。 [SEP]'
tokenizer.add_tokens(['[<S>]', '[<O>]'])

In [None]:
tokenized_text = tokenizer.tokenize(text)
tokenized_text

In [None]:
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
input_ids

In [None]:
model.resize_token_embeddings(len(tokenizer))

In [None]:
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, hidden = model(tokens_tensor)
assert len(encoded_layers) == 1
print(encoded_layers[0].size())
print(hidden.size())
print(encoded_layers[0][-1,:].size())

## Trying BERT model from Kyoto University

In ordder to use the new version `transformers`, we need to download the new version of pre-trained Japanese BERT model [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3%83%A2%E3%83%87%E3%83%AB). We do not need to comment out the line `# text = self._tokenize_chinese_chars(text)`.

In [None]:
from pyknp import Juman
jumanpp = Juman()

def tokenize(text):
    result = jumanpp.analysis(text)
    tokens = []
    for mrph in result.mrph_list():
        tokens.append(mrph.midasi)
    return ' '.join(tokens)
        
path_to_pretrained_model = '/Users/minhpqn/nlp/data/japanese/bert/kyodai_bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers'

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained(path_to_pretrained_model, do_lower_case=False)

# Tokenized input
text = "数学の最も普通の定義としては、「数および図形についての学問」というものがある。"
text = tokenize(text)
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])

print(indexed_tokens)
print(tokens_tensor.size())
print(tokens_tensor)

In [None]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained(path_to_pretrained_model, cache_dir=None)
model.eval()

In [None]:
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, hidden = model(tokens_tensor)
assert len(encoded_layers) == 1
print(encoded_layers[0].size())
print(hidden.size())
print(encoded_layers[0][-1,:].size())

In [None]:
encoder_layers[:, 0]