### <strong>主題:
啤酒評論評分預測 - 分詞器(Tokenizer)
### <strong>說明:
繼續上次啤酒的評鑑資料集的練習，還記得在上次的資料初步分析之後，我們對於BERT模型最大 <br />
長度的限制是255嗎? 然而這還只是未使用分詞器所得到的結論，當真正使用分詞器之後，每個 <br />
評論具的token數可能會與我們先前的評估有所不同，因此這是作業主要是以BERT分瓷器來驗證 <br />
上次作業得到的結論是否需要修正。
### <strong>題目
1. 創建英文BERT所使用的分詞器，提供下一題分析以及後續訓練使用
2. 以分詞器生成每個評論語句的token，並評估上次得到的最大長度限制是否合理

#### <strong>提示: 不要忘記加上[SEP]與[CLS]

In [1]:
import torch
import transformers
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from torch import nn, optim
from transformers import BertModel, BertTokenizer
from transformers import AdamW, get_linear_schedule_with_warmup

In [2]:
TRAIN = pd.read_json("train_set.json")
TEST = pd.read_json("test_set.json")

In [3]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
TOKENIZER = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [4]:
text = TRAIN['review/text'][0]
tokens = TOKENIZER.tokenize(text)
tokens_id = TOKENIZER.convert_tokens_to_ids(tokens)
print(text)
print(tokens[:50])
print(tokens_id[:50])

Pours a clouded gold with a thin white head. Nose is quite floral with a larger amount of spices added. Definitely a spice forward fragrance. Flavor has an odd burn that hits on the first sip. After it fades it seems like a dirty vanilla aftertaste. Perhaps this is the absinthe? Regardless of that, I get a quite spiced tone on the tongue. Almost feel a little heat from it. I think that my inexperienced palate on these spices is contributing to my ignorance of what precisely they are. Overall a nice drinker indeed.
['Po', '##urs', 'a', 'cloud', '##ed', 'gold', 'with', 'a', 'thin', 'white', 'head', '.', 'No', '##se', 'is', 'quite', 'floral', 'with', 'a', 'larger', 'amount', 'of', 'spices', 'added', '.', 'De', '##fin', '##ite', '##ly', 'a', 's', '##pice', 'forward', 'f', '##rag', '##rance', '.', 'F', '##lav', '##or', 'has', 'an', 'odd', 'burn', 'that', 'hits', 'on', 'the', 'first', 'sip']
[18959, 7719, 170, 7180, 1174, 2284, 1114, 170, 4240, 1653, 1246, 119, 1302, 2217, 1110, 2385, 22504,

In [5]:
print(TOKENIZER.sep_token, TOKENIZER.sep_token_id)
print(TOKENIZER.cls_token, TOKENIZER.cls_token_id)
print(TOKENIZER.pad_token, TOKENIZER.pad_token_id)
print(TOKENIZER.unk_token, TOKENIZER.unk_token_id)

[SEP] 102
[CLS] 101
[PAD] 0
[UNK] 100


In [6]:
MAX_SEQ_LEN = 256

In [7]:
encoding = TOKENIZER.encode_plus(
  text,
  max_length=MAX_SEQ_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  truncation=True,
  return_tensors='pt',
)

print(encoding["input_ids"][0])
print(encoding["attention_mask"])

tensor([  101, 18959,  7719,   170,  7180,  1174,  2284,  1114,   170,  4240,
         1653,  1246,   119,  1302,  2217,  1110,  2385, 22504,  1114,   170,
         2610,  2971,  1104, 25133,  1896,   119,  3177, 16598,  3150,  1193,
          170,   188, 15633,  1977,   175, 20484, 10555,   119,   143,  9516,
         1766,  1144,  1126,  5849,  6790,  1115,  4919,  1113,  1103,  1148,
        11456,   119,  1258,  1122, 15854,  1116,  1122,  3093,  1176,   170,
         7320,  3498,  5878,  1170, 10401,  1566,   119,  5203,  1142,  1110,
         1103,   170,  4832, 10879,  4638,   136, 20498,  1104,  1115,   117,
          146,  1243,   170,  2385,   188, 15633,  1181,  3586,  1113,  1103,
         3661,   119,  8774,  1631,   170,  1376,  3208,  1121,  1122,   119,
          146,  1341,  1115,  1139,  1107, 11708,  3365, 28118,   185,  5971,
         1566,  1113,  1292, 25133,  1110,  7773,  1106,  1139, 21326,  1104,
         1184, 11228,  1152,  1132,   119,  8007,   170,  3505, 



In [8]:
print(TOKENIZER.convert_ids_to_tokens(encoding["input_ids"][0]))

['[CLS]', 'Po', '##urs', 'a', 'cloud', '##ed', 'gold', 'with', 'a', 'thin', 'white', 'head', '.', 'No', '##se', 'is', 'quite', 'floral', 'with', 'a', 'larger', 'amount', 'of', 'spices', 'added', '.', 'De', '##fin', '##ite', '##ly', 'a', 's', '##pice', 'forward', 'f', '##rag', '##rance', '.', 'F', '##lav', '##or', 'has', 'an', 'odd', 'burn', 'that', 'hits', 'on', 'the', 'first', 'sip', '.', 'After', 'it', 'fade', '##s', 'it', 'seems', 'like', 'a', 'dirty', 'van', '##illa', 'after', '##tas', '##te', '.', 'Perhaps', 'this', 'is', 'the', 'a', '##bs', '##int', '##he', '?', 'Regardless', 'of', 'that', ',', 'I', 'get', 'a', 'quite', 's', '##pice', '##d', 'tone', 'on', 'the', 'tongue', '.', 'Almost', 'feel', 'a', 'little', 'heat', 'from', 'it', '.', 'I', 'think', 'that', 'my', 'in', '##ex', '##per', '##ienced', 'p', '##ala', '##te', 'on', 'these', 'spices', 'is', 'contributing', 'to', 'my', 'ignorance', 'of', 'what', 'precisely', 'they', 'are', '.', 'Overall', 'a', 'nice', 'drink', '##er', 'in