# Tokenizer and Embedding
<br>
This notebook reviews and practice the tokenizer and embedding of BERT.<br>
We are going to go over this content using a implemented library.<br> <br>
The source code below is from <br> https://www.youtube.com/watch?v=zJW57aCBCTk&t=948s <br> <br>
Also, I referenced from the Tokenizer official API of huggingface, <br> https://huggingface.co/docs/tokenizers/python/latest/

### bert-base-uncased
There exist two models of BERT, 'bert-base' and 'bert-large'.<br>
While the second one is the SOTA model, it is more heavier, so we use 'bert-base' for demonstration in this notebook.<br>
'cased' and 'uncased' are to decide whether there are capitalized words or not.

In [4]:
import torch
import transformers
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

# Vocabulary.txt
We will first take a look at all the words within the vocabulary of BERT.<br>
The tokens that fill the BERT Tokenizer is based on WordPiece.

##### UnicodeEncodeError: 'cp949' codec can't encode character '\xa2' in position 0: illegal multibyte sequence<br>
<br>
If the error above occurs, it is because the data you are trying to write and the write target file has different encoding format.<br>
Then, add the -1 and 'utf-8' argument to open().

In [5]:
with open("vocabulary.txt","w",-1,'utf-8') as f:
    for token in tokenizer.vocab.keys():
        f.write(token+'\n')

In [6]:
one_chars=[]
one_chars_hashes=[]

for token in tokenizer.vocab.keys():
    if len(token)==1:
        one_chars.append(token)
        
    elif len(token)==3 and token[0:2]=='##':
        one_chars_hashes.append(token)

In [7]:
print('Number of single character tokens : ',len(one_chars),'\n')

for i in range(0,len(one_chars),40):
    end = min(i+40,len(one_chars)+1)
    print(' '.join(one_chars[i:end]))

Number of single character tokens :  997 

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b
c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬
® ° ± ² ³ ´ µ ¶ · ¹ º » ¼ ½ ¾ ¿ × ß æ ð ÷ ø þ đ ħ ı ł ŋ œ ƒ ɐ ɑ ɒ ɔ ɕ ə ɛ ɡ ɣ ɨ
ɪ ɫ ɬ ɯ ɲ ɴ ɹ ɾ ʀ ʁ ʂ ʃ ʉ ʊ ʋ ʌ ʎ ʐ ʑ ʒ ʔ ʰ ʲ ʳ ʷ ʸ ʻ ʼ ʾ ʿ ˈ ː ˡ ˢ ˣ ˤ α β γ δ
ε ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω а б в г д е ж з и к л м н о п р с т у
ф х ц ч ш щ ъ ы ь э ю я ђ є і ј љ њ ћ ӏ ա բ գ դ ե թ ի լ կ հ մ յ ն ո պ ս վ տ ր ւ
ք ־ א ב ג ד ה ו ז ח ט י ך כ ל ם מ ן נ ס ע ף פ ץ צ ק ר ש ת ، ء ا ب ة ت ث ج ح خ د
ذ ر ز س ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي ٹ پ چ ک گ ں ھ ہ ی ے अ आ उ ए क ख ग च
ज ट ड ण त थ द ध न प ब भ म य र ल व श ष स ह ा ि ी ो । ॥ ং অ আ ই উ এ ও ক খ গ চ ছ জ
ট ড ণ ত থ দ ধ ন প ব ভ ম য র ল শ ষ স হ া ি ী ে க ச ட த ந ன ப ம ய ர ல ள வ ா ி ு ே
ை ನ ರ ಾ ක ය ර ල ව ා ก ง ต ท น พ ม ย ร ล ว ส อ า เ ་ ། ག ང ད ན པ བ མ འ ར ལ ས မ ა
ბ გ დ ე ვ თ ი კ ლ მ ნ ო რ ს ტ უ ᄀ ᄂ ᄃ ᄅ ᄆ ᄇ ᄉ ᄊ ᄋ ᄌ ᄎ ᄏ ᄐ ᄑ ᄒ ᅡ ᅢ ᅥ ᅦ ᅧ ᅩ ᅪ ᅭ

In [8]:
print('Number of single character tokens with hashes : ',len(one_chars_hashes),'\n')

tokens = [token.replace('##','') for token in one_chars_hashes]

for i in range(0,len(tokens),40):
    end = min(i+40,len(tokens)+1)
    print(' '.join(tokens[i:end]))

Number of single character tokens with hashes :  997 

s a e i n o d r y t l m u h k c g p 2 z 1 b 3 f 4 6 7 x v 8 5 9 0 w j q ° ₂ а и
² ₃ ı ₁ ⁺ ½ о ه ي α е د ن ν ø р ₄ ₀ ر я ³ ι ł н ᵢ ₙ ß ة ς م − т ː ل ь к ♭ η ی в
ا × ¹ ы ה ɛ л ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ ¡
¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ± ´ µ ¶ · º » ¼ ¾ ¿ æ ð ÷ þ đ ħ ŋ œ ƒ ɐ ɑ ɒ ɔ ɕ ə ɡ ɣ ɨ
ɪ ɫ ɬ ɯ ɲ ɴ ɹ ɾ ʀ ʁ ʂ ʃ ʉ ʊ ʋ ʌ ʎ ʐ ʑ ʒ ʔ ʰ ʲ ʳ ʷ ʸ ʻ ʼ ʾ ʿ ˈ ˡ ˢ ˣ ˤ β γ δ ε ζ
θ κ λ μ ξ ο π ρ σ τ υ φ χ ψ ω б г д ж з м п с у ф х ц ч ш щ ъ э ю ђ є і ј љ њ ћ
ӏ ա բ գ դ ե թ ի լ կ հ մ յ ն ո պ ս վ տ ր ւ ք ־ א ב ג ד ו ז ח ט י ך כ ל ם מ ן נ ס
ע ף פ ץ צ ק ר ש ת ، ء ب ت ث ج ح خ ذ ز س ش ص ض ط ظ ع غ ـ ف ق ك و ى ٹ پ چ ک گ ں ھ
ہ ے अ आ उ ए क ख ग च ज ट ड ण त थ द ध न प ब भ म य र ल व श ष स ह ा ि ी ो । ॥ ং অ আ
ই উ এ ও ক খ গ চ ছ জ ট ড ণ ত থ দ ধ ন প ব ভ ম য র ল শ ষ স হ া ি ী ে க ச ட த ந ன ப
ம ய ர ல ள வ ா ி ு ே ை ನ ರ ಾ ක ය ර ල ව ා ก ง ต ท น พ ม ย ร ล ว ส อ า เ ་ ། ག ང ད
ན པ བ མ འ ར ལ ས မ ა ბ გ დ ე ვ თ ი კ ლ მ ნ ო რ ს ტ უ ᄀ ᄂ ᄃ ᄅ ᄆ ᄇ ᄉ

In [9]:
print('Are the two sets identical?',set(one_chars)==set(tokens))

Are the two sets identical? True


### Some frequently misspelled words

Maybe Google eliminated the misspelled words in the vocabulary..?

In [10]:
print('Is misspelled in vocabulary?','misspelled' in tokenizer.vocab) # Right One
print('Is mispelled in vocabulary?','mispelled' in tokenizer.vocab) # Wrong One

Is misspelled in vocabulary? False
Is mispelled in vocabulary? False


In [11]:
print('Is government in vocabulary?','government' in tokenizer.vocab) # Right One
print('Is goverment in vocabulary?','goverment' in tokenizer.vocab) # Wrong One

Is government in vocabulary? True
Is goverment in vocabulary? False


In [12]:
print('Is beginning in vocabulary?','beginning' in tokenizer.vocab) # Right One
print('Is begining in vocabulary?','begining' in tokenizer.vocab) # Wrong One

Is beginning in vocabulary? True
Is begining in vocabulary? False


### How about contractions?

It seems like the vocabulary doesn't include words with symbols in them..!

In [13]:
print("Is can't in vocabulary?","Can't" in tokenizer.vocab)
print("Is cant in vocabulary?","Cant" in tokenizer.vocab)

Is can't in vocabulary? False
Is cant in vocabulary? False


### Multi-Character Subwords
We saw that for single characters, there are both the individual character and the ## version.<br>
This is not true for subwords!

In [14]:
print("Is ly in vocabulary?","ly" in tokenizer.vocab)
print("Is ##ly in vocabulary?","##ly" in tokenizer.vocab)

Is ly in vocabulary? False
Is ##ly in vocabulary? True


### What about MultiLingual Tokenizer?
<br>
Let's take a look at another tokenizer.

In [15]:
multi_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

In [16]:
with open("multivocabulary.txt","w",-1,'utf-8') as f:
    for token in multi_tokenizer.vocab.keys():
        f.write(token+'\n')

### Using the Tokenizer from transformers library of huggingface
<br>
BertTokenizer is inherited from the PretrainedTokenizer class within the transformers library.<br>

In [None]:
text1 = 

# Various Tokenizer Modules
<br>
Besides the BertTokenizer from the huggingface library implemented by WordPiece, there exist various modules and algorithms to process tokenizing.<br>
<br>
For example, the li