<a href="https://colab.research.google.com/github/notAlex2/Translation-Team08-IFT6759/blob/master/notebooks/harman_use_saved_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import os
project_path = "/content/drive/My Drive/machine-translation/pretrain_language_model"
os.chdir(project_path)

In [0]:
! pip install transformers===2.7.0 
from transformers import AutoTokenizer

In [3]:
tokenizer_path_en = "tokenizer_data_en"
! tree tokenizer_data_en

tokenizer_data_en
├── config.json
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.json

0 directories, 5 files


In [5]:
tokenizer_path_fr = "tokenizer_data_fr"
! tree tokenizer_data_fr

tokenizer_data_fr
├── config.json
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.json

0 directories, 5 files


### **Load/Restore Tokenizer**

In [0]:
# make sure the path contains above 5 files
tokenizer_en = AutoTokenizer.from_pretrained(tokenizer_path_en)

tokenizer_fr = AutoTokenizer.from_pretrained(tokenizer_path_fr)

### How to use Tokenizer  


*   To tokenize a sentence, use `tokens = tokenizer.tokenize(sentence)`
*   To encode a sentence to integers, use `encoded_sequence = tokenizer.encode(sentence)`. Not that it also adds start and end tokens, i.e. `<s>` and `</s>` to the encoded outputs.
*   To decode/untokenize a sentence, use `tokenizer.decode(encoded_sequence, skip_special_tokens=True)`
* We use keras's `pad_sequences` to pad. Make sure to use `tokenizer.pad_token_id` to provide the tokenizer specific pad token. 


Usage of this tokenizer is shown in following examples.



In [12]:
text = "Montreal is a great city".strip()
tokenizer_en.tokenize(text)

['M', 'ont', 'real', 'Ġis', 'Ġa', 'Ġgreat', 'Ġcity']

Capitalization and lowercased inputs will give different results. Hence, its user's choice on how he provides input to the tokenizer.

In [13]:
text = "Montreal is a great city".strip().lower()
tokenizer_en.tokenize(text)

['mont', 'real', 'Ġis', 'Ġa', 'Ġgreat', 'Ġcity']

In [24]:
encoded_seq = tokenizer_en.encode(text)
encoded_seq

[1, 18325, 306, 263, 805, 3195, 2]

In [26]:
# decode sequence back!
tokenizer_en.decode(encoded_seq, skip_special_tokens=False)

'<s> montreal is a great city</s>'

In [28]:
tokenizer_en.decode(encoded_seq, skip_special_tokens=True)

' montreal is a great city'

In [30]:
tokens = tokenizer_en.encode_plus(text)
tokens

{'attention_mask': [1, 1, 1, 1, 1, 1, 1],
 'input_ids': [1, 18325, 306, 263, 805, 3195, 2]}

In [31]:
tokens["input_ids"]

[1, 18325, 306, 263, 805, 3195, 2]

In [33]:
tokenizer_en.get_special_tokens_mask(encoded_seq, already_has_special_tokens=True)

[1, 0, 0, 0, 0, 0, 1]

In [36]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# pad sequences!
padded_seq = pad_sequences([tokens["input_ids"]], padding='post', value=tokenizer_en.pad_token_id, maxlen=15)
padded_seq[0]

array([    1, 18325,   306,   263,   805,  3195,     2,     0,     0,
           0,     0,     0,     0,     0,     0], dtype=int32)

In [37]:
tokenizer_en.get_special_tokens_mask(padded_seq[0], already_has_special_tokens=True)

[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

#### Un-tokenize inputs

In [38]:
tokenizer_en.decode(padded_seq[0], skip_special_tokens=False)

'<s> montreal is a great city</s><pad><pad><pad><pad><pad><pad><pad><pad>'

In [39]:
tokenizer_en.decode(padded_seq[0], skip_special_tokens=True)

' montreal is a great city'

In [40]:
# encode batch in one go!
text1 = "Montreal is a great city".strip()
text2 = "California has good weather".strip()

texts = [text1, text2]
tokenizer_en.batch_encode_plus(texts)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[1, 225, 49, 2096, 9317, 306, 263, 805, 3195, 2],
  [1, 225, 39, 289, 8955, 407, 793, 5872, 2]]}