<a href="https://colab.research.google.com/github/notAlex2/Translation-Team08-IFT6759/blob/master/notebooks/harman_use_saved_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import os
project_path = "/home/alex/Translation-Team08-IFT6759"
os.chdir(project_path)

In [3]:
! pip install transformers===2.7.0 
from transformers import AutoTokenizer

Collecting transformers===2.7.0
  Downloading transformers-2.7.0-py3-none-any.whl (544 kB)
[K     |████████████████████████████████| 544 kB 2.3 MB/s eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Using cached dataclasses-0.7-py3-none-any.whl (18 kB)
Installing collected packages: dataclasses, transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 2.6.0
    Uninstalling transformers-2.6.0:
      Successfully uninstalled transformers-2.6.0
Successfully installed dataclasses-0.7 transformers-2.7.0


In [6]:
tokenizer_path_en = "tokenizer_data_en"
! tree tokenizer_data_en

[34;42mtokenizer_data_en[00m
├── config.json
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.json

0 directories, 5 files


In [7]:
tokenizer_path_fr = "tokenizer_data_fr"
! tree tokenizer_data_fr

[34;42mtokenizer_data_fr[00m
├── config.json
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.json

0 directories, 5 files


### **Load/Restore Tokenizer**

In [8]:
# make sure the path contains above 5 files
tokenizer_en = AutoTokenizer.from_pretrained(tokenizer_path_en)

tokenizer_fr = AutoTokenizer.from_pretrained(tokenizer_path_fr)

### How to use Tokenizer  


*   To tokenize a sentence, use `tokens = tokenizer.tokenize(sentence)`
*   To encode a sentence to integers, use `encoded_sequence = tokenizer.encode(sentence)`. Not that it also adds start and end tokens, i.e. `<s>` and `</s>` to the encoded outputs.
*   To decode/untokenize a sentence, use `tokenizer.decode(encoded_sequence, skip_special_tokens=True)`
* We use keras's `pad_sequences` to pad. Make sure to use `tokenizer.pad_token_id` to provide the tokenizer specific pad token. 


Usage of this tokenizer is shown in following examples.



In [10]:
text = "Montreal is a great city.".strip()
tokenizer_en.tokenize(text)

['M', 'ont', 'real', 'Ġis', 'Ġa', 'Ġgreat', 'Ġcity', '.']

Capitalization and lowercased inputs will give different results. Hence, its user's choice on how he provides input to the tokenizer.

In [11]:
text = "Montreal is a great city.".strip().lower()
tokenizer_en.tokenize(text)

['mont', 'real', 'Ġis', 'Ġa', 'Ġgreat', 'Ġcity', '.']

In [12]:
encoded_seq = tokenizer_en.encode(text)
encoded_seq

[1, 18304, 306, 263, 803, 3194, 18, 2]

In [13]:
# decode sequence back!
tokenizer_en.decode(encoded_seq, skip_special_tokens=False)

'<s> montreal is a great city.</s>'

In [14]:
tokenizer_en.decode(encoded_seq, skip_special_tokens=True)

' montreal is a great city.'

In [15]:
tokens = tokenizer_en.encode_plus(text)
tokens

{'input_ids': [1, 18304, 306, 263, 803, 3194, 18, 2],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
tokens["input_ids"]

[1, 18304, 306, 263, 803, 3194, 18, 2]

In [17]:
tokenizer_en.get_special_tokens_mask(encoded_seq, already_has_special_tokens=True)

[1, 0, 0, 0, 0, 0, 0, 1]

In [18]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# pad sequences!
padded_seq = pad_sequences([tokens["input_ids"]], padding='post', value=tokenizer_en.pad_token_id, maxlen=15)
padded_seq[0]

array([    1, 18304,   306,   263,   803,  3194,    18,     2,     0,
           0,     0,     0,     0,     0,     0], dtype=int32)

In [19]:
tokenizer_en.get_special_tokens_mask(padded_seq[0], already_has_special_tokens=True)

[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

#### Un-tokenize inputs

In [20]:
tokenizer_en.decode(padded_seq[0], skip_special_tokens=False)

'<s> montreal is a great city.</s><pad><pad><pad><pad><pad><pad><pad>'

In [21]:
tokenizer_en.decode(padded_seq[0], skip_special_tokens=True)

' montreal is a great city.'

In [22]:
# encode batch in one go!
text1 = "Montreal is a great city".strip()
text2 = "California has good weather".strip()

texts = [text1, text2]
tokenizer_en.batch_encode_plus(texts)

{'input_ids': [[1, 225, 49, 2095, 9337, 306, 263, 803, 3194, 2],
  [1, 225, 39, 289, 8974, 407, 793, 5869, 2]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

## Alex's Test cases

### English tokenizer

In [51]:
train_lang1 = "but if you do this it 's quick"   # train.lang1 format
unaligned_en = "But if you do this, it's quick." # unaligned.en format

In [52]:
print(tokenizer_en.tokenize(train_lang1))
print(tokenizer_en.tokenize(unaligned_en))

['but', 'Ġif', 'Ġyou', 'Ġdo', 'Ġthis', 'Ġit', "Ġ'", 's', 'Ġquick']
['B', 'ut', 'Ġif', 'Ġyou', 'Ġdo', 'Ġthis', ',', 'Ġit', "'s", 'Ġquick', '.']


In [53]:
print(tokenizer_en.decode(tokenizer_en.encode(train_lang1)))
print(tokenizer_en.decode(tokenizer_en.encode(unaligned_en)))

<s> but if you do this it's quick</s>
<s> But if you do this, it's quick.</s>


Notes:
* Capitalization needs to be removed from "unaligned.en" before tokenization
* Punctuation needs to be removed from "unaligned.en" before tokenization
* Punctuation is concatenated on output (it's VS it 's)

### French tokenizer

In [54]:
train_lang2 = "Alors , où en sommes - nous ?" # train.lang2 format
unaligned_fr = "Alors, où en sommes-nous?"    # unaligned.fr format

print(tokenizer_fr.tokenize(train_lang2))   
print(tokenizer_fr.tokenize(unaligned_fr)) 

['Alors', 'Ġ,', 'ĠoÃ¹', 'Ġen', 'Ġsommes', 'Ġ-', 'Ġnous', 'Ġ?']
['Alors', ',', 'ĠoÃ¹', 'Ġen', 'Ġsommes', '-', 'nous', '?']


In [55]:
print(tokenizer_fr.decode(tokenizer_fr.encode(train_lang2)))
print(tokenizer_fr.decode(tokenizer_fr.encode(unaligned_fr)))

<s> Alors, où en sommes - nous?</s>
<s> Alors, où en sommes-nous?</s>


Notes:
* Tokenizer concatenates punctuation when decoding (nous? VS nous ?)
* unaligned.fr needs to have tokens disconnected from words before fitting tokenizer

In [57]:
print(tokenizer_fr.encode("Alors"))
print(tokenizer_fr.encode("alors"))

[1, 10706, 2]
[1, 1282, 2]


In [63]:
train_lang2 = "J' ai ?" # train.lang2 format
unaligned_fr = "J'ai?"  # unaligned.fr format

print(tokenizer_fr.tokenize(train_lang2))   
print(tokenizer_fr.tokenize(unaligned_fr))
print(tokenizer_fr.decode(tokenizer_fr.encode(train_lang2)))
print(tokenizer_fr.decode(tokenizer_fr.encode(unaligned_fr)))

['J', "'", 'Ġai', 'Ġ?']
['J', "'", 'ai', '?']
<s> J' ai?</s>
<s> J'ai?</s>


Notes:
* Capitalization affects token definition

## Add/remove capitalization special character

In [208]:
import re

def add_cap_tokens(tokens):
    output = tokens.strip()
    positions = []
    for m in re.finditer(r'\b[A-ZÀ-ÖÙ-Ý]', output):
        positions += [m.start()]
    
    for idx in reversed(positions):
        output = output[:idx] + '@ ' + output[idx].lower() + output[(idx+1):]
        
    return output

In [209]:
print(add_cap_tokens("Test Words"))
print(add_cap_tokens("test"))
print(add_cap_tokens("T T T TT"))
print(add_cap_tokens("   T   T     T    TT    "))
print(add_cap_tokens("État"))
print(add_cap_tokens("A Z À Ö Ù Ý"))
print(add_cap_tokens("AZÀÖÙÝ"))

@ test @ words
test
@ t @ t @ t @ tT
@ t   @ t     @ t    @ tT
@ état
@ a @ z @ à @ ö @ ù @ ý
@ aZÀÖÙÝ


In [210]:
def remove_cap_tokens(tokens):
    output = re.sub(' +', ' ', tokens).strip()
    positions = []
    for m in re.finditer(r'@', output):
        positions += [m.start()]
    
    for idx in reversed(positions):
        # Catch the case when sequence is terminated by <CAP>
        if idx + 2 >= len(output):
            output = output[:idx]
            continue
        output = output[:idx] + output[(idx+2)].upper() + output[(idx+3):]
        
    return output

In [211]:
print(remove_cap_tokens("@ test @ words"))
print(remove_cap_tokens("test"))
print(remove_cap_tokens("@ t @ t @ t @ tT"))
print(remove_cap_tokens("@ t @ t @ t @ tT"))
print(remove_cap_tokens("@ état"))
print(remove_cap_tokens("@ a @ z @ à @ ö @ ù @ ý"))
print(remove_cap_tokens("@ aZÀÖÙÝ"))
print(remove_cap_tokens("@ aZÀÖÙÝ <CAP> "))
print(remove_cap_tokens("@ @ @ . @ aZÀÖÙÝ @ @"))
print(remove_cap_tokens("@   aZÀÖÙÝ @    "))
print(remove_cap_tokens("a @ @ @ @ @ @   . a"))

Test Words
test
T T T TT
T T T TT
État
A Z À Ö Ù Ý
AZÀÖÙÝ
AZÀÖÙÝ <CAP>
. AZÀÖÙÝ 
AZÀÖÙÝ 
a . a


In [212]:
tokenizer_en.add_tokens(['@'])
tokenizer_fr.add_tokens(['@'])

print(tokenizer_en.encode(add_cap_tokens("test words")))
print(tokenizer_en.encode(add_cap_tokens("Test Words")))

print(remove_cap_tokens(tokenizer_en.decode(tokenizer_en.encode(add_cap_tokens("test words")), skip_special_tokens=True)))
print(remove_cap_tokens(tokenizer_en.decode(tokenizer_en.encode(add_cap_tokens("Test Words")), skip_special_tokens=True)))

[1, 2294, 1738, 2]
[1, 225, 36, 2294, 225, 36, 1738, 2]
test words
Test Words


### Pre-Process Files Before Training Tokenizer

In [213]:
file_with_caps = "/home/alex/Translation-Team08-IFT6759/data/train.fr.tokenized/unaligned.fr"
with open(file_with_caps) as f:
    lines = [line.rstrip('\n') for line in f]

In [214]:
output_file = "/home/alex/Translation-Team08-IFT6759/data/train.fr.tokenized/unaligned.fr.CAP"
with open(output_file, "w+") as f:
    for line in lines:
        f.write(add_cap_tokens(line) + '\n')

In [215]:
file_with_caps = "/home/alex/Translation-Team08-IFT6759/data/train.lang2"
with open(file_with_caps) as f:
    lines = [line.rstrip('\n') for line in f]

In [216]:
output_file = "/home/alex/Translation-Team08-IFT6759/data/train.lang2.CAP"
with open(output_file, "w+") as f:
    for line in lines:
        f.write(add_cap_tokens(line) + '\n')

## Un-concatenate punctuation

Problem: punctuation is concatenated

In [155]:
train_lang2 = "Alors , où en sommes - nous ?" # train.lang2 format

print(tokenizer_fr.tokenize(train_lang2))   
print(tokenizer_fr.decode(tokenizer_fr.encode(train_lang2)))

['Alors', 'Ġ,', 'ĠoÃ¹', 'Ġen', 'Ġsommes', 'Ġ-', 'Ġnous', 'Ġ?']
<s> Alors, où en sommes - nous?</s>


Solution: skip tokenizer clean-up

In [158]:
train_lang2 = "Alors , où en sommes - nous ?" # train.lang2 format

output = tokenizer_fr.decode(tokenizer_fr.encode(train_lang2), clean_up_tokenization_spaces=False, skip_special_tokens=True).strip()
print(output)
assert output == train_lang2

Alors , où en sommes - nous ?


TODO:
* Add processed version of unaligned en/fr files