# Faster than training from scratch 
# Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 

> Tutorial on how to use fastai v2 over Hugging Face's Transformers and Tokenizers libraries to fine-tune an English pre-trained transformer-based language model (GPT-2) to any language other than English

Notebook is based on work of Pierre Guillou (https://www.linkedin.com/in/pierreguillou)

Other resources used:
---


- Post in medium: [Faster than training from scratch - Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787)
- Fast notebook: [finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb](https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb)
- Hugging face model page of [GPorTuguese-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese): a language model for Portuguese text generation (and more NLP tasks...)
- Other posts in medium of the GPT-2 series: 
  - [NLP & fastai | GPT-2](https://medium.com/@pierre_guillou/nlp-fastai-gpt-2-16ee145a4a28)
  - [Byte-level BPE, an universal tokenizer but...](https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe)

In [None]:
#start by mounting google drive
from google.colab import drive, files
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
# need to instal fastai 2 etc before 
!pip install -q git+https://github.com/fastai/fastai
!pip install -q git+https://github.com/fastai/fastcore
!pip install -q iterative-stratification

[K     |████████████████████████████████| 61kB 3.0MB/s 
[K     |████████████████████████████████| 12.8MB 326kB/s 
[K     |████████████████████████████████| 776.8MB 21kB/s 
[?25h  Building wheel for fastai (setup.py) ... [?25l[?25hdone
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.7.1 which is incompatible.[0m
  Building wheel for fastcore (setup.py) ... [?25l[?25hdone


In [None]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [None]:
from  nlputilsfastai  import * # augumented py file ---> from fastai.basics import * # was fastai2

In [None]:
# !pip install fastcore==1.3.8

Collecting fastcore==1.3.8
[?25l  Downloading https://files.pythonhosted.org/packages/26/53/d79c0f942f8bb44903108462541130b53fc7b4d744b1b5df9127b0b524d6/fastcore-1.3.8-py3-none-any.whl (48kB)
[K     |██████▉                         | 10kB 19.8MB/s eta 0:00:01[K     |█████████████▋                  | 20kB 25.6MB/s eta 0:00:01[K     |████████████████████▍           | 30kB 23.5MB/s eta 0:00:01[K     |███████████████████████████▏    | 40kB 26.4MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 5.8MB/s 
Installing collected packages: fastcore
  Found existing installation: fastcore 1.3.20
    Uninstalling fastcore-1.3.20:
      Successfully uninstalled fastcore-1.3.20
Successfully installed fastcore-1.3.8


# 1. Installing required libraries and mounting google drive

In [None]:
#start by mounting google drive
from google.colab import drive, files
drive.mount('/content/gdrive', force_remount=True)

In [2]:
# need to instal fastai 2 etc before 
%%time
!pip install -q git+https://github.com/fastai/fastai
!pip install -q git+https://github.com/fastai/fastcore
!pip install -q iterative-stratification

  Building wheel for fastai (setup.py) ... [?25l[?25hdone
  Building wheel for fastcore (setup.py) ... [?25l[?25hdone
CPU times: user 121 ms, sys: 34.4 ms, total: 156 ms
Wall time: 50.3 s


# 2. Initialization

In [3]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [4]:
# from fastai2.text.all import *
# from nlputils_fastai2 import * 

from fastai.text.all import *
from nlputilsfastai import * 

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [5]:
gpu = 0
torch.cuda.set_device(gpu)
print(f'cuda device: {torch.cuda.current_device()}')
print(f'cuda device name: {torch.cuda.get_device_name(gpu)}')

cuda device: 0
cuda device name: Tesla K80


In [6]:
!nvidia-smi

Fri Mar 19 09:41:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8    33W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Load standard snipet to prevent random disconnects
This cell runs JS code to automatic reconnect to runtime.

In [7]:
import IPython
from google.colab import output

display(IPython.display.Javascript('''
 function ClickConnect(){
   btn = document.querySelector("colab-connect-button")
   if (btn != null){
     console.log("Click colab-connect-button"); 
     btn.click() 
     }
   
   btn = document.getElementById('ok')
   if (btn != null){
     console.log("Click reconnect"); 
     btn.click() 
     }
  }
  
setInterval(ClickConnect,60000)
'''))

print("Done.")

<IPython.core.display.Javascript object>

Done.


In [8]:
# Get config of fastai2 paths
config = Config()
config.d

{'archive_path': '/root/.fastai/archive',
 'data_path': '/root/.fastai/data',
 'model_path': '/root/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents (for other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias)).

In [9]:
# setup new path_data and create the corresponding folder
lang = 'pl'
name = f'{lang}wiki'
data_path = config['data_path']
path_data = data_path/name
path_data.mkdir(exist_ok=True, parents=True)

In [10]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [11]:
data_path, path_data

(Path('/root/.fastai/data'), Path('/root/.fastai/data/plwiki'))

# 3. Loading previously prepared scraped wiki file ~1G for particular language
for that purpose another notebook was used [wiki download](https://github.com/len-sla/other/blob/main/wiki_download.ipynb)

In [12]:
!cp /content/gdrive/MyDrive/fastai/all_texts_plwiki.csv  /root/.fastai/data/plwiki
!cp /content/gdrive/MyDrive/fastai/all_texts_plwiki.txt  /root/.fastai/data/plwiki

In [13]:
!du -hs {'/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv'}

1.1G	/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv


In [14]:
df = pd.read_csv('/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv')
df.head()

Unnamed: 0,text
0,"Henry Wager Halleck (ur. 16 stycznia 1815, zm. 9 stycznia 1872) – amerykański wojskowy, naukowiec i prawnik, oficer United States Army.\n\n, znany pod – obraźliwym później – przydomkiem „Old Brains”, brał czynny udział w dziele przyłączenia Kalifornii jako stanu. Z powodzeniem praktykował jako prawnik i deweloper. Na początku wojny secesyjnej, był naczelnym dowódcą Armii Unii na zachodnim teatrze działań, a jednocześnie – przez prawie dwa lata – głównodowodzącym wszystkich armii USA. „Awansował” na szefa sztabu armii, gdy generał-porucznik Ulysses Grant, były podkomendny Hallecka na zachod..."
1,"Kościół Najświętszej Marii Panny (""in summo"") w Poznaniu – zabytkowy gotycki kościół na Ostrowie Tumskim wraz z resztkami wczesnopiastowskiego palatium.\n\nW dzisiejszym kształcie powstał w połowie XV wieku, jednak jego historia rozpoczyna się około 965 roku, gdy po przybyciu Dobrawy wzniesiono na Ostrowie Tumskim kaplicę zamkową. W dokumentach kościół Najświętszej Marii Panny pod swoim dzisiejszym wezwaniem pojawia się po raz pierwszy w 1247. \n\nWedług najnowszych badań prawdopodobnie pod prezbiterium znajdują się fundamenty rotundy pełniącej funkcję kaplicy, pewnym jest natomiast istnie..."
2,"Gieorgij Andriejewicz Mołczanow (ros. Георгий Андреевич Молчанов, ur. 3 kwietnia 1897 w Charkowie, zm. 9 października 1937 w miejscu egzekucji Kommunarka) – funkcjonariusz radzieckiej policji politycznej, komisarz bezpieczeństwa państwowego II rangi, ludowy komisarz spraw wewnętrznych Białoruskiej SRR (1936-1937).\n\nUrodzony w rodzinie rosyjskiej. Do 1917 uczył się w szkole handlowej w Charkowie, od listopada 1917 do czerwca 1918 był żołnierzem i członkiem sztabu Głównodowodzącego Wojsk Południa Rosji Antonowa-Owsiejenki, później pracował w sztabie Frontu Wschodniego. \n\nOd grudnia 1917 ..."
3,"José Manuel Durão Barroso (wym. []; ur. 23 marca 1956 w Lizbonie) – portugalski polityk, prawnik i nauczyciel akademicki. W latach 1992–1995 minister spraw zagranicznych w rządzie Aníbal Cavaco Silvy, od 1999 do 2004 przewodniczący Partii Socjaldemokratycznej. Premier Portugalii od 6 kwietnia 2002 do 17 lipca 2004. Od 22 listopada 2004 do 31 października 2014 przewodniczący Komisji Europejskiej.\n\nUkończył prawo na Uniwersytecie Lizbońskim, a także studia europejskie na Uniwersytecie Genewskim, na którym uzyskał również magisterium w zakresie nauk politycznych. Pracował jako nauczyciel ak..."
4,"Laodika I (gr. ""Λαοδίκη"", ""Laodíkē"") (zm. po 242 p.n.e.) – córka Achajosa Starszego z dynastii Seleucydów, brata Antiocha I Sotera, pierwsza żona brata stryjecznego Antiocha II Theosa, króla państwa Seleucydów, syna Antiocha I Sotera.\n\nW czasie II wojny syryjskiej (258-248 p.n.e.) jej mąż Antioch II Theos, jako sprzymierzeniec Macedonii walczył przeciwko Egiptowi. W wyniku tej wojny Antioch II zawarł porozumienie z królem Egiptu Ptolemeuszem II Filadelfem w r. 250 p.n.e. Miał się wyprzeć żony Laodiki I i wspólnych z nią dzieci, a poślubić jego córkę Berenikę oraz zdeklarować się uczynić ..."


# 4. copying ready polish tokenizer

In [15]:
%%time
!pip install transformers
!pip freeze | grep transformers

transformers==4.4.2
CPU times: user 14.3 ms, sys: 119 ms, total: 133 ms
Wall time: 4.05 s


In [27]:
%%time
from transformers import GPT2TokenizerFast

pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)

CPU times: user 136 ms, sys: 9.13 ms, total: 146 ms
Wall time: 3.41 s


In [28]:
# To correct the warning about token_pad (GPT2TokenizerFast), run the following code
# source: https://github.com/huggingface/transformers/issues/2648#issuecomment-616177044
tokenizer_en.pad_token = tokenizer_en.eos_token

In [29]:
# source: https://huggingface.co/transformers/_modules/transformers/tokenization_utils_fast.html

print('---------- vocab ----------')
print()

print('vocab_files_names:',tokenizer_en.vocab_files_names)
print()

for k,v in tokenizer_en.pretrained_vocab_files_map.items():
    print(k)
    for kk,vv in v.items():
        print('- ',kk,':',vv)
    print()
    
print('vocab_size:',tokenizer_en.vocab_size)
print()
#print(tokenizer_en.get_vocab())

num = 50
print(f'First {num} items of the vocab: {dict(itertools.islice(tokenizer_en.get_vocab().items(), 20))}')

---------- vocab ----------

vocab_files_names: {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt', 'tokenizer_file': 'tokenizer.json'}

vocab_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/vocab.json
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/vocab.json
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/vocab.json
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/vocab.json
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/vocab.json

merges_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/merges.txt
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/merges.txt
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/merges.txt
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/merges.txt
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/merges.txt

tokenizer_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/tokenizer.json
-  gpt2-medium : https://huggingface.co/gpt

In [30]:
!pip install tokenizers
!pip freeze | grep tokenizers

tokenizers==0.10.1


In [31]:
# creating  directory for tokenizer
ByteLevelBPE_tokenizer_pl_rep = 'ByteLevelBPE_tokenizer_pl'
path_to_ByteLevelBPE_tokenizer_pl_rep = path_data/ByteLevelBPE_tokenizer_pl_rep
if not (path_to_ByteLevelBPE_tokenizer_pl_rep).exists():
    path_to_ByteLevelBPE_tokenizer_pl_rep.mkdir(exist_ok=True, parents=True)
# ByteLevelBPE_tokenizer_pl.save_model(str(path_to_ByteLevelBPE_tokenizer_pl_rep))

In [32]:
ls /root/.fastai/data/plwiki -all

total 2302132
drwxr-xr-x 3 root root       4096 Mar 19 09:07 [0m[01;34m.[0m/
drwxr-xr-x 3 root root       4096 Mar 19 08:57 [01;34m..[0m/
-rw------- 1 root root 1101183658 Mar 19 09:41 all_texts_plwiki.csv
-rw------- 1 root root 1098323868 Mar 19 09:42 all_texts_plwiki.txt
drwxr-xr-x 2 root root       4096 Mar 19 08:59 [01;34mByteLevelBPE_tokenizer_pl[0m/
-rw-r--r-- 1 root root    1216559 Mar 19 09:05 different_tokens_list.pl
-rw-r--r-- 1 root root    1640303 Mar 19 09:07 idxs_train.pl
-rw-r--r-- 1 root root     410351 Mar 19 09:07 idxs_val.pl
-rw-r--r-- 1 root root  154390264 Mar 19 09:05 new_wte_wgts.pl
-rw-r--r-- 1 root root     182831 Mar 19 09:05 same_tokens_list.pl


In [33]:
#copying previiously created pl okenizer ( saving ~30min fro preparing that)
!cp  /content/gdrive/MyDrive/fastai/vocab.json  /root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl
!cp  /content/gdrive/MyDrive/fastai/merges.txt  /root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl

In [34]:
from tokenizers.implementations import ByteLevelBPETokenizer
ByteLevelBPE_tokenizer_pl = ByteLevelBPETokenizer(
    "/root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl/vocab.json",
    "/root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl/merges.txt",
)

Testing if it is working

In [35]:
# Get vocab as a list
ByteLevelBPE_tokenizer_pl_vocab = ByteLevelBPE_tokenizer_pl.get_vocab() 
ByteLevelBPE_tokenizer_pl_vocab_ls = [k for k, v in sorted(ByteLevelBPE_tokenizer_pl_vocab.items(), key=lambda item: item[1])]
len(ByteLevelBPE_tokenizer_pl_vocab_ls),ByteLevelBPE_tokenizer_pl_vocab_ls[:5]

(50257, ['<|endoftext|>', '!', '"', '#', '$'])

In [36]:
text = "Taki mały tekst dla sprawdzenia ."
output = ByteLevelBPE_tokenizer_pl.encode(text)
print('\n splitting by tokens\n ')
print(output.ids,)
print(output.tokens)
print(output.offsets)

back_to_text = ByteLevelBPE_tokenizer_pl.decode(ByteLevelBPE_tokenizer_pl.encode(text).ids)

print('\ninput text:', text)
print('tokens ids:', output.ids)
print('back to text:', back_to_text)


 splitting by tokens
 
[5565, 335, 10120, 7591, 624, 1877, 1054, 4461]
['Ta', 'ki', 'ĠmaÅĤy', 'Ġtekst', 'Ġdla', 'Ġspraw', 'dzenia', 'Ġ.']
[(0, 2), (2, 4), (4, 9), (9, 15), (15, 19), (19, 25), (25, 31), (31, 33)]

input text: Taki mały tekst dla sprawdzenia .
tokens ids: [5565, 335, 10120, 7591, 624, 1877, 1054, 4461]
back to text: Taki mały tekst dla sprawdzenia .


<!-- czyli jestem w tym momencie -->

# 5. Create a fastai tokenizer and update the embeddings matrix of the GPT-2 English pre-trained model

Now let's see how we can use fastai v2 to fine-tune this model on Wikipedia in Portuguese, using all the fastai v2 training utilities.

We will follow these 2 following steps:

- 4.1) **GPT2TokenizerFast (imported GPT-2 tokenizer) --> fastai Tokenizer**: to process the data to train a model, we need to build a fastai tokenizer from the GPT-2 tokenizer with vocab in Portuguese.
- 4.2) **Change vocab embeddings (wte matrix) in the GPT-2 pre-trained model to adapt to the Portuguese vocab**: as the vocab embedding matrix (wte) of the pre-trained GPT-2 model corresponds to the English vocabulary, we'll keep the embeddings vectors of the common tokens between the English and Portuguese vocab.

 First, we import all the text utilities:

In [38]:
from fastai.text.all import *

#### 4.1 GPT2TokenizerFast (imported GPT-2 tokenizer) --> fastai Tokenizer

*(text from Sylvain Gugger Transformers Tutorial)* To process this data to train a model, we need to build a `Transform` that will be applied lazily. In a fastai `Transform` you can define:
- an `encodes` method that is applied when you call the transform (a bit like the `forward` method in a `nn.Module`)
- a `decodes` method that is applied when you call the [decode](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.decode) method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)
- a `setups` method that sets some inner state of the `Transform` (not needed here)

In [39]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

Two comments on the code above:
- in `encodes` we don't use the [tokenizer.encode](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.encode) method since it does some additional preprocessing for the model after tokenizing and numericalizing (the aprt throwing a warning before). Here we don't need any post-processing so it's fine to skip it and we use the [tokenizer.tokenize](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.tokenize) method followed by the [tokenizer.convert_tokens_to_ids](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.convert_tokens_to_ids) one.
- in `decodes` we return a `TitledStr` object and not just a plain string. That's a fastai class that adds a `show` method to the string, which will allow us to use all the fastai show methods.

##### Tokenizers

ENGLISH

In [40]:
%%time
# Load the GPT2 tokenizer in English
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model_en = GPT2LMHeadModel.from_pretrained(pretrained_weights)

# To correct the warning about token_pad (GPT2TokenizerFast), run the following code
# source: https://github.com/huggingface/transformers/issues/2648#issuecomment-616177044
tokenizer_en.pad_token = tokenizer_en.eos_token

CPU times: user 5.84 s, sys: 396 ms, total: 6.23 s
Wall time: 10.9 s


POLISH

In [41]:
# Get the path to ByteLevelBPE_tokenizer_pt config files
ByteLevelBPE_tokenizer_pl_rep = 'ByteLevelBPE_tokenizer_pl'
path_to_ByteLevelBPE_tokenizer_pl_rep = path_data/ByteLevelBPE_tokenizer_pl_rep

# import the pre-trained GPT2TokenizerFast tokenizer with the tokenizer_pt config files
tokenizer_pl = GPT2TokenizerFast.from_pretrained(
    str(path_to_ByteLevelBPE_tokenizer_pl_rep), 
    pad_token='<|endoftext|>')

# Get sequence length max of 1024
tokenizer_pl.model_max_length = 1024

In [42]:
tokenizer_pl.model_max_length = 1024

##### Test

tokenizer_fastai_en

In [43]:
# Test of the class TransformersTokenizer of fastai with tokenizer_en
tokenizer_fastai_en = TransformersTokenizer(tokenizer_en)
text = "Nie masz racji."
tokens_ids = tokenizer_fastai_en.encodes(text)
tokens = tokenizer_fastai_en.tokenizer.convert_ids_to_tokens(tokens_ids)

print('input text:',TitledStr(text))
print('text tokens:',TitledStr(tokens))
print('text tokens_ids:',TitledStr(tokens_ids))
print('output text:',TitledStr(tokenizer_fastai_en.decodes(tokens_ids)))

input text: Nie masz racji.
text tokens: ['N', 'ie', 'Ġmas', 'z', 'Ġrac', 'ji', '.']
text tokens_ids: tensor([   45,   494, 12422,    89,  3444,  7285,    13])
output text: Nie masz racji.


tokenizer_fastai_pl

In [44]:
# Test of the class TransformersTokenizer of fastai with tokenizer_pl
tokenizer_fastai_pl = TransformersTokenizer(tokenizer_pl)
text = "Maybe, you're right"
tokens_ids = tokenizer_fastai_pl.encodes(text)
tokens = tokenizer_fastai_pl.tokenizer.convert_ids_to_tokens(tokens_ids)

print('input text:',TitledStr(text))
print('text tokens:',TitledStr(tokens))
print('text tokens_ids:',TitledStr(tokens_ids))
print('output text:',TitledStr(tokenizer_fastai_pl.decodes(tokens_ids)))

input text: Maybe, you're right
text tokens: ['Ma', 'y', 'be', ',', 'Ġyou', "'", 're', 'Ġri', 'ght']
text tokens_ids: tensor([ 2945,    89,  1355,    12, 37025,     7,   299, 23035,  3767])
output text: Maybe, you're right


#### 4.2 Change vocab embeddings (wte matrix) in the GPT-2 pre-trained model to adapt to the Portuguese vocab

In [33]:
# import model if needed
from transformers import GPT2LMHeadModel
pretrained_weights = 'gpt2'
model_en = GPT2LMHeadModel.from_pretrained(pretrained_weights)

##### Check vocabs size

In [34]:
tokenizer_fastai_en = TransformersTokenizer(tokenizer_en)
old_vocab_size = tokenizer_fastai_en.tokenizer.vocab_size

tokenizer_fastai_pl = TransformersTokenizer(tokenizer_pl)
new_vocab_size = tokenizer_fastai_pl.tokenizer.vocab_size

print('old_vocab_size--> {} ,new_vocab_size -->{}     diffrence  -->{}'.format(old_vocab_size,new_vocab_size,old_vocab_size-new_vocab_size))

old_vocab_size--> 50257 ,new_vocab_size -->50257     diffrence  -->0


##### Check vocabs

In [35]:
tokenizer_fastai_vocab_en = tokenizer_fastai_en.tokenizer.get_vocab()
tokenizer_fastai_vocab_ls_en = [k for k, v in sorted(tokenizer_fastai_vocab_en.items(), key=lambda item: item[1])]
len(tokenizer_fastai_vocab_ls_en),tokenizer_fastai_vocab_ls_en[:10]

(50257, ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*'])

In [36]:
tokenizer_fastai_vocab_pl = tokenizer_fastai_pl.tokenizer.get_vocab() 
tokenizer_fastai_vocab_ls_pl = [k for k, v in sorted(tokenizer_fastai_vocab_pl.items(), key=lambda item: item[1])]
len(tokenizer_fastai_vocab_ls_pl),tokenizer_fastai_vocab_ls_pl[:10]

(50257, ['<|endoftext|>', '!', '"', '#', '$', '%', '&', "'", '(', ')'])

##### Changing vocabs and the vocab embeddings matrix (ie, setup new embeddings matrix)

In [37]:
# Check atual weight of wte and lm_head and if wte = lm_head
tens_a = model_en.transformer.wte.weight
tens_b = model_en.lm_head.weight
model_en.transformer.wte.weight,model_en.lm_head.weight,torch.all(tens_a.eq(tens_b))

(Parameter containing:
 tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
         [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
         [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
         ...,
         [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
         [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
         [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
        requires_grad=True), Parameter containing:
 tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
         [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
         [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
         ...,
         [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
         [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
         [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
        requires_grad=True), tensor(True))

In [38]:
# Get weights of the old wte
old_wgts = model_en.transformer.get_input_embeddings().weight.clone().detach()

# Get the mean embedding vetor of the old wte
wgts_m = old_wgts.mean(0)

# Initialize vocab size and weights of the new wte
new_vocab_size = tokenizer_fastai_pl.tokenizer.vocab_size
new_wgts = old_wgts.new_zeros(new_vocab_size,old_wgts.size(1))

In [39]:
path_data

Path('/root/.fastai/data/plwiki')

**Save**

In [40]:
# Get the new wte keeping the embeddings vetors of tokens in common in the 2 vocabs
# A token present in the new vocab but not in the old one gets the mean embedding vetor of the old wte
old_vocab = tokenizer_fastai_en.tokenizer.get_vocab()
new_vocab = tokenizer_fastai_pl.tokenizer.get_vocab()
same_tokens_list = list()
different_tokens_list = list()
    
for w,idx_new in new_vocab.items():    
    idx_old = old_vocab.get(w, -1)
    if idx_old>=0:
        new_wgts[idx_new] = old_wgts[idx_old]
        same_tokens_list.append((w,idx_new))
    else:
        new_wgts[idx_new] = wgts_m
        different_tokens_list.append((w,idx_new))

# setup in model the new wte
new_wte = nn.Embedding(new_vocab_size,old_wgts.size(1))
#new_wte.weight.data.normal_(mean=0.0, std=model.config.initializer_range)
new_wte.weight.data = new_wgts
model_en.transformer.set_input_embeddings(new_wte)
print(f'Polish wte matrix setup done!\n\nWe kept {len(same_tokens_list)} embeddings vectors from the English one.\nWe did not kept {len(different_tokens_list)} embeddings vectors from the English one (instead, we used the old wte mean vector).\n')

# Check identical tokens between the 2 vocabs               
num = 15
print(f'{num} first tokens IN common between the 2 vocabs:\n{same_tokens_list[:num]}\n')
print(f'{num} first tokens NOT in common between the 2 vocabs:\n{different_tokens_list[:num]}')

# save new_wgts
torch.save(new_wgts, path_data/'new_wte_wgts.pl')
# save same_tokens_list and different_tokens_list
torch.save(same_tokens_list, path_data/'same_tokens_list.pl')
torch.save(different_tokens_list, path_data/'different_tokens_list.pl')

Polish wte matrix setup done!

We kept 7725 embeddings vectors from the English one.
We did not kept 42532 embeddings vectors from the English one (instead, we used the old wte mean vector).

15 first tokens IN common between the 2 vocabs:
[('ĠJud', 22904), ('ĠSab', 42367), ('ĠAnge', 5618), ('1', 17), ('ĠTin', 38533), ('ĠCook', 40773), ('ĠOne', 12435), ('Ġsale', 19760), ('ĠRun', 28577), ('Ã©n', 20218), ('ras', 7778), ('ĠEth', 40490), ('ĠEk', 4341), ('arn', 43204), ('ĠFin', 5592)]

15 first tokens NOT in common between the 2 vocabs:
[('udio', 16969), ('ĠgaÅĤÄħ', 19241), ('ĠCechÄħ', 39503), ('ĠOlgi', 48836), ('ĠTrzebi', 25840), ('szyÄĩ', 7428), ('Ġewangelickiej', 36500), ('Ġpriorytet', 41018), ('ĠBrooklynie', 49683), ('ĠÅļwiatowej', 17951), ('ĠuczestniczÄħ', 24890), ('ĠkursÃ³w', 21245), ('ĠBost', 14634), ('zachodniego', 41008), ('ĠZiemiÄħ', 48258)]


In [41]:
ls  -all '/root/.fastai/data/plwiki'

total 2300124
drwxr-xr-x 3 root root       4096 Mar 19 09:05 [0m[01;34m.[0m/
drwxr-xr-x 3 root root       4096 Mar 19 08:57 [01;34m..[0m/
-rw------- 1 root root 1101183658 Mar 19 08:57 all_texts_plwiki.csv
-rw------- 1 root root 1098323868 Mar 19 08:58 all_texts_plwiki.txt
drwxr-xr-x 2 root root       4096 Mar 19 08:59 [01;34mByteLevelBPE_tokenizer_pl[0m/
-rw-r--r-- 1 root root    1216559 Mar 19 09:05 different_tokens_list.pl
-rw-r--r-- 1 root root  154390264 Mar 19 09:05 new_wte_wgts.pl
-rw-r--r-- 1 root root     182831 Mar 19 09:05 same_tokens_list.pl


In [45]:
!cp /root/.fastai/data/plwiki/new_wte_wgts.pl  /content/gdrive/MyDrive/fastai
!cp /root/.fastai/data/plwiki/different_tokens_list.pl /content/gdrive/MyDrive/fastai
!cp /root/.fastai/data/plwiki/same_tokens_list.pl /content/gdrive/MyDrive/fastai