# Faster than training from scratch 
# Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 

> Tutorial on how to use fastai v2 over Hugging Face's Transformers and Tokenizers libraries to fine-tune an English pre-trained transformer-based language model (GPT-2) to any language other than English

Notebook is based on work of Pierre Guillou (https://www.linkedin.com/in/pierreguillou)

Other resources used:
---


- Post in medium: [Faster than training from scratch - Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787)
- Fast notebook: [finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb](https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb)
- Hugging face model page of [GPorTuguese-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese): a language model for Portuguese text generation (and more NLP tasks...)
- Other posts in medium of the GPT-2 series: 
  - [NLP & fastai | GPT-2](https://medium.com/@pierre_guillou/nlp-fastai-gpt-2-16ee145a4a28)
  - [Byte-level BPE, an universal tokenizer but...](https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe)

## Overview

In this tutorial, instead of training from scratch, we will see how to fine-tune in just over a day, on one GPU and with a little more than 1GB of training data an English pre-trained [transformer](https://arxiv.org/abs/1706.03762)-based language model to any another language. 

As a practical case, we fine-tune to Portuguese the [English pre-trained GPT-2](https://github.com/openai/gpt-2) by wrapping the [Transformers](https://github.com/huggingface/transformers) and [Tokenizers](https://github.com/huggingface/tokenizers) libraries of Hugging Face into [fastai v2](https://github.com/fastai/fastai2). We thus create a new language model: [GPorTuguese-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese), a language model for Portuguese text generation (and more NLP tasks...).

![The 3 main steps of fine-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 (image edited from fast.ai NLP)](images/GPT2_tf_ft_approach.png "The 3 main steps of fine-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 (image edited from fast.ai NLP)")

## Acknowledgment

This tutorial was made possible thanks to the computing power of the [AI Lab](https://www.linkedin.com/company/ailab-unb/) (University of Brasilia) to which I am attached as an Associate Researcher in NLP and the participation of its directors in the definition of the NLP strategy, Professors [Fabricio Ataides Braz](https://www.linkedin.com/in/fabricio-braz-b356457/) and [Nilton Correia da Silva](https://www.linkedin.com/in/nilton-silva-6097853/). Thank you so much!

And special thanks to Sylvain Gugger for his [tutorial on Transformers and fastai v2](https://dev.fast.ai/tutorial.transformers) which is the basis of this tutorial.

## Table of contents

- [Overview](#Overview)
- [Texts generated by GPorTuguese-2 on Covid-19, Netflix, Artificial Intelligence and... unicorns](#Texts-generated-by-GPorTuguese-2-on-Covid-19-Netflix-Artificial-Intelligence-and...-unicorns)
- [Acknowledgment](#Acknowledgment)
- [Post, notebooks, Web App and model download](#Post,-notebooks,-Web-App-and-model-download)
- [Results](#Results)
- [About the need for language models not just in English... and how to do it in real life](#About-the-need-for-language-models-not-just-in-English...-and-how-to-do-it-in-real-life)
  - [(option 1) Fast pipeline to localize any transformer-based model to any language](#(option-1)-Fast-pipeline-to-localize-any-transformer-based-model-to-any-language)
  - [(option 2) Fine-tuning of an existing pre-trained model](#(option-2)-Fine-tuning-of-an-existing-pre-trained-model)
- [Why using fastai v2 over Hugging Face libraries to fine-tune a pre-trained transformer-based language model?](#Why-using-fastai-v2-over-Hugging-Face-libraries-to-fine-tune-a-pre-trained-transformer-based-language-model?)
  - [Tokenizers and Transformers from Hugging Face](#Tokenizers-and-Transformers-from-Hugging-Face)
  - [fastai v2](#fastai-v2)
- [About the choice of GPT-2](#About-the-choice-of-GPT-2)
- [References](#References)
  - [GPT-2](#GPT-2)
  - [Datasets in Portuguese](#Datasets-in-Portuguese)
  - [Hugging Face](#Hugging-Face)
  - [Pytorch, fastai & Transformers (Hugging Face)](#Pytorch,-fastai-&-Transformers-(Hugging-Face))
- [Main coding steps to fine-tune a Hugging Face language model with fastai v2](#Main-coding-steps-to-fine-tune-a-Hugging-Face-language-model-with-fastai-v2)
  - [1. Initialization](#Initialization)
  - [2. Download Wikipedia in Portuguese](#2.-Download-Wikipedia-in-Portuguese)
  - [3. Download a GPT-2 English pre-trained model and train a GPT-2 tokenizer with a vocab in Portuguese](#3.-Download-a-GPT-2-English-pre-trained-model-and-train-a-GPT-2-tokenizer-with-a-vocab-in-Portuguese)
  - [4. Create a fastai tokenizer and update the embeddings matrix of the GPT-2 English pre-trained model](#4.-Create-a-fastai-tokenizer-and-update-the-embeddings-matrix-of-the-GPT-2-English-pre-trained-model)
    - [4.1 GPT2TokenizerFast (imported GPT-2 tokenizer) --> fastai Tokenizer](#4.1-GPT2TokenizerFast-(imported-GPT-2-tokenizer)--->-fastai-Tokenizer)
    - [4.2 Change vocab embeddings (wte matrix) in the GPT-2 pre-trained model to adapt to the Portuguese vocab](#4.2-Change-vocab-embeddings-(wte-matrix)-in-the-GPT-2-pre-trained-model-to-adapt-to-the-Portuguese-vocab)
  - [5. Create fastai v2 Datasets and Dataloaders](#5.-Create-fastai-v2-Datasets-and-Dataloaders)
    - [5.1 fastai v2 Datasets](#5.1-fastai-v2-Datasets)
      - [Visualize Data](#Visualize-Data)
      - [Sample (this allows us to quickly test our code)](#Sample-(this-allows-us-to-quickly-test-our-code))
      - [All data](#All-data)
      - [Check datasets](#Check-datasets)
    - [5.2 fastai v2 Dataloaders](#5.2-fastai-v2-Dataloaders)
- [Model sharing and uploading in the Hugging Face model hub](#Model-sharing-and-uploading-in-the-Hugging-Face-model-hub)
- [Text Generation by our Portuguese GPT-2](#Text-Generation-by-our-Portuguese-GPT-2)
  - [Famous OpenAI generated text about unicorns](#Famous-OpenAI-generated-text-about-unicorns)
  - [Text generation techniques]()
    - [Text Generation techniques](#Text-generation-techniques)
      - [(Use case 1) Top-k sampling](#(Use-case-1)-Top-k-sampling)
      - [(Use case 2) Top-p (nucleus) sampling](#(Use-case-2)-Top-p-(nucleus)-sampling)
    - [Text n°1 | Famous OpenAI generated text about unicorns](#Text-n°1-|-Famous-OpenAI-generated-text-about-unicorns)
    - [Text n°2 | Recent text on the coronavirus disease (Covid-19)](#Text-n°2-|-Recent-text-on-the-coronavirus-disease-(Covid-19))

## Results

Previous autor says: In a little more than a day (we only used one GPU NVIDIA V100 32GB; through a Distributed Data Parallel (DDP) training mode, we could have divided by three this time to 10 hours, just with 2 GPUs), **we got a loss of 3.17, an accuracy of 37.99% and a perplexity of 23.76** (see the validation results table below and explications about perplexity at the end of the paragraph). Happy!

```
+------------+------+----------+------------+----------+-----------+
|   after    | loss | accuracy | perplexity |   time   | cumulative|
| ... epochs |      |   (%)    |            | by epoch |    time   |
+------------+------+----------+------------+----------+-----------+
|      0     | 9.95 |    9.90  |  20950.94  | 00:00:00 | 00:00:00  |
|      1     | 3.64 |   32.52  |     38.12  |  5:48:31 |  5:48:31  |
|      2     | 3.30 |   36.29  |     27.16  |  5:38:18 | 11:26:49  |
|      3     | 3.21 |   37.46  |     24.71  |  6:20:51 | 17:47:40  |
|      4     | 3.19 |   37.74  |     24.21  |  6:06:29 | 23:54:09  |
|      5     | 3.17 |   37.99  |     23.76  |  6:16:22 | 30:10:31  |
+------------+------+----------+------------+----------+-----------+
                 Fine-tuning of GPT-2 into Portuguese
               Table of training and validation results
```

![Validation loss and accuracy of pre-trained English GPT-2 of Hugging Face fine-tuned to Portuguese by fastai v2](images/gpt2_loss_acc_finetuned_fastaiv2.png "Validation loss and accuracy of pre-trained English GPT-2 of Hugging Face fine-tuned to Portuguese by fastai v2")

**After a huge gain at the end of the first epoch (see validation results graph below), the validation accuracy continues to improve until the end of training but less** (it goes to nearly 40%, that is considered a good performance for a language model - check these notebooks [nn-vietnamese.ipynb](https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb) and [nn-turkish.ipynb](https://github.com/fastai/course-nlp/blob/master/nn-turkish.ipynb) from Jeremy Howard of fastai).

To read more about these results, read the post ["Faster than training from scratch - Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)"](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787).

## About the need for language models not just in English... and how to do it in real life

Even if English is today the most spoken language in the world, **the world is multilingual**. It is therefore necessary to have **natural language models trained in all existing languages**, and not just in English, since these models constitute the essential basis for the training of models capable of performing a particular task in linguistics (classification, Q&A, synthesis, entity searches, etc.).

However, if it is extremely simple and free to download a language model trained in English via in particular the [Transformers library](https://huggingface.co/transformers/) of Hugging Face, it is often much more difficult to find online a model trained in another language. 

### (option 1) Fast pipeline to localize any transformer-based model to any language

The easiest way to get theses language-specific language models would be to **use a pipeline of existing pre-trained transformer-based models** like the following one:

![pipeline of existing pre-trained transformer-based models with a translator one at the input and output (image edited from fast.ai NLP)](images/trans_tf.png "pipeline of existing pre-trained transformer-based models with a translator one at the input and output (image edited from fast.ai NLP)")

For example, to obtain a Portuguese GPT-2, we could download from the [Transformers](https://github.com/huggingface/transformers) library of Hugging Face the [OpenAI GPT-2 pre-trained in English](https://huggingface.co/transformers/model_doc/gpt2.html) and the [MarianMT](https://huggingface.co/transformers/model_doc/marian.html) translator (we could also use [BART](https://huggingface.co/transformers/model_doc/bart.html) or [T5](https://huggingface.co/transformers/model_doc/t5.html) for the translation) in order to create the following pipeline:
```
                (input) Portuguese to English (MarianMT) 
                          >> English pre-trained language model (GPT-2) 
                                    >> (output) English to Portuguese (MarianMT)
```

So, for free and with only a few lines of code, we can get any language model in any language, and even any task-oriented NLP model (classification, Q&A, synthesis, entity searches, etc.) using the same pipeline. Not bad!

We will find the code of this pipeline and examples of use for text generation in the post "**Fast pipeline to localize any transformer-based model to any language**".

However, **the problem with this simple solution is that we depend on the quality of training of 2 pre-trained NLP models, which greatly increases the risk of losing the linguistic singularities and nuances of the desired language**.

### (option 2) Fine-tuning of an existing pre-trained model

Therefore, it often becomes necessary to have to train its own language model.

Nevertheless, training from scratch a powerful language model like [GPT-2](https://github.com/openai/gpt-2) or [GPT-3](https://github.com/openai/gpt-3) of OpenAI, [BART](https://arxiv.org/abs/1910.13461) of Facebook or [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) of Google requires tens or even hundreds of GB of text, which is impossible or difficult to find or requires power gigantic computing that only a few companies in the world have. For example,
- [GPT-2 Extra-Large](https://openai.com/blog/gpt-2-1-5b-release/) (1.5 billion parameters) was trained on 40GB of WebText on [32 Cloud TPU v3](https://twitter.com/teradimich/status/1096232184875233280) for 1 week ([cost of 43.008 dollars](https://twitter.com/Smerity/status/1096268294942674946))
- [CamemBERT, the BERT in French, was trained on 38GB of raw text on 256 GPUs (32 GB Tesla V100) for 1 day](https://github.com/huggingface/transformers/issues/1356#issuecomment-561691234)
- [RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s](https://github.com/huggingface/transformers/issues/1356#issuecomment-536187777)... and we are not talking about T5 or GPT-3 whose [computational cost was estimated at 4.6 million of dollars](https://lambdalabs.com/blog/demystifying-gpt-3/)! ("*We are waiting for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost 4.6M dollars for a single training run.*")

![NLP models through time, with their number of parameters (Image credit: TensorFlow blog)](images/NLPmodels.png "NLP models through time, with their number of parameters (Image credit: TensorFlow blog)")

Thus, as it is easy to download a few GB of texts from an online language corpus ([Wikipedia](https://dumps.wikimedia.org/), [OSCAR](https://oscar-corpus.com/), [Common Crawl](https://commoncrawl.org/) for example) and rent a NVIDIA V100 GPU for $1.24 an hour ([GCP](https://cloud.google.com/), [AWS](https://aws.amazon.com/), [Azur](https://azure.microsoft.com/) for example), **it is more realistic for the majority of people and organizations wishing to use a language model other than English to fine-tune on few GB of texts a model already pre-trained in English** (i.e. fine-tuning a model obtained by Transfer Learning) using Deep Learning frameworks such as [TensorFlow](https://www.tensorflow.org/)+[Keras](https://keras.io/) or [PyTorch](https://pytorch.org/)+[fastai](https://dev.fast.ai/).

This tutorial show how to implement this second option and you will find examples of use for text generation in the paragraph [Text Generation by our Portuguese GPT-2](##Text-Generation-by-our-Portuguese-GPT-2) at the end of this tutorial.

## Why using fastai v2 over Hugging Face libraries to fine-tune a pre-trained transformer-based language model?

### Tokenizers and Transformers from Hugging Face

The [Tokenizers](https://github.com/huggingface/tokenizers) and [Transformers](https://huggingface.co/transformers/) library from [Hugging Face](https://huggingface.co/) are today **the most up-to-date NLP libraries (Natural Language Processing)** used all over the world.

Let's copy and paste the most important information from the [Transformers documentation](https://huggingface.co/transformers/):

> Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ [pretrained models](https://huggingface.co/transformers/pretrained_models.html) in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch".

> The library was designed with [two strong goals](https://huggingface.co/transformers/quickstart.html#philosophy) in mind:
> - be as easy and fast to use as possible
> - provide state-of-the-art models with performances as close as possible to the original models

> The library is build around [three types of classes for each model](https://huggingface.co/transformers/quickstart.html#main-concepts):
> - **model classes** like `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
> - **configuration classes** which store all the parameters required to build a model, like `BertConfig`. You don’t always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
> - **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, like `BertTokenizer`.

> All these classes can be instantiated from pretrained instances: `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself or stored locally (or on a server) by the user.

### fastai v2

However, as written in the [Philosophy](https://huggingface.co/transformers/quickstart.html#philosophy) paragraph of the [Quickstart](https://huggingface.co/transformers/quickstart.html) page:
> the Transformers library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.

Therefore, despite of the running py files published by Hugging Face (for example, the [run_language_modeling.py](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) for fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa)), **when it comes necessary to fine-tune a pre-trained model to another language and/or to another task, we need to use regular Python/PyTorch modules in order to apply Transfer Learning and fine-tuning modern techniques, in particular if we do not have a huge new training dataset**.

Here is a non-exhaustive list of these fine-tuning techniques based on Transfer Learning:
- **Learning rate finder** (method that helps finding the best learning rate to train the model)
- **Mixed precision training** (some of the operations will be done in FP16, others in FP32 in order to speed up the training)
- **Gradual unfreezing** (layers groups are defined allowing to decide the layers to be trained)
- **1cycle policy** (the 1cycle policy was introduced by Leslie N. Smith et al. in <a href="https://arxiv.org/abs/1708.07120">Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates</a>. It schedules the learning rate with a cosine annealing)
- **Differential learning rates** (a specific learning rate is setup by layers group)
- **Distributed training** (training distributed on different GPUs in order to speed up the training)

Since **fastai v2 provides all of these powerful fine-tuning techniques**, this is a primary candidate library for training transformer-based language models pre-trained with the Tokenizers and Transformers libraries of Hugging Face.

## About the choice of GPT-2

In order to demonstrate the feasibility of fine-tuning Hugging Face models via fastai v2, we had to choose an emblematic model of the [Transformer revolution](https://arxiv.org/abs/1706.03762) in the NLP since 2017.

Thus, between the GPT-2 and [BERT](https://github.com/google-research/bert) models, we chose the GPT-2 model because it has strongly influenced minds beyond the circle of Deep Learning specialists in early 2019 by [writing texts of a quality level close to that of humans](https://openai.com/blog/better-language-models/#samples). Today "exceeded" in number of parameters and performance by more recent models like BART, T5 and of course GPT-3 (175 billion parameters!), it remains a reference and a model used in research and applications.
For those you want to understand better how GPT-2 works, read the following posts:
- [The Illustrated GPT-2 (Visualizing Transformer Language Models)](http://jalammar.github.io/illustrated-gpt2/)
- [NLP & fastai | GPT-2](https://medium.com/@pierre_guillou/nlp-fastai-gpt-2-16ee145a4a28)

**About the version of GPT-2**

There are 3 versions of the GPT-2 model (look at the [transformers documentation](https://huggingface.co/transformers/pretrained_models.html) for more details). Here, **we use the small version**, the one with the smallest number of weights (124 millions, not 117 as written in the original paper) but you can change the model used by changing the content of `pretrained_weights` (if it's not a GPT2 model, you'll need to change the classes used for the model and the tokenizer of course).

**More about GPT-2**

Source: https://huggingface.co/transformers/model_doc/gpt2.html

> OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. It’s a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~40 GB of text data.

> The abstract from the paper is the following: *GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.*

> Tips:
> - GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
> - GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.
> - The PyTorch models can take the past as input, which is the previously computed key/value attention pairs. Using this past value prevents the model from re-computing pre-computed values in the context of text generation. See [reusing the past in generative models](https://huggingface.co/transformers/quickstart.html#using-the-past) for more information on the usage of this argument.

> [Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.

>The original code can be found [here](https://openai.com/blog/better-language-models/).

## References

### GPT-2

- Understanding
  - [Better Language Models and Their Implications](https://openai.com/blog/better-language-models/) (OpenAI, 02/14/2019)
  - [The Illustrated GPT-2 (Visualizing Transformer Language Models)](http://jalammar.github.io/illustrated-gpt2/)
  - [The Annotated GPT-2](https://amaarora.github.io/2020/02/18/annotatedGPT2.html)
  - [Understanding the GPT-2 Source Code](https://medium.com/analytics-vidhya/understanding-the-gpt-2-source-code-part-1-4481328ee10b)
  - [How To Make Custom AI-Generated Text With GPT-2](https://minimaxir.com/2019/09/howto-gpt2/)
- Online Apps
  - [Write With Transformer (distilgpt2-small, gpt2small, gpt2medium, gpt2large)](https://transformer.huggingface.co/doc/gpt2-large)
  - [Write With DistilGPT-2](https://transformer.huggingface.co/model/distil-gpt2)
  - [Generate custom text from an AI using GPT-2 (using the 117M default model)](https://minimaxir.com/apps/gpt2-small/)
  - [Allen GPT2 Large Demo](https://demo.allennlp.org/next-token-lm?text=AllenNLP%20is)
- Others papers: [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html), [Layer Normalization](https://arxiv.org/abs/1607.06450)

### Datasets in Portuguese

- Wikipedia
  - (fastai): code from [Vietnamese ULMFiT from scratch](https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb)
  - (Hugging Face): [code from nlp](https://huggingface.co/nlp/viewer/?dataset=wikipedia&config=20200501.pt)
- [OSCAR corpus](https://traces1.inria.fr/oscar/): code from [Find a Dataset](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=oK7PPVm2XBgr)

### Hugging Face

- Dataset
  - [nlp](https://github.com/huggingface/nlp)
  - [Colab tutorial](https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb)
  - [Online dataset explorer](https://huggingface.co/nlp/viewer)
- Tokenizers
  - [Tokenizers](https://github.com/huggingface/tokenizers) (github)
  - Source code
    - [Source code for transformers.tokenization_gpt2](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)
    - [Source code for transformers.tokenization_utils_base](https://huggingface.co/transformers/_modules/transformers/tokenization_utils_base.html)
    - [Source code for transformers.tokenization_utils](https://huggingface.co/transformers/_modules/transformers/tokenization_utils.html)
    - [Source code for transformers.tokenization_utils_fast](https://huggingface.co/transformers/_modules/transformers/tokenization_utils_fast.html)
    - [classmethod from_pretrained()](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.from_pretrained): Instantiate a PreTrainedTokenizer (or a derived class) from a predefined tokenizer.
  - [Source code for transformers.tokenization_gpt2](https://huggingface.co/transformers/_modules/transformers/tokenization_gpt2.html)
  - [Hugging Face Tutorials - Training Tokenizer](https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer)
  - [Hugging Face Introduces Tokenizers](https://medium.com/dair-ai/hugging-face-introduces-tokenizers-d792482db360)
  - How to train a new language model from scratch using Transformers and Tokenizers (05/15/2020): [blog post](https://huggingface.co/blog/how-to-train) & [colab notebook](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
  - [HuggingFace Tokenizers Cheat Sheet](https://www.kaggle.com/debanga/huggingface-tokenizers-cheat-sheet)
  - [Tokenizers: How machines read](https://blog.floydhub.com/tokenization-nlp/) (01/28/2020)
  - [Byte Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/) (07/19/2019)
  - [What is a tokenizer?](https://docs.rs/tokenizers/0.10.1/tokenizers/#what-is-a-tokenizer)
- Transformers
  - [Transformers](https://huggingface.co/transformers/) de Hugging Face & [Transformers github](https://github.com/huggingface/transformers)
  - [Glossary](https://huggingface.co/transformers/glossary.html)
  - [OpenAI GPT2](https://huggingface.co/transformers/model_doc/gpt2.html#openai-gpt2)
  - Source code
    - [Source code for transformers.modeling_gpt2](https://huggingface.co/transformers/_modules/transformers/modeling_gpt2.html)
    - [Source code for transformers.configuration_gpt2](https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html)
  - [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5), [DistilGPT2](https://huggingface.co/distilgpt2) & [Download Model: distilgpt2](https://huggingface.co/distilgpt2)
  - [Train a GPT-2 Text-Generating Model w/ GPU For Free](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=H7LoMj4GA4n_) (colab notebook, 11/10/2019)
  - How to generate text: using different decoding methods for language generation with Transformers (03/18/2020, Hugging Face): [blog post](https://huggingface.co/blog/how-to-generate) and [colab notebook](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) 

### Pytorch, fastai & Transformers (Hugging Face)

- [Sequence-to-Sequence Modeling with nn.Transformer and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html#sequence-to-sequence-modeling-with-nn-transformer-and-torchtext)
- [Fastai v2](https://dev.fast.ai) (Deep Learning library on PyTorch) & [Hugging face](https://huggingface.co/)
- [blurr](https://ohmeow.github.io/blurr/): a library that integrates huggingface transformers with version 2 of the fastai framework
- fastai v2
  - Integration of the GPT2 model into fastai v2: code from [Tutorial - Transformers](https://dev.fast.ai/tutorial.transformers) and [10_nlp.ipynb](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb) (how to fine-tune an NLP model with fastai v2)
  - FastHugs
    - [FastHugs in the fastai forum](https://forums.fast.ai/t/fasthugs-fastai-v2-and-huggingface-transformers/63681)
    - [FastHugs: Language Modelling with Tranformers and Fastai](https://www.ntentional.com/nlp/transformers/training%20technique/classification/2020/04/24/fasthugs_language_model.html) (04/24/2020, fastai v2)
    - [FastHugs: Sequence Classification with Transformers and Fastai](https://www.ntentional.com/nlp/training%20technique/classification/2020/04/17/fasthugs_seq_classification.html) (04/17/2020, fastai v2)
- fastai v1
  - [A Tutorial to Fine-Tuning BERT with Fast AI](http://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/) (05/15/2019, fastai v1)
  - [Fastai integration with BERT: Multi-label text classification identifying toxicity in texts](https://medium.com/@abhikjha/fastai-integration-with-bert-a0a66b1cecbe) (07/17/2019, fastai v1)
  - [When Pytorch-transformers meets Fastai (w/ Google Colab)](https://towardsdatascience.com/best-of-two-worlds-pytorch-transformers-meets-fastai-5fd51ef34b0f) (08/26/2019, fastai v1)
  - [Using RoBERTa with Fastai for NLP](https://medium.com/analytics-vidhya/using-roberta-with-fastai-for-nlp-7ed3fed21f6c) (09/02/2019, fastai v1)
  - [RoBERTa with Fastai](https://www.kaggle.com/abhikjha/roberta-with-fastai) (11/14/2019, fastai v1)
  - [Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)](https://towardsdatascience.com/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2) (11/27/2019, fastai v1): A tutorial to implement state-of-the-art NLP models with Fastai for Sentiment Analysis ([notebook](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta))
  - [RoBERTa (fastai, HuggingFace 🤗Transformers)](https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers/execution) (01/17/2020, fastai v1)

## Main coding steps to fine-tune a Hugging Face language model with fastai v2

The 6 main steps detailed below can be summarized in 3 main ones:

1. **Initialization & download** (download of Portuguese Wikipedia and GPT-2 English pre-trained model and tokenizer)
2. **GPT-2 tokenizer with a Portuguese vocab** (train a GPT-2 tokenizer with a vocab in Portuguese, wrap it into a fastai v2 tokenizer and update the embeddings matrix of the GPT-2 English pre-trained model according to the new Portuguese vocab: keep the embeddings vectors of the common tokens between English and Portuguese vocabs)
3. **Fine-tune on Portuguese Wikipedia the GPT-2 model with fastai v2 training functionalities**

In [None]:
# extra small thing to setup drives paths etc written 

In [None]:
#start by mounting google drive
from google.colab import drive, files
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
# need to instal fastai 2 etc before 
!pip install -q git+https://github.com/fastai/fastai
!pip install -q git+https://github.com/fastai/fastcore
!pip install -q iterative-stratification

[K     |████████████████████████████████| 61kB 3.0MB/s 
[K     |████████████████████████████████| 12.8MB 326kB/s 
[K     |████████████████████████████████| 776.8MB 21kB/s 
[?25h  Building wheel for fastai (setup.py) ... [?25l[?25hdone
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.7.1 which is incompatible.[0m
  Building wheel for fastcore (setup.py) ... [?25l[?25hdone


In [None]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [None]:
from  nlputilsfastai  import * # augumented py file ---> from fastai.basics import * # was fastai2

In [None]:
# !pip install fastcore==1.3.8

Collecting fastcore==1.3.8
[?25l  Downloading https://files.pythonhosted.org/packages/26/53/d79c0f942f8bb44903108462541130b53fc7b4d744b1b5df9127b0b524d6/fastcore-1.3.8-py3-none-any.whl (48kB)
[K     |██████▉                         | 10kB 19.8MB/s eta 0:00:01[K     |█████████████▋                  | 20kB 25.6MB/s eta 0:00:01[K     |████████████████████▍           | 30kB 23.5MB/s eta 0:00:01[K     |███████████████████████████▏    | 40kB 26.4MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 5.8MB/s 
Installing collected packages: fastcore
  Found existing installation: fastcore 1.3.20
    Uninstalling fastcore-1.3.20:
      Successfully uninstalled fastcore-1.3.20
Successfully installed fastcore-1.3.8


# 1. Installing required libraries and mounting google drive

In [None]:
#start by mounting google drive
from google.colab import drive, files
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
# need to instal fastai 2 etc before 
%%time
!pip install -q git+https://github.com/fastai/fastai
!pip install -q git+https://github.com/fastai/fastcore
!pip install -q iterative-stratification
!pip install --upgrade tables

[K     |████████████████████████████████| 61kB 3.3MB/s 
[K     |████████████████████████████████| 12.8MB 251kB/s 
[K     |████████████████████████████████| 776.8MB 22kB/s 
[?25h  Building wheel for fastai (setup.py) ... [?25l[?25hdone
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.7.1 which is incompatible.[0m
  Building wheel for fastcore (setup.py) ... [?25l[?25hdone
Collecting tables
[?25l  Downloading https://files.pythonhosted.org/packages/0f/cb/4097be890a773af95343389faa8c283b0d9ff606f144227a548461dcbdd5/tables-3.6.1-cp37-cp37m-manylinux1_x86_64.whl (4.3MB)
[K     |████████████████████████████████| 4.3MB 5.6MB/s 
Installing collected packages: tables
  Found existing installation: tables 3.4.4
    Uninstalling tables-3.4.4:
      Successfully uninstalled tables-3.4.4
Successfully installed tables-3.6.1
CPU times: user 1.43 s, sys: 391 ms, total: 1.82 s
Wall time: 3min 34s


# 2. Initialization

In [None]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [None]:
# from fastai2.text.all import *
# from nlputils_fastai2 import * 

from fastai.text.all import *
from nlputilsfastai import * 

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
gpu = 0
torch.cuda.set_device(gpu)
print(f'cuda device: {torch.cuda.current_device()}')
print(f'cuda device name: {torch.cuda.get_device_name(gpu)}')

cuda device: 0
cuda device name: Tesla K80


In [None]:
!nvidia-smi

Sun Mar 21 13:23:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8    31W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Load standard snipet to prevent random disconnects
This cell runs JS code to automatic reconnect to runtime.

In [None]:
import IPython
from google.colab import output

display(IPython.display.Javascript('''
 function ClickConnect(){
   btn = document.querySelector("colab-connect-button")
   if (btn != null){
     console.log("Click colab-connect-button"); 
     btn.click() 
     }
   
   btn = document.getElementById('ok')
   if (btn != null){
     console.log("Click reconnect"); 
     btn.click() 
     }
  }
  
setInterval(ClickConnect,60000)
'''))

print("Done.")

<IPython.core.display.Javascript object>

Done.


In [None]:
# Get config of fastai2 paths
config = Config()
config.d

{'archive_path': '/root/.fastai/archive',
 'data_path': '/root/.fastai/data',
 'model_path': '/root/.fastai/models',
 'storage_path': '/tmp',
 'version': 2}

This will create a `{lang}wiki` folder, containing a `{lang}wiki` text file with the wikipedia contents (for other languages, replace `{lang}` with the appropriate code from the [list of wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias)).

In [None]:
# setup new path_data and create the corresponding folder
lang = 'pl'
name = f'{lang}wiki'
data_path = config['data_path']
path_data = data_path/name
path_data.mkdir(exist_ok=True, parents=True)

In [None]:
cd /content/gdrive/MyDrive/fastai

/content/gdrive/MyDrive/fastai


In [None]:
data_path, path_data

(Path('/root/.fastai/data'), Path('/root/.fastai/data/plwiki'))

# 3. Loading previously prepared scraped wiki file ~1G for particular language
for that purpose another notebook was used [wiki download](https://github.com/len-sla/other/blob/main/wiki_download.ipynb)

In [None]:
!cp /content/gdrive/MyDrive/fastai/all_texts_plwiki.csv  /root/.fastai/data/plwiki
!cp /content/gdrive/MyDrive/fastai/all_texts_plwiki.txt  /root/.fastai/data/plwiki

In [None]:
!du -hs {'/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv'}

1.1G	/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv


In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/fastai/all_texts_plwiki.csv')
df.head()

Unnamed: 0,text
0,"Henry Wager Halleck (ur. 16 stycznia 1815, zm. 9 stycznia 1872) – amerykański wojskowy, naukowiec i prawnik, oficer United States Army.\n\n, znany pod – obraźliwym później – przydomkiem „Old Brains”, brał czynny udział w dziele przyłączenia Kalifornii jako stanu. Z powodzeniem praktykował jako prawnik i deweloper. Na początku wojny secesyjnej, był naczelnym dowódcą Armii Unii na zachodnim teatrze działań, a jednocześnie – przez prawie dwa lata – głównodowodzącym wszystkich armii USA. „Awansował” na szefa sztabu armii, gdy generał-porucznik Ulysses Grant, były podkomendny Hallecka na zachod..."
1,"Kościół Najświętszej Marii Panny (""in summo"") w Poznaniu – zabytkowy gotycki kościół na Ostrowie Tumskim wraz z resztkami wczesnopiastowskiego palatium.\n\nW dzisiejszym kształcie powstał w połowie XV wieku, jednak jego historia rozpoczyna się około 965 roku, gdy po przybyciu Dobrawy wzniesiono na Ostrowie Tumskim kaplicę zamkową. W dokumentach kościół Najświętszej Marii Panny pod swoim dzisiejszym wezwaniem pojawia się po raz pierwszy w 1247. \n\nWedług najnowszych badań prawdopodobnie pod prezbiterium znajdują się fundamenty rotundy pełniącej funkcję kaplicy, pewnym jest natomiast istnie..."
2,"Gieorgij Andriejewicz Mołczanow (ros. Георгий Андреевич Молчанов, ur. 3 kwietnia 1897 w Charkowie, zm. 9 października 1937 w miejscu egzekucji Kommunarka) – funkcjonariusz radzieckiej policji politycznej, komisarz bezpieczeństwa państwowego II rangi, ludowy komisarz spraw wewnętrznych Białoruskiej SRR (1936-1937).\n\nUrodzony w rodzinie rosyjskiej. Do 1917 uczył się w szkole handlowej w Charkowie, od listopada 1917 do czerwca 1918 był żołnierzem i członkiem sztabu Głównodowodzącego Wojsk Południa Rosji Antonowa-Owsiejenki, później pracował w sztabie Frontu Wschodniego. \n\nOd grudnia 1917 ..."
3,"José Manuel Durão Barroso (wym. []; ur. 23 marca 1956 w Lizbonie) – portugalski polityk, prawnik i nauczyciel akademicki. W latach 1992–1995 minister spraw zagranicznych w rządzie Aníbal Cavaco Silvy, od 1999 do 2004 przewodniczący Partii Socjaldemokratycznej. Premier Portugalii od 6 kwietnia 2002 do 17 lipca 2004. Od 22 listopada 2004 do 31 października 2014 przewodniczący Komisji Europejskiej.\n\nUkończył prawo na Uniwersytecie Lizbońskim, a także studia europejskie na Uniwersytecie Genewskim, na którym uzyskał również magisterium w zakresie nauk politycznych. Pracował jako nauczyciel ak..."
4,"Laodika I (gr. ""Λαοδίκη"", ""Laodíkē"") (zm. po 242 p.n.e.) – córka Achajosa Starszego z dynastii Seleucydów, brata Antiocha I Sotera, pierwsza żona brata stryjecznego Antiocha II Theosa, króla państwa Seleucydów, syna Antiocha I Sotera.\n\nW czasie II wojny syryjskiej (258-248 p.n.e.) jej mąż Antioch II Theos, jako sprzymierzeniec Macedonii walczył przeciwko Egiptowi. W wyniku tej wojny Antioch II zawarł porozumienie z królem Egiptu Ptolemeuszem II Filadelfem w r. 250 p.n.e. Miał się wyprzeć żony Laodiki I i wspólnych z nią dzieci, a poślubić jego córkę Berenikę oraz zdeklarować się uczynić ..."


# 4. copying ready polish tokenizer

In [None]:
%%time
!pip install transformers
!pip freeze | grep transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 4.2MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 38.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 39.6MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=f5a5b523d9f

In [None]:
%%time
from transformers import GPT2TokenizerFast

pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…


CPU times: user 665 ms, sys: 123 ms, total: 788 ms
Wall time: 9.07 s


In [None]:
# To correct the warning about token_pad (GPT2TokenizerFast), run the following code
# source: https://github.com/huggingface/transformers/issues/2648#issuecomment-616177044
tokenizer_en.pad_token = tokenizer_en.eos_token

In [None]:
# source: https://huggingface.co/transformers/_modules/transformers/tokenization_utils_fast.html

print('---------- vocab ----------')
print()

print('vocab_files_names:',tokenizer_en.vocab_files_names)
print()

for k,v in tokenizer_en.pretrained_vocab_files_map.items():
    print(k)
    for kk,vv in v.items():
        print('- ',kk,':',vv)
    print()
    
print('vocab_size:',tokenizer_en.vocab_size)
print()
#print(tokenizer_en.get_vocab())

num = 50
print(f'First {num} items of the vocab: {dict(itertools.islice(tokenizer_en.get_vocab().items(), 20))}')

---------- vocab ----------

vocab_files_names: {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt', 'tokenizer_file': 'tokenizer.json'}

vocab_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/vocab.json
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/vocab.json
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/vocab.json
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/vocab.json
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/vocab.json

merges_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/merges.txt
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/merges.txt
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/merges.txt
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/merges.txt
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/merges.txt

tokenizer_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/tokenizer.json
-  gpt2-medium : https://huggingface.co/gpt

In [None]:
!pip install tokenizers
!pip freeze | grep tokenizers

tokenizers==0.10.1


In [None]:
# creating  directory for tokenizer
ByteLevelBPE_tokenizer_pl_rep = 'ByteLevelBPE_tokenizer_pl'
path_to_ByteLevelBPE_tokenizer_pl_rep = path_data/ByteLevelBPE_tokenizer_pl_rep
if not (path_to_ByteLevelBPE_tokenizer_pl_rep).exists():
    path_to_ByteLevelBPE_tokenizer_pl_rep.mkdir(exist_ok=True, parents=True)
# ByteLevelBPE_tokenizer_pl.save_model(str(path_to_ByteLevelBPE_tokenizer_pl_rep))

In [None]:
ls /root/.fastai/data/plwiki -all

total 2147980
drwxr-xr-x 3 root root       4096 Mar 21 13:25 [0m[01;34m.[0m/
drwxr-xr-x 3 root root       4096 Mar 21 13:23 [01;34m..[0m/
-rw------- 1 root root 1101183658 Mar 21 13:23 all_texts_plwiki.csv
-rw------- 1 root root 1098323868 Mar 21 13:24 all_texts_plwiki.txt
drwxr-xr-x 2 root root       4096 Mar 21 13:25 [01;34mByteLevelBPE_tokenizer_pl[0m/


In [None]:
#copying previiously created pl okenizer ( saving ~30min fro preparing that)
!cp  /content/gdrive/MyDrive/fastai/vocab.json  /root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl
!cp  /content/gdrive/MyDrive/fastai/merges.txt  /root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
ByteLevelBPE_tokenizer_pl = ByteLevelBPETokenizer(
    "/root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl/vocab.json",
    "/root/.fastai/data/plwiki/ByteLevelBPE_tokenizer_pl/merges.txt",
)

Testing if it is working

In [None]:
# Get vocab as a list
ByteLevelBPE_tokenizer_pl_vocab = ByteLevelBPE_tokenizer_pl.get_vocab() 
ByteLevelBPE_tokenizer_pl_vocab_ls = [k for k, v in sorted(ByteLevelBPE_tokenizer_pl_vocab.items(), key=lambda item: item[1])]
len(ByteLevelBPE_tokenizer_pl_vocab_ls),ByteLevelBPE_tokenizer_pl_vocab_ls[:5]

(50257, ['<|endoftext|>', '!', '"', '#', '$'])

In [None]:
text = "Taki mały tekst dla sprawdzenia ."
output = ByteLevelBPE_tokenizer_pl.encode(text)
print('\n splitting by tokens\n ')
print(output.ids,)
print(output.tokens)
print(output.offsets)

back_to_text = ByteLevelBPE_tokenizer_pl.decode(ByteLevelBPE_tokenizer_pl.encode(text).ids)

print('\ninput text:', text)
print('tokens ids:', output.ids)
print('back to text:', back_to_text)


 splitting by tokens
 
[5565, 335, 10120, 7591, 624, 1877, 1054, 4461]
['Ta', 'ki', 'ĠmaÅĤy', 'Ġtekst', 'Ġdla', 'Ġspraw', 'dzenia', 'Ġ.']
[(0, 2), (2, 4), (4, 9), (9, 15), (15, 19), (19, 25), (25, 31), (31, 33)]

input text: Taki mały tekst dla sprawdzenia .
tokens ids: [5565, 335, 10120, 7591, 624, 1877, 1054, 4461]
back to text: Taki mały tekst dla sprawdzenia .


<!-- czyli jestem w tym momencie -->

# 5. Create a fastai tokenizer and update the embeddings matrix of the GPT-2 English pre-trained model

Now let's see how we can use fastai v2 to fine-tune this model on Wikipedia in Portuguese, using all the fastai v2 training utilities.

We will follow these 2 following steps:

- 4.1) **GPT2TokenizerFast (imported GPT-2 tokenizer) --> fastai Tokenizer**: to process the data to train a model, we need to build a fastai tokenizer from the GPT-2 tokenizer with vocab in Portuguese.
- 4.2) **Change vocab embeddings (wte matrix) in the GPT-2 pre-trained model to adapt to the Portuguese vocab**: as the vocab embedding matrix (wte) of the pre-trained GPT-2 model corresponds to the English vocabulary, we'll keep the embeddings vectors of the common tokens between the English and Portuguese vocab.

 First, we import all the text utilities:

In [None]:
from fastai.text.all import *

#### 4.1 GPT2TokenizerFast (imported GPT-2 tokenizer) --> fastai Tokenizer

*(text from Sylvain Gugger Transformers Tutorial)* To process this data to train a model, we need to build a `Transform` that will be applied lazily. In a fastai `Transform` you can define:
- an `encodes` method that is applied when you call the transform (a bit like the `forward` method in a `nn.Module`)
- a `decodes` method that is applied when you call the [decode](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.decode) method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)
- a `setups` method that sets some inner state of the `Transform` (not needed here)

In [None]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

Two comments on the code above:
- in `encodes` we don't use the [tokenizer.encode](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.encode) method since it does some additional preprocessing for the model after tokenizing and numericalizing (the aprt throwing a warning before). Here we don't need any post-processing so it's fine to skip it and we use the [tokenizer.tokenize](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.tokenize) method followed by the [tokenizer.convert_tokens_to_ids](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.convert_tokens_to_ids) one.
- in `decodes` we return a `TitledStr` object and not just a plain string. That's a fastai class that adds a `show` method to the string, which will allow us to use all the fastai show methods.

##### Tokenizers

ENGLISH

In [None]:
%%time
# Load the GPT2 tokenizer in English
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model_en = GPT2LMHeadModel.from_pretrained(pretrained_weights)

# To correct the warning about token_pad (GPT2TokenizerFast), run the following code
# source: https://github.com/huggingface/transformers/issues/2648#issuecomment-616177044
tokenizer_en.pad_token = tokenizer_en.eos_token

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…


CPU times: user 18.4 s, sys: 2.41 s, total: 20.8 s
Wall time: 32.7 s


POLISH

In [None]:
# Get the path to ByteLevelBPE_tokenizer_pt config files
ByteLevelBPE_tokenizer_pl_rep = 'ByteLevelBPE_tokenizer_pl'
path_to_ByteLevelBPE_tokenizer_pl_rep = path_data/ByteLevelBPE_tokenizer_pl_rep

# import the pre-trained GPT2TokenizerFast tokenizer with the tokenizer_pt config files
tokenizer_pl = GPT2TokenizerFast.from_pretrained(
    str(path_to_ByteLevelBPE_tokenizer_pl_rep), 
    pad_token='<|endoftext|>')

# Get sequence length max of 1024
tokenizer_pl.model_max_length = 1024

In [None]:
tokenizer_pl.model_max_length = 1024

##### Sample (this allows us to quickly test our code======================)

- train: 80%
- val = 20%

In [None]:
df_sample = df[:1000]

num = int(0.8*len(df_sample))

idxs = np.random.randint(0, len(df_sample), len(df_sample))
idxs_train = idxs[:num]
idxs_val = idxs[num:]

We gather all texts in one numpy array (since it will be easier to use this way with fastai):

In [None]:
%%time
all_texts = np.concatenate([df_sample.iloc[idxs_train].text.values, df_sample.iloc[idxs_val].text.values])

CPU times: user 2.63 ms, sys: 0 ns, total: 2.63 ms
Wall time: 5.8 ms


In [None]:
%%time
splits = [list(idxs_train), list(idxs_val)]
tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer_pl), splits=splits, dl_type=LMDataLoader)

CPU times: user 8.48 ms, sys: 858 µs, total: 9.34 ms
Wall time: 10.4 ms


We specify `dl_type=LMDataLoader` for when we will convert this `TfmdLists` to `DataLoaders`: we will use an `LMDataLoader` since we have a language modeling problem, not the usual fastai `TfmdDL`.

##### All data

- train: 80%
- val = 20%

In [None]:
num = int(0.8*len(df))

idxs = np.random.randint(0, len(df), len(df))
idxs_train = idxs[:num]
idxs_val = idxs[num:]

# save idxs train and valid
torch.save(idxs_train, path_data/'idxs_train.pl')
torch.save(idxs_val, path_data/'idxs_val.pl')

SAVING

In [None]:
!cp /root/.fastai/data/plwiki/idxs_train.pl  /content/gdrive/MyDrive/fastai
!cp /root/.fastai/data/plwiki/idxs_val.pl  /content/gdrive/MyDrive/fastai

LOADING

In [None]:
!cp /content/gdrive/MyDrive/fastai/idxs_train.pl  /root/.fastai/data/plwiki
!cp /content/gdrive/MyDrive/fastai/idxs_val.pl  /root/.fastai/data/plwiki

In [None]:
# load idxs train and valid
idxs_train = torch.load(path_data/'idxs_train.pl')
idxs_val = torch.load(path_data/'idxs_val.pl')

We gather all texts in one numpy array (since it will be easier to use this way with fastai):

In [None]:
%%time
all_texts = np.concatenate([df.iloc[idxs_train].text.values, df.iloc[idxs_val].text.values])

CPU times: user 28.3 ms, sys: 44.6 ms, total: 72.9 ms
Wall time: 72.9 ms


In [None]:
%%time
splits = [list(idxs_train), list(idxs_val)]
tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer_pl), splits=splits, dl_type=LMDataLoader)

Token indices sequence length is longer than the specified maximum sequence length for this model (2174 > 1024). Running this sequence through the model will result in indexing errors


CPU times: user 91.2 ms, sys: 9.93 ms, total: 101 ms
Wall time: 101 ms


We specify `dl_type=LMDataLoader` for when we will convert this `TfmdLists` to `DataLoaders`: we will use an `LMDataLoader` since we have a language modeling problem, not the usual fastai `TfmdDL`.

##### Check datasets

In a `TfmdLists` you can access to the elements of the training or validation set quite easily:

In [None]:
tls.train[0],tls.valid[0]

(tensor([39020,   685,  2526,  ...,   859,  9016,    12]),
 tensor([   28, 19903,    30, 15583, 19903,    30,   199, 18704,  2944,   562,
           441, 15587, 11590,  2446,  7100, 25190,  4910, 24809,  5189, 18436,
            14,  1978,   524,   830, 11590,  2446,  4910, 24809,   365,  8171,
          5142,   389,  3921,  2601,   409,   604,     2,   343,  8500,   497,
         18135,   260,  1465,  7060,   332,  2197,  1279,  1878,    14,  5272,
         32828,   315,  5130,  4612,   332,  1441,  1279,  1878,    14, 11276,
          1943, 32958,  8645,   389,    39,   789,  1312, 17592,   713,  9446,
          3372,   289,   357,  1900,   522,    14,    51,    14, 19489,   538,
            14, 19489,  6862,   389, 18704,  2944,   562,     2,   311, 15838,
          9584,  1878,   388,    14, 44448,  2208,   734,  6702,   902,  8717,
         42250,   332,  5631,   260,  1878,    14,  9291, 11299,   263,  5418,
          1103,   389,    48,  3217, 15482,  1549,     2,   474, 25027, 

They are not the same. We can see the shape are differents:

In [None]:
tls.tfms(tls.train.items[0]).shape, tls.tfms(tls.valid.items[0]).shape

(torch.Size([468]), torch.Size([402]))

And we can have a look at both decodes using `show_at`:

In [None]:
show_at(tls.train, 0)

In [None]:
# cp -R /content/gdrive/'My Drive'/fastai/data/plwiki/ /root/.fastai/data/   #----------------odtworzenie-------!!!!!

#### 5.2 fastai v2 Dataloaders

*(text from Sylvain Gugger Transformers Tutorial)* The fastai v2 library expects the data to be assembled in a `DataLoaders` object (something that has a training and validation dataloader). We can get one by using the `dataloaders` method. We just have to specify a batch size and a sequence length. 

Since the GPT-2 model was trained with sequences of size 1024, we use this sequence length (it's a stateless model, so it will change the perplexity if we use less).

In [None]:
# %%time
# bs,sl = 6,1024
# dls = tls.dataloaders(bs=bs, seq_len=sl)

Token indices sequence length is longer than the specified maximum sequence length for this model (4097 > 1024). Running this sequence through the model will result in indexing errors


CPU times: user 13min 39s, sys: 1min 35s, total: 15min 14s
Wall time: 15min 15s


to avoid problem like above and problem with GPU RAM  there is need to decrease

In [None]:
%%time
bs,sl = 2,1024
dls = tls.dataloaders(bs=bs, seq_len=sl)

CPU times: user 8min 5s, sys: 9.01 s, total: 8min 14s
Wall time: 8min 20s


poszlo dobrze 2 x1024

# 6.2 Learner

*(text from Sylvain Gugger Transformers Tutorial)* Now, we are ready to create our `Learner`, which is a fastai object grouping data, model and loss function and handles model training or inference. Since we are in a language model setting, we pass accuracy and perplexity as metrics, and we need to use the callback we just defined. Lastly, we use mixed precision to save every bit of memory we can (and if you have a modern GPU, it will also make training faster).

In [None]:
# Learner: basic class for handling the training loop
# source: https://dev.fast.ai/learner#Learner
learn = Learner(dls, model_en, loss_func=CrossEntropyLossFlat(),
                splitter = splitter,
                cbs=[DropOutput], 
                metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
# Check the number of parameters groups and the hyperparameters values
learn.create_opt()
print(f'number of parameters groups: {len(learn.opt.param_groups)}')

# ... and the list of Learning Rates (before its atualization by the Optimizer of the function fit_one_cycle())
for i,h in enumerate(learn.opt.hypers):
    print(i,h)

number of parameters groups: 4
0 {'wd': 0.01, 'sqr_mom': 0.99, 'lr': 0.001, 'mom': 0.9, 'eps': 1e-05}
1 {'wd': 0.01, 'sqr_mom': 0.99, 'lr': 0.001, 'mom': 0.9, 'eps': 1e-05}
2 {'wd': 0.01, 'sqr_mom': 0.99, 'lr': 0.001, 'mom': 0.9, 'eps': 1e-05}
3 {'wd': 0.01, 'sqr_mom': 0.99, 'lr': 0.001, 'mom': 0.9, 'eps': 1e-05}


- Loss = 9.95
- accuracy = 0.099
- perplexity = 20950.94

In [None]:
%%time
# loss, accuracy, Perplexity() of validation dataset
learn.validate()

CPU times: user 1h 22min 15s, sys: 53min 26s, total: 2h 15min 42s
Wall time: 2h 15min 37s


(#3) [9.495806694030762,0.07362030446529388,13303.822265625]

for the 1GB file resullts are :

In [None]:
# %%time
# # loss, accuracy, Perplexity() of validation dataset
# learn.validate()

CPU times: user 2h 28min 1s, sys: 1h 53min 5s, total: 4h 21min 7s
Wall time: 4h 20min 59s


(#3) [9.487800598144531,0.0741734430193901,13197.736328125]

Now that we have a `Learner`, we will use during training all the **fine-tuning techniques** seen for classification model training (see the notebook [10_nlp.ipynb](https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb) about "NLP Deep Dive: RNNs") to take advantage of the **Transfer Learning** of the GPT-2 pre-trained embeddings and model from Hugging Face Transformers:
- **learning rate finder** (method that helps finding the best learning rate to train the model)
- **Mixed precision training** (some of the operations will be done in FP16, others in FP32 in order to speed up the training)
- **gradual unfreezing** (the model has 4 layers groups created by our method `splitter` : the embedding one and the 3 groups of 4 decoder blocks each)
- **1cycle policy** with the method [fit_one_cycle()](https://dev.fast.ai/callback.schedule#Learner.fit_one_cycle) (The 1cycle policy was introduced by Leslie N. Smith et al. in <a href="https://arxiv.org/abs/1708.07120">Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates</a>. It schedules the learning rate with a cosine annealing from `lr_max/div` to `lr_max` then `lr_max/div_final` (pass an array to `lr_max` if you want to use differential learning rates) and the momentum with cosine annealing according to the values in `moms`. The first phase takes `pct_start` of the training. You can optionally pass additional `cbs` and `reset_opt`.)
- **differential learning rates** (each layers group with a learning rate different: the biggest one for the embeddings group, and the smallest one for the first 4 decoder blocks)

##### 6.2.1 Freeze all layers but the last layers group (do not freeze `wte`, `wpe` embeddings matrices and last `LayerNorm`)

In [None]:
learn.freeze()
learn.summary()

GPT2LMHeadModel (Input shape: 2)
Layer (type)         Output Shape         Param #    Trainable 
                     2 x 1024 x 768      
Embedding                                 38597376   True      
Embedding                                 786432     True      
Dropout                                                        
LayerNorm                                 1536       True      
____________________________________________________________________________
                     2 x 1024 x 2304     
Conv1D                                    1771776    False     
Conv1D                                    590592     False     
Dropout                                                        
Dropout                                                        
LayerNorm                                 1536       True      
____________________________________________________________________________
                     2 x 1024 x 3072     
Conv1D                                    23623

The `learn.summary ()` method gives almost the right numbers. In fact, it counts twice the weights of the wte matrix (vocab embeddings) because they are duplicated in the weights of the output linear layer.

The real numbers are:
- Total params: 163,037,184 - 38,597,376 = **124,439,808** (about 124 millions)
- Total trainable params: 77,982,720 - 38,597,376 = **39,385,344** (about 40 millions)
- Total non-trainable params: **85,054,464** (about 85 millions)

SAVE ( first time)

In [None]:
learn.save(path_data/'GPT2_pl_before_lr_find_bs_sl_2_1024')
!cp  /root/.fastai/data/plwiki/GPT2_pl_before_lr_find_bs_sl_2_1024.pth  /content/gdrive/MyDrive/fastai

LOAD

In [None]:
!cp   /root/.fastai/data/plwiki/GPT2_pl_before_lr_find_bs_sl_2_1024.pth /root/.fastai/data/plwiki/

cp: '/root/.fastai/data/plwiki/GPT2_pl_before_lr_find_bs_sl_2_1024.pth' and '/root/.fastai/data/plwiki/GPT2_pl_before_lr_find_bs_sl_2_1024.pth' are the same file
