### Introduction

This notebook analyzes the code given with https://paperswithcode.com/paper/language-models-are-unsupervised-multitask

It has been inspired from https://github.com/huggingface/pytorch-pretrained-BERT
and https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-pretrained-bert_gpt.ipynb#scrollTo=wcW-1zmUsuff

It tries to use the four models as text generation rather than language modeling.

The models used are the following ones : 

1.   Google's BERT model
2.   OpenAI's GPT model
3.   Google/CMU's Transformer-XL model
4.   OpenAI's GPT-2 model




In [12]:
import shutil
!pip install regex ftfy pytorch-pretrained-bert
!git clone https://github.com/numediart/Text-Generation.git
!git clone https://github.com/huggingface/pytorch-pretrained-BERT
shutil.move("pytorch-pretrained-BERT/examples", "examples")
shutil.rmtree("pytorch-pretrained-BERT")

fatal: destination path 'Text-Generation' already exists and is not an empty directory.
fatal: destination path 'pytorch-pretrained-BERT' already exists and is not an empty directory.


Error: ignored

### .1 Using Bert Model

Basically it can't generate text, it can at most fill in one or a few words in a well-constructed sentence.

See http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/

In [27]:
!python Text-Generation/bert.py --text "All my friends were coming at the party." --mask "friends"

Original: All my friends were coming at the party.
Masked: all my [MASK] were coming at the party .
Predicted token: ['parents']
Other options:
['friends']
['kids']
['people']
['they']
['mom']
['classes']
['girls']
['thoughts']
['things']
['own']


### .2 Using OpenAI-GPT Model

This model can generate text. It's still far from human generation but at least it works.

The seed given in "text" variable conditionned the generation that will be made. It's a determined generation as it can be reproduced with the same seed. "tokens_to_generate" expects an int with number of tokens to output (different from the number of words !).

In [17]:
!python Text-Generation/openai.py --text "Give this a little try." --tokens_to_generate 40

100% 815973/815973 [00:00<00:00, 859357.79B/s]
100% 458495/458495 [00:00<00:00, 612091.36B/s]
100% 478750579/478750579 [00:39<00:00, 12062529.25B/s]
100% 273/273 [00:00<00:00, 191063.74B/s]
give this a little try . " 
 " i 'm not sure i can . " 
 " you can . " 
 " i do n't know . " 
 " you can . " 
 " i do n't know . " 
 


### .3.1 Using OpenAI GPT-2 Model

This model can as well generate text. But few experimentations showed that outputs are far less interesting that GPT model.
The examples regularly come to loop over the same sentences. See below.

In [29]:
!python Text-Generation/gpt2.py --text "Maybe this will work" --tokens_to_generate 120

Maybe this will work for you.

The first thing you need to do is to create a new file called "config.json" in your project's root directory.

In this file, you'll need to add the following line to your .bashrc :

{ "name": "config.json", "version": "1.0", "version_id": "1", "version_name": "config.json", "version_name_id": "1", "version_name_name": "config.json", "version_name_name_id": "


### .3.2 Using provided example code for GPT2


Example 'pytorch-pretrained-BERT/examples/run_gpt2.py' provides an interface to enter a seed or add "--unconditional" parameter to avoid entering some seed.

--nsamples can be used to show more samples with one seed.

--length is settable as well.

Process is non-determined.

In [19]:
!python examples/run_gpt2.py --unconditional --nsamples 2 --length 40

Namespace(batch_size=-1, length=40, model_name_or_path='gpt2', nsamples=2, seed=0, temperature=1.0, top_k=0, unconditional=True)
07/08/2019 13:07:20 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /root/.pytorch_pretrained_bert/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
07/08/2019 13:07:20 - INFO - pytorch_pretrained_bert.tokenization_gpt2 -   loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /root/.pytorch_pretrained_bert/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
07/08/2019 13:07:21 - INFO - pytorch_pretrained_bert.modeling_gpt2 -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin from cache at /root/.pyt

### .3.3 Adapting example code for GPT model

In [21]:
!python Text-Generation/openai_huggingface_example.py --unconditional --length 50

07/08/2019 13:08:40 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json from cache at /root/.pytorch_pretrained_bert/4ab93d0cd78ae80e746c27c9cd34e90b470abdabe0590c9ec742df61625ba310.b9628f6fe5519626534b82ce7ec72b22ce0ae79550325f45c604a25c0ad87fd6
07/08/2019 13:08:40 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt from cache at /root/.pytorch_pretrained_bert/0f8de0dbd6a2bb6bde7d758f4c120dd6dd20b46f2bf0a47bc899c89f46532fde.20808570f9a3169212a577f819c845330da870aeb14c40f7319819fce10c3b76
07/08/2019 13:08:43 - INFO - pytorch_pretrained_bert.modeling_openai -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-pytorch_model.bin from cache at /root/.pytorch_pretrained_bert/e45ee1afb14c5d77c946e66cb0fa70073a77882097a1a2cefd51fd24b172355e.e7ee3fcd07c695a4c9f

### .4 Using Transformer-XL Model

Model has been trained on wiki103 which is based on wikipedia pages so it will surely output text in this format.

Code has been copied and modified from https://github.com/kimiyoung/transformer-xl/issues/49#issuecomment-472212730.


In [30]:
!python Text-Generation/transformer_xl.py --text "First world war" --tokens_to_generate 200

INFO:pytorch_pretrained_bert.tokenization_transfo_xl:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin from cache at /root/.pytorch_pretrained_bert/b24cb708726fd43cbf1a382da9ed3908263e4fb8a156f9e0a4f45b7540c69caa.a6a9c41b856e5c31c9f125dd6a7ed4b833fbcefda148b627871d4171b25cffd1
INFO:pytorch_pretrained_bert.modeling_transfo_xl:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-pytorch_model.bin from cache at /root/.pytorch_pretrained_bert/12642ff7d0279757d8356bfd86a729d9697018a0c93ad042de1d0d2cc17fd57b.e9704971f27275ec067a00a67e6a5f0b05b4306b3f714a96e9f763d8fb612671
INFO:pytorch_pretrained_bert.modeling_transfo_xl:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json from cache at /root/.pytorch_pretrained_bert/a6dfd6a3896b3ae4c1a3c5f26ff1f1827c26c15b679de9212a04060eaf1237df.aef76fb1064c932cd6a2a2be3f23ebbfa5f9b6e29e8e87b571c45b4a5d5d1b90
INFO:

### .5 Fine-tuning Data

If you want to customize your text generation, you can fine-tune a GPT model.

Code has been taken from 'pytorch-pretrained-BERT/examples/run-openai-gpt.py' and customized to fit one-class data in utf8 format and txt file.

Upload a txt file with each sample on one line (for example, one line = one paragraph) then modify command line to adapt your file name and save directory.

When trained, go back to .3.3 and use your save directory name as model path to start generation.


In [23]:
!python Text-Generation/fine_tuning_openai.py --do_train --output_dir train_clarke --train_dataset Text-Generation/FineTuning-example.txt

Namespace(do_eval=False, do_train=True, eval_batch_size=16, eval_dataset='', learning_rate=6.25e-05, lm_coef=0.9, lr_schedule='warmup_linear', max_grad_norm=1, model_name='openai-gpt', n_valid=374, num_train_epochs=3, output_dir='train_clarke', seed=42, server_ip='', server_port='', train_batch_size=8, train_dataset='Text-Generation/FineTuning-example.txt', warmup_proportion=0.002, weight_decay=0.01)
07/08/2019 13:22:21 - INFO - __main__ -   device: cuda, n_gpu 1
07/08/2019 13:22:23 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json from cache at /root/.pytorch_pretrained_bert/4ab93d0cd78ae80e746c27c9cd34e90b470abdabe0590c9ec742df61625ba310.b9628f6fe5519626534b82ce7ec72b22ce0ae79550325f45c604a25c0ad87fd6
07/08/2019 13:22:23 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-merges.txt from cache 

In [26]:
!python Text-Generation/openai_huggingface_example.py --model_name_or_path "train_clarke"

07/08/2019 13:35:22 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading special tokens file train_clarke/special_tokens.txt
07/08/2019 13:35:22 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading vocabulary file train_clarke/vocab.json
07/08/2019 13:35:22 - INFO - pytorch_pretrained_bert.tokenization_openai -   loading merges file train_clarke/merges.txt
07/08/2019 13:35:23 - INFO - pytorch_pretrained_bert.tokenization_openai -   Special tokens {'_start_': 40478, '_end_': 40479}
07/08/2019 13:35:23 - INFO - pytorch_pretrained_bert.modeling_openai -   loading weights file train_clarke/pytorch_model.bin
07/08/2019 13:35:23 - INFO - pytorch_pretrained_bert.modeling_openai -   loading configuration file train_clarke/config.json
07/08/2019 13:35:23 - INFO - pytorch_pretrained_bert.modeling_openai -   Model config {
  "afn": "gelu",
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 512,
  "n_embd": 768,
