<a href="https://colab.research.google.com/github/programminghumanity/AITextGenerator/blob/master/nlg_aitextgen_chekhov4_20200705.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  aitextgen — Train a GPT-2 Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: Jul 5th, 2020*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [1]:
# Freeze versions of dependencies for now
!pip install transformers==2.9.1

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

Collecting transformers==2.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/22/97/7db72a0beef1825f82188a4b923e62a146271ac2ced7928baa4d47ef2467/transformers-2.9.1-py3-none-any.whl (641kB)
[K     |▌                               | 10kB 27.4MB/s eta 0:00:01[K     |█                               | 20kB 2.1MB/s eta 0:00:01[K     |█▌                              | 30kB 2.8MB/s eta 0:00:01[K     |██                              | 40kB 3.1MB/s eta 0:00:01[K     |██▌                             | 51kB 2.5MB/s eta 0:00:01[K     |███                             | 61kB 2.8MB/s eta 0:00:01[K     |███▋                            | 71kB 3.1MB/s eta 0:00:01[K     |████                            | 81kB 3.4MB/s eta 0:00:01[K     |████▋                           | 92kB 3.6MB/s eta 0:00:01[K     |█████                           | 102kB 3.4MB/s eta 0:00:01[K     |█████▋                          | 112kB 3.4MB/s eta 0:00:01[K     |██████▏                         | 122

07/05/2020 19:04:19 — INFO — transformers.file_utils — PyTorch version 1.5.1+cu101 available.
07/05/2020 19:04:20 — INFO — transformers.file_utils — TensorFlow version 2.2.0 available.


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, or an Nvidia P100 GPU. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM.

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [2]:
!nvidia-smi

Sun Jul  5 19:04:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2: currently, aitextgen only works with the smallest one:

* `124M` (default): the "small" model, 500MB on disk.

The next cell downloads it from Google's servers and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [3]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

07/05/2020 19:04:31 — INFO — aitextgen — Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers


HBox(children=(FloatProgress(value=0.0, description='Fetching checkpoint', max=77.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Fetching hparams.json', max=90.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Fetching model.ckpt.data-00000-of-00001', max=497759232.0…




HBox(children=(FloatProgress(value=0.0, description='Fetching model.ckpt.index', max=5215.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='Fetching model.ckpt.meta', max=471155.0, style=ProgressSt…

07/05/2020 19:04:35 — INFO — aitextgen — Converting the 124M GPT-2 TensorFlow weights to PyTorch.



Save PyTorch model to aitextgen/pytorch_model.bin


07/05/2020 19:04:39 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


07/05/2020 19:04:43 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [4]:
mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [5]:
file_name = "chekhov_4plays_all_dialog_only.txt"

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

In [7]:
!pwd

/content


In [9]:
copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM.

In [10]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=5000,
         generate_every=500,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-4,
         batch_size=1, 
         )

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=9810.0), HTML(value='')), layout=Layout(d…

07/05/2020 19:15:01 — INFO — aitextgen.TokenDataset — Encoding 9,810 sets of tokens from chekhov_4plays_all_dialog_only.txt.
GPU available: True, used: True
07/05/2020 19:15:01 — INFO — lightning — GPU available: True, used: True
TPU available: False, using: 0 TPU cores
07/05/2020 19:15:01 — INFO — lightning — TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
07/05/2020 19:15:01 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]





HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=5000.0), HTML(value='')), layout=Layout(d…

[1m500 steps reached: generating sample texts.[0m
Bobby’s

out in the trunk of his jacket, and Bobby’s
showing a bunch of new things....

CHEBUTIKIN. [Kissing her brother] My brother’s a
psychological curiosity. He’s a good-natured man, but he
doesn’t like everybody, and he’s awfully shy... but he really
likes us, and we’re all in it for one thing, and that
is to be sure, to be sure that he won’t deceive anybody....

KULIGIN. [Cries] Bobby’s a funny man, but he’s not afraid of us,
and he’s well-used to us.

SOLENI. I’m not going away. I’m here to-day, I’m
not bothering any one, I’m not bothering anybody. I’m not going away, I’m
going to meet somebody, it’s all settled and just... and
I’m so happy, so happy.

KULIGIN. [C
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m

   “Oh my darling--” [She embraces him] We shall see each other again,
and will--shall--

          I shall meet again, and shall be near you.
          

07/05/2020 19:35:05 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


. The whole thing is worth one’s
seventy-five thousand roubles.

MASHA. Fifty, did you say?

TROFIMOV. Why not?

MASHA. Fifty.

TROFIMOV. Why did I say fifty? I am going to
take charge of the place.

MASHA. Fifty, did you say?

TROFIMOV. Why not?

MASHA. I don’t remember.

TROFIMOV. In the first place I meant to say “Evstigney,” but Evstigney Deriganedov
now has been severely criticised for his conduct. See, he’s
at his age and disposition, and in the second place I seem to have
gone through a great deal by indulging in my reading. I am,
of course, thirty-five years old. But say what you will, Evstigney is
a clever man and very good company; people will have to
listen to his voice and gesture. I must admit that his works are
not as good as Tolstoi, but they are so fresh and easy to get



You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

Running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [11]:
from_folder = None

for file in ["pytorch_model.bin", "config.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

FileNotFoundError: ignored

The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [12]:
ai = aitextgen(model="./trained_model/pytorch_model.bin", config="./trained_model/config.json", to_gpu=True)

07/05/2020 19:53:51 — INFO — aitextgen — Loading GPT-2 model from provided ./trained_model/pytorch_model.bin.
07/05/2020 19:53:55 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate()` without any parameters generates a single text from the loaded model to the console.

In [14]:
ai.generate()


I have been sucked off my feet; on two legs at once! I try to
walk again. [He straightens SORIN’S collar] Your hair and beard are all on
end. Oughtn’t you to have them trimmed?

SORIN. [Smoothing his beard] They are the tragedy of my existence. Even
when I was young I always looked as if I were drunk, and all. Women have
never liked me. [Sitting down] Why is my sister out of temper?

TREPLIEFF. Why? Because she is jealous and bored. [Sitting down beside
SORIN] She is not acting this evening, but Nina is, and so she has set
herself against me, and against the performance of the play, and against
the play itself, which she hates without ever having read it.

SORIN. [Laughing] Does she, really?

TREPLIEFF. Yes, she is furious because Nina is going to have a
success on this little stage. [Looking at his watch] My mother is a
psychological curiosity. Without doubt brilliant and talented, capable
of


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2, but it will be _much_ slower)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [16]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            max_length=512,
            temperature=0.7,
            top_p=0.9)

[1mROMEO:[0m [Seizes his hand] Dearest!

NINA. Be quiet! Here they come.

TRIGORIN. They are coming!

NINA. [Shrugging her shoulders] I must go. Good-bye.

He goes out through the centre door on the left, dressed in a long coat with
a cape, and carrying his hat and cane.

TRIGORIN. I am going to spend the evening here. In a moment.

NINA. Good-bye. [She and MEDVIEDENKO go out.]

MEDVIEDENKO. [Kissing him kisses his forehead] Good-bye, doctor.

MEDVIEDENKO. [Kissing him his forehead] Good-bye, old man.

MEDVIEDENKO. What a wind!

MASHA. Yes. I’m tired of winter. I’ve already forgotten what summer’s
like.

MEDVIEDENKO. It’s coming out, I see. We’re going to Moscow.

MASHA. No, it won’t come out. Look, the eight was on the two of
spades. [The lightning flashes] There it is! I’ve missed the first one, and it
was so dark that I couldn’t see the second. Good-bye,
old man. [The lightning flashes again] Goodbye, old man!

MEDVIEDENKO. It isn’t as if I was ever in Moscow. I was born there, an

In [15]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            max_length=512,
            temperature=1.0,
            top_p=0.9)

[1mROMEO:[0m [Looks in through the door on the left] There it is!... red!

ARKADINA. Where is it? I hid it in the cellar.

DORN. Yes, there was one such thing in the cellar.

ARKADINA. I am going to look for it.

DORN. Where is it?

ARKADINA. I am going to look for it. [She and TRIGORIN go out.]

TRIGORIN. Well, it is time to begin. I am going to spend the evening. Good-bye.

ARKADINA. [Frightened] Peter! [She tries to support him] Goodbye, all! [She kisses his
hands] Good-bye, all! [She tries to leave the room.]

DORN. [Looking through the pages of a book] Page 121, lines 11 and
12; here it is. [He kisses ARKADINA and MEDVIEDENKO IVANOVNA] Have your picture taken,
Andrey.

ARKADINA. Good-bye, all! [She and MEDVIEDENKO go out.]

SHAMRAEFF. [Kissing MASHA] Good-bye, all! [He kisses NINA and
PAULINA] The gander cackles; I am getting excited.

PAULINA. Come, let us begin. Don’t let us waste time, we shall soon be
called to supper.

SHAMRAEFF, MASHA, and DORN sit down at the card-table.


In [17]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            max_length=512,
            temperature=1.0,
            top_p=0.7)

[1mROMEO:[0m [Goes to the cupboard and stands in the corner] What a
rogue.

SEREBRAKOFF. He hadn’t touched a drop for two years, and now he suddenly goes
and gets drunk....

VOICE AT THE DOOR. Ermolai Alexeyevitch!

LOPAKHIN. [Angry] Decayed gentleman!

SEREBRAKOFF. Yes, I am a decayed gentleman, and I’m proud of it!

VOICE AT THE DOOR. [Declaims] “You’re drunk, old man!”

LOPAKHIN. Not that I am angry with you.

VOICE AT THE DOOR. You’re old man!

LOPAKHIN. Not that I am angry with you. [Exit slowly.]

VOICE AT THE DOOR. [Kisses LUBOV ANDREYEVNA’S hand] Your room, my windows....

LUBOV. [Shouts] Ooh!

ANDREYEVNA. [To GAEV] Ooh!

GAEV. [Confused] That’s enough, that’s enough, Luba.

VARYA. [Weeps] But I told you, Peter, to wait till to-morrow.

LUBOV. My Grisha... my boy... Grisha... my son.

VARYA. What are we to do, little mother? It’s the will of God.

TROFIMOV. [Softly, through his tears] It’s all right, it’s all right.

LUBOV. [Still weeping] My boy’s dead; he was drowned. Why? 

In [18]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            max_length=1024,
            temperature=0.7,
            top_p=0.7)

[1mROMEO:[0m [To SONIA] Sonia, hand me that bottle on the
table.

SONIA. Here it is. [Goes into the drawing-room with OLGA.]

[Shouts are heard. ANDREY and FERAPONT come in.]

ANDREY. [Kissing IRINA] Sonia!

FERAPONT. Documents to sign....

ANDREY. [Kissing IRINA] That’s what you want.

IRINA. That’s what you gave me.

FERAPONT. [Kissing him] Thank you.

ANDREY. I am happy. We’ll sign the agreement. [Combs his beard.]

FERAPONT. That’s what you want. [Going into the drawing-room, to the
dining-room] Under these new circumstances I shall sign....

ANDREY. [Kissing his wife’s hand] Under these new circumstances I shall sign.

[Exit FERAPONT]

ANDREY. [Kissing his wife’s hand] Under these new circumstances I shall sign.

[Retires to backbench.]

TUZENBACH. [Laughs] I didn’t know you were here, I only lost my memory.
[Wipes his forehead] Bobby gave me that bottle on the day of your wedding.
I couldn’t understand it at all, and you couldn’t understand it either.

TUZENBACH. She’s forgotte

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [19]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=1000,
                     batch_size=50,
                     prompt="ROMEO:",
                     max_length=256,
                     temperature=1.0,
                     top_p=0.9)

07/05/2020 19:59:53 — INFO — aitextgen — Generating 1,000 texts to ATG_20200705_195953_97350876.txt


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

07/05/2020 20:02:08 — INFO — aitextgen — Generating 1,000 texts to ATG_20200705_200208_20249326.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

07/05/2020 20:04:23 — INFO — aitextgen — Generating 1,000 texts to ATG_20200705_200423_58099902.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

07/05/2020 20:06:38 — INFO — aitextgen — Generating 1,000 texts to ATG_20200705_200638_51884887.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

07/05/2020 20:08:53 — INFO — aitextgen — Generating 1,000 texts to ATG_20200705_200853_54538677.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




# LICENSE

MIT License

Copyright (c) 2020 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.