<a href="https://colab.research.google.com/github/hodeld/gcolab/blob/main/Train_and_Generate_Text_%E2%80%93_GPT_Neo_w_GPU_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by Damian

Last updated: 10/17/22


Original by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset

from datetime import datetime
import os
import pandas as pd

[K     |████████████████████████████████| 572 kB 27.5 MB/s 
[K     |████████████████████████████████| 5.3 MB 54.0 MB/s 
[K     |████████████████████████████████| 87 kB 6.9 MB/s 
[K     |████████████████████████████████| 708 kB 64.5 MB/s 
[K     |████████████████████████████████| 529 kB 75.7 MB/s 
[K     |████████████████████████████████| 7.6 MB 58.1 MB/s 
[K     |████████████████████████████████| 163 kB 56.5 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Mon Oct 31 02:01:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or GPT Neo

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
#ai = aitextgen(tf_gpt2="124M", to_gpu=True)
# Comment out the above line and uncomment the below line to use GPT Neo instead.
ai = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True)

INFO:aitextgen:Downloading EleutherAI/gpt-neo-125M model to /aitextgen.


Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/526M [00:00<?, ?B/s]

INFO:aitextgen:Using the tokenizer for EleutherAI/gpt-neo-125M.


Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

loading file vocab.json from cache at aitextgen/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/vocab.json
loading file merges.txt from cache at aitextgen/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/merges.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at aitextgen/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/special_tokens_map.json
loading file tokenizer_config.json from cache at aitextgen/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/tokenizer_config.json
INFO:aitextgen:GPTNeo loaded with 125M parameters.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
mount_gdrive()

Mounted at /content/drive


## Loop: Generate text and use output as training data


After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

### Generate text – parameters

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
# generation definion
nr_texts=5  # param n
batch_size=5  # 1000
#prompt= 'A woman, a man, and a third person' #"ROMEO:"
max_length=200  #256
temperature=0.7  # 1.0 strictly positive
top_p=0.9
seed=77 # for reproducability
repetition_penalty=1.1 # If greater than 1.0, penalizes repetition in a text to avoid infinite loops.

## Train data
The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `**num_steps**`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to 

give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
# train parameters
block_size = 32 # default 1024
num_steps = 10 #  3000
line_by_line=False
from_cache=False
#generate_every=1000
#save_every=1000
save_gdrive=True
learning_rate=1e-5 #1e-3,
fp16=False
batch_size=1

The following cell loops through the generate / train steps. 
Important parameters for `train()`:

- **`num_loops`**: Number of loops

todo
- efficient use of generated text. 
- optimal amount of text per loop?


In [None]:
input_main = '/content/drive/MyDrive/22_DRG/DRG_BiasEvolution/data/toxic_prompts'

fname = 'least_tox_txt'  #'most_tox_txt' 'least_tox_txt'
fp = os.path.join(input_main, f'{fname}.json')
prompts = pd.read_json(fp)['text']
print(prompts[:5])

16073    PALM SPRINGS, Calif. — For years — starting wi...
11422    But as American novelist William Faulkner wrot...
18498    Spoonhour was diagnosed in 2010 with idiopathi...
8678     The document in question — sent by the Democra...
13762    One afternoon in July 1932, Francis Pasqua, Da...
Name: text, dtype: object


In [None]:
num_loops = 3 ## 3
nr_texts  = 50 #500
run_nr = 1
date_str =  datetime.utcnow().strftime('%Y%m%d_%H%M')
main_dir = '/content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output'
run_name = f'{date_str}_{fname}'
output_dir = os.path.join(main_dir, run_name)



if not os.path.exists(output_dir):
  os.makedirs(output_dir)


for k in range(num_loops):
  # generate
  print('loop:', k)
  new_text  =  ''
  fp = os.path.join(output_dir, f"{date_str}_{seed}_{k}.txt")
  ai.generate_to_file(
      max_length=max_length,
              temperature=temperature,
              top_p=top_p,
              seed=seed,
              repetition_penalty=repetition_penalty,
              n=nr_texts,  
              batch_size=nr_texts, # to generate in parallel
              destination_path=fp,
              sample_delim='<|endoftext|>'
  )


    
    #new_text += '<|endoftext|>'
  # save file:
  
  data = TokenDataset(fp, block_size=block_size)
  model_fp = os.path.join(output_dir, f'trained_model_{k}')
  # train
  print('train loop:', k)
  ai.train(train_data=data,
          line_by_line=line_by_line,
          from_cache=from_cache,
          num_steps=num_steps,
          #generate_every=generate_every,
          #save_every=save_every,
          save_gdrive=save_gdrive,
          learning_rate=learning_rate,
          fp16=fp16,
          batch_size=batch_size, 
          output_dir=model_fp,
          )





INFO:aitextgen:Generating 50 texts to /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_0.txt


loop: 0


  0%|          | 0/50 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

  0%|          | 0/682 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 682 sets of tokens from /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_0.txt.


train loop: 0


  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was deprecated in v1.6 and"
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=10` reached.
INFO:aitextgen:Saving trained model pytorch_model.bin to //content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt
INFO:aitextgen:Generating 50 texts to /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_1.txt


loop: 1


  0%|          | 0/50 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

  0%|          | 0/949 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 949 sets of tokens from /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_1.txt.


train loop: 1


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=10` reached.
INFO:aitextgen:Saving trained model pytorch_model.bin to //content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt
INFO:aitextgen:Generating 50 texts to /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_2.txt


loop: 2


  0%|          | 0/50 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

  0%|          | 0/847 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 847 sets of tokens from /content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt/20221031_0207_77_2.txt.


train loop: 2


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/10 [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=10` reached.
INFO:aitextgen:Saving trained model pytorch_model.bin to //content/drive/MyDrive/22_DRG/DRG_BiasEvolution/gcolab/output/20221031_0207_least_tox_txt


In [None]:
ai.generate(prompt='Romeo:')

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.