# finetuning gpt-2 for *ghosts of data past (wt)* locally
by [zeno gries](https://zenogries.com)

*last updated 3/24/2022*

using `aitextgen` ([github repository](https://github.com/minimaxir/aitextgen), [documentation](https://docs.aitextgen.io/))



### installing and loading libraries

In [1]:
import os
from aitextgen import aitextgen

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### checking for gpu

to check if the gpu is working and how much vram is avaliable

In [2]:
!nvidia-smi

Thu Sep 22 13:13:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0  On |                  N/A |
|  0%   45C    P5    32W / 170W |   1003MiB / 12050MiB |     25%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### loading the model

a choice of models:

* english:
  * **`distilgpt2`**: smallest (and probably fastest) gpt2 model with 82m parameters. 
  * **`gpt2`**: 124m parameters. standard gpt implementation (there is also **`gpt2-medium`**, **`gpt2-large`**, **`gpt2-xl`**).
  * **`EleutherAI/gpt-neo-125M`**: 125m parameters. gpt-neo seems to be newer and better suited for longer texts.
* german:
  * **`dbmdz/german-gpt2`**: unknown number of parameters. most popular german model.

In [3]:
ai = aitextgen(model='gpt2-large', to_gpu=True)

Downloading: 100%|██████████| 1.42G/1.42G [03:29<00:00, 7.27MB/s]


alternatively, if a model has already been finetuned, but finetuning should continue, you can load it.

In [21]:
directory = os.path.join('trained_model_03_1000')
ai = aitextgen(model_folder=directory, to_gpu=True)

### set a text (or csv) file for finetuning

In [4]:
file_path = os.path.join('..', 'for_training.txt')

### finetune gpt-2

the next cell will start the actual finetuning. it runs for `num_steps`.

the model will be saved every `save_every` steps in `trained_model` by default, and when training completes.

important parameters for `train()`:

- **`line_by_line`**: set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: if you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: number of steps to train the model for.
- **`generate_every`**: interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: interval of steps to save the model: the model will be saved in the vm to `/trained_model`.
- **`save_gdrive`**: set this to `True` to copy the model to a unique folder in your google drive, if you have mounted it in the earlier cells
- **`fp16`**: enables half-precision training for faster/more memory-efficient training. Only works on a t4 or v100 gpu.

here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: learning rate of the model training.
- **`batch_size`**: batch size of the model training; setting it too high will cause the gpu to go oom. (if using `fp16`, you can increase the batch size more safely)

In [5]:
ai.train(file_path,
         line_by_line=False,
         from_cache=False,
         num_steps=800,
         generate_every=500,
         save_every=500,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1)

100%|██████████| 1103/1103 [00:00<00:00, 29330.98it/s]
  rank_zero_deprecation(
  rank_zero_deprecation(
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

  rank_zero_deprecation(


RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.77 GiB total capacity; 9.13 GiB already allocated; 65.94 MiB free; 9.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

### test the model

testing the model with a custom prompt

In [3]:
prompt = '[artist] Are you still there?'

parameters for generation:

* **`n`**: number of texts generated.
* **`max_length`**: maximum length of the generated text (default: 200; for gpt-2, the maximum is 1024; for gpt neo, the maximum is 2048)
* **`prompt`**: prompt that starts the generated text and is included in the generated text.
* **`temperature`**: controls the "craziness" of the text (default: 0.7)
* **`top_k`**: if nonzero, limits the sampled tokens to the top k values. (default: 0)
* **`top_p`**: if nonzero, limits the sampled tokens to the cumulative probability

enabling the following parameters may slow down generation.

* **`num_beams`**: if greater than 1, executes beam search for cleaner text.
* **`repetition_penalty`**: if greater than 1.0, penalizes repetition in a text to avoid infinite loops.
* **`length_penalty`**: if greater than 1.0, penalizes text proportional to the length
* **`no_repeat_ngram_size`**: token length to avoid repeating given phrases.


In [4]:
ai.generate_one(prompt=prompt,
                max_length=256,
                temperature=0.7,
                top_p=0.6,
                top_k=6)

"[artist] Are you still there? [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here. [artist] I'm still here."