# finetuning gpt-2 for *ghosts of data past (wt)* locally
by [zeno gries](https://zenogries.com)

*last updated 3/24/2022*

using `aitextgen` ([github repository](https://github.com/minimaxir/aitextgen), [documentation](https://docs.aitextgen.io/))



### installing and loading libraries

In [7]:
import os
from aitextgen import aitextgen

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### checking for gpu

to check if the gpu is working and how much vram is avaliable

In [2]:
!nvidia-smi

Wed Jul 13 18:16:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   41C    P5    18W / 170W |    645MiB / 12050MiB |     31%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### loading the model

a choice of models:

* english:
  * **`distilgpt2`**: smallest (and probably fastest) gpt2 model with 82m parameters. 
  * **`gpt2`**: 124m parameters. standard gpt implementation (there is also **`gpt2-medium`**, **`gpt2-large`**, **`gpt2-xl`**).
  * **`EleutherAI/gpt-neo-125M`**: 125m parameters. gpt-neo seems to be newer and better suited for longer texts.
* german:
  * **`dbmdz/german-gpt2`**: unknown number of parameters. most popular german model.

In [8]:
ai = aitextgen(model='EleutherAI/gpt-neo-125M', to_gpu=True)

loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/vocab.json from cache at aitextgen/08c00c4159e921d4c941ac75732643373aba509d9b352a82bbbb043a94058d98.a552555fdda56a1c7c9a285bccfd44ac8e4b9e26c8c9b307831b3ea3ac782b45
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/merges.txt from cache at aitextgen/12305762709d884a770efe7b0c68a7f4bc918da44e956058d43da0d12f7bea20.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/special_tokens_map.json from cache at aitextgen/6c3239a63aaf46ec7625b38abfe41fc2ce0b25f90800aefe6526256340d4ab6d.2b8bf81243d08385c806171bc7ced6d2a0dcc7f896ca637f4e777418f7f0cc3c
loading file https://huggingface.co/EleutherAI/gpt-neo-1

alternatively, if a model has already been finetuned, but finetuning should continue, you can load it.

In [5]:
directory = os.path.join('trained_model')
ai = aitextgen(model_folder=directory, to_gpu=True)

### set a text (or csv) file for finetuning

In [10]:
file_path = os.path.join('..', 'for_training.txt')

### finetune gpt-2

the next cell will start the actual finetuning. it runs for `num_steps`.

the model will be saved every `save_every` steps in `trained_model` by default, and when training completes.

important parameters for `train()`:

- **`line_by_line`**: set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: if you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: number of steps to train the model for.
- **`generate_every`**: interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: interval of steps to save the model: the model will be saved in the vm to `/trained_model`.
- **`save_gdrive`**: set this to `True` to copy the model to a unique folder in your google drive, if you have mounted it in the earlier cells
- **`fp16`**: enables half-precision training for faster/more memory-efficient training. Only works on a t4 or v100 gpu.

here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: learning rate of the model training.
- **`batch_size`**: batch size of the model training; setting it too high will cause the gpu to go oom. (if using `fp16`, you can increase the batch size more safely)

In [11]:
# 500 - 700 steps seem to be a good range
ai.train(file_path,
         line_by_line=False,
         from_cache=False,
         num_steps=700,
         generate_every=100,
         save_every=100,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1)

100%|██████████| 1086/1086 [00:00<00:00, 29242.11it/s]
pytorch_model.bin already exists in /trained_model and will be overwritten!
  rank_zero_deprecation(
  rank_zero_deprecation(
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/700 [00:00<?, ?it/s]

  rank_zero_deprecation(


[1m100 steps reached: saving model to /trained_model[0m                                      
[1m100 steps reached: generating sample texts.[0m                                            
                                                                                               
[scientist] Well it's so hard to believe that I've been wrong in my perspective.
[scientist] I know it well, it's a lot of small, not being but a little afraid.
[artist] Fear is not really conrete.
[scientist] I'm afraid of the dark.
[scientist] I Fear is not really conrete.
[scientist] I can see it every now and then, too. I can hear an airplane flying outside.
[artist] I was told not to ask too much right now.
[scientist] I'm feeling sleepy. I'm feeling a bit breathless.
[artist] I just realised that I stopped breathinging for a moment.
[scientist] Exciting not knowing where we are heading.
[scientist] I am here. I feel impatient about it. Concerned, but relieved to be still trying.
[scientist] I am g

### test the model

testing the model with a custom prompt

In [1]:
prompt = '[artist] I can\'t believe'

parameters for generation:

* **`n`**: number of texts generated.
* **`max_length`**: maximum length of the generated text (default: 200; for gpt-2, the maximum is 1024; for gpt neo, the maximum is 2048)
* **`prompt`**: prompt that starts the generated text and is included in the generated text.
* **`temperature`**: controls the "craziness" of the text (default: 0.7)
* **`top_k`**: if nonzero, limits the sampled tokens to the top k values. (default: 0)
* **`top_p`**: if nonzero, limits the sampled tokens to the cumulative probability

enabling the following parameters may slow down generation.

* **`num_beams`**: if greater than 1, executes beam search for cleaner text.
* **`repetition_penalty`**: if greater than 1.0, penalizes repetition in a text to avoid infinite loops.
* **`length_penalty`**: if greater than 1.0, penalizes text proportional to the length
* **`no_repeat_ngram_size`**: token length to avoid repeating given phrases.


In [6]:
ai.generate_one(prompt=prompt,
                max_length=256,
                temperature=0.7,
                top_p=0.7)

"[artist] I can't believe that there?\n[scientist] I am here.\n[scientist] [image] I'm not sure about.\n[scientist] What are here.\n[scientist] I am here.\n[scientist] I'm not either.\n[scientist] I am not sure.\n[scientist] I can sleep, not sure about.\n[scientist] I'm not sure.\n[scientist] What are here.\n[scientist] No, too much.\n[scientist] [image] I'm not having a ghost.\n[scientist]\n[scientist] I'm just have seen one?\n[scientist] I'm not really have seen one I'm alone, too much more questions.\n[scientist] I have seen one?\n[scientist] I'm not sure.\n[scientist] [image]\n[scientist] I'm not sure.\n[scientist] I'm not having a ghost.\n[scientist] Yes, tell me.\n[scientist] I think so...\n[scientist] Are you there?\n[scientist] I'm not having a ghost,"