# finetuning gpt-2 for *ghosts of data past (wt)* locally
by [zeno gries](https://zenogries.com)

*last updated 3/24/2022*

using `aitextgen` ([github repository](https://github.com/minimaxir/aitextgen), [documentation](https://docs.aitextgen.io/))



### installing and loading libraries

In [1]:
import os
from aitextgen import aitextgen

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### checking for gpu

to check if the gpu is working and how much vram is avaliable

In [2]:
!nvidia-smi

Fri Apr  1 07:41:25 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   37C    P8    19W / 170W |    618MiB / 12050MiB |     69%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### loading the model

a choice of models:

* english:
  * **`distilgpt2`**: smallest (and probably fastest) gpt2 model with 82m parameters. 
  * **`gpt2`**: 124m parameters. standard gpt implementation (there is also **`gpt2-medium`**, **`gpt2-large`**, **`gpt2-xl`**).
  * **`EleutherAI/gpt-neo-125M`**: 125m parameters. gpt-neo seems to be newer and better suited for longer texts.
* german:
  * **`dbmdz/german-gpt2`**: unknown number of parameters. most popular german model.

In [3]:
ai = aitextgen(model='distilgpt2', to_gpu=True)

Downloading: 100%|██████████| 762/762 [00:00<00:00, 1.49MB/s]
Downloading: 100%|██████████| 336M/336M [00:49<00:00, 7.15MB/s] 


alternatively, if a model has already been finetuned, but finetuning should continue, you can load it.

In [None]:
directory = os.path.join('..', 'models', 'MODEL_NAME')
ai = aitextgen(model_folder=directory, to_gpu=True)

### set a text (or csv) file for finetuning

In [4]:
file_path = os.path.join('..', 'parsed', 'parsed.txt')

### finetune gpt-2

the next cell will start the actual finetuning. it runs for `num_steps`.

the model will be saved every `save_every` steps in `trained_model` by default, and when training completes.

important parameters for `train()`:

- **`line_by_line`**: set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: if you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: number of steps to train the model for.
- **`generate_every`**: interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: interval of steps to save the model: the model will be saved in the vm to `/trained_model`.
- **`save_gdrive`**: set this to `True` to copy the model to a unique folder in your google drive, if you have mounted it in the earlier cells
- **`fp16`**: enables half-precision training for faster/more memory-efficient training. Only works on a t4 or v100 gpu.

here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: learning rate of the model training.
- **`batch_size`**: batch size of the model training; setting it too high will cause the gpu to go oom. (if using `fp16`, you can increase the batch size more safely)

In [5]:
ai.train(file_path,
         line_by_line=False,
         from_cache=False,
         num_steps=500,
         generate_every=100,
         save_every=500,
         learning_rate=1e-3,
         fp16=False,
         batch_size=3)

100%|██████████| 651/651 [00:00<00:00, 64213.63it/s]
  rank_zero_deprecation(
  rank_zero_deprecation(
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/500 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

  rank_zero_deprecation(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Loss: 4.450 — Avg: 4.106 — GPU Mem: 10765 MB:   4%|▍         | 20/500 [00:08<03:16,  2.45it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Loss: 1.640 — Avg: 3.841 — GPU Mem: 10755 MB:   8%|▊         | 40/500 [00:15<02:57,  2.59it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable T

### test the model

testing the model with a custom prompt

In [6]:
prompt = 'Dear Zeno,\nI don\'t think'

parameters for generation:

* **`n`**: number of texts generated.
* **`max_length`**: maximum length of the generated text (default: 200; for gpt-2, the maximum is 1024; for gpt neo, the maximum is 2048)
* **`prompt`**: prompt that starts the generated text and is included in the generated text.
* **`temperature`**: controls the "craziness" of the text (default: 0.7)
* **`top_k`**: if nonzero, limits the sampled tokens to the top k values. (default: 0)
* **`top_p`**: if nonzero, limits the sampled tokens to the cumulative probability

enabling the following parameters may slow down generation.

* **`num_beams`**: if greater than 1, executes beam search for cleaner text.
* **`repetition_penalty`**: if greater than 1.0, penalizes repetition in a text to avoid infinite loops.
* **`length_penalty`**: if greater than 1.0, penalizes text proportional to the length
* **`no_repeat_ngram_size`**: token length to avoid repeating given phrases.


In [7]:
ai.generate_one(prompt=prompt,
                max_length=256,
                temperature=1.0,
                top_p=0.9)

'Dear Zeno,\nI don\'t think I have seen a ghost like you may have. This is because I am somewhat unclear of what a ghost is, or what phenomena should (still) count as ghosts. The more rigid the definition, the less likely I\'ll agree to have seen one: if it must be an actual dead person\'s spirit lingering, I have no reason to believe that I have seen one (or that anyone has seen one). If its more of a weirdly personal apparition that cannot be explained, then I\'ll believe others have seen one (but not me). And if it\'s more of a surprising/spontaneous feeling of connection to someone long gone "as if they were there", then I can say that I have. Seeing old pictures of someone with yourself in it without remembering that moment yourself feels like seeing a "ghost", as it\'s a gap in your own stream of consciousness but remembered by technology.\nBest from afar,\nHendrik\n\n\nHi Hendrik,\nso those friends that you made online back in the day, are friends that  you meet in person now or