<a href="https://colab.research.google.com/github/papayapeter/ghosts-of-data/blob/master/colab/aitextgen_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# finetuning gpt-2 for *ghosts of data past (wt)*
by [zeno gries](https://zenogries.com)

*last updated 2/20/2022*

using `aitextgen` ([github repository](https://github.com/minimaxir/aitextgen), [documentation](https://docs.aitextgen.io/))



### installing and loading libraries

In [None]:
!pip install -q aitextgen

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

### checking for gpu

to check if the notebook is connected to a gpu and the amount of vram

In [None]:
!nvidia-smi

### mounting google drive

In [None]:
mount_gdrive()

### loading the model

a choice of models:

* english:
  * **`distilgpt2`**: smallest (and probably fastest) gpt2 model with 82m parameters. 
  * **`gpt2`**: 124m parameters. standard gpt implementation (there is also **`gpt2-medium`**, **`gpt2-large`**, **`gpt2-xl`**).
  * **`EleutherAI/gpt-neo-125M`**: 125m parameters. gpt-neo seems to be newer and better suited for longer texts.
* german:
  * **`dbmdz/german-gpt2`**: unknown number of parameters. most popular german model.

In [None]:
ai = aitextgen(model='distilgpt2', to_gpu=True)

alternatively, if a model has already been finetuned, but finetuning should continue, you can load it from the google drive.

In [None]:
from_folder = None

for file in ["pytorch_model.bin", "config.json"]:
  copy_file_from_gdrive(file, from_folder)

ai = aitextgen(model_folder=".", to_gpu=True)

### upload a file for training

updoad it through the file tab on the left and then set the file path.

In [6]:
file_path = "dataset.txt"

alternatively (if the dataset is larger than 10mb) updoad it to the google drive first and then copy it over.

In [None]:
from_folder = None

copy_file_from_gdrive(file_path, from_folder)

### finetune gpt-2

the next cell will start the actual finetuning. it runs for `num_steps`.

the model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will *also* be saved there in a unique folder.

the training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

important parameters for `train()`:

- **`line_by_line`**: set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: if you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: number of steps to train the model for.
- **`generate_every`**: interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: interval of steps to save the model: the model will be saved in the vm to `/trained_model`.
- **`save_gdrive`**: set this to `True` to copy the model to a unique folder in your google drive, if you have mounted it in the earlier cells
- **`fp16`**: enables half-precision training for faster/more memory-efficient training. Only works on a t4 or v100 gpu.

here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: learning rate of the model training.
- **`batch_size`**: batch size of the model training; setting it too high will cause the gpu to go oom. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai.train(file_path,
         line_by_line=False,
         from_cache=False,
         num_steps=500,
         generate_every=100,
         save_every=500,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=False,
         batch_size=3)

### test the model

testing the model with a custom prompt

In [13]:
prompt = 'Dear Zeno,\nI don\'t think'

parameters for generation:

* **`n`**: number of texts generated.
* **`max_length`**: maximum length of the generated text (default: 200; for gpt-2, the maximum is 1024; for gpt neo, the maximum is 2048)
* **`prompt`**: prompt that starts the generated text and is included in the generated text.
* **`temperature`**: controls the "craziness" of the text (default: 0.7)
* **`top_k`**: if nonzero, limits the sampled tokens to the top k values. (default: 0)
* **`top_p`**: if nonzero, limits the sampled tokens to the cumulative probability

enabling the following parameters may slow down generation.

* **`num_beams`**: if greater than 1, executes beam search for cleaner text.
* **`repetition_penalty`**: if greater than 1.0, penalizes repetition in a text to avoid infinite loops.
* **`length_penalty`**: if greater than 1.0, penalizes text proportional to the length
* **`no_repeat_ngram_size`**: token length to avoid repeating given phrases.


In [None]:
ai.generate_one(prompt=prompt,
                max_length=256,
                temperature=1.0,
                top_p=0.9)