<a href="https://colab.research.google.com/github/papayapeter/ghosts-of-data/blob/master/aitextgen_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# finetuning gpt-2 for *ghosts of data past (wt)*
by [zeno gries](https://zenogries.com)

*last updated 2/20/2022*

using `aitextgen` ([github repository](https://github.com/minimaxir/aitextgen), [documentation](https://docs.aitextgen.io/))



### installing and loading libraries

In [2]:
!pip install -q aitextgen

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

### checking for gpu

to check if the notebook is connected to a gpu and the amount of vram

In [3]:
!nvidia-smi

Mon Feb 21 16:51:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### mounting google drive

In [4]:
mount_gdrive()

Mounted at /content/drive


### loading the model

a choice of models:

* english:
  * `distilgpt2`: smallest (and probably fastest) gpt2 model with 82m parameters. 
  * `gpt2`: 124m parameters. standard gpt implementation (there is also `gpt2-medium`, `gpt2-large`, `gpt2-xl`).
  * `EleutherAI/gpt-neo-125M` 125m parameters. gpt-neo seems to be newer and better suited for longer texts.
* german:
  * `dbmdz/german-gpt2`: unknown number of parameters. most popular german model.

In [5]:
ai = aitextgen(model='distilgpt2', to_gpu=True)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

alternatively, if a model has already been finetuned, but finetuning should continue, you can load it from the google drive.

In [None]:
from_folder = None

for file in ["pytorch_model.bin", "config.json"]:
  copy_file_from_gdrive(file, from_folder)

ai = aitextgen(model_folder=".", to_gpu=True)

### upload a file for training

updoad it through the file tab on the left and then set the file path.

In [6]:
file_path = "dataset.txt"

alternatively (if the dataset is larger than 10mb) updoad it to the google drive first and then copy it over.

In [None]:
from_folder = None

copy_file_from_gdrive(file_path, from_folder)

### finetune gpt-2

the next cell will start the actual finetuning. it runs for `num_steps`.

the model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will *also* be saved there in a unique folder.

the training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

important parameters for `train()`:

- **`line_by_line`**: set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: if you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: number of steps to train the model for.
- **`generate_every`**: interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: interval of steps to save the model: the model will be saved in the vm to `/trained_model`.
- **`save_gdrive`**: set this to `True` to copy the model to a unique folder in your google drive, if you have mounted it in the earlier cells
- **`fp16`**: enables half-precision training for faster/more memory-efficient training. Only works on a t4 or v100 gpu.

here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: learning rate of the model training.
- **`batch_size`**: batch size of the model training; setting it too high will cause the gpu to go oom. (if using `fp16`, you can increase the batch size more safely)

In [12]:
ai.train(file_path,
         line_by_line=False,
         from_cache=False,
         num_steps=500,
         generate_every=100,
         save_every=500,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=False,
         batch_size=3)

  0%|          | 0/651 [00:00<?, ?it/s]

pytorch_model.bin already exists in /trained_model and will be overwritten!
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/500 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m100 steps reached: generating sample texts.[0m
eno

Hi Hendrik,
Incidentally, what do you think about (ai driven) therapy apps?
Zeno


Dear Zeno,
If I am correct (at least in my doubt), and we were not healthier but just sick in a different way, making help more accessible cannot be a concern but must be an obligation (which refocuses the question back to whether AI is actually the solution here, or potentially exploitative). We now have the theories and technology to help people more readily, sustainably, and accurately. Might as well use it.
Best from afar,
Hendrik

Hi Hendrik,
I am sure Derrida has or had (unless he is still with us) quite more to  say on the subject. I don't at the moment. I am however in the lucky  position of not having to explain myself too much. As I said, for me it  is more of a mind game right now.
Rgds,
Zeno

Dear Zeno,
The other reason why I do not share the concern of improved healthcare being a rat race is that this argument only looks at the collect

### test the model

testing the model with a custom prompt

In [13]:
prompt = 'Dear Zeno,\nI don\'t think'

parameters for generation:

* `n`: number of texts generated.
* `max_length`: maximum length of the generated text (default: 200; for gpt-2, the maximum is 1024; for gpt neo, the maximum is 2048)
* `prompt`: prompt that starts the generated text and is included in the generated text.
* `temperature`: controls the "craziness" of the text (default: 0.7)
* `top_k`: if nonzero, limits the sampled tokens to the top k values. (default: 0)
* `top_p`: if nonzero, limits the sampled tokens to the cumulative probability

enabling the following parameters may slow down generation.

* `num_beams`: if greater than 1, executes beam search for cleaner text.
* `repetition_penalty`: if greater than 1.0, penalizes repetition in a text to avoid infinite loops.
* `length_penalty`: if greater than 1.0, penalizes text proportional to the length
* `no_repeat_ngram_size`: token length to avoid repeating given phrases.


In [15]:
ai.generate_one(prompt=prompt,
                max_length=256,
                temperature=1.0,
                top_p=0.9)

'Dear Zeno,\nI don\'t think you are particularly callous. The frustration comes from the the the fact that psychotherapy had a  huge potential digital presence.  However, many psychologists understood themselves as social media presence  is more intentional, and thereby there are more opportunities to "create" a presence that\'s less authentic. Yet, as you also ask, what even is authenticity? If all my friends were to meet each other at a party, I am not convinced they would recognize the same Hendrik. With some people, I am more witty, political, and earnest. With others, I am more silly. With others yet again, I am softer and understanding, depending on mood, chemistry, and the general vibe.\nBest from afar,\nHendrik\n\n\nHi Hendrik,\nI think it is rather impossible to have an "authentic" online presence.  Contrary to how we may phrase it, one cannot "be" online, right? The  only thing users can do is leave traces that reference their existence  and those are, like most stories I thi