<a href="https://colab.research.google.com/github/laurenku/Diary-of-Florence-H/blob/main/Florence_H_aitextgen_%E2%80%94_Train_a_GPT_2_Text_Generating_Model_w_GPU_(Allison's_copy_for_CAtN).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  aitextgen — Train a GPT-2 Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: Jul 5th, 2020*

(This is Allison Parrish's copy of the notebook, which incorporates [this fix](https://github.com/minimaxir/aitextgen/issues/58))

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
# Freeze versions of dependencies for now
!pip3 install pytorch-lightning==0.7.6
!pip3 install transformers==2.9.1
!pip3 install fire==0.3.0

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

Collecting pytorch-lightning==0.7.6
  Downloading pytorch_lightning-0.7.6-py3-none-any.whl (248 kB)
[?25l[K     |█▎                              | 10 kB 32.2 MB/s eta 0:00:01[K     |██▋                             | 20 kB 13.1 MB/s eta 0:00:01[K     |████                            | 30 kB 9.1 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 7.7 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 4.9 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.8 MB/s eta 0:00:01[K     |█████████▏                      | 71 kB 5.6 MB/s eta 0:00:01[K     |██████████▌                     | 81 kB 5.3 MB/s eta 0:00:01[K     |███████████▉                    | 92 kB 5.9 MB/s eta 0:00:01[K     |█████████████▏                  | 102 kB 5.4 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 5.4 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 5.4 MB/s eta 0:00:01[K     |█████████████████               | 133 k

05/04/2022 17:18:23 — INFO — numexpr.utils — NumExpr defaulting to 2 threads.


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, or an Nvidia P100 GPU. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM.

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Wed May  4 17:18:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2: currently, aitextgen only works with the smallest one:

* `124M` (default): the "small" model, 500MB on disk.

The next cell downloads it from Google's servers and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

05/04/2022 17:18:23 — INFO — aitextgen — Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

05/04/2022 17:18:41 — INFO — aitextgen — Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight

Save PyTorch model to aitextgen/pytorch_model.bin


05/04/2022 17:18:47 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


05/04/2022 17:18:49 — INFO — aitextgen — GPT2 loaded with 124M parameters.
05/04/2022 17:18:49 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
mount_gdrive()

Mounted at /content/drive


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [None]:
file_name = "diary_dataset.txt"

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

In [None]:
copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM.

In [None]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=250,
         generate_every=50,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-4,
         batch_size=1, 
         )

05/04/2022 17:19:17 — INFO — aitextgen — Loading text from diary_dataset.txt with generation length of 1024.


  0%|          | 0/439 [00:00<?, ?it/s]

05/04/2022 17:19:17 — INFO — aitextgen.TokenDataset — Encoding 439 sets of tokens from diary_dataset.txt.
05/04/2022 17:19:17 — INFO — torch.distributed.nn.jit.instantiator — Created a temporary directory at /tmp/tmp_ttsuluk
05/04/2022 17:19:17 — INFO — torch.distributed.nn.jit.instantiator — Writing /tmp/tmp_ttsuluk/_remote_module_non_sriptable.py
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
05/04/2022 17:19:17 — INFO — pytorch_lightning.utilities.rank_zero — GPU available: True, used: True
05/04/2022 17:19:17 — INFO — pytorch_lightning.utilities.rank_zero — TPU available: False, using: 0 TPU cores
05/04/2022 17:19:17 — INFO — pytorch_lightning.utilities.rank_zero — IPU available: False, using: 0 IPUs
05/04/2022 17:19:17 — INFO — pytorch_lightnin

  0%|          | 0/250 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m50 steps reached: generating sample texts.[0m

I am currently on a website for my 3rd birthday, so I'm looking for some support. I have a few questions about how to get in touch with my guest and get a reply before I go to the office. (It will be up soon.) I should also check to see if I can get in touch with her through email. I will try to get in touch as soon as possible. (I feel really desperate.) I'm pretty much the only person on that website that is not on my list (not because I'm not on my list). So I would also like to get in touch with someone as soon as possible so I can get in touch on the situation. I need your help with final exam questions. (I'm trying to get back to my email as soon as possible.) Anyways, I appreciate it! I will try to get in touch with my advisor as soon as possible. (I'm not sure what to expect.) I will also try to get in touch with my personal advisor as soon as possible. (I'm looking for someone to help with my final exam questions.) I'm also l

05/04/2022 17:21:09 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

Running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
from_folder = aitextgen

for file in ["pytorch_model.bin", "config.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

TypeError: ignored

The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [None]:
ai = aitextgen(model="pytorch_model.bin", config="config.json", to_gpu=True)

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate()

. This is how I got my internship at the University of Pittsburgh in 2003. I started with Matthew in March 2003 and since then he has provided invaluable support and flexibility to my projects. I have also had a difficult time getting my professors to meet and discuss options with Mark. Because I’ve elected to have my final exam extended on Monday, I will have to wait until Monday December 14th to see if he will be able to prepare for that exam. Because the regular exam time for that class is Monday December 13th, this is time sensitive and I want to keep an eye on the status of the final exam.
Thank you,
Lauren

Monday December 13th, 2021, 3:48 PM
Hi Professor Offner,
Thank you for your understanding and I would like to confirm an incomplete for 15-150. Do you have an approximate window in mind for when the exam time is scheduled?
Thank you,
Lauren

Tuesday December 6th, 2021, 5:48 PM
Hi Mark,
Thank you so much for your support and flexibility this week, I wanted to let yo

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2, but it will be _much_ slower)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(n=5,
            batch_size=5,
            prompt="Friday, April 29th, 2022, night",
            max_length=768,
            temperature=1.0,
            top_p=0.9)

[1mFriday, April 29th, 2022, night[0m before school started. She says she was visiting her room on Valentine's Day with her ex-boyfriend Mark Cato and they had sex more or less only kissed in general, but she says that he didn't touch her until after the break up had been made. She says he then came to her room and kissed me, which is very uncomfortable. She says he kissed me just to get lost and think about it. She says he then kissed me again just to get to the last word. I don't think I'll give him the room or the door any extra love. When he came to talk to me, he assumed that I was weird for not engaging with him or that he didn't like me. I think he's going to ask me questions about my sexuality in general, or why do I feel this way about him? I know he doesn't like that he thinks I'm weird for not engaging with him, why is that not the point? I think he thinks that I'm just a question mark because he doesn't know me or that I don't feel that way about him. I know how he sees m

The code in the following cell generates one output, and then formats it nicely with Python's `textwrap` module:

In [None]:
import textwrap
out = ai.generate_one(
            prompt="This is the most painful thing I've ever experienced. This morning feels like a month ago.",
            max_length=768,
            temperature=2.0,
            top_p=0.5)
for i, line in enumerate(out.split("\n")):
    if i > 0 and "This is the most painful thing I've ever experienced. This morning feels like a month ago." in line:
        break
    print(textwrap.fill(line, 60))
    print()

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=1000,
                     batch_size=50,
                     prompt="ROMEO:",
                     max_length=256,
                     temperature=1.0,
                     top_p=0.9)

# LICENSE

MIT License

Copyright (c) 2020 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.