<a href="https://colab.research.google.com/github/pszemraj/ai-msgbot/blob/main/notebooks/colab-notebooks/GPT2_general_conv_textgen_deepspeed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

This notebook is based on the original tutorial from `aitextgen`. **What is different vs. any other one is that this one installs a clone of a _branch_ oi aitextgen that has actually been updated recently.Deepspeed seems to work on this one.**

- For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).
- for `ai-msgbot` (which is using `aitextgen` for chatbot-esque purposes) you can find the project repo [here](https://github.com/pszemraj/ai-msgbot)


_updates made by [Peter](https://peterszemraj.ch/)_



---

### GPU

Colaboratory uses a Nvidia K80, an Nvidia P100, an Nvidia V100, or Nvidia A100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a k80 or a P100 is ideal since they have more VRAM. 

- In theory: **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

In [None]:
# !sudo apt-get purge nvidia*
# !udo add-apt-repository ppa:graphics-drivers/ppa
# !sudo apt-get update
# !ubuntu-drivers devices
# !sudo apt-get install nvidia-driver-460

In [None]:
!nvidia-smi

Wed Dec 15 03:18:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    56W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0


### formatting

In [None]:
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

## setup

In [None]:
# update torch in case using a A100 GPU
# may take 5+ minutes

# !pip3 install -q torch==1.9.1+cu111 torchtext torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html -q

!pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html -q
!pip install https://storage.googleapis.com/jax-releases/cuda111/jaxlib-0.1.71+cuda111-cp37-none-manylinux2010_x86_64.whl -q

# see this issue https://github.com/googlecolab/colabtools/issues/2452

In [None]:
# !pip install -q aitextgen
!pip install https://github.com/minimaxir/aitextgen/archive/trainer.zip -q
# pip install https://github.com/user/repository/archive/branch.zip
!pip install -q deepspeed
import logging

logging.basicConfig(
    format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

In [None]:
mount_gdrive()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Loading GPT-2 or GPT Neo


- A common use case is *continuing* to fine-tune a model that was originally pretrained, and then fine-tuned a little bit, but needs to be fine-tuned more for accuracy/saliency reasons or because Google cut off the runtime earlier. 
    - in this case, `load_from_folder` should be set to `True` and `load_folder_dir` points to where the model checkpoint is on your google drive.
- **the below section describes loading an new/pretrained model from the original tutorial.**

> If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

> There are several sizes of GPT-2:

    * `124M` (default): the "small" model, 500MB on disk.
    * `355M` (default): the "medium" model, 1.5GB on disk.
    * `774M` (default): the "large" model, 3GB on disk.

> You can also finetune a GPT Neo model instead ([_or any textgen GPT-architecture model on huggingface_](https://huggingface.co/models?pipeline_tag=text-generation)), which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model. (355M parameter model was removed)
*  `EleutherAI/gpt-neo-1.3B` : 1.3 billion parameter model. Have yet to see this train on Colab without crashing

**the largest model trained on Colab on "standard settings" is [GPT-Neo 1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) as of Nov 27**

> The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

## 

In [None]:
model_size = "774M" #@param ["355M", "774M"]
load_from_folder = False #@param {type:"boolean"}
load_folder_dir = "/content/drive/MyDrive/Programming/ai-msgbot/ur-model-to-load" #@param {type:"string"}
custom_model = "gpt2-large" #@param {type:"string"}


In [None]:
if load_from_folder:
    ai = aitextgen(model_folder=load_folder_dir, to_gpu=True,
                   gradient_checkpointing=True)
elif len(custom_model) > 0:
    model_size = custom_model.split('/')[-1]
    ai = aitextgen(model=custom_model, 
                   to_gpu=True, 
                   to_fp16=True,

                   gradient_checkpointing=True)
else:
    ai = aitextgen(tf_gpt2=model_size, to_gpu=True,
                   to_fp16=True,
                   gradient_checkpointing=True)

12/15/2021 03:18:33 — INFO — aitextgen — Loading gpt2-large model from /aitextgen.
12/15/2021 03:18:45 — INFO — aitextgen — GPT2 loaded with 774M parameters.
12/15/2021 03:18:45 — INFO — aitextgen — Gradient checkpointing enabled for model training.
12/15/2021 03:18:45 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


### load training data


- <font color="orange"> combine any other data from the a "standard" conversational dataset and do a final pass of training (with different data) </font>


```

WoW
https://github.com/pszemraj/ai-msgbot/raw/main/conversation-data/wizard-of-wikipedia/ScriptParse-wow-train-kilt.txt

Daily Dialogues
https://github.com/pszemraj/ai-msgbot/raw/main/conversation-data/Daily-Dialogues/daily_dialogue_augment.txt

```


In [None]:
dl_link = "https://www.dropbox.com/s/b54gwbxl4e0p4sn/ScriptParse-wow-train-kilt_3.txt?dl=1" #@param {type:"string"}


In [None]:
# download test image
from urllib import request
from os.path import join
import os
vm_wd = os.getcwd()
local_name = join(vm_wd, "training_script.txt")
request.urlretrieve(dl_link, local_name)


('/content/training_script.txt', <http.client.HTTPMessage at 0x7fec38e1fa50>)

In [None]:
# adjust names in script if needed  
import pprint as pp
from os.path import basename

def update_script_names(local_name, spkr_from="speaker a", 
                        spkr_to="person alpha",
                        resp_from="speaker b", resp_to="person beta",
                        verbose=False):
    """
    update_script_names - if the textfile script has different names for the 
    speaker/responder than desired (i.e. it is a group conversation, and the 
    chatbot is just supposed to simulate 1:1) this function can be used to 
    standardize
    """

    with open(local_name, 'r', encoding='utf-8', errors='ignore') as fi:
        orig_lines = fi.readlines()

    from tqdm.auto import tqdm

    upd_lines = []

    for line in tqdm(orig_lines, total=len(orig_lines), 
                     desc="replacing speaker names"):
        
        fixline = line.replace(spkr_from, spkr_to)
        fixline = fixline.replace(resp_from, resp_to)
        upd_lines.append(fixline)

    local_namev2 = join(vm_wd, "V2-rename-" + basename(local_name))

    with open(local_namev2, 'w', encoding='utf-8', errors='ignore') as fo:
        fo.writelines(upd_lines)

    if verbose: pp.pprint(upd_lines[:10])
    # return filepath
    return local_namev2


def preview_script(file_path, num_lines:int=20):
    with open(local_name, 'r', encoding='utf-8', errors='ignore') as fi:
        script_lines = fi.readlines()

    print(f"A preview of the first {num_lines} lines of {file_path} is: \n")
    pp.pprint(script_lines[:num_lines])

In [None]:
local_name = update_script_names(local_name)

file_name = local_name # update if using fn above
preview_script(file_name)

replacing speaker names:   0%|          | 0/262550 [00:00<?, ?it/s]

A preview of the first 20 lines of /content/V2-rename-training_script.txt is: 

['that is true. i think i am pretty good at visualization. are you?\n',
 '\n',
 'person beta:\n',
 'yeah i would say so. it is fun to talk about because not everyone has the '
 "same interpretations, you can't really control it down to one.\n",
 '\n',
 'person alpha:\n',
 'i really enjoy a cappuccino, with all that warm steamed milk and double '
 "espresso...it really starts off my day. do you like cappuccino's?\n",
 '\n',
 'person beta:\n',
 "love them. though now it's almond milk for me! and having a cappucino in "
 'italy is pretty darn close to heaven ;-)\n',
 '\n',
 'person beta:\n',
 "well, i've never had one in italy, but would love it! i like cappuccinos "
 'with cream too, maybe a dab of cinnamon as well.\n',
 '\n',
 'person alpha:\n',
 'i like the color blue, i even named my dog blue!\n',
 '\n',
 'person beta:\n',
 "that's pretty cool. i like blue from blue's clues, haha. blue is one of the "
 'th

## Train / Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change **unless the model size grows beyond 1B params**

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
base_dir = "/content/drive/MyDrive/Programming/ai-msgbot" #@param {type:"string"}
# update to yours

In [None]:
import gc, os
from os.path import join
from datetime import datetime

def get_timestamp():
    return datetime.now().strftime("%b-%d-%Y_t-%H")

temp_gpu_path = join(base_dir, 
                     "GPT2-conversational-{sz}-{dt}".format(sz=model_size,
                                                            dt=get_timestamp(),
                                                            )
                     )
os.makedirs(temp_gpu_path, exist_ok=True)
gc.collect()

124

### example outputs

- this cell had its outputs cleared before notebook was posted.
- the below is an example of what the outputs during training should look like.

```
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py:1819: LightningDeprecationWarning: `trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7. Use `ProgressBarBase.get_metrics` instead.
  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."
1,000 steps reached: saving model to //content/drive/MyDrive/Programming/ai-msgbot/GPT2-conversational-774M-Nov-25-2021_t-04
1,500 steps reached: generating sample texts.
/usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
==========


person alpha:
do you enjoy the movie?

person beta:
yes, i love the film is based in george r. r. martin scotland

person alpha:
oh cool, how does it get to be one?

person beta:
well it was a break between a single.

person alpha:
i really like the show but i don't know much about it, i do not know much about it

person beta:
i know there are many episodes for the show, and the show has to be the highest grosss in the show though.

person alpha:
i've been a new game of thrones, or twice a year

```


(train_data: str | TokenDataset, output_dir: str = "trained_model", fp16: bool = False, fp16_opt_level: str = "O1", n_gpu: int = - 1, tpu_cores: int = 0, max_grad_norm: float = 0.5, gradient_accumulation_steps: int = 1, seed: int | None = None, learning_rate: float = 0.001, weight_decay: float = 0.05, adam_epsilon: float = 1e-8, warmup_steps: int = 0, num_steps: int = 5000, save_every: int = 1000, generate_every: int = 1000, n_generate: int = 1, loggers: List | None = None, batch_size: int = 1, num_workers: int | None = None, benchmark: bool = True, avg_loss_smoothing: float = 0.01, save_gdrive: bool = False, run_id: str = <Expression>, progress_bar_refresh_rate: int = 20, freeze_layers: bool = False, num_layers_freeze: int | None = None, use_deepspeed: bool = False, **kwargs) -> None
```


In [None]:
from aitextgen.TokenDataset import TokenDataset

data = TokenDataset(file_name)

# try ai.train_pt() later

  0%|          | 0/262550 [00:00<?, ?it/s]

12/15/2021 03:18:54 — INFO — aitextgen.TokenDataset — Encoding 262,550 sets of tokens from /content/V2-rename-training_script.txt.


### t r a i n

In [None]:
# DO NOT USE WARMUP STEPS
# ai.train_pt()
ai.train(
            train_data=data,
            output_dir=temp_gpu_path, # where it saves during "save_every"
            line_by_line=False, # if using CSV file input
            from_cache=False,
            num_steps=50000, # takes about 5 hours on 16 gb v100 GPU for 75000
            generate_every=1500,
            max_grad_norm=0.5,
            save_every=1000,
            gradient_accumulation_steps=1,
            save_gdrive=False, # this is an "automated" save which is worse than current method (IMO)
            learning_rate=1e-4,
            # fp16_opt_level="O2", # different types of FP16 are possible
         
            # learning_rate=1e-3,
            use_deepspeed=True,
            # fp16=True, # may be relevant to set to false (even if available) for "final" training
            batch_size=1, # if pushing model_size you probably want to leave this at 1
            freeze_layers= True, # whether to change weights on ALL layers or not
            num_layers_freeze = 28, # look up how many layers your model has
        #  fp16_opt_level="O2", # different types of FP16 are possible
        )

PyTorch: setting up devices
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
  Num examples = 2288881
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 50000


  0%|          | 0/50000 [00:00<?, ?it/s]

Saving model checkpoint to /content/drive/MyDrive/Programming/ai-msgbot/GPT2-conversational-gpt2-large-Dec-15-2021_t-03/checkpoint-500


{'loss': 1.4697, 'learning_rate': 0.0001, 'epoch': 0.0}


In [None]:
save_path = join(base_dir, 
                     "FINAL-GPT2-conv-{sz}-{dt}".format(sz=model_size,
                                                        dt=get_timestamp(),
                                                        )
                     )


In [None]:
import os
os.makedirs(save_path, exist_ok=True)
ai.save(save_path)

print(f'saved! {get_timestamp()}')


---

## Use a Trained Model for Generation

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
!nvidia-smi

In [None]:
mount_gdrive()


In [None]:
# best model thus far @ 1.3B parameters and tuned for 50k steps
# from_folder = "/content/drive/MyDrive/Programming/AI_peter/GPT-Neo-1B-V1"

from_folder = save_path

if len(from_folder) > 2:

    for file in ["pytorch_model.bin", "config.json"]:
        if from_folder:
            copy_file_from_gdrive(file, from_folder)
        else:
            copy_file_from_gdrive(file)

    ai = aitextgen(model_folder=from_folder, to_gpu=True)
else:
    ai = aitextgen(model_folder=".", to_gpu=True)


### Generate Text


`generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate(n=3, max_length=256, 
            temperature=1.0, top_p=0.9)

In [None]:
ai.generate(prompt="these days, it always seems like ", temperature=1,
            min_length=10, batch_size =20)

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(
    n=3, batch_size=25, prompt="i just", max_length=256, 
    temperature=1.0, top_p=0.9
)

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
save_loc = join(base_dir, "generated_text_out")

os.makedirs(save_loc, exist_ok=True)

In [None]:
p_list = [["person alpha:"+"\n", 
           "how are you doing?"+"\n", "\n", 
           "person beta:" + "\n"], 
          ["person alpha:"+"\n", 
           "hello there!"+"\n", "\n", 
           "person beta:" + "\n"], 
           ["person alpha:"+"\n", "why does it always seem that "],
           ["person beta:" + "\n"],
]


prompts = ["".join(line) for line in p_list]

In [None]:
from datetime import datetime
import pprint as pp

ds_date_time = datetime.now().strftime("%m.%d.%Y")

base_header = "gpt-model-textgen-{}".format(ds_date_time)
prompt_IDs = [base_header + "_file-{}.txt".format(i+1) for i in range(5, len(prompts)+11)]

prompt_mng = {}
for pid, text in zip(prompt_IDs, prompts):
    prompt_mng[pid] = text
pp.pprint(prompt_mng)

In [None]:
from os.path import join

for pfile, my_prompt in prompt_mng.items():
    ai.generate_to_file(
        n=50,
        batch_size=5,
        prompt=my_prompt,
        max_length=512,
        temperature=0.8,
        top_p=0.9,
        destination_path=join(save_loc, pfile)
    )
