<a href="https://colab.research.google.com/github/gu-ma/hgk-ml-workshop/blob/main/notebooks/train_gpt2(aitextgen).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

original colab by [Max Woolf](https://minimaxir.com)

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).

## Setup

In [None]:
# !pip install -q aitextgen
!pip install git+https://github.com/llimllib/aitextgen@fix_tpu_available 

import os
import shutil
from pathlib import Path

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive, copy_file_to_gdrive

## Connect Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# 1. Try some models

In [None]:
# @title Choose and Load model

# @markdown Select a model to load
model = "huggingartists/eminem" #@param ["striki-ai/william-shakespeare-poetry","huggingartists/eminem","pranavpsv/gpt2-genre-story-generator","fabianmmueller/deep-haiku-gpt-2"] {allow-input: true}

# @markdown If you feel aventurous you can also try some models [from here](https://huggingface.co/models?library=transformers,pytorch&language=en&other=gpt2&p=1&sort=downloads) just use the sufix of the url. For example huggingface.co/__huggingartists/rihanna__

ai = aitextgen(model=model, to_gpu=True)

print(f"---\nRead the model card for info on how to generate text https://huggingface.co/{model}")

In [None]:
# @title Generate some samples

# @markdown How the text should start
prompt = "I am a"  # @param {type:"string"}

# @markdown Number of samples to generate
n = 4  # @param {type:"integer"}

# @markdown Length of the generated text
max_length = 64  # @param {type:"integer"}

# @markdown The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
temperature = 1.0  # @param {type:"number"}

# @markdown Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
top_p = 0.9  # @param {type:"number"}

ai.generate(
    n=n, prompt=prompt, max_length=max_length, temperature=temperature, top_p=top_p
)


# 2. Finetune GPT-2

First upload a file to Colab and set the path. 

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

In [None]:
file_name = "warpeace_input.txt"

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai.train(
    file_name,
    line_by_line=False,
    from_cache=False,
    num_steps=3000,
    generate_every=1000,
    save_every=1000,
    save_gdrive=False,
    learning_rate=1e-3,
    fp16=False,
    batch_size=1,
)
model_folder = "trained_model"

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model. But first it might be a good idea to copy you model to gdrive ⬇️

In [None]:
#@title Upload model to Google Drive

gdrive_output_dir = "/content/drive/MyDrive/AI/hgk_workshop" #@param {type:"string"}

model_name = "my-gpt2-model" #@param {type:"string"}

dst = Path(gdrive_output_dir) / Path(model_name)

src = '/content/trained_model/'

# Create dst dir if it does not exist
if not os.path.isdir(dst):
    os.mkdir(dst)

# Copy the processed files to gdrive
print(f'Copying files to Google Drive this could take some time 😴')
shutil.copytree(src, dst, dirs_exist_ok=True)


#3. Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM.

In [None]:
#@title Download model from Google Drive

# @markdown Path to your model on google drive. Right click your directory and choose "copy path" then paste it below
gdrive_model_dir = "/content/drive/MyDrive/AI/hgk_workshop/my-gpt2-model" #@param {type:"string"}

(gdrive_path, model_name) = os.path.split(gdrive_model_dir)

src = gdrive_model_dir

dst = f'/content/{model_name}/'
model_folder = dst

# Create local dst dir if it does not exist
if not os.path.isdir(dst):
    os.mkdir(dst)

# Copy the processed files to gdrive
print(f'Copying files from Google Drive')
shutil.copytree(src, dst, dirs_exist_ok=True)

## Generate Text From The Trained Model

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [None]:
ai = aitextgen(model_folder=model_folder, to_gpu=True)

`generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate()

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(
    n=5,
    prompt="A man",
    max_length=128,
    temperature=1.0,
    top_p=0.9
)

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_files = 5

for _ in range(num_files):
    ai.generate_to_file(
        n=10,
        prompt="A man",
        max_length=256,
        temperature=1.0,
        top_p=0.9
    )

Zip text files and delete them

In [None]:
! zip samples.zip *.txt
! rm *.txt