## Fine-tune a pre-trained model with task-specific data (Amazon ads) 

Generative Pretrained Transformer models like OpenAI's GPT-2, GPT-3 or EleutherAI's gpt-neo-1.3B are trained on a large amount of data and do very well on the data they're trained on, with zero-shot learning. 

They do have one- or few-shot learning, where you can provide example prompts in a line or two. The models learn from this minimal training data to generate similar output content. I have not found them to generate ad text well, especially for Amazon ads. 

Fine tuning involves training on a supervised dataset specific to the desired task. It involves updating the weights of a pre-trained model and trains the model by performing gradient updates after every epoch similar to the training of neural networks. 

This is similar to transfer learning or headless servers. 


For fine tuning used Aitextgen which is a Python tool for text-based AI training and generation using [OpenAI's](https://openai.com/) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai/) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.

aitextgen is a Python package that leverages [PyTorch](https://pytorch.org/), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2

For more about `aitextgen`, visit [the GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).



In [1]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

%cd drive/MyDrive/Colab\ Notebooks/nlg/code

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/nlg/code


In [2]:
!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

[K     |████████████████████████████████| 572 kB 4.1 MB/s 
[K     |████████████████████████████████| 4.2 MB 36.1 MB/s 
[K     |████████████████████████████████| 87 kB 7.0 MB/s 
[K     |████████████████████████████████| 584 kB 41.9 MB/s 
[K     |████████████████████████████████| 136 kB 42.9 MB/s 
[K     |████████████████████████████████| 409 kB 23.7 MB/s 
[K     |████████████████████████████████| 596 kB 51.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 37.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 31.1 MB/s 
[K     |████████████████████████████████| 84 kB 3.2 MB/s 
[K     |████████████████████████████████| 271 kB 50.7 MB/s 
[K     |████████████████████████████████| 94 kB 2.7 MB/s 
[K     |████████████████████████████████| 144 kB 50.8 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


05/19/2022 08:11:35 — INFO — numexpr.utils — NumExpr defaulting to 2 threads.


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Wed May 18 08:00:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or GPT Neo

To retrain a model on new text, we need to download and load the GPT model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk. 
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk

We can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model 
* `1.3B`:   

Note: The 350M model is no longer available. The 1.3B model gives OOM when trying to use it on the Free Colab

Download the model and save it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
# Use the EleutherAI/gpt-neo-125M model

#ai = aitextgen(tf_gpt2="355M", to_gpu=True)
#ai = aitextgen(tf_gpt2="124M", to_gpu=True)

ai = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True)

#ai = aitextgen(model="EleutherAI/gpt-neo-350M") # dont have 350M model any more
#ai = aitextgen(model="EleutherAI/gpt-neo-1.3B", to_gpu=True) # OOM



05/18/2022 08:05:18 — INFO — aitextgen — Downloading EleutherAI/gpt-neo-125M model to /aitextgen.
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/config.json not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmpx9strw4f


Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/config.json in cache at aitextgen/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
creating metadata file for aitextgen/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmp55in52xi


Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/pytorch_model.bin in cache at aitextgen/b0ace3b93ace62067a246888f1e54e2d3ec20807d4d3e27ac602eef3b7091c0b.6525df88f1d5a2d33d95ce2458ef6af9658fe7d1393d6707e0e318779ccc68ff
creating metadata file for aitextgen/b0ace3b93ace62067a246888f1e54e2d3ec20807d4d3e27ac602eef3b7091c0b.6525df88f1d5a2d33d95ce2458ef6af9658fe7d1393d6707e0e318779ccc68ff
05/18/2022 08:05:37 — INFO — aitextgen — Using the tokenizer for EleutherAI/gpt-neo-125M.
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmpk3mz5b9_


Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/tokenizer_config.json in cache at aitextgen/3cc88b3aa29bb2546db2dc21783292e2a086bb7158c7b5ceddeb24158a85c183.e74f7c3643ee79eb023ead36008be72fe726dada60fa3b2a0569925cfefa1e74
creating metadata file for aitextgen/3cc88b3aa29bb2546db2dc21783292e2a086bb7158c7b5ceddeb24158a85c183.e74f7c3643ee79eb023ead36008be72fe726dada60fa3b2a0569925cfefa1e74
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmpkv5ds186


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/vocab.json in cache at aitextgen/08c00c4159e921d4c941ac75732643373aba509d9b352a82bbbb043a94058d98.a552555fdda56a1c7c9a285bccfd44ac8e4b9e26c8c9b307831b3ea3ac782b45
creating metadata file for aitextgen/08c00c4159e921d4c941ac75732643373aba509d9b352a82bbbb043a94058d98.a552555fdda56a1c7c9a285bccfd44ac8e4b9e26c8c9b307831b3ea3ac782b45
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmpkodrwf98


Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/merges.txt in cache at aitextgen/12305762709d884a770efe7b0c68a7f4bc918da44e956058d43da0d12f7bea20.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
creating metadata file for aitextgen/12305762709d884a770efe7b0c68a7f4bc918da44e956058d43da0d12f7bea20.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /content/drive/MyDrive/Colab Notebooks/nlg/code/aitextgen/tmpr3ofhcca


Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/special_tokens_map.json in cache at aitextgen/6c3239a63aaf46ec7625b38abfe41fc2ce0b25f90800aefe6526256340d4ab6d.2b8bf81243d08385c806171bc7ced6d2a0dcc7f896ca637f4e777418f7f0cc3c
creating metadata file for aitextgen/6c3239a63aaf46ec7625b38abfe41fc2ce0b25f90800aefe6526256340d4ab6d.2b8bf81243d08385c806171bc7ced6d2a0dcc7f896ca637f4e777418f7f0cc3c
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/vocab.json from cache at aitextgen/08c00c4159e921d4c941ac75732643373aba509d9b352a82bbbb043a94058d98.a552555fdda56a1c7c9a285bccfd44ac8e4b9e26c8c9b307831b3ea3ac782b45
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/merges.txt from cache at aitextgen/12305762709d884a770efe7b0c68a7f4bc918da44e956058d43da0d12f7bea20.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/tokenizer.json from cache at None
loadin

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

storing https://huggingface.co/EleutherAI/gpt-neo-125M/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
creating metadata file for /root/.cache/huggingface/transformers/29380fef22a43cbfb3d3a6c8e2f4fd951459584d87c34e4621b30580a54aca84.f0f7ebddfc6e15a23ac33e7fa95cd8cca05edf87cc74f9e3be7905f538a59762
05/18/2022 08:05:41 — INFO — aitextgen — GPTNeo loaded with 125M parameters.


## Mounting Google Drive



In [None]:
#mount_gdrive()

#### Use the Amazon Ads data file from G drive

In [None]:
file_name = "/content/drive/MyDrive/Colab Notebooks/nlg/data/amazon_tv_wearable_12k_holidays.csv"



In [None]:
# copy_file_from_gdrive(file_name) # don't need to do this

## Finetune GPT model

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai.train(file_name,
         line_by_line=True,
         from_cache=False,
         num_steps=3000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1, 
         )

05/18/2022 08:17:44 — INFO — aitextgen — Loading text from /content/drive/MyDrive/Colab Notebooks/nlg/data/amazon_tv_wearable_12k_holidays.csv with generation length of 2048.


  0%|          | 0/13070 [00:00<?, ?it/s]

05/18/2022 08:17:44 — INFO — aitextgen.TokenDataset — Encoding 13,070 rows from /content/drive/MyDrive/Colab Notebooks/nlg/data/amazon_tv_wearable_12k_holidays.csv.
05/18/2022 08:17:46 — INFO — torch.distributed.nn.jit.instantiator — Created a temporary directory at /tmp/tmppgmp2i6q
05/18/2022 08:17:46 — INFO — torch.distributed.nn.jit.instantiator — Writing /tmp/tmppgmp2i6q/_remote_module_non_sriptable.py
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
05/18/2022 08:17:46 — INFO — pytorch_lightning.utilities.rank_zero — GPU available: True, used: True
05/18/2022 08:17:46 — INFO — pytorch_lightning.utilities.rank_zero — TPU available: False, using: 0 TPU cores
05/18/2022 08:17:46 — INFO — pytorch_lightning.utilities.rank_zero — IPU available: False, 

  0%|          | 0/3000 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
mm HDTV - Digital Amplifier HDTV Antenna, HDTV Antenna with 120 Miles Long Range - Best Present for UHF/VHF/VHF (AT-414B)
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
/XS/XR/XS Max/X/8/8Plus/7/7Plus/6/6Plus (Silver)


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [3]:
from_folder = "/content/drive/MyDrive/Colab Notebooks/nlg/code/trained_model"

for file in ["pytorch_model.bin", "config.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

Now load the retrained model + metadata necessary to generate text.

In [4]:
ai = aitextgen(model_folder=".", to_gpu=True)

05/19/2022 08:11:55 — INFO — aitextgen — Loading model from provided weights and config in /..
05/19/2022 08:12:01 — INFO — aitextgen — GPTNeo loaded with 125M parameters.
05/19/2022 08:12:01 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [5]:
ai = aitextgen(model_folder="trained_model", to_gpu=True)

05/19/2022 08:12:20 — INFO — aitextgen — Loading model from provided weights and config in /trained_model.
05/19/2022 08:12:23 — INFO — aitextgen — GPTNeo loaded with 125M parameters.
05/19/2022 08:12:23 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


`generate()` without any parameters generates a single text from the loaded model to the console.

In [6]:
# Generate files needed to upload the model to HuggingFace
ai.save_for_upload("AdTextGenerator")

tokenizer config file saved in AdTextGenerator/tokenizer_config.json
Special tokens file saved in AdTextGenerator/special_tokens_map.json


In [7]:
ai.generate()

mm Wristband Compatible with Samsung Galaxy Fit E 42mm 46mm, Gear S3 Classic/Frontier, Galaxy Fit E5 Smart Watch Fitness [Small Size]


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [8]:
temperatures = [1.0, 0.7]

for temp in temperatures:
  print('temperature:', temp)
  ai.generate(n=5,
            batch_size=5,
            prompt="Last minute deals on Sony",
            max_length=100,
            temperature=temp,
            top_p=0.9)
  print ("\n")

temperature: 1.0
[1mLast minute deals on Sony[0m Blu-Ray DVD Player BDP-BX58 BDP-BX510 BDP-BX58 BDP-BX510 BDP-BX58 PDP-BX510 PDP-BDX1600
[1mLast minute deals on Sony[0m Blu-Ray DVD Player DVP-SR201P DVP-SR201P DVP-SR201P DVP-SR201P DVP-SR210P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR405P DVP-SR510
[1mLast minute deals on Sony[0m LCD TV, Smart TV: UN40U8500, UN40X6200/ UN40X6200/ UN40X8065/ UN40X8065/ UN40X8065/ UN40X8070/ UN40X8075/ UN40X8075/ UN48X80/ UN48X80/ UN48X80/ UN50X80/ UN50X80/ UN50X80/ UN55X80
[1mLast minute deals on Sony[0m DVD Player, HDMI, USB, VGA, AV Input, and Audio Input
[1mLast minute deals on Sony[0m Blu-Ray DVD Player, HD1080P Support BDPK3345


temperature: 0.7
[1mLast minute deals on Sony[0m Blu-Ray DVD Player BDP-XV5100 BDP-XV5900 BDP-XV5100/XAA BDP-XV5100/XAA XBR5500A
[1mLast minute deals on Sony[0m Blu-Ray, DVD and AV, USB HDMI, Solid Stainless Steel Strap with Magnet Lock, Built-in PAL RCA RCA

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [9]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=100,
                     batch_size=50,
                     prompt="AmazonBasics:",
                     max_length=256,
                     temperature=1.0,
                     top_p=0.9)

05/19/2022 08:14:20 — INFO — aitextgen — Generating 1,000 texts to ATG_20220519_081420_33788193.txt


  0%|          | 0/1000 [00:00<?, ?it/s]

05/19/2022 08:15:15 — INFO — aitextgen — Generating 1,000 texts to ATG_20220519_081515_37675417.txt


  0%|          | 0/1000 [00:00<?, ?it/s]

05/19/2022 08:16:01 — INFO — aitextgen — Generating 1,000 texts to ATG_20220519_081601_34059022.txt


  0%|          | 0/1000 [00:00<?, ?it/s]

05/19/2022 08:16:53 — INFO — aitextgen — Generating 1,000 texts to ATG_20220519_081653_28100189.txt


  0%|          | 0/1000 [00:00<?, ?it/s]

05/19/2022 08:17:40 — INFO — aitextgen — Generating 1,000 texts to ATG_20220519_081740_72095916.txt


  0%|          | 0/1000 [00:00<?, ?it/s]

# End of notebook