<a href="https://colab.research.google.com/github/mholmeslinder/ai_rtist_bot/blob/master/new_ai_rtists_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 2: Finetuning GPT-2 on our dataset and spitting out 'new artists'

### This is totally based on the awesome [GPT-2-Simple](https://github.com/minimaxir/gpt-2-simple) and its associated Colab.

In [Part 1](https://github.com/mholmeslinder/ai_rtist_bot/blob/master/bot_ml_repo/new_ai_rtists_dataset.ipynb), we used the Last.fm API to collect our dataset - a list of ~10,000 names of real musical artists. In this Colab, we'll be retraining [OpenAI's GPT-2](https://openai.com/blog/better-language-models/) on that data and getting it to spit out some fictional 'new artist names', which we will run through our ["cleaner"](https://github.com/mholmeslinder/ai_rtist_bot/blob/master/bot_ml_repo/cleaner.py), removing any items that duplicate our existing dataset. 

At that point, we'll have a list of 'new artist names', which we can use for whatever purpose we want!

## Imports.

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## GPU

Here's what Max Woolf says in his [brilliant Colab](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce):

"Colaboratory uses either a Nvidia T4 GPU or an Nvidia K80 GPU. The T4 is slightly faster than the old K80 for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text.

You can verify which GPU is active by running the cell below." 

In [2]:
!nvidia-smi

Sat Mar  7 01:08:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Download GPT-2
We'll be training on GPT-2's Medium (355M) model. If you'd like more info about the different models of GPT-2 that OpenAI released, you can check out the GPT-2-Simple links in the cells above or on [OpenAI's site](https://openai.com/blog/gpt-2-1-5b-release/).

In [3]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 299Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 100Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 608Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:05, 278Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 279Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 114Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 167Mit/s]                                                       


## Dataset File

In theaforementioned totally awesome Colab, he advises mounting Drive as a way to upload datasets for our model to crunch on. This is an excellent idea, and I highly recommend it for most Colab purposes. 

Since, for the purposes of documenting our [Twitter Bot](https://twitter.com/new_ai_rtists), we only need one dataset, we'll instead run the following code to grab it from our [Github repo](https://github.com/mholmeslinder/ai_rtist_bot).

In [4]:
!git clone https://github.com/mholmeslinder/ai_rtist_bot
file_name = "ai_rtist_bot/data/artist_names.txt"

Cloning into 'ai_rtist_bot'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 70 (delta 28), reused 63 (delta 21), pack-reused 0[K
Unpacking objects: 100% (70/70), done.


## Finetune GPT-2

Here, I will again c+p from Max Woolf's awesome colab:

"The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies."

In [5]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=1000,
              restore_from='latest',
              run_name='run3',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:00<00:00,  1.81it/s]


dataset has 43571 tokens
Training...
[10 | 32.03] loss=3.75 avg=3.75
[20 | 53.97] loss=4.24 avg=4.00
[30 | 75.94] loss=3.61 avg=3.87
[40 | 97.99] loss=3.27 avg=3.72
[50 | 120.07] loss=3.45 avg=3.66
[60 | 142.13] loss=2.55 avg=3.47
[70 | 164.25] loss=4.12 avg=3.57
[80 | 186.40] loss=3.94 avg=3.61
[90 | 208.48] loss=2.64 avg=3.50
[100 | 230.57] loss=2.47 avg=3.39
[110 | 252.71] loss=2.62 avg=3.32
[120 | 274.83] loss=2.75 avg=3.27
[130 | 296.93] loss=2.81 avg=3.23
[140 | 319.06] loss=2.23 avg=3.16
[150 | 341.20] loss=2.29 avg=3.09
[160 | 363.36] loss=3.20 avg=3.10
[170 | 385.47] loss=1.95 avg=3.03
[180 | 407.64] loss=2.29 avg=2.98
[190 | 429.74] loss=2.76 avg=2.97
[200 | 451.88] loss=2.59 avg=2.95

Bobby Hebb
Mariah Montage
Lorne Balfe
The Magician
The Kooks
Bobby Wilson
Caleb Belkin
Mobb Deep
Kelsey Lu
Jadakiss
The Dap-Kings
Yvonne Elliman
Clara Nunes
Jawbreaker
Maggie Lindemann
Blues Traveler
M. Ward
The Magician
Pierce Fulton
Lorena
Fela Kuti
Gang Starr
Aurora A.M.
Bobby McFerrin
NCT 1

## Model Checkpoints
 
Once the above finetuning is done, you'll find a `.rar` checkpoint of it in the file navigator to the left in `Files/models/355M`. Since this is just a demonstration of creating the model for `new_ai_rtists`, we'll skip the steps on saving and loading checkpoints to Drive. As always, see the original Colab for more info.

## Generate Text From The Trained Model

After you've trained the model, you can now generate text. `generate` generates a single text from the loaded model.



In [6]:
gpt2.generate(sess, run_name='run3')

Scott Lambert
Blues Traveler
James Brown
Jony
Groove Armada
Static & Rock
Nio Garcia
Dayseeker
James Brown
BD
Ylvis
Tim Buckley
Belly
Kenny G
Cigarettes After Sex
La Femme
Thekla
Jaden
Bakermat
The Victor
The Secret Sisters
AXA
Martha Reeves & The Vandellas
BB King
Marcus King
George Ezra
Dwayne Johnson
BTS
Weezer
Young Thug
LIL UZI VERT
Arcade Fire
Ellie Goulding
Black Eyed Peas
Lewis Capaldi
Niara
Dance Gavin Dance
Nick Cave
Wings
Formosa
Oliver Évilo
S3RL
The Ink Spots
S3R3
Slade
Curbi
Måns Zelmerlöw
Night Moves
Flobots
Arctic Monkeys
Hayley Williams
Death in June
Wintergatan
Édith Piaf
Tep No
Jonny Greenwood
Beach Goons
Chloe x Halle
Tegan and Sara
Zero 7
BLACK MIDI
Tomaso Giovanni Cesari
Easy Life
Juice Newton
Pusha T
A$AP Twelvyy
Glass Candy
Blanck Mass
Tejal Yann
Eli.
Mike Williams
Ed Sheeran
Blanck Mass
The Lumineers
Johann Sebastian Bach
Lou Reed
Blues Saraceno
My Morning Jacket
Sugabababababababababababababaco
Monolord
Yello
Paul Weller
Conor Oberst
Weller
Johann Sebastian Ba

You can pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.


For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [0]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=1023,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20,
                      run_name='run3'
                      )

In [0]:
# may have to run twice to get file to download
files.download(gen_file)

## Conclusion

Just following along and running the cells in this Colab, you'll produce a GPT-2 model, checkpoint, AND generated text(s) using the exact methodology I did to create `new_ai_rtists`. Of course, it should be pretty easy to adapt all of this to whatever text-based project you want, so **GO NUTS!**

##NOTE: 
You could make a pretty good argument that, since GPT-2 was trained on internet data, it's more likely to spit out artist names that already belong to real artists, whether or not they're in our initial dataset. It's my intention to replicate this whole process using a 'fresh' text-generation RNN and see how the results compare. 

So, be on the lookout for updated documentation!