#  Fine tune a pre-trained GPT2 Wine Review Generation

Retrain GPT-2 advanced text generating neural network on wine review corpus to generate short (2 sentence) reviews.

Visit [MsSionSommelier GitHub repository](https://github.com/jayozer/MsSionSommelier) for The project information!



In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [11]:
!nvidia-smi # check which GPU is assigned

Sun Dec  6 19:55:23 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P0    42W / 250W |   8587MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading GPT-2

Since I am retraining a model on new text I need to download GPT-2 first. Initially I ran 124M and then with 355 to increase accuracy. In my opinion both worked like a champ.

There are three released sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M` and `1558M`: the "large" model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model

**Keeping some of the instructions in case**: Larger models have more knowledge, but take longer to finetune and longer to generate text. You can specify which base model to use by changing `model_name` in the cells below. The next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`. This model isn't permanently saved in the Colaboratory VM; you'll have to redownload it if you want to retrain it at a later time.

In [9]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 683Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 89.2Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 732Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:11, 125Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 572Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 110Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 182Mit/s]                                                       


#### Mounting Google Drive

In [10]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Uploading the wine corpus text to be trained to Colaboratory

The working directory is ('/content/drive')

In [11]:
file_name = "wine_review_corpus.txt"

Since my text file is 29.1MB (> 10MB), I uploaded the corpus txt to Google Drive first, then copied that file from Google Drive to the Colaboratory VM.

In [12]:
gpt2.copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/runxxxM` by default. The checkpoints are saved every 1000 steps and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** To rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). Rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:

*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

I did not further experiment with additional parameters.

In [13]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=1000,
              restore_from='fresh',
              run_name='run355M',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:26<00:00, 26.82s/it]


dataset has 6794543 tokens
Training...
[10 | 17.01] loss=3.42 avg=3.42
[20 | 25.81] loss=3.53 avg=3.47
[30 | 34.61] loss=3.12 avg=3.35
[40 | 43.42] loss=3.22 avg=3.32
[50 | 52.22] loss=3.15 avg=3.28
[60 | 61.02] loss=3.33 avg=3.29
[70 | 69.82] loss=3.19 avg=3.28
[80 | 78.62] loss=3.11 avg=3.26
[90 | 87.41] loss=3.22 avg=3.25
[100 | 96.21] loss=3.05 avg=3.23
[110 | 105.05] loss=2.94 avg=3.20
[120 | 113.85] loss=2.58 avg=3.15
[130 | 122.64] loss=3.01 avg=3.14
[140 | 131.44] loss=3.16 avg=3.14
[150 | 140.23] loss=3.17 avg=3.14
[160 | 149.03] loss=2.67 avg=3.11
[170 | 157.83] loss=3.12 avg=3.11
[180 | 166.62] loss=2.89 avg=3.10
[190 | 175.42] loss=3.22 avg=3.10
[200 | 184.21] loss=2.94 avg=3.09
 wine flavors of vanilla and cinnamon. the texture is long and slightly chewy. with its fruity and almost tropical flavors, this will turn into more of a medium-bodied chardonnay, but for now, it's a good bottle-name, and worth adding to the cellar. aromas for this wine are almost all white-flowered

After the model is trained, I copied the checkpoint folder to my Drive. When Collab is restarted all variables including the model results are lost. Collab resets it all unless saved to drive it will be lost.

In [14]:
gpt2.copy_checkpoint_to_gdrive(run_name='run355M')

The retrained model is ready. It was extremely painless, Google Collab Pro works like a champ. Next is generating text based on the wine corpus domain specific retrained model.

## Load a saved Model Checkpoint from Gdrive to VM

Copy the `.rar` checkpoint file from your Drive back into the Colaboratory VM. This is very useful when switching between models. Whenever I change I model I have to restart runtime. So when I want to generate text I need to start from this point.
One thing to note here, after I run the above copy_checkpoint to gdrive, I had to run below copy_checkpoint_from_gdrive for it to save into my gdrive. Below step is for loading gdrive to virtual vm

In [15]:
#gpt2.copy_checkpoint_from_gdrive(run_name='run124M')
gpt2.copy_checkpoint_from_gdrive(run_name='run355M')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [1]:
# restart runtime first
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [2]:
# sess = gpt2.start_tf_sess()
# gpt2.load_gpt2(sess, run_name='run124M')

sess2 = gpt2.start_tf_sess()
gpt2.load_gpt2(sess2, run_name='run355M')

Loading checkpoint checkpoint/run355M/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run355M/model-1000


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [5]:
#2gpt2.generate(sess, run_name='run124M')
gpt2.generate(sess2, run_name='run355M')

keeps the aromas fresh with further decanting. this is a richly textured cabernet, with ripe fruit, spice and herbs aromas and flavors that are reminiscent of a sb with a little more depth and length. the mouthfeel is soft and plush with a firm, fine-grained tannins. the wine is ready to drink. this smells of ripe black currant, licorice, leather and a touch of baking spice. the palate is tight and tight, with a deep core of tannin and acidity. the finish is spicy and peppery. drink through 2017. this is a lively, vibrant and full-bodied wine with a combination of fruit and barrel flavors that add a creamy texture to the rich blackberry and black currant fruit. it has a firm, tannic structure and a finish that is spicy and peppery. drink through 2018. this is a fresh and fruity wine from the highland area of the cabernet region, with notes of citrus, apple and toast. it is lightly structured, with a slightly soft structure. the wine is ready to drink. this is a ripe, structured, full-b

There are a few option that I would like to note here. This returns a single string however if I was creating an API based on the model I just created and needed to pass the generated text elsewhere, I could have added `text = gpt2.generate(sess, return_as_list=True)[0]`

I also thought passing a `prefix` of `This wine is` to the generate function to force the text to start with a given character sequence and generate text from there. I think this brings additional cohesiveness which I have been striving for.

I also generated 5 texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup which I also set to 5. (in Collaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate`:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.


I think the to one here is the `include_prefix`. I would still have the desired effects of `This wine is ` but without the repetitive nature of the string.
Also I set length to 45 tokens/words since on average it is between 40 and 50 words per reviews.

In [7]:
!pwd

/content


In [11]:
#model_name='124M'
model_name='355M'

# I had folder path issues, which I solved with adding parameter names explicitly. namely model_name=model_name,

In [12]:
gpt2.generate(sess,
              model_name=model_name,
              prefix="This wine is",
              length=45,
              temperature=0.7,
              top_p=0.9,
              nsamples=5,
              batch_size=5
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
This wine is made from merlot, cabernet sauvignon, petit verdot and petit verdot. it has a firm structure, with the fruit, wood and acidity of the variety. it is a wine
This wine is big and bold, with concentrated blackberry, black currant and cola flavors. it's dry and firm, with a firm, dark tannic structure. a small amount of cabernet sauvignon gives
This wine is a blend of cabernet sauvignon, merlot and petit verdot. it's rich in black cherry and blackberry jam, oak, and a good, solid tannin structure. drink now.
This wine is a blend of cabernet sauvignon, merlot and petit verdot. it's a juicy wine with soft tannins, a ripe black cherry flavor and a light touch of acidity. it's
This wine is made from the cabernet sauvignon and merlot varieties. it has a strongly tannic character that is characteristic of the variety. it is very dry and shows intense acidity. the wine is spicy and


In [13]:
gpt2.generate(sess,
              model_name=model_name,
              length=45,
              temperature=0.9,
              prefix="This wine is",
              nsamples=5,
              batch_size=5
              )

This wine is exclusive to the national u.s. auction house of dan sommers. it's one of the price points that mark the high-end of merlot. crisp acidity and bright fruit flavors are beautifully interrupted by a
This wine is crisp and fresh, with crisp lemon-lime acidity and easier-and-bitter tannins. with mulled pomegranate, baking spice and braces of basil and fennel on the long, vel
This wine is dry, delicate and contemporary, with a creamy, supple mouthfeel and a pleasant fruity aftertaste. the following is a variety that is frequently used in this region formerly known as the loire valley. the wine
This wine is made from vines from the arinto cannard grape. the amount of strong, red fruits is great, along with plenty of spice and tannins. the finish is dry and clean, with a note of chocolate. it
This wine is a straight line between rosé and petit rouge. it starts off with crisp citrus flavors covered in caramel and honey, backed with firm acidity. it finishes with a rich, round textu

**Bulk generation** 
Directly from Text to a file and sort out the samples locally on your computer. Below code will generate a text file with a timestamp.

In [16]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      model_name=model_name,
                      destination_path=gen_file,
                      length=45,
                      temperature=0.7,
                      top_p=0.9,
                      prefix="This wine is",
                      nsamples=100,
                      batch_size=20
                      )

In [17]:
# may have to run twice to get file to download
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!kill -9 -1 # force-kill the Collab