<a href="https://colab.research.google.com/github/honzasvasek/free-butoh-teacher/blob/master/Copy_of_Train_a_GPT_2_Text_Generating_Model_w_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Creating body-centric Butoh-Fu with GPT2 

Remix from a codelab by [Max Woolf](http://minimaxir.com)


For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple). You can also read my [blog post](https://minimaxir.com/2019/09/howto-gpt2/) for more information how to use this notebook!


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook in Google Chrome.
3. Run the cells below:


In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## GPU

Colaboratory uses either a Nvidia T4 GPU or an Nvidia K80 GPU. The T4 is slightly faster than the old K80 for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text.

You can verify which GPU is active by running the cell below.

In [None]:
!nvidia-smi

Sat Sep  5 12:47:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading GPT-2

If you're retraining a model on new text, you need to download the GPT-2 model first. 

There are three released sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model (see later in Notebook)
* `1558M`: the "extra large", true model. Will not work if a K80 GPU is attached to the notebook. (like `774M`, it cannot be finetuned).

Larger models have more knowledge, but take longer to finetune and longer to generate text. You can specify which base model to use by changing `model_name` in the cells below.

The next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`.

This model isn't permanently saved in the Colaboratory VM; you'll have to redownload it if you want to retrain it at a later time.

In [None]:
gpt2.download_gpt2(model_name="774M")

Fetching checkpoint: 1.05Mit [00:00, 240Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 107Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 295Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 3.10Git [00:46, 66.4Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 271Mit/s]                                                
Fetching model.ckpt.meta: 2.10Mit [00:00, 193Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 143Mit/s]                                                       


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/TGcZT4h.png)

Upload **any smaller text file**  (<10 MB) and update the file name in the cell below, then run the cell.

In [None]:
file_name = "butoh-fu.txt"

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

In [None]:
gpt2.copy_file_from_gdrive(file_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [None]:
sess = gpt2.start_tf_sess()
 
gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=50,
              restore_from='latest',
              run_name='run1',
              print_every=10,
              sample_every=50,
              save_every=50,
              learning_rate=0.00001
 
              )

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a `.rar` compressed file; you can download it and uncompress it locally.

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [None]:
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, model_name='774M')

Loading pretrained model models/774M/model.ckpt
INFO:tensorflow:Restoring parameters from models/774M/model.ckpt


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [None]:
gpt2.generate(sess, model_name='1558M') #, run_name='run1')

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [8]:
yourBodyPart = [ 'your body','your tailbone','your hands','your feet','your legs','your arms','your head','your torso','your knees','your hips','your wrists','your elbows','your shoulders','your spine','your imaginary tail','your fingers','your toes','your footsoles','your ankels','your lower legs','your kneecaps','your upper legs','your forearms','your upper arms','your buttocks','your belly','your heels','your neck','your hipjoints','your bellybutton','your chest','your chestbone','your breast','your collarbones','your upper spine','your middle spine','your lower spine','your sacrum','your lower back','your middle back','your upper back','your shoulderblades','the palms of your hands','the tops of your hands','your left hand','your right hand','your left foot','your right foot','your left leg','your right leg','your left arm','your right arm','your left knee','your right knee','your left hip','your right hip','your left wrist','your right wrist','your left elbow','your right elbow','your left shoulder','your right shoulder','the fingers of your left hand','the fingers of your right hand','the toes of your left foot','the toes of your right foot','your left footsole','your right footsole','your left ankel','your right ankel','your left lower leg','your right lower leg','your left kneecap','your right kneecap','your left upper leg','your right upper leg','your left forearm','your right forearm','your left upper arm','your right upper arm','your left buttock','your right buttock','your left heel','your right heel','the top of your left foot','the top of your right foot','your left hipjoint','your right hipjoint','your left breast','your right breast','your left collarbone','your right collarbone','your left shoulderblade','your right shoulderblade','the palm of your left hand','the palm of your right hand','the top of your left hand','the top of your right hand','your anus','your nippels','your left nippel','your right nippel','your internal organs','your bladder','your intestines','your kidneys','your stomach','your left kidney','your right kidney','your left lung','your right lung','your left brain','your right brain','your gallbladder','your liver','your lungs','your heart''your brain','your genitals','the backs of your lower legs','the fronts of your lower legs','the backs of your knees','the backs of your upper legs','the fronts of your upper legs','the tops of your feet','the back of your left lower leg','the back of your right lower leg','the front of your left lower leg','the front of your right lower leg','the back of your left knee','the back of your right knee','the back of your left upper leg','the back of your left upper leg','the front of your left upper leg','the front of your right upper leg']
 
# parts and seed you can fill in as you like. They are not printed but provide
# a context. They can be multi sentence, but cannot contain a newline '\n' 
import random
 
seed='You are in a very strange place. You are powerful. You are hiding your secret.'
 
parts = ['Butoh is about becoming un-human. \
          It is about connecting to the whole universe. ',
         'Slow beginning, you get into a trance. ',
         'Your trance is getting stronger. ',
         'Slow beginning, you get into a trance. ',
         #'Gradual speeding up adventure, your ego is gone. ',
         'Surprising end, connecting to the rhizome. ']
        
opening = 'You are a horny ghost running after your secret desire.\nYou smell'
 
actions = ['', '', '', '', '', 'Notice ', 'Move ']
          
 
 
index = 0
####
# print(opening)
####
subseed=''
textbuf=120 # size of the parts.
for part in parts: 
  if index != 0:
    opening =  random.sample(actions, 1)[0] + random.sample(yourBodyPart, 1)[0]
  #opening =  ivana[index] #+ random.sample(actions, 1)[0] + random.sample(yourBodyPart, 1)[0].capitalize()
  mylist = gpt2.generate(sess,
              length=textbuf,
              temperature=0.8,
              prefix=seed + ' ' + subseed + ' ' + part + "Butoh-fu:\n" + opening.capitalize(),
              return_as_list=True,
              include_prefix=True,
              nsamples=1,
              model_name='774M',
              truncate="<|endoftext|>"
              )
  index = index + 1
  mylines=mylist[0].splitlines()
  if len(mylines) > 0: mylines.pop(0)
  if len(mylines) > 1: mylines.pop()
  for myline in mylines:
    subseed = subseed + myline.replace('Butoh-fu:','').replace('butoh-fu:','\n') + ' '
  subseed = subseed + '- '
print(subseed.replace('  ',' ').replace('. ','.\n').replace('- ','\n').replace('? ','?\n').replace('! ','!\n'))

You are a horny ghost running after your secret desire.

Your shoulderblades end up spinning faster and faster.
You feel your back getting stretched out.
Your velvet crotch is hard and slippery.
Your belt is getting wet too.
You feel your rope getting wet.
As much as you want to "rap" yourself, you can't.
Your back muscles still aren't broken.
You've got to restrain yourself for a while.
Your sex organs are standing tall.
You can't resist your inner passion.
It's like you have a lust for a white guy as well.
You can't take this.
You know what's about to happen.
It's going to be so sexy 
Your nippels are throbbing.
You can't take it anymore.
You reach out.
You push yourself as hard as you can.
You can't take it any more.
You want the orgasm to come now.
You want to squeeze it out of your pussy.
You can't take it any more.

The orgasm comes.
Oh, oh, oh.
You're going to cum.
Your mind starts to fill with ecstasy.
You're going to cum for the first time.
You're going to cum with your secret

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [None]:
# may have to run twice to get file to download
files.download(gen_file)

## Generate Text From The Pretrained Model

If you want to generate text from the pretrained model, not a finetuned model, pass `model_name` to `gpt2.load_gpt2()` and `gpt2.generate()`.

This is currently the only way to generate text from the 774M or 1558M models with this notebook.

In [None]:
model_name = "774M"

gpt2.download_gpt2(model_name=model_name)

Fetching checkpoint: 1.05Mit [00:00, 354Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 131Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 279Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 3.10Git [00:23, 131Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 380Mit/s]                                                
Fetching model.ckpt.meta: 2.10Mit [00:00, 226Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 199Mit/s]                                                       


In [None]:
sess = gpt2.start_tf_sess()

gpt2.load_gpt2(sess, model_name=model_name)

W0828 18:37:58.571830 139905369159552 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Loading pretrained model models/774M/model.ckpt


In [None]:
gpt2.generate(sess,
              model_name=model_name,
              prefix="The secret of life is",
              length=100,
              temperature=0.7,
              top_p=0.9,
              nsamples=5,
              batch_size=5
              )

# Etcetera

If the notebook has errors (e.g. GPU Sync Fail), force-kill the Colaboratory virtual machine and restart it with the command below:

In [None]:
!kill -9 -1

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.