In [None]:
!nvidia-smi && free -h && python --version

In [None]:
# GDRIVE
!ls "/content/drive/My Drive/COLAB/"
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# DEPENDENCIES
!pip install -q aitextgen

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from IPython.display import HTML, display

def set_css():
    display(HTML('''<style>pre { white-space: pre-wrap; }</style>'''))
    
get_ipython().events.register('pre_run_cell', set_css)

## Loading the GPT model

Now we have to download the language model. This can take a while.

In [None]:
import torch
torch.cuda.empty_cache()

%cd "/content"
!rm -rf aitextgen

#model_selection="gpt-neo-125M" # GPT-Neo Pytorch, requires 12GB VRAM
model_selection="355M" # Stock GPT-2 TF, requires 16GB VRAM
#model_selection="124M" # Stock GPT-2 TF, requires 8GB VRAM

# 1. Do this first only if you have a local base model...
!ln -s "/content/drive/MyDrive/COLAB/ExperimentalWriting/base_models/$model_selection/aitextgen"

# 2. Do this next (for local or remote base models)...
# 2.1. GPT-Neo Pytorch...
#ai = aitextgen(model="EleutherAI/" + model_selection, to_gpu=True) # gpt-neo-125M, gpt-neo-1.3B, gpt-neo-2.7B, gpt-neo-6B

# 2.2. Stock GPT-2 TF...
ai = aitextgen(tf_gpt2=model_selection, to_gpu=True) # 124M, 355M, 774M, 1558M

# 2.3. Other custom Pytorch...
#ai = aitextgen(model=model_selection, to_gpu=True) # gpt2-medium

## Uploading a Text File to be Trained to Colaboratory

* **Upload** any text file, for example, a .txt file of your own writing, or a [pre-compiled corpus](https://discord.com/channels/928996422509027408/928996422509027414/931150499430924288). The file must be in UTF-8 format.
* **Update** the file name in the cell below to *exactly* match the title of the .txt file as it exists in the sidebar folder, then run the cell.

In [None]:
%cd "/content"
!rm "input.txt"
!ln -s "/content/drive/MyDrive/COLAB/ExperimentalWriting/corpora/nick_corpus_combo.txt" input.txt
file_name = "input.txt"

## Finetuning the language model
The next cell will start the actual finetuning of the language model. It trains the model to the **voice and concepts** of the supplied text. 

Tuning runs for `num_steps` steps - **500** is a good number for experimenting. Longer training means that the model will be more sophisticated in how it combines your words with the words it already knows.

A progress bar will appear to show training progress.

***To begin with, just leave the settings as they are! As you learn and grow, you can mess around!***


---



Important parameters for `train()`:

- The model will be saved every `save_every` steps in `trained_model` by default, and when training completes.
- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.
- **`learning_rate`**: Learning rate of the model training. Only change this if you know what you're doing.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

Afterward, we reload the model so things are properly set up. 

Running `generate()` without any parameters generates a single text from the loaded model, to test that everything is in order.

In [None]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=500,
         generate_every=100,
         save_every=500,
         save_gdrive=False,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1, 
         )

ai = aitextgen(model_folder="trained_model", to_gpu=True)

ai.generate()

## Generate Text From The Trained Model

Now it gets interesting! You can pass a `prompt` to the generate function to force the text to start with a given phrase.

**Be sure to never end the prompt with a space!**

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, a `batch_size` of 50 will make you run out of memory).

The following parameters define what you get back from the model:

*  **`min_length`**: The minimum length of the generated text.
*  **`max_length`**: Number of tokens to generate. Default is 256. You can generate up to 2048.
* **`temperature`**: The higher the temperature, the crazier the text. Default is 0.7, recommended to keep between 0.7 and 1.0.
* **`top_k`**: Filters the generated guesses to the top *k* guesses. Default is 0, which disables the behavior. If the generated output has semantic or syntactic issues, you may want to set `top_k=40` or use `top_p` sampling (see below).
* **`top_p`**: Another way of filtering bad results calld _nucleus sampling_. Limits the generated guesses to a cumulative probability. Gets good results on a dataset with `top_p=0.9`. If `top_p` as well as `top_k` are specified only top_p is used.
* **`repetition_penalty`**: Set this to more than 1 to avoid repeating words in the text.

## Suggestions
* gpt-neo-125M: 
temperature 1.35, top_p 0.97, repetition_penalty 1.1


In [None]:
#@title Text Generation Parameters
prompt = "MARY: Get to battle stations!" #@param {type:"string"}
temperature = 1.35 #@param {type:"slider", min:0, max:1.5, step:0.05}
top_p = 0.97 #@param {type:"slider", min:0.8, max:0.99, step:0.01}
repetition_penalty = 1.1 #@param {type:"slider", min:1, max:1.5, step:0.1}
batch_size = 3 #@param {type:"slider", min:1, max:3, step:1}
max_length = 256 #@param {type:"slider", min:256, max:2048, step:256}

ai.generate(n=batch_size,
            batch_size=batch_size,
            prompt=prompt,
            max_length=max_length,
            temperature=temperature, 
            top_p=top_p)

In [None]:
# OUTPUT

%cd /content/

# Saves the cached base model
#!cp -r aitextgen /content/drive/MyDrive/COLAB/ExperimentalWriting/

# Saves the training results
!cp -r trained_model /content/drive/MyDrive/COLAB/ExperimentalWriting/

# LICENSE

MIT License

Original work copyright (c) 2020-2021 Max Woolf

Modified work copyright (c) 2022 Martin Pichlmair

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.