<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/creative_writing_gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cloning the creative writing code base

This copies the code & data required onto the Colab notebook instance.

In [0]:
!rm -rf /content/creative-writing-with-gpt2
!git clone https://github.com/ADGEfficiency/creative-writing-with-gpt2

Cloning into 'creative-writing-with-gpt2'...
remote: Enumerating objects: 104, done.[K
remote: Counting objects: 100% (104/104), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 104 (delta 30), reused 96 (delta 22), pack-reused 0[K
Receiving objects: 100% (104/104), 30.10 MiB | 19.73 MiB/s, done.
Resolving deltas: 100% (30/30), done.


# Installing required packages

In [0]:
!pip install -q -r /content/creative-writing-with-gpt2/requirements.txt
print(' ')
print('finished installing packages')

import os 
os.makedirs('/content/creative-writing-with-gpt2/models', exist_ok=True)

#  silence tensorflow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import tensorflow as tf
tf.get_logger().setLevel('INFO')

[K     |████████████████████████████████| 6.4MB 7.9MB/s 
[K     |████████████████████████████████| 87.9MB 86kB/s 
[K     |████████████████████████████████| 317kB 58.1MB/s 
[K     |████████████████████████████████| 3.1MB 35.1MB/s 
[K     |████████████████████████████████| 501kB 74.0MB/s 
[K     |████████████████████████████████| 675kB 68.0MB/s 
[K     |████████████████████████████████| 1.0MB 67.0MB/s 
[K     |████████████████████████████████| 860kB 58.1MB/s 
[K     |████████████████████████████████| 61kB 10.1MB/s 
[?25h  Building wheel for regex (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
 
finished installing packages


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Available raw data

The creative writing code base has a few clean datasets included, which can be used for fine-tuning.  You can see the text using the *Files* browser on the right.

In [0]:
print('Available datasets:')
!ls /content/creative-writing-with-gpt2/data

Available datasets:
alan-watts  bible  hemingway  meditations  tolkien
asimov	    harry  mahabarta  plato


# Downloading pretrained models

Because the size of the pretrained models is massive, I've made them available as shared links on my Google Drive.  

The code below will download them to this instance of the Colab notebook.

In [0]:
from google_drive_downloader import GoogleDriveDownloader as gdd

%cd /content/creative-writing-with-gpt2/
from models import models, download_pretrained_model
%cd /content/

print(' ')
print('Available pre-finetuned models:')
for k in models.keys():
  print(k)

/content/creative-writing-with-gpt2
/content
 
Available pre-finetuned models:
alan-watts
bible
harry
meditations
tolkien
asimov
hemingway


In [0]:
download_pretrained_model('tolkien', prefix='/content/creative-writing-with-gpt2')
download_pretrained_model('bible', prefix='/content/creative-writing-with-gpt2')
download_pretrained_model('harry', prefix='/content/creative-writing-with-gpt2')
download_pretrained_model('asimov', prefix='/content/creative-writing-with-gpt2')
download_pretrained_model('meditations', prefix='/content/creative-writing-with-gpt2')

!ls /content/creative-writing-with-gpt2/models

downloading tolkien
Downloading 1-0lq9cGClSqcvcI3WqGkxdmAdoWrhD4e into /content/creative-writing-with-gpt2/models/tolkien.zip... Done.
Unzipping...Done.
downloading bible
Downloading 1x8SQgqZyLGRdHRV6BUIHEPxZuWUCyhRc into /content/creative-writing-with-gpt2/models/bible.zip... Done.
Unzipping...Done.
downloading harry
Downloading 1-3iQhw89Biyv1QMf4o2BEahoPX9g3fNd into /content/creative-writing-with-gpt2/models/harry.zip... Done.
Unzipping...Done.
downloading asimov
Downloading 1yg4bORU_KpV4h_aVnbMaekulK6ShpCS1 into /content/creative-writing-with-gpt2/models/asimov.zip... Done.
Unzipping...Done.
downloading meditations
Downloading 1-9TiibA0_dqD7dqyJnBNBrZnLuegAa_E into /content/creative-writing-with-gpt2/models/meditations.zip... Done.
Unzipping...Done.
asimov	    bible      harry	  meditations	   tolkien
asimov.zip  bible.zip  harry.zip  meditations.zip  tolkien.zip


## Using a finetuned model

This can be run after either downloading a model or training your own.

In [0]:
import os

os.environ['MODELNAME'] = 'tolkien'  # change this to use a different model
!python /content/creative-writing-with-gpt2/run_generation.py \
--model_type=gpt2 \
--model_name_or_path="/content/creative-writing-with-gpt2/models/$MODELNAME" \
--length=50  # determines the amount of characters to output

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
12/14/2019 17:05:54 - INFO - transformers.tokenization_utils -   Model name '/content/creative-writing-with-gpt2/models/tolkien' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, distilgpt2). Assuming '/content/creative-writing-with-gpt2/models/tolkien' is a path or url to a directory containing tokenizer files.
12/14/2019 17:05:54 - INFO - transformers.tokeni

## Finetuning your own model

Here we take the original GPT2 model and retrain it on a dataset of our choosing.

After training the model is zipped up - you can use the *Files* tab on the right to see it.

In [0]:
os.environ['MODELNAME'] = 'meditations'  #  change this to use a different dataset
os.makedirs('/content/creative-writing-with-gpt2/models/{}'.format(os.environ['MODELNAME']), exist_ok=True)

!python /content/creative-writing-with-gpt2/run_lm_finetuning.py \
  --output_dir="/content/creative-writing-with-gpt2/models/$MODELNAME" \
  --model_type="bert-base-german-cased" \
  --model_name_or_path=gpt2 \
  --do_train \
  --train_data_file="/content/creative-writing-with-gpt2/data/$MODELNAME/clean.txt" \
  --num_train_epochs=4 \
  --overwrite_output_dir \
  --per_gpu_train_batch_size 1 \
  --save_steps 10000

%cd /content/creative-writing-with-gpt2/models/ 
!zip -r "$MODELNAME.zip" "$MODELNAME"
%cd /content/

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "/content/creative-writing-with-gpt2/run_lm_finetuning.py", line 545, in <module>
    main()
  File "/content/creative-writing-with-gpt2/run_lm_finetuning.py", line 473, in main
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
KeyError: 'bert-base-german-cased'
/content/creative-writing-with-gpt2/models
upda

# Mounting your Google Drive

After training your model you might want to save the `.zip` file - an easy way to do this is to transfer it to your Google Drive.

After running the cell below, you can transfer files using the *Files* explorer on the right.

You can also use your Google Drive to transfer datasets to finetune on, by putting a file `clean.txt` into the `data` folder (i.e. `creative-writing-with-gpt2/data/my_author/clean.txt`)

In [0]:
from google.colab import drive

drive.mount('/content/drive')