<a href="https://colab.research.google.com/github/martysteer/omftm/blob/master/Gwern_GPT_2_Gutenberg_Poetry_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Our Mutual Friend the Machine
### [Being Human Festival](https://beinghumanfestival.org/event/our-mutual-friend-the-machine/), November 2019

Run the cells below to setup the environment. You (currently) need to connect to martin.steer@sas.ac.uk's google drive, because the model files, metadata and logs are all kept there.

This notebook uses [Gwern's Gutenberg Poetry model](https://www.gwern.net/GPT-2), which was trained on the [Gutenburg Poetry Corpus](https://doi.org/10.3389/fdigh.2018.00005) using [OpenAI's GPT-2 Transformer](https://openai.com/blog/better-language-models/) architecture for Natural Language Generation (NLG). This notebook was based on [Max Wolf's Google Colab notebook](https://minimaxir.com/2019/09/howto-gpt2/).

The gutenburg metadata has been filtered to 19th Century authors/volumes.

Enjoy!

[Digital Humanities @SAS](https://www.sas.ac.uk/projects-and-initiatives/digital-humanities)


##  Libraries and setup


In [0]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

import re
import json
import random

In [0]:
!nvidia-smi

Wed Nov 20 18:51:28 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# This extracts 19th century authors metadata.

!cp  ./drive/My\ Drive/gutenbermetadata.json .

with open('gutenbermetadata.json') as f:
  metadata = json.load(f)


# Save a reference list of 19th century authors
ninteethnc = {}
for item in metadata:
    if 'Author Death' in item:
        death = item['Author Death']
        if len(death) > 0 and isinstance(item['Author Death'][0], int) and item['Author Death'][0] < 1900:
            ninteethnc[int(item['Num'])] = item
            
with open('ninteenth-century-gutenberg-authors.txt', 'w') as f:
    for gid, item in ninteethnc.items():
        print(item['Num'], item['Author'][0], file=f)


# Save a reference list of 19th century authors and their works
auths = {}
for aid, item in ninteethnc.items():
    auths.setdefault(item['Author'][0],[]).append(item)
    
with open('ninteenth-century-gutenberg.txt', 'w') as f:
    for aid, items in auths.items():
        print(aid, file=f)
        for item in items:
            print(item['Num'], item['Title'][0], file=f)
        print('', file=f)
        print('', file=f)


def gidsummary(gid):
  auth = ninteethnc[gid]
  out = ['=========', auth['Title'][0], auth['Author'][0]]
  if 'LoC Class' in auth.keys() and len(auth['LoC Class']):
    out += [auth['LoC Class'][0]]
  return out

Setup the logging. All poems which are generated and the input parameters are logged to google drive.

In [0]:
import logging
import uuid

# Each time you run the notebook we generate a UUID and use this is the logfiles.
notebookuuid = str(uuid.uuid4())

logging.basicConfig(filename='./drive/My Drive/logs/log.' + notebookuuid + '.txt',
                    filemode='a',
                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
                    datefmt='%H:%M:%S',
                    level=logging.INFO)

gpt2poetrylogger = logging.getLogger('gpt-2-poetry' + notebookuuid) 

In [0]:
def formatGPToutput(samplelist, numPoems=10, logger=None):
  '''
  This formats the output. Strips the gutenburg id from the start of each line
  and looks up this author's metadata.
  '''
  if not logger:
    logger = logging.getLogger('gpt-2-poetry' + notebookuuid)
  
  for poem in random.sample(samplelist, numPoems):
      alllines = re.findall('([0-9]+)?\|?(.*)', poem)
      gid = int(list(set(g[0] for g in alllines if g[0]))[0])
      lines = [l[1] for l in alllines if l[1]]

      logger.info('Gutenburg ID: ' + str(gid))
      
      # If it is 19th C, print poem, else skip.
      # The metadata will
      if gid in ninteethnc.keys():
        logger.info(gidsummary(gid))
        print('\n\r'.join(gidsummary(gid)))
        print('---')
        print('\n\r'.join(lines), '...')
        print()
      else:
        logger.info('Not C19th')
        
      # Log all poems
      logger.info(lines)

In [0]:
def generatePoetry(seedLine, authorid=None, numPoems=10, poemlength=100, logger=None):
  '''
  This does what it says, but it also logs the parameters which 
  were used to generate the poetry.
  '''
  if not logger:
    logger = logging.getLogger('gpt-2-poetry' + notebookuuid)
  
  logger.info('========================')
  logger.info('Running generatepoetry()')
  logger.info('Seedline: ' + seedLine)
  logger.info('Author GID: ' + authorid)
  logger.info('Poem length: '+ str(poemlength))
  logger.info('Num Poems: '+ str(numPoems))
  
  if authorid:
    seedLine = str(authorid) + '|' + seedLine
    
  samples = gpt2.generate(sess,
              length=poemlength,
              temperature=0.95,
              run_name=run_name,
              prefix=seedLine,
              return_as_list=True,
              nsamples=10,  # always 10, but the format function will limit
              batch_size=5,
              top_k=40, 
  #               top_p=0.4
              )
  return formatGPToutput(samples, numPoems, logger)

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from Google Drive into the Colaboratory VM.

In [0]:
# Copy the poetry prefix model built by Gwern Branwen
poetryprefix = "345M_poetry"
# poetryprefix = "gwern"
run_name = poetryprefix

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=run_name)

Loading checkpoint checkpoint/345M_poetry/model-815326
INFO:tensorflow:Restoring parameters from checkpoint/345M_poetry/model-815326


## GPT-2 19th Century Literature poem generator
Choose from the list of 19th century project gutenburg poets and generate some poetry! Try to use the last line of one poem as the starting line of the second poem, and experiment with different styles.

Copy/Paste the ones you like into a [google doc](https://drive.google.com/drive/u/0/my-drive), edit it, clean it up a little bit, change some words here and there... and ask us to print it out!

In [0]:
#@markdown ---
#@markdown ### Configure your poem:
Poem_start = "Seaside as evening, raining" #@param {type:"string"}
Poem_length = "100" #@param ['20','50','100','150','200','300','500','1000']
Number_of_poems = "10" #@param ['1', '2', '5', '10', '20']
In_the_style_of = "51120 John Keats, Po\xE8mes et Po\xE9sies Traduction pr\xE9c\xE9d\xE9e d'une \xE9tude par Paul Gallimard" #@param ["4212 Matthew Arnold, Culture and Anarchy", "5159 Matthew Arnold, Celtic Literature", "27739 Matthew Arnold, Poetical Works of Matthew Arnold", "4253 Robert Browning, Dramatic Romances", "16376 Robert Browning, Browning's Shorter Poems", "18343 Robert Browning, The Pied Piper of Hamelin", "50954 Robert Browning, The Complete Poetic and Dramatic Works of, Cambridge Edition", "11 Lewis Carroll, Alice's Adventures in Wonderland", "651 Lewis Carroll, Phantasmagoria and Other Poems", "4763 Lewis Carroll, The Game of Logic", "28696 Lewis Carroll, Symbolic Logic", "33582 Lewis Carroll, Rhyme? And Reason?", "8209 John Keats, Poems 1817", "23684 John Keats, Keats: Poems Published in 1820", "24280 John Keats, Endymion A Poetic Romance", "51120 John Keats, Poèmes et Poésies Traduction précédée d'une étude par Paul Gallimard", "982 Edward Lear, The Book of Nonsense", "13649 Edward Lear, Laughable Lyrics", "20113 Edward Lear, Nonsense Drolleries The Owl & The Pussy-Cat—The Duck & The Kangaroo.", "34906 Edward Lear, The Jumblies, and Other Nonsense Verses", "4654 Percy Bysshe Shelley, The Daemon of the World", "4696 Percy Bysshe Shelley, The Witch of Atlas", "4800 Percy Bysshe Shelley, The Complete Poetical Works of", "5428 Percy Bysshe Shelley, A Defence of Poetry and Other Essays", "1322 Walt Whitman, Leaves of Grass", "8388 Walt Whitman, Poems By", "8801 Walt Whitman, Drum Taps", "8813 Walt Whitman, Complete Prose Works Specimen Days and Collect, November Boughs and Goodbye My Fancy", "27494 Walt Whitman, The Patriotic Poems of", "47846 Walt Whitman, Poèmes de"]





generatePoetry(Poem_start, 
               authorid=In_the_style_of.split(' ')[0], 
               poemlength=int(Poem_length), 
               numPoems=int(Number_of_poems))

Poèmes et Poésies	Traduction précédée d'une étude par Paul Gallimard
John Keats
---
Seaside as evening, raining tears.
The rich and rare the Muse has found
Of varied talents, in thy mind;
But never can the Muse of gold
Delight her slave with one contralto's lay.
But I have heard them sung by others' voice,
And by my master's, and the muse's;
And though with every instrument
They make the praise ...

Poèmes et Poésies	Traduction précédée d'une étude par Paul Gallimard
John Keats
---
Seaside as evening, raining clouds,--
And, as the clouds decline, the tempests roar.
These fowls must be of rare and potent breed.
I hate that bird of all the fowls divine.
But though such fowls are so rare and potent;
Yet will I not so much shun as despise;
While the world with its ogling crew contend.
In ...

Poèmes et Poésies	Traduction précédée d'une étude par Paul Gallimard
John Keats
---
Seaside as evening, raining pearls,
'Mid the wild, glad hills of May.
So there they stayed, l

### Other code examples

In [0]:
# You can seed the next verse using mutliple lines from the prior verse, to help carry some of the context forward.
# Or you can paste multiline text in (a whole verse for example) to experiment.

firstLine = '''How Nehru cuts a mango --
Your method, Nusrat begum,
must be improved-
if you want to have it all.
Balance the mango on it's heel
'''
authorid = 24280  # John Keats, Endymion A Poetic Romance

generatePoetry(firstLine, authorid=authorid, poemlength=200, numPoems=10)

.

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.