Using Allison Parrish's Project Gutenberg Poetry Corpus in order to generate Poetry

    The first step is to download the corpus using the following code:

In [1]:
!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 52.2M    0 13606    0     0  18294      0  0:49:57 --:--:--  0:49:57 18312
  0 52.2M    0  214k    0     0   119k      0  0:07:29  0:00:01  0:07:28  119k
  1 52.2M    1  600k    0     0   212k      0  0:04:11  0:00:02  0:04:09  212k
  1 52.2M    1  864k    0     0   225k      0  0:03:57  0:00:03  0:03:54  225k
  2 52.2M    2 1354k    0     0   282k      0  0:03:09  0:00:04  0:03:05  282k
  3 52.2M    3 1717k    0     0   296k      0  0:03:00  0:00:05  0:02:55  338k
  3 52.2M    3 1986k    0     0   300k      0  0:02:58  0:00:06  0:02:52  369k
  4 52.2M    4 2440k    0     0   313k      0  0:02:50  0:00:07  0:02:43  370k
  5 52.2M    5 3000k    0     0   348k      0  0:02:33  0:00:08  0:02:25  446k
  5 52.2M    5 3170k    0     0   322k      0  0:02

This file contains three million lines of poetry stored in a gzipped newline deliminated JSON format with a JSON object on each line. 
The followng code uses Python's gzip library to open the file and store each line of poetry as JSON objects as a list called poetry_lines.

In [2]:
import gzip, json
poetry_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    poetry_lines.append(json.loads(line.strip()))

We can look at a random sample of petry lines:

In [3]:
import random

In [4]:
random.sample(poetry_lines, 8)

[{'s': 'straight and swift across the sea through all its course, to',
  'gid': '1997'},
 {'s': 'Of mortal members, subject to decay,', 'gid': '228'},
 {'s': 'affection awaits the sun, fixedly looking till the dawn may',
  'gid': '1997'},
 {'s': 'What from this day I shall be,', 'gid': '1304'},
 {'s': 'When he heard the owls at midnight,', 'gid': '1365'},
 {'s': 'Over the ringing battle of dauntless men,', 'gid': '658'},
 {'s': 'Now am I come where many a plaining voice', 'gid': '1005'},
 {'s': 'His right hand will shield thee then.', 'gid': '1365'}]

Allison Parrish stored these lines of poetry in JSON format using 's' as the key with the lines of poetry, and 'gid' as the key containing the Project Gutenberg ID of the file which allows us to look up the title of the book of poetry and the author.

Markov Chain Text Generation:

    In order to generate a poem, we will use Markov chain text generation which uses statistical information on word co-occurrence based on the source text we are providing. In this case, this model will be built upon the three million lines of poetry stored in poetry_lines.

In Python, we will use the Markovify library in order to build and generate from Markov chain models.
Install with:

In [5]:
!pip install markovify



Import it:

In [6]:
import markovify

We will use a Markov chain to generate new lines of poetry from the Gutenberg Poetry Corpus deisgned by Allison Parrish. 
Since Markkovify requires text to be passed in as a string, we then create a large string with a sample of the poetry lines separated buy new lines:

In [7]:
big_poem = "\n".join([line['s'] for line in random.sample(poetry_lines, 250000)])

The sample can be of any size, but larger samples may take longer to run.

Build the model:

In [8]:
model = markovify.NewlineText(big_poem)

Then generate some lines:

In [9]:
for i in range(14):
    print(model.make_sentence())

To loose the rein,
Of prisons where they affirm that they forget, so let them perish,
To purchase his own cross.
Ere they closed their eyes shall see them riding down the curtain.
Then do I delay my mother's knee. Was I offered in that sole
The sky and star,
Who is my confidence,
this last of the government to double all its throb intense
waiting for one should him wrong.
She, proudly, thinning in the vast love, and her husband entered.
Which showed thee the
Through the roof rattles with the Indian Government, always keen to please,
Though I own the kindness done to much nye were.
Precise in dealing, foes to you, Father Malloy,


We can further create a poem about a specific subject by filtering the poetry lines for lines contianing a specific word. For example, we can find each line which contains the word "god" and do so using a regular expression that finds the string "god" between teo word boundaries, without respect to case:

In [12]:
import re
god_lines = [line['s'] for line in poetry_lines if re.search(r'\bgod\b', line['s'], re.I)]
len(god_lines)

28266

Using these lines containing the word "god" at least once, we can then train another Markov Chain Model to create a poem seemingly focused on this subject.

In [19]:
god_big_poem = "\n".join([line['s'] for line in random.sample(poetry_lines, 28000)])

Build the model:

In [20]:
god_model = markovify.NewlineText(god_big_poem)

Generate some lines of poetry:

In [21]:
for i in range(14):
    print(god_model.make_sentence())

Lurks in each plan to guide my fighting arm.
Right, law, and industry gave way to Beth'lem and as the angels adore,
illustration of the mouth, fatigued by the clear ideas acquired by our loss we may not come to the widowed diadem promoted
Which shall the Lover and his kinsmen, in a whirlwind: all
Siegfried Sassoon is an authentic document.
To come forth and see.
The harpies are not here to sleep;
To wit, a relation from that cheek wherewith he is sleeping in their fall receives:
In the very soul
and on the hill;
As dry as the badde,
And sat and guided with nice care the helm, the rowers on the desert sand.
I read the starry Heaven:
On Sunday to church you go, each with her veil conceals the coming throng?--a singular sight,
