Using an Edgar Allen Poe Poetry Corpus in order to generate new poetry.

    The first step is to download the corpus using the following code:

This file contains over 13,000 lines of poetry stored in a newline deliminated JSON format with a JSON object on each line. 
The followng code open's the file and stores each line of poetry as JSON objects as a list called poetry_lines.

In [8]:
import gzip, json
poetry_lines = []
for line in open("poetry_poe.ndjson"):
    poetry_lines.append(json.loads(line.strip()))
    
len(poetry_lines)    

13455

We can look at a random sample of petry lines:

In [3]:
import random

In [4]:
random.sample(poetry_lines, 8)

[{'s': '"employed by us for several months as critic and subeditor.... He',
  'gid': '10031'},
 {'s': 'Never on our lips before;', 'gid': '2151'},
 {'s': 'Us to the field againe.', 'gid': '10031'},
 {'s': 'Cas. Why, sir, the Earl Politian.', 'gid': '2151'},
 {'s': 'canvas, an intensity of intolerable awe, no shadow of which felt',
  'gid': '932'},
 {'s': 'Yet let me not be misapprehended. The undue, earnest, and morbid',
  'gid': '2148'},
 {'s': 'intense mental collectedness and concentration to which I have',
  'gid': '932'},
 {'s': 'From the wild energy of wanton haste', 'gid': '2151'}]

Allison Parrish stored these lines of poetry in JSON format using 's' as the key with the lines of poetry, and 'gid' as the key containing the Project Gutenberg ID of the file which allows us to look up the title of the book of poetry and the author.

Markov Chain Text Generation:

    In order to generate a poem, we will use Markov chain text generation which uses statistical information on word co-occurrence based on the source text we are providing. In this case, this model will be built upon the three million lines of poetry stored in poetry_lines.

What are Markov Chains?

A Markov chain is a stochastic process that models sequences of events where the probabilty of each event depends on the previous event. In the case of our poetry generation, the probability of each word we see in the poem depends on what word comes before it. The model compares the probability of a word which follows the first one and results in a chain of words with each process performed on each word.

The visual below represents this process:
![image.png](attachment:image.png)

In Python, we will use the Markovify library in order to build and generate from Markov chain models.
Install with:

In [5]:
!pip install markovify



Import it:

In [6]:
import markovify

We will use a Markov chain to generate new lines of poetry from the Gutenberg Poetry Corpus deisgned by Allison Parrish. 
Since Markkovify requires text to be passed in as a string, we then create a large string with a sample of the poetry lines separated buy new lines:

In [9]:
big_poem = "\n".join([line['s'] for line in random.sample(poetry_lines, 13455)])

The sample can be of any size, but larger samples may take longer to run.

Build the model:

In [10]:
model = markovify.NewlineText(big_poem)

Then generate some lines:

In [11]:
for i in range(14):
    print(model.make_sentence())

Attend thee ever; and I even welcomed his presence as the Count Castiglione never
There was a very
This and more I sat divining, with my mother's milk I did lie,
And thus too, it happened, perhaps, that more of Sin,
Well!--I will think of it--I will not raise a hand all thy melody
Walked in beauty at my chamber door,
Seem'd earthly in the year,
On that side now, and now pulling therewith sturdily, he so cracked, and
And each separate dying ember wrought its ghost upon the ear, in Eyraco ,
I stand beneath the tamarind tree?
furnished here for evermore.
Ghastly grim and ancient Raven wandering from the wide and rigid bier low lies thy love,
In climes of the pale-faced moon.
No more, my lord, than I have been a beauteous dame beyond the sea!


We can further create a poem about a specific subject by filtering the poetry lines for lines contianing a specific word. For example, we can find each line which contains the word "god" and do so using a regular expression that finds the string "god" between teo word boundaries, without respect to case:

In [21]:
import re
god_lines = [line['s'] for line in poetry_lines if re.search(r'\bdark\b', line['s'], re.I)]
len(god_lines)

35

In [22]:
random.sample(god_lines, 8)

['Here sate he with his love--his dark eye bent',
 'ignorant errors of the dark ages of the church.--_Dr.',
 'In visions of the dark night',
 'In visions of the dark night',
 'of the gloomy furniture of the room--of the dark and tattered',
 'Here sate he with his love--his dark eye bent',
 'Are where thy dark eye glances,',
 '"Even this division," said I, "leaves me still in the dark."']

Using these lines containing the word "god" at least once, we can then train another Markov Chain Model to create a poem seemingly focused on this subject.

In [23]:
god_big_poem = "\n".join([line for line in random.sample(god_lines, 34)])

Build the model:

In [24]:
god_model = markovify.NewlineText(god_big_poem)

Generate some lines of poetry:

In [25]:
for i in range(5):
    print(god_model.make_sentence())

In visions of the gloomy furniture of the dark arch,
In visions of the dark the silent stream--
In visions of the gloomy furniture of the vaulted and fretted ceiling. Dark draperies
the recesses of the dark night
In visions of the gloomy furniture of the dark the silent stream--
