# Machine Learning Project: Writing Poetry with RNNs

*by Abigail Rictor, due May 14, 2019*

## Introduction

For this project I decided to explore the idea of writing using a neural network. I am very interested in linguistic structure and how natural language can be broken down and "understood" by machines. This has influenced my previous academic and personal projects. Most relevant are a series of Markov algorithms I have written to probabilistically generate different types of text, ranging from tweets to poems. I spent a good deal of last semester working on a Markov based web server which generates poetry based on a 10,000+ element dataset taken from poetryfoundation.org. For this I've developed ways of pre-processing that data to eliminate strange characters as well as methods for post-processing in order to end up with the highest quality content, and that has informed the way I pre- and post- process in this project.

Here I am using the same dataset and feeding it to a recurrent neural network which generates new content character by character. A recurrent neural network functions on the principle of remembering state by recycling outputs as inputs. This is useful here because each character we generate relies on the characters generated before it, allowing our output to have some consistency throughout.

I was able to generate poems after training on that dataset for different amounts of data, numbers of iterations, and using different batch sizes. Comparing these results to each other gives a more clear idea of the meanings and practical uses of these variables. I also compare methods by looking at these poems side by side with some of the results from my Markov writer with attention to quality, form, and the types of mistakes they make.

## Methods

In [24]:
# Importing external libraries
import numpy as np
import json
import random
import torch

from ipywidgets import IntProgress
from ipywidgets import Dropdown
from ipywidgets import Button
from IPython.display import display

import CharRNN as crnn #See GitHub https://github.com/albertlai431/Machine-Learning/tree/master/Text%20Generation

The code below defines a method to read in the text, and maps characters to integers. I am reading in data stored in my local directory which I found by crawling poetryfoundation.org and ripping JSON formatted files for each poem. During this process I replace some unicode characters with ones which will be more recognizable in this context.

In [45]:
def readData(num_poems=5000, all=False):
    poems = []
    f = IntProgress(min=0, max=num_poems, description= "Reading data...")
    display(f)
    text = ""
    
    poem = random.randint(1,10216)
    for i in range(1,num_poems+1):
        if(all):
            poem = i
        else:
            poems.append(poem)
            while(poem in poems): 
                poem = random.randint(1, 10215)
        text += "\n".join(json.load(open("./poems/"+ str(poem) + ".json"))['text'])
        f.value += 1
    f.close()
    print("Data read complete.")
    
    text.replace(u"\u0092", "'").replace("”", "\"").replace("“", "\"").replace(u"\2019\ufeff", "\'")
    
    return text

Below, I make a call to read the data and then set up two dictionaries which map characters to integers and integers to characters for simple conversion between the characters inputted and outputted and the integers we choose here to represent them in the context of the program. This transforms our data into something potentially mathematically understandable, but isn't enough to define relationships between any given characters. After passing this in for the model to be trained on, the numbers are encoded as one-hot vectors, which allow complex relationships between individual characters to define their own output using vector math.

An example of the code to set up and train the recurrent neural network is also below, though I've run it with many other configurations than these. It's calling a file I've formatted and imported which takes much of its code from Shakespeare.py in the github I've listed as a reference.

In [46]:
text = readData(all=True)

IntProgress(value=0, description='Reading data...', max=5000)

Data read complete.


In [None]:
# encoding the text and map each character to an integer and vice versa

# We create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to integers
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}

# Encode the text
encoded = np.array([char2int[ch] for ch in text])

# Define the net
n_hidden=512
n_layers=2
net = crnn.CharRNN(chars, n_hidden, n_layers)

# Declaring the hyperparameters
batch_size = 128
seq_length = 100
n_epochs = 30 # start smaller if you are just testing initial behavior

# train the model
epochs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 50]

for e in epochs:
    crnn.train(net, encoded, epochs=e, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=500000)
    # Saving the model
    model_name = 'rnn_'+str(e)+'_epoch.net'

#     checkpoint = {'n_hidden': net.n_hidden,
#                   'n_layers': net.n_layers,
#                   'state_dict': net.state_dict(),
#                   'tokens': net.chars}

    with open(model_name, 'wb') as f:
        torch.save(net, f)
    

Training on GPU!


IntProgress(value=0, description='Training...', max=492)

Finished training.


IntProgress(value=0, description='Training...', max=984)

Finished training.


IntProgress(value=0, description='Training...', max=1476)

Finished training.


IntProgress(value=0, description='Training...', max=1968)

Finished training.


IntProgress(value=0, description='Training...', max=2460)

Finished training.


IntProgress(value=0, description='Training...', max=2952)

Finished training.


IntProgress(value=0, description='Training...', max=3444)

In [42]:
print(crnn.sample(net, 500, prime="The", top_k=20))

The wast'l,!
Nor frustes cern,
in I redee the frounts the cool unstlien
of say a tunce aflears,
and with his, deal pingert faping to timntich
the wrelised in twher of migher timute ou clamy, of mirllot geats
I came the muster ambirt to the clace clurios
somerep of a thoussing your, exs stur sostlat
That rise oul for the oughing gost smt eid
The rost rode?
Than brears, in the shosibpos, the best obflounthel.
Ald by his means a bore and lin’s bost oad
I suld me hed
wound so ear youlsay, from sweel,
Br


In [43]:
def load(b):
    print(str(dropdown.value) + " epochs")
    with open('rnn_'+ dropdown.value +'_epoch.net', 'rb') as f:
        load_net = torch.load(f)
        load_net.eval()
        print(crnn.sample(load_net, 500, prime="The", top_k=20))

epochs = ['1', '5', '20']
dropdown = Dropdown(options=epochs, value='5', description='Number:', disabled=False)
button = Button(description='Click me', disabled=False, button_style='success', tooltip='Click me', icon='check')
button.on_click(load)
display(dropdown, button)


Dropdown(description='Number:', index=1, options=('1', '5', '20'), value='5')

Button(button_style='success', description='Click me', icon='check', style=ButtonStyle(), tooltip='Click me')

5 epochs


AttributeError: 'dict' object has no attribute 'eval'

20 epochs
Their smaty
carace torning are I grus
Fearte mear on stiff pack ew ase spor its
Told refail ham fools in thing
To ligount plead ruar of blomip ops,
And where panple nowe liped feem by teed
wirk rumonared tratl covers,
Labe praotirn ent nod begightred my wrombening
of ald the cell-the pimes from hear
s
byom hourses, and aspiront, so ditce is my bus, ap unsueded,
Furnt as sasp froores.
The bay,
but trey lake buruy. So wepel’ins ciudinger of whish
bealicly
all blew
With rynogieves gabfis, a turth, it b


Next, I sample a string of characters from the network by feeding it a "prime" which will appear at the beginning of the output and influence what is chosen next. I experimented with some post-processing by iterating through an outputted string and checking for each word (separated by spaces or punctuation) in a dictionary object filled with nearly 500,000 english words. This is loaded from a .json file found on a github listed in the references. I added common contractions to the version of the file found there. When a made-up word is encountered, all text up to that point will be used as the prime in a new sample from the network. This runs until the poem is appropriately long, then cuts off any excess text after the last instance of ending punctuation. This avoids outputting writing which ends in the middle of an idea.

In [None]:
def rawSample(prime):
    return crnn.sample(net, 500, prime=prime, top_k=20)

dictionary = json.load(open("./words_dictionary.json")) #resource to check for real words
def checkDictionary(word):
    try:
        dictionary[word]
        return True
    except:
        return False

def write(): 
    poem = crnn.sample(net, 500, prime="The", top_k=20)
    poem_arr = poem.split(" ")
    for word in poem_arr: 
        if not checkDictionary(word.lower()):
            poem = crnn.sample(net, 500, prime=poem[:poem.rfind(word)], top_k=20)
        if len(poem)>1000:
            break
    #iterate through words in poem and check if they exist in dictionary
    
    
    last_punctuation = max(poem.rfind('.'), poem.rfind('!'), poem.rfind('?'))
    poem = poem[:last_punctuation+1]
    print(poem)
write()



In [None]:
def cleanFile(filename):
    f = open(filename, "r")
    poem = f.read()
    return poem.replace("\n", "<br>")
print("| Examples |")
epochs = [10, 20]
for e in epochs:
    poem = cleanFile("epoch"+str(e))
    print("| 5000 poems, "+ str(e) +" epochs |")
    print("| <p align=\"left\">" + poem +"</p> |") 

Steps I took.  Resources I used, such as code from the class, on-line resources, research articles, books [Goodfellow, et al., 2016], ....

Say in detail what each team member did.

## Results

| Passage from Beowulf, Old English | Poem generated after 1 epoch (5000 poems) |
|-----------------------------------|--------------------------------|
| <p align="left">Hwaet. We Gardena in geardagum, <br>beodcyninga, brym gefrunon, <br>hu oa aebelingas ellen fremedon. <br>Oft Scyld Scefing sceabena breatum, <br>monegum maegbum, meodosetla ofteah, <br>egsode eorlas. Syooan aerest wearo <br>feasceaft funden, he baes frofre gebad, <br>weox under wolcnum, weoromyndum bah, <br>oobaet him aeghwylc bara ymbsittendra <br>ofer hronrade hyran scolde, <br>gomban gyldan. baet waes god cyning. <br>oaem eafera waes aefter cenned, <br>geong in geardum, bone god sende <br>folce to frofre; fyrenoearfe ongeat <br>be hie aer drugon aldorlease <br>lange hwile. Him baes liffrea, <br>wuldres wealdend, woroldare forgeaf; <br>Beowulf waes breme blaed wide sprang, <br>Scyldes eafera Scedelandum in. </p> | <p align="left">Theis of alr ared ress bit wine nunt epey fotin,<br>wis cand oo thouk ifond risht<br>thel darine, to severy<br>To praarto on.<br>Whe goon woar worl.<br>Whe lanss pirg, inst gon allold,<br>Hode ame heive.<br>Anorecl lint.<br>Denave parlet,<br>Wores, ing dand froake thad a if nifh pad thove made<br>Andry fweang,<br>Trat my sant<br>derred Ind cleen-wars bipd het or tunle, cany<br>we dacl seire thip nattod foy in haln shew<br> I sov a badl rase op the rar lagle, sith es frentes<br>brabs. Yome tea glaoner wet on ninns.<br>Huchey<br>I bocens me goocs<br>in wo</p>

I experimented by testing different number of epochs when training the network. Because the intention of this program is to generate new content, there aren't specifically accurate values I can aim to achieve. One metric I ended up looking at was the percentage of existing English words used in a given poem. Early in training, the network uses mostly gibberish (which, funnily enough, could probably be mistaken for a poem in old english). By searching for individual words in each poem in a dictionary-dictionary with over 400,000 English words that I loaded from a .JSON file I found online, I was able to quickly assess the number of real and fake words in a string.


While setting things up, I often ran the program witha  smaller subset of the data, only 500 poems, and was able to see how few real words it would use with such a small dataset.


| 500 poems | 5000 poems |
|------------------------|----------------|
| 500 poems, 10 epochs | 5000 poems, 10 epochs |
|<p align="left">The<br> and that in a chostluce didss in thoughors hipcerias flases  <br>the tulk straaloniss harlery dings lingunt    <br>I having the each me us no could and srid    <br>And can wans bained siclan drod with he tome to but of erond meron whoud,.    <br>Wheon cordow pras righs.,    <br>At thourtandare out four the inistir thar the bets    <br>he sirnored, aln has basing gaib ghay a dimy badn of mepin,    <br>avly about it als bemcore olromlend wa coued as the end.</p> | <p align="left">There who pulses.<br>I askon her world not ripped her;<br>Exwidence so in the longers of death, and chances<br>To all, feared a spilling nouth,<br>Beneath the bad it, at week in the time<br>And you hour the clouds,<br>From the mees singled by peagle down<br>On a ty the sealed of make,<br>Nor dropped shorter with<br>A kip dry legs by blossoms<br>Wherein of shiff yet a squack dirte.<br>The time didn’t said<br>Definive a temption poison.<br>Only happened me so screamed.<br>No morlight he doesn't have finds<br>All that he was sudding it me!<br>Fear t</p> |
| 500 poems, 20 epochs | 5000 poems, 20 epochs |
| <p align="left">The lingare  <br>Houch batch.  <br>I kune pare,  <br>the forgar,  <br>than it wond tipen, when the glot  <br>Any yeorering laskned,  <br>By creess. Sand.  <br>Saartwas suntery at sernod, on ourland floger barling pillion  <br>of mesing to’th.  <br>I’ve agould hay he.</p> | <p align="left">The woman said then, then<br>is that his heavings and stars<br>in his crowns, within the black cappling,<br>the minerries of it! of their things,<br>who had scan death, is weared<br>they’d been watching the pans and snares.It was a cordipating,<br>All familiar than of the silence of his dark.<br>The name of these muscles trade<br>The satures of leather double-cofse<br>From the teachings of the tack<br>Till he goes blowing to heaven with tood and toss.<br>Now, not I will not see:<br>Through the third grassy windows begin to read the bl</p> |

| More Examples using 5000 poems|
|----------|
| 5000 poems, 3 epochs |
| <p align="left">The souls.<br>To swasting, I heap on paber<br>And treeples you parted, is a swear,<br>And the esconderssed time we have gold,<br>Nizely through at the douch from the roses that I<br>while the tall-not hund, cames lade<br>Blank.<br>Still some drive me touches again; to start?<br>I’fr see the great, for a shefred poop to do hap him lafe.All the and bunsmore the firy of my crops<br>O’ceaning two chose-lecken wells and live,<br>And the pood spolls rake cities or city<br>Women her not for a doning telk flower,<br>Sainting skin.<br>The invany </p> |
| 5000 poems, 15 epochs |
| <p align="left">The wild dream is so is?”<br>She was white, at the great stone?<br>If you understand, except that he loves<br>the shadow in the but song<br>of new hand to crash cornile from table on the white sky,<br>his rings of talle trees, sharpering.<br>Daped I met count through the fields,<br>she said, “It took them.<br>Nearry is all you did not make<br>as the dry deprides. You’re only thought?<br>Some shadow didn’t binst burning death.<br>To think a harm, eyelish is to warm or go in his sweet arms<br>cross personal bodies.I cut a lantup on all </p> |
| 5000 poems, 50 epochs |
| <p align="left">The hawks were glassed in water,<br>my head still dining.<br>As disclauming he<br>flowed and walking<br>to the price this wile wall<br>because purpose we have to rest.<br>White archivest trocker,<br>the long-sounding remain began standing<br>someone has not speaking gold around<br>his exorvise.<br>Their baby days crowded at a yellow<br>old; accident who can’t finish.<br>There are wings, even with me<br>as if I see in the little ran<br>night wandering like a scattered dish.<br>Shriveling for one land so line—<br>is that trace of a would be already</p> |

In [None]:
percents = []
epochs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 50]

for e in epochs:
    f = open("epoch"+str(e), "r")
    poem = f.read()
    poem = poem.replace("\n", " ").replace(".", "").replace(",", "").replace("!", "").replace("?", "")
    real = 0
    poem_list = poem.split(" ")
    for word in poem_list:
        if(checkDictionary(word)):
            real+=1
    percents.append(real/(len(poem_list)))

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 5))
plt.title("Real Words Generated Across Epochs")
plt.plot(epochs, percents)
plt.xticks(epochs, epochs, horizontalalignment='center')
plt.xlabel('Number of Epochs')
plt.ylabel('Percent Existing Words')
plt.grid(True)

This graph can be taken with a grain of salt, because there are of course probabilistic factors affecting the number of real words used every time the network is sampled, but we can notice how as the program is trained, it uses real words more consistently. It learns the rules of the english languages, and even the words that don't exist might look a lot like ones that do, or ones that real people would make up.

## Conclusions

Because my project incorporated a type of network not covered in detail in class, I feel like I learned a lot from this process. Also, since I wasn't working with as familiar code, I spent a lot of time reviewing things that we did cover in class so that I could learn to understand and recognize them in different contexts. 

Much of the early time spent on this project, I was reformatting code to be more understandable. I added widgets to track training and data reading. I also moved the CharRNN class to its own file, along with some related methods for encoding, training, and sampling. This helped me keep track of the elements I needed to be actively changing, as well as format it in a way that highlights my project above the code I've borrowed from other sources.

Some of the difficulties I had with this project include understanding certain pytorch values I'm not familiar with and fully understanding code that I did not write. A lot of machine learning still feels like magic to me, because the math is abstracted from me and it's hard to understand how the output is actually generated.

My timeline reflected the one I submitted in my proposal with decent accuracy, though perhaps bumped up by a few days at the beginning because I underestimated my workload upon my return from travelling.

### References

* https://github.com/albertlai431/Machine-Learning/tree/master/Text%20Generation
* https://github.com/dwyl/english-words
* https://poetryfoundation.org

[Goodfellow, et al., 2016] Ian Goodfellow and Yoshua Bengio and Aaron Courville, [Deep Learning](http://www.deeplearningbook.org), MIT Press. 2014.

Your report for a single person team should contain approximately 2,000 to 5,000 words, in markdown cells.  You can count words by running the following python code in your report directory.  Projects with two people, for example, should contain 4,000 to 8,000 words.

In [None]:
import io
from IPython.nbformat import current
import glob
nbfile = glob.glob('RictorProjectReport.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)