# CS155 Project 3: Shakespearean Sonnets

In [122]:
import random
import os
import nltk
from nltk.corpus import cmudict
from HMM_Project3 import unsupervised_HMM
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

## Preprocessing:

#### Initial Attempt:
- Process the words line by line
- Remove line containing numbering for poem (1, 2, 3, etc.)
- Change all the words to lowercase
- Uses TweetTokenizer to separate words (retains apostrophes and hyphens)
- Remove all punctualization

In [89]:
def preprocess_init(text):
    # Convert text to dataset.
    lines = text.split('\n')

    obs_counter = 0
    obs = []
    obs_map = {}

    for line in lines:
        # Separate into words using TweetTokenizer and lowercase
        sentence = tknzr.tokenize(line)
        # Skip if line is poem numbering
        if sentence != [] and not sentence[0].isnumeric():
            obs_elem = []
            punct = ".',':;!?()"; 
            
            for word in sentence:
                # Remove intermediate punctuation
                if not word in punct:
                    # Turn to lowercase
                    word = word.lower()
                    if word not in obs_map:
                        # Add unique words to the observations map.
                        obs_map[word] = obs_counter
                        obs_counter += 1

                    # Add the encoded word.
                    obs_elem.append(obs_map[word])

            # Add the encoded sequence.
            obs.append(obs_elem)

    return obs, obs_map

In [90]:
text = open(os.path.join(os.getcwd(), 'data/shakespeare.txt')).read()
obs, obs_map = preprocess_init(text)

## Unsupervised Learning and Poetry Generation with HMMs:

If we were to do a training/testing split for using validation to determine the number of states, we wouldn't be able to guarantee that every state would end up in the training set since some words only appear once in all of the poems. So, we will instead generate some sample poems and subjectively judge the best number of states, as suggested on Piazza.

In [92]:
def obs_map_reverser(obs_map):
    obs_map_r = {}

    for key in obs_map:
        obs_map_r[obs_map[key]] = key

    return obs_map_r

#### Initial Attempt:
- Determine number of words in each line by sampling randomly from all of the line lengths
- Dictate all of the end-line punctuation to be commas except for the final line, which ends with a period.
- Use characteristic 14-phrase structure.

In [101]:
# Generate array of all shakespeare line lengths (in terms of number of words)
line_lens = [len(i) for i in obs]

In [99]:
def generate_poem_init(hmm, obs_map, line_lens):
    # Get reverse map.
    obs_map_r = obs_map_reverser(obs_map)
    
    poem = ""
    
    for i in range(14):
        # Get desired line length:
        n_words = random.choice(line_lens)
        emission, states = hmm.generate_emission(n_words)
        sentence = [obs_map_r[i] for i in emission]
        
        formatted = ' '.join(sentence).capitalize()
        if i < 13:
            formatted += ",\n"
        else:
            formatted += "."
        
        poem += formatted

    return poem

Generate poems for 1, 2, 4, 8, and 16 hidden states to assess coherency:

In [105]:
hmm2 = unsupervised_HMM(obs, 2, 100)
print('\nSample Poem:\n====================')
print(generate_poem_init(hmm2, obs_map, line_lens))


Sample Poem:
Of climbed so same shop and they place from,
Walls of each what my her state be,
Dear me posterity within in ) love's thy,
Poet white verse thy the gone dote my,
The me to that within heaven dispense i,
From me or it beated look papers sharpened,
Like set here's all insults adder's jacks that esteeming,
For upon the all of the to fool,
Are or course the rest swear your o makes,
That bett'ring a thee beauty me for poor,
Kind are anew tyrant worth lose ocean change should i,
You more put my as which strive,
And seeking and then and that,
Truth the watchman nor his for use they.


In [106]:
hmm4 = unsupervised_HMM(obs, 4, 100)
print('\nSample Poem:\n====================')
print(generate_poem_init(hmm4, obs_map, line_lens))


Sample Poem:
Fair hide share beauty's tongue and of will,
Both with fear are his that,
Of is were but what others hate thy lest with,
In but dear thee mistress hath,
His sweetest send thy to votary and cast,
Long give and wand'ring still whereto to eyes cherubins,
To gave methinks my argument side my is enmity,
Vexed o was on i and you to,
Thy quite eyes your to paying indeed conspire which,
Memory thou laid impute niggard my ushers happy and,
Particulars power and tell aspect flatter his full none,
Heaven the sight saucy state a pattern beauties sweet hung left,
And thee mine your buds,
No frown this my black to this tell best.


In [112]:
hmm8 = unsupervised_HMM(obs, 8, 100)
print('\nSample Poem:\n====================')
print(generate_poem_init(hmm8, obs_map, line_lens))


Sample Poem:
Like found in importune the with this waste,
Title a as a interest beauty's sickness pent thee,
Made by in up in my pierced in strong,
Better burthen leaves with clouds so excellent,
Be gainst at much so perish who of,
See that nymphs let unfair of,
All jacks those grief's moment yours touches work,
Nor it think if thee for,
That or we have thy straight is thy,
Or and but boast my slave or,
Bars breast both sun dead thy in fears,
Despise the every dear-purchased decay of my,
Dateless all the expiate sweets upon that form,
Her appetite life that find me seal from self.


In [120]:
hmm16 = unsupervised_HMM(obs, 16, 100)
print('\nSample Poem:\n====================')
print(generate_poem_init(hmm16, obs_map, line_lens))


Sample Poem:
Merits had they thought acquaintance in boast bud,
If that them again from kind,
And and much to fair sounds renewed from,
Of on hast golden might their for their store,
Morning then sing when nor respect and jaws,
Or best use the music world is,
Gems tables dull broke holds me it mind when suffered,
Some why for pace than and majesty prisoner,
My no bonds from my,
Am see courses hath the sweet moon as,
And and mother it there upon ill,
Taste called live i use death's heart with fortune,
World soil of my brave treasure and thy tears,
Brave sweet grief untold eye to seeming why heaven lost.


The poems are all generally pretty nonsensical, but grammatically the poem with 8 hidden states and the poem with 16 hidden states performed significantly better. Since their performance grammatically was relatively similar, and both are still relatively thematically uncoordinated, for the sake of the time tradeoff we will use 8 hidden states for further generation/improvements.

## Additional Goals:

In the following preprocessing and generation functions, we modified them to attempt to include the following aspects from the actual Shakespearean poetry:

#### Rhyme
We implement the *abab cdcd efef gg* rhyme scheme by making a dictionary of all rhyming end pairs during pre-processing, and by seeding each paired phrase with a randomly generated pair from the dictionary and generating the poetry in reverse.

#### Syllable Count (10)
We implement the 10 syllable count by counting as we generate an emission and limiting the possibilities for words as we reach the end so that we end up at 10 syllables.

#### Punctuation
We implement this by allowing intermediate commas in the emission, and since the end-of-line punctuation has more to do with poetic structure than with the preceeding word, we will generate it making a distribution for each line number of what the punctuation usually is and then sampling from that distribution (with the exception of the final line, which is always a period). 

To do this, we first need to parse the syllable counts:

In [123]:
def parse_syll_text(syll_text):
    # Convert syllable text to dictionary
    lines = [line.split() for line in syll_text.split('\n') if line.split()]

    syll_dict = {}

    for line in lines:
        word = re.sub(r'[^\w]', '', line[0]).lower()
        syll_dict[word] = line[1:]

    return syll_dict

In [124]:
syll_text = open(os.path.join(os.getcwd(), 'data/Syllable_dictionary.txt')).read()
syll_dict = parse_syll_text(syll_text)

Now, we need to initialize the rhyming and endline punctuation dictionaries:

In [129]:
rhyme_dict = {}
punct_dict = {k: [] for k in range(13)}