# Spamku 
The objective is to write haiku about Spam (ala Hormel) based on a corpus of "spamku" from the "SPAM Haiku Archive Home Page - MIT" at http://web.mit.edu/jync/www/spam/ > There are about 20,000 but I used the 9000 or so ones that were curated into various "best of" categories.

Words are chosen with three criteria in mind.  First every line in corpus was analyzed for parts of speech and put into a list.  Duplicates are allowed.  A part of speech pattern is chosen from this list first. Second, a syllable pattern is chosen from a dictionary that has all observed syllable patterns for that part-of-speech pattern.   Next, a word is chosen from a markov chain (obviously the first word is random).  If the word doesn't match the part of speech, another is chosen until it does.  Next that word is checked against the syllable pattern.  If it doesn't match, the process begins again until a word that matches part of speech and syllable pattern is found.

### Import Libraries and Setup Variables and Define Helper Functions

Import nltk libraries

Setup dictionaries for storing learning.  Define a couple helper functions.

In [1]:
# https://www.nltk.org/
import nltk;

# http://www.nltk.org/book/ch05.html
from nltk import pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('cmudict')
from nltk.corpus import cmudict

# https://docs.python.org/3/library/os.html
import os

# https://docs.python.org/3/library/pickle.html
import pickle

# https://docs.python.org/3/library/collections.html
from collections import defaultdict


pos_dict=defaultdict(list)
# declare custom dictionary for determining part of speech
pronunciation_dict = cmudict.dict()

# https://docs.python.org/3/library/random.html
import random


import string

def clean_word(word):
    word= word.translate(str.maketrans('', '', string.punctuation))
    word=word.lower()
    return word

def get_syllable_count(word):
    # print(pronunciation_dict[word])
    if word == 'spamku':
        return 2
    else:
        syl=[ x for x in pronunciation_dict[word][0] if x[-1].isdigit()]
        return len(syl)



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nate\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Nate\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\Nate\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


### Read in Corpus of Sample Spam as a text file and also make a list of haiku, so each item in the list is a complete three line haiku

This file contains 8893 spam-ku


In [2]:
# the name of the corpus is text. It is just one long string, no newlines
filename="corpus/spamku.txt"
file = open(filename, mode = 'r')
text = ""
for line in file:
    text += line.strip()+' '
file.close()


c=0
l=[]
# haiku_list is the name of our list which holds the haiku in their original form
haiku_list=[]
file = open(filename, mode = 'r')
for line in file:
    c+=1
    l.append(line)
    if c==3:
        haiku_list.append(l)
        c=0
        l=[]
file.close()

##### The first 3 spamku:

In [3]:
haiku_list[:3]

[['See the pretty SPAM\n',
  'Flying through the bedroom door\n',
  'Heading towards your face.\n'],
 ['Ears, snouts and innards,\n',
  'A homogeneous mass--\n',
  'Pass another slice.\n'],
 ['Pink tender morsel,\n',
  'Glistening with salty gel.\n',
  'What the hell is it?\n']]

### Read in pickle files 

pos_dict is a dictionary with indivdual parts of speech as keys, and the value is a list of all words from our spamku corpus that match that part of speech.  It is filtered so only words that work with our syllable count function are used.

pos_patterns is a dictionary with parts-of-speech-patterns as keys, and how often that pattern occured in the corpus as values.  Each patterns is from one line of a haiku
For example: 'JJ NN NN' : 140  It's not really used except as an interesting statistic.

Instead I use pos_patters5 and pos_patterns7, which are just lists of parts-of-speech-patterns, separated by whether they occured in a five or seven syllable line.  It allows duplicates, so common patterns are chosen more frequently.

pattern_syl5 and pattern_syl7 are dictionaries with parts_of_speech_patterns as keys.  The value is a list containing all syllable counts associated with the pattern.  For example: 
pattern_syl7 \[ 'VBG IN NN NN ' \] = \[ \[ 3, 1, 2, 1 \], \[ 2, 2, 1, 2 \], \[ 2, 1, 2, 2 \] \]

Some of these dictionaries take a while to generate on my i5-3320M computer, so I pickled them.  See the end of the notebook to see how I generated them (those cells will consist of commented out code)



In [4]:
def read_pickle(filename): 
    file = open(filename, 'rb')      
    data = pickle.load(file) 
    file.close()
    return data

pos_patterns = read_pickle("pos_patterns.p")
pos_dict = read_pickle("pos_dict.p")
pos_patterns5 = read_pickle("pos_patterns5.p")
pos_patterns7 = read_pickle("pos_patterns7.p")
pattern_syl5 = read_pickle("pattern_syl5.p")
pattern_syl7 = read_pickle("pattern_syl7.p")



### Perform Markov Chain Analysis on Sample Poems

Determining the most likely word to follow each word in sample

In [5]:
def make_markov_dict(text, filename):
    
    #text is a long string, filename is a file to save rejects to
    f = open(filename, 'a')
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    markov_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        try:
            if get_syllable_count(clean_word(current_word)):  # throws an error if we don't have a syllable count
                if get_syllable_count(clean_word(next_word)): # throws an error if we don't have a syllable count
                if get_syllable_count(clean_word(next_word)):
                    markov_dict[clean_word(current_word)].append(clean_word(next_word) )    
        except:
            # saves words without syllable count to a file.  I can manually add these later
            print(clean_word(current_word), file=f)
            pass
    # Convert the default dict back into a dictionary
    markov_dict = dict(markov_dict)
    f.close()
    return markov_dict

mc=make_markov_dict(text, "words_without_syllable_count.txt")


In [13]:
len(list(mc.keys()))

11389

### Define functions to count syllables for each word

Testing with 3 words

In [15]:
def get_syllable_count(word):
    # print(pronunciation_dict[word])
    if word == 'spamku':
        return 2
    else:
        syl=[ x for x in pronunciation_dict[word][0] if x[-1].isdigit()]
        return len(syl)

print(get_syllable_count('pregnant'))
print(get_syllable_count('sausage'))
print(get_syllable_count('ribs'))


2
2
1


### Define function to return part of speech for each word

Testing with 3 words

In [16]:
def get_pos_label(word):
    return pos_tag(word)[0][1]

print(get_pos_label(['capicola']))
print(get_pos_label(['pastrami']))
print(get_pos_label(['andouille']))


NN
NN
NN


#### Create functions to choose POS pattern and syllable structure

In [91]:
import random
       
def get_word(pos, count, prevword):
    for i in range(2000): #tries up to 2000 times to get a good word
        try:
            word = random.choice(mc[clean_word(prevword)]) if prevword else random.choice(pos_dict[pos])
            if get_syllable_count(clean_word(word)) == count:
                return clean_word(word)
        except: #A bug, sometimes a word is chosen that can't be matched in the markov chain...
            print("mc error from preword: ", prevword)
            print('nate' , word, pos , count)
            return("ham")
    return('spam') #if I can't get a word in 2000 tries

def make_line(line_pattern, line_syl):
    l=""
    prev_word=False
    #print("make line: ", line_pattern, line_syl)
    #print("length of  line pattern: ", len(line_pattern.split()), "len of line syl: ", len(line_syl))
    for pos, count in zip(line_pattern.split(), line_syl):
        #print("I'm here:", pos, count)
        word = get_word(pos,count, prev_word)
        l += word + ' '
        prev_word = word
    return l

#This function establishs the parts of speech patterns and syllable counts for the haiku
# it returns a list of 3 tuples, for example:
# [['NN DT NN NN NN ', [2, 1, 1, 1]], ['NN NN TO DT NN ', [2, 1, 1, 1, 2]], ['NN DT NN MD ', [2, 1, 1, 1]]]
def get_line_info():
    line_info =[]
    for i in range(3):
        if i != 1:
            tempPat=random.choice(pos_patterns5)
            line_info.append([tempPat, random.choice(pattern_syl5[tempPat])])
        else:
            tempPat=random.choice(pos_patterns7)
            line_info.append([tempPat, random.choice(pattern_syl7[tempPat])])
    return line_info

#This function will print out several haiku
def generate_spamku(num_wanted):
    try:
        for i in range(num_wanted):
            line_info = get_line_info()
            for line in line_info:
                print(make_line(line[0], line[1]))
            print()
    except:
        print("\nError.  Restarting...\n")
        generate_spamku(num_wanted-i)
        pass
              

generate_spamku(5)   


blue can coat your face 
with it reminds me turkey 
enlightened in the 

wild west virginia 
down at birth control spittle 
your lunch prepare eat 

eat it take it looks 
it delight ah mismanaged 
really got it for 

the litter next to 
the sun has it is always 
gaseous clouds and 

but there they always 
things we part the sign of night 
but man dressed spam 



### Functions used to create pickled dictionaries

In [None]:
# # Make pos patterns from haiku
# pos_patterns =  defaultdict(list)
# pos_patterns_markov_data = []
# haiku_pat = []
# for haiku in haiku_list:
#     for line in haiku:
#         l=""
#         line = line.strip()
#         for word in line.split():
#             word = clean_word(word)
#             try:
#                 l += pos_tag([word])[0][1]+" "
#             except:
#                 #I discovered that word is "" sometimes, which raises an error
#                 #print("error when building l", word)
#                 pass
#         haiku_pat.append(l)
#         if l in pos_patterns:
#             pos_patterns[l] += 1
#         else:
#             pos_patterns[l] = 1
#     pos_patterns_markov_data.append(l)

In [14]:
# # Make pos pattern - syllable information dict
# pos_patterns5 =  []
# pos_patterns7 =  []
# pattern_syl5 = defaultdict(list)
# pattern_syl7 = defaultdict(list)
# #pos_patterns_syl_markov_data_syl = []
# for haiku in haiku_list:
#     for line in haiku:
        
#         l=""
#         syl_pattern = []
#         line = line.strip()
#         for word in line.split():
#             word = clean_word(word)
#             try:
#                 l += pos_tag([word])[0][1]+" "
#                 syl_pattern.append(get_syllable_count(word))
#             except:
#                 #I discovered that word is "" sometimes, which raises an error
#                 #print("error when building l", word)
#                 pass
#         if sum(syl_pattern) == 5:
#             pattern_syl5[l].append(syl_pattern)
#             pos_patterns5.append(l)
#         if sum(syl_pattern) == 7:
#             pattern_syl7[l].append(syl_pattern)
#             pos_patterns7.append(l)


In [31]:
# # Make pos_dict from spamku

# import string
# import sys
# c=0
# pos_dict=defaultdict(list)
# for word in text.split(' '):
#     try:
#         c+=1
#         word = clean_word(word) 
#         if c % 5000 == 0:
#             print(c, word)
#         tok = pos_tag([word])
#         # This line is here to throw an error if we don't have a syllable count for this word
#         syl = get_syllable_count(word)
#         pos_dict[tok[0][1]].append(tok[0][0])
#     except:
#         #print("SYL ERROR", tok, word)
#         continue
# print('***************  Done  ***************')

5000 ive
10000 to
15000 bill
20000 western
25000 below
30000 deadly
35000 ate
40000 godzilla
45000 was
50000 their
55000 the
60000 rises
65000 love
70000 were
75000 stock
80000 stronger
85000 pan
90000 a
95000 an
100000 i
105000 ammo
110000 blend
***************  Done  ***************


In [32]:
pickle.dump(pos_dict, open("pos_dict.p", "wb"))
pickle.dump(pos_patterns, open("pos_patterns.p", "wb"))
pickle.dump(pos_patterns5, open("pos_patterns5.p", "wb"))
pickle.dump(pos_patterns7, open("pos_patterns7.p", "wb"))
pickle.dump(pattern_syl5, open("pattern_syl5.p", "wb"))
pickle.dump(pattern_syl7, open("pattern_syl7.p", "wb"))


In [12]:
##code to remove duplicates from the rejected words list
# f=open("rejected_words.py", 'r')
# rejects=[]
# for word in f:
#     word = clean_word(word)
#     rejects.append(word)
# rejects=set(rejects)
# f.close()
# f=open("clean_rejects.txt", 'a')
# for word in rejects:
#     f.write(word.strip()+':  ,\n')
# f.close()


In [None]:
## legacy code

# pattern_prob = []
# for k,v in pos_patterns.items():
#     for i in range(v):
#         try:
#             if v != "":
#                 pattern_prob.append(k)
#         except:
#             print("error: ", k, v)
# num_diff_patterns = len(pattern_prob)

# syllable_lengths = {
#     'VB': 1,
#     'DT': 1,
#     'RB': 2,
#     'NN': 5,
#     'VBG': 3,
#     'IN': 1,
#     'NNS': 4,
#     'PRP$': 1,
#     'CC': 1,
#     'JJ': 2,
#     'WP': 1,
#     'VBZ': 1,
#     'PRP': 1, 
#     'VBN': 2,
#     'JJS': 2,
#     'MD': 1, 
#     'TO': 1,
#     'VBD': 2,
#     'VBP': 2,
#     'WRB': 1,
#     'CD': 1, #cardinal digit, could be more
#     'RBR': 2,
#     'JJR': 2,
#     'WDT': 1,
#     'WP$': 1
# }


# # Maybe do a probability analysis on these?
# # CC coordinating conjunction
# # CD cardinal digit
# # DT determiner
# # EX existential there (like: “there is” … think of it like “there exists”)
# # FW foreign word
# # IN preposition/subordinating conjunction
# # JJ adjective ‘big’
# # JJR adjective, comparative ‘bigger’
# # JJS adjective, superlative ‘biggest’
# # LS list marker 1)
# # MD modal could, will
# # NN noun, singular ‘desk’
# # NNS noun plural ‘desks’
# # NNP proper noun, singular ‘Harrison’
# # NNPS proper noun, plural ‘Americans’
# # PDT predeterminer ‘all the kids’
# # POS possessive ending parent’s
# # PRP personal pronoun I, he, she
# # PRP$ possessive pronoun my, his, hers
# # RB adverb very, silently,
# # RBR adverb, comparative better
# # RBS adverb, superlative best
# # RP particle give up
# # TO, to go ‘to’ the store.
# # UH interjection, errrrrrrrm
# # VB verb, base form take
# # VBD verb, past tense took
# # VBG verb, gerund/present participle taking
# # VBN verb, past participle taken
# # VBP verb, sing. present, non-3d take
# # VBZ verb, 3rd person sing. present takes
# # WDT wh-determiner which
# # WP wh-pronoun who, what
# # WP$ possessive wh-pronoun whose
# # WRB wh-abverb where, when

# def choose_target_syllable_count(pattern, number_of_syllables_wanted):
#     syllable_counts = []
#     possible = False
#     pattern_len = len(pattern.split())
#     max_per_word = number_of_syllables_wanted - pattern_len
#     for i in range(len(pattern.split())):
#         while possible == False:
#             x = randint(1, syllable_lengths[pattern.split()[i]])
#             print(x)
#             if x <= max_per_word:
#                 possible = True
#         syllable_counts.append(x)
#         number_of_syllables_wanted -= x
#         pattern_len -= 1
#         max_per_word = number_of_syllables_wanted - pattern_len
#         print('***********************', syllable_counts)
#     return syllable_counts

# #pass in poem, pattern, 
                    
# def build_line(line, pattern, max_length):
#     if len(pattern) == 0:
#         return line
#     else:
#         pos = pattern[-1]
#     try:
#         new_max_length = max_length
#         #print(max_length)
#         word = random.choice(pos_dict[pos])
#         new_max_length = max_length - get_syllable_count(word)
#         if new_max_length >= 0:
#             line = word + ' ' + line
#         else:
#             new_max_length = max_length
#     except:
#         print("funny word: ", word)        
#     #if max_length >= 0:
#     pattern=pattern[:-1]
#     return build_line(line, pattern, new_max_length)

## word1.capitalize()        
        
# line1text=""
# word=""
# c=0
# for pos in line1:
#     word=""
#     possible = False
#     try:
#         while possible == False:
#             word = random.choice(pos_dict[pos])
#             if get_syllable_count(word) = l1[c]
#                 line1text += word " "
#                 c += 1
#                 possible = true
#         except:
#             print("Unexpected error when adding to dict:", sys.exc_info()[0])
#             print("pos:", pos, l1[c], word)
    