# Extracting Haiku(ish things) from Wikipedia

This is an experiment in found poetry, done mostly to try out some Python libraries I haven't used before (Allison Parrish's pronouncingpy and CLiPs's pattern.

First I need some text to work with.

In [1]:
from pattern.web import (Wikipedia, WikipediaArticle)

enwiki = Wikipedia('en')

raw_text = enwiki.search('cat').string

Ok, I have some text. It's got a lot of garbage and punctuation I don't care about, though, so let's get rid of that and turn it into a list of words.

In [2]:
import re

# A word is uppercase or lowercase letters, maybe apostrophes, maybe hyphens
words_re = re.compile(r"[A-Za-z]+")

def words(text):
    return [word.lower() for word in words_re.findall(text)]
    
# For example:
print words("Some! garbage[0] text I can't   use.very.well,?hUh")

['some', 'garbage', 'text', 'i', 'can', 't', 'use', 'very', 'well', 'huh']


In [3]:
# Now, the article:
article_words = words(raw_text)
# truncated for readability
print article_words[0:5]

[u'this', u'article', u'is', u'about', u'the']


Haiku follow a fairly strict convention of syllable counts; the classic pattern is 5-7-5. So I'll need to be able to figure out how many syllables a word has. For that, I'll use pronouncing.

In [4]:
import pronouncing

def syllables(word):
    try:
        return pronouncing.syllable_count(pronouncing.phones_for_word(word)[0]) # just uses the first pronunciation
    except:
        return 17 # don't use this word if we don't know how many syllables it is

print syllables("pronouncing")

3


Now comes the tricky part. Given a list of words, how can I determine whether some subset of that list can form a valid haiku?

Well, first I'll need to be able to tell if a set of words could form a line of five or seven syllables.

In [5]:
def get_line(word_list, syllable_goal):
    line = []
    syllable_count = 0
    for word in word_list:
        line.append(word)
        syllable_count += syllables(word)
        if syllable_count == syllable_goal:
            return line
        elif syllable_count > syllable_goal:
            # failed; can't make a line of the right length
            return None

print get_line(words("I can make five here"), 5)
print get_line(words("...but here it's impossible."), 5)

print get_line(words("here is a seven-count line"), 7)

['i', 'can', 'make', 'five', 'here']
None
['here', 'is', 'a', 'seven', 'count', 'line']


So, what is a haiku? A five-syllable line, followed by a seven-syllable line, followed by a five-syllable line. We can try to extract a haiku from a list of words by looking for those lines, in order.

In [6]:
def attempt_haiku(word_list):
    haiku = []
    remaining_words = word_list + [] # copy word list to avoid mutating it
    for length in [5, 7, 5]:
        line = get_line(remaining_words, length)
        if line == None:
            return None # failure
        else: # found a line
            haiku.append(line)
            remaining_words = remaining_words[len(line):] # discard the words we're already using
    return haiku

print "\nReal haiku attempt:\n"
print attempt_haiku(words("""
Start spirit; behold
the skull. A living head loved
earth. My bones resign
""")) # haiku from here: https://www.poets.org/poetsorg/poem/lines-skull

article_haiku = attempt_haiku(article_words)
print "\nArticle about cats attempt:\n"
print article_haiku

# Wow, first try! Let's make it prettier.

def print_haiku(haiku):
    print "\n" # let it breathe
    for line in haiku:
        print ' '.join(line)
    print "\n"

print_haiku(article_haiku)


Real haiku attempt:

[['start', 'spirit', 'behold'], ['the', 'skull', 'a', 'living', 'head', 'loved'], ['earth', 'my', 'bones', 'resign']]

Article about cats attempt:

[[u'this', u'article', u'is'], [u'about', u'the', u'cat', u'species', u'that'], [u'is', u'commonly', u'kept']]


this article is
about the cat species that
is commonly kept




I got lucky this time, but this isn't quite enough for further searching, because it will only find a haiku at the beginning of an article. To find a haiku deeper in the article, we can just try again with the first word dropped until we find something or the article ends. In fact, it'd be nice to find _all_ of the haiku that we can in the article (no overlaps, though -- you have to have standards in poetry).

In [7]:
def get_all_haiku(article_words):
    all_haiku = []
    while len(article_words) > 0:
        haiku = attempt_haiku(article_words)
        if haiku:
            print_haiku(haiku)
            # ugh, why can't I just call haiku.flatten(), Python?
            haiku_word_count = len([word for line in haiku for word in line])
            all_haiku.append(haiku)
            article_words = article_words[haiku_word_count:] # drop the words we used
        else:
            article_words = article_words[1:]
    return all_haiku
            
# Uncomment this to get all the haiku! (Github won't display it otherwise)
# print get_all_haiku(article_words) 



this article is
about the cat species that
is commonly kept




typically furry
carnivorous mammal they
are often called house




as indoor pets or
simply cats when there is no
need to distinguish




and felines cats are
often valued by humans
for companionship




ability to
hunt vermin there are more than
cat breeds different




associations
proclaim different numbers
according to their




a strong flexible
body quick reflexes sharp
retractable claws




and predatory
ecological niche cats
can hear sounds too faint




in frequency for
human ears such as those made
by mice and other




small animals they
can see in near darkness like
most other mammals




have poorer color
vision and a better sense
of smell than humans




cats despite being
solitary hunters are
a social species




and grunting as well
as cat pheromones and types
of cat specific




have a high breeding
rate under controlled breeding
they can be bred and




shown as registered
pedigree pets a hobby
known as