# Purpose
For a given English word, this nugget explores how we approach finding the count of syllables for the word.  This is important because haiku's rely on syllables to structure the lines. <br> <br>

Sections:
* 0.0 Imports
* 1.0 Utilize NLTK

# 0.0 Imports 


In [1]:
#nltk choosen over spacy due to problems with installation of spacy
import nltk

# stopwords are words that do not provide meaning
# cmudict - Carnegie Mellon Pronoucing Dictionary - will get the majorty of syalble questions
from nltk.corpus import stopwords, cmudict

# 1.0 Utilize NLTK 
We need a basic understanding of nltk in order to proerly leverage it

### 1.1 Find Parts of Speech 

Example text

In [2]:
tweet = "The quick brown fox jumps over the lazy dog"

Interpretting the parts of speech 

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: "there is" ... think of it like "there exists")
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective 'big'
* JJR adjective, comparative 'bigger'
* JJS adjective, superlative 'biggest'
* LS list marker 1)
* MD modal could, will
* NN noun, singular 'desk'
* NNS noun plural 'desks'
* NNP proper noun, singular 'Harrison'
* NNPS proper noun, plural 'Americans'
* PDT predeterminer 'all the kids'
* POS possessive ending parent's
* PRP personal pronoun I, he, she
* PRP$ possessive pronoun my, his, hers
* RB adverb very, silently,
* RBR adverb, comparative better
* RBS adverb, superlative best
* RP particle give up
* TO to go 'to' the store.
* UH interjection errrrrrrrm
* VB verb, base form take
* VBD verb, past tense took
* VBG verb, gerund/present participle taking
* VBN verb, past participle taken
* VBP verb, sing. present, non-3d take
* VBZ verb, 3rd person sing. present takes
* WDT wh-determiner which
* WP wh-pronoun who, what
* WP\$ possessive wh-pronoun whose
* WRB wh-abverb where, when

Get Part of Speech

In [3]:
tokenize = nltk.word_tokenize(tweet.lower())
nltk.pos_tag(tokenize)

[('the', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

### 1.2 Indicate Stop Words 

Indicate Stop Words 

In [4]:
[(x,True if x in stopwords.words('english') else False) for x in tokenize]

[('the', True),
 ('quick', False),
 ('brown', False),
 ('fox', False),
 ('jumps', False),
 ('over', True),
 ('the', True),
 ('lazy', False),
 ('dog', False)]

### 1.3 Indicate Syllables

For common words find the number of syllables (based on pronunciation)

In [5]:
d = cmudict.dict()

Each line consists of an upercase word, a counter (for alternative prounciations) and a transcriptions.  Vowels are marked for stress  (1=primary, 2=secondary, 0=no stress) <br>
Must be lower case <br>
In cases of multiple pronunciations, take the first list (most common pronunciation)

In [6]:
#simple word
d['i']

[['AY1']]

In [7]:
#sinlge pronunciation
d['multiple']

[['M', 'AH1', 'L', 'T', 'AH0', 'P', 'AH0', 'L']]

In [8]:
#multiple pronunciations
d['washington']

[['W', 'AA1', 'SH', 'IH0', 'NG', 'T', 'AH0', 'N'],
 ['W', 'AO1', 'SH', 'IH0', 'NG', 'T', 'AH0', 'N']]

In [9]:
#test from twitter (whill need to leverage camel case in #words to split into multiple syllables)
#twitter = #KaiHotDebutDay
sampleHashTag = d['kai'][0] + d['hot'][0] + d['debut'][0] + d['day'][0]
sampleHashTag

['K', 'AY1', 'HH', 'AA1', 'T', 'D', 'EY0', 'B', 'Y', 'UW1', 'D', 'EY1']

The count of numbers across the word is the count of syllables

In [10]:
def SyllableCount(word,cmuDict = d):
    ''' For a given word,find the number of syllables'''
    
    #clean word
    word = word.lower()
    
    #if word in dict, then proceeed
    if word in cmuDict.keys():
        
        #get the word spread out (join)
        dList = "".join(cmuDict[word][0])
        
        #acquire total count of syllables 
        dCount = dList.count("0") + dList.count("1") + dList.count("2")
        
        #return
        return dCount
    
    #indicate unknown word
    else:
        #return 999 to indicate an error (high number favorable to make sure not accidently used)
        return 999

In [11]:
SyllableCount('Washington')

3

# END 