# Code Chalenge

## CMU Pronouncing Dictionary

There is no way to know how an English word should be pronounced given only its spelling. However, in the course of creative language generation, we're often in need of information about how a particular word would be pronounced, if read aloud; we might want to use this information for a number of creative and poetic purposes, such as automated rhyming and assonance, or to generate text that conforms to a particular meter.

Fortunately, the powers that be (i.e., DARPA and "member companies of the Carnegie Mellon Sphinx Speech Consortium) have gifted us with the CMU Pronouncing Dictionary. The CMU Pronouncing Dictionary is a plain-text, computer-readable database that maps English words to their pronunciations. It's an incredible boon to poets and researchers alike.

Visit the home page of the dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict), or download the dictionary itself (http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b).

## Importing Dataset

### File format

But before we start working with libraries, I want to explain a little bit about how the dictionary itself is structured, so you can parse it on your own if you need to!

The dictionary is a plain-text file. Each line of the file has a word and its pronunciation, separated by two spaces. Here's a sample line:

CARNEGIE  K AA1 R N EH0 G IY0

This is the entry for the word CARNEGIE, which has a pronunciation of K AA1 R N EH0 G IY0 (for more on what the characters in the pronunciation mean, see below).

Occasionally, one word will have several pronunciations associated with it. In those cases, the dictionary has an entry for each possible pronunciation, with a parenthesized number that increments for each subsequent entry:

ADULT  AH0 D AH1 L T
ADULT(1)  AE1 D AH0 L T


Additionally, there are some lines (at the beginning of the file) that begin with a semicolon (;). These are comments and should be ignored.



In [167]:
import pandas as pd

In [168]:
url = 'http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b'
#words = pd.read_fwf('cmudict-0.7b.txt', header = None)
words = pd.read_fwf(url, header = None, skiprows = 56, \
                    encoding = 'latin-1', colspecs=[(0,100)]  )


In [169]:
words.sample(10)

Unnamed: 0,0
108017,SEPARATE(1) S EH1 P ER0 IH0 T
89318,PATINKIN P AH0 T IH1 NG K IH0 N
111231,SLOANE S L OW1 N
84720,NUCKLES N AH1 K AH0 L Z
126738,VIABILITY V AY0 AH0 B IH1 L AH0 T IY0
5052,APPROXIMATION AH0 P R AA2 K S AH0 M EY1 SH AH0 N
72815,MAGRATH M AH0 G R AE1 TH
39036,EUGENIA Y UW2 JH IY1 N IY0 AH0
80177,MOODY M UW1 D IY0
32490,DISFAVOR D IH0 S F EY1 V ER0


## Adjustment on the data impoted

In [170]:
words_olny = list(words[0])

collumn_1 = []
collumn_2 = []

for string in words_olny:
    collumn_1.append(string.split()[0])
    collumn_2.append(" ".join(string.split()[1:]))    

In [171]:
words = pd.DataFrame(collumn_1, columns=['word'])
words['config'] = collumn_2

In [172]:
words.sample(5)

Unnamed: 0,word,config
56097,HORNBOSTEL,HH AO1 R N B AH0 S T AH0 L
48041,GLIMMERS,G L IH1 M ER0 Z
104309,SAITTA,S EY1 T AH0
115687,STRICKEN,S T R IH1 K AH0 N
31527,DICTATION,D IH0 K T EY1 SH AH0 N


## Counting syllables

To count syllables the number of syllables, you need only to count how many vowels there are. Because all vowels in the dictionary have a number next to them (for stress), you can simply count how many times those numbers occur. 

In [173]:
def syllable_count(word):
    return sum(word.count(i) for i in '012')

In [174]:
words['syllables'] = words['config'].apply(syllable_count)

In [175]:
words.sample(10)

Unnamed: 0,word,config,syllables
23618,COMPRISES(1),K AH0 M P R AY1 Z IH0 Z,3
97558,RATHERT,R AE1 TH ER0 T,2
41062,FEATHERLY,F EH1 DH ER0 L IY0,3
110969,SLABS,S L AE1 B Z,1
126407,VENSEL,V EH1 N S AH0 L,2
85135,OARLOCK,AO1 R L AA2 K,2
7483,BACKES,B AE1 K S,1
32028,DIOCESE(1),D AY1 AH0 S AH0 S,3
82676,NEIS,N IY1 Z,1
66476,KUMBAYA,K UH2 M B AY0 Y AH1,3


##  Count words by syllables

We can use pandas to count using group by

In [176]:
words[['syllables','word']].groupby('syllables').count()

Unnamed: 0_level_0,word
syllables,Unnamed: 1_level_1
1,16935
2,61493
3,37105
4,13474
5,3882
6,815
7,130
8,16
9,3
10,1
