# CMU Pronouncing Dictionary

CMU Pronouncing Dictionary (CMU dict) is a pronunciation dictionary for North American English. It contains more than 134, 000 words and their pronunciations. CMU dict uses ARPAbet, a phonetic transcription alphabet which uses ASCII symbols. Therefore, it is easier to work with in Python than the IPA. In addition, CMU dict is incorporated into NLTK, so we will just import it to our program.

In this notebook we will learn how to print a pronunciation of an English sentence. Then we will learn how to find and print words which can be pronounced in more than one way.

## Import Libraries and Modules

Let's start by importing all libraries and modules we will need. Wee need to import NLTK to work with CMU Pronouncing Dictionary. For tokenization we import word_tokenize from NLTK. To normalize a sentence and prepare it for searching we use a regular expression module re and a string module. Lastly we import defaultdict to handle dictionaries easier.

**Note** that you have to have NLTK, re and string installed. Also, to use CMU dict in NLTK you have to download it. To be able to use NLTK's tokenizer you have to have punkt downloaded.

In [13]:
import nltk, re, string
from nltk.tokenize import word_tokenize
from collections import defaultdict

## CMU Dict Structure

Let's start by saving the CMU dict into a varible so we can easily access it. 

In [2]:
entries = nltk.corpus.cmudict.entries()

Before we start programming, we should know how every entry in the dictionary looks like. Let's print a word _dog_.

In [70]:
print(entries[33330])

('dog', ['D', 'AO1', 'G'])


As we can see, each entry is saved in a tuple. The first item is a lower-case English word and the second item is a list of phones ("the pronunciation"). Since tuples are ordered and unchangeable we do not have to inspect other words to make sure every entry is the same.

## Pronunciation of an English Sentence

Now we can start programming. We will do a simple program which takes an English sentence and for each word in the sentence it prints its pronunciation. At first we will save our sentence in a variable. 

In [109]:
sent = 'I will meet you tomorrow.'

Then we will create a function to normalize the sentence. There are separate lower-case words in the CMU dictionary.

At first we should get rid of upper-case letters. Then we will remove punctuation with re and string. Then we have to tokenize the text to get separate words. The easiest way how to do it is to use NLTK's tokenizer.

We have now normalized separate words in a list and we can return it.

In [110]:
def normalize(sentence):
    """ Remove upper-case letters and punctuation, then tokenize the text """
    
    sentence = sentence.lower() #lower-case everything
    regex = re.compile('[%s]' % re.escape(string.punctuation)) #rregex pattern
    sentence = regex.sub('', sentence) #remove punctuation from a string
    sentence = word_tokenize(sentence) #tokenize the sentence
    
    return sentence

In [111]:
sent = normalize(sent) #save the tokenized sentence into a variable

Now we can find each word's pronunciation in the CMU dict. We will create another function called `pronunciation`.

We will prepare an empty list where we save pronunciation of words later. Then we iterate over every word in the CMU dict and check if the word is in our sentence. If it is we save its pronunciation into a list. 

We use join function while appending the pronunciation into the list. As we remember, the pronunciation of a word dog consists of 3 phones. The phones were saved separately in a list. If we use join and keep all phones from a CMU dict list in a string we do not have to bother about indexing in nested lists and printing the final result will become easier.

In ARPAbet, phones are written with a space in between them. We will keep the space in our code, too.

In [100]:
def pronunciation(sent):
    """ For each word in a sentence find its pronunciation in the CMU dict """
    sent_pron = [] #an empty list for pronunciation
    
    for entry in entries:
        for tok in sent:
            if entry[0] == tok:
                sent_pron.append(' '.join(p for p in entry[1])) #save the pronunciation as a string
    
    return sent_pron

In [113]:
pron_result = pronunciation(sent) #save the pronunciation into a variable

In [114]:
' | '.join(tok for tok in pron_result) #print the pronunciation of the sentence, separate words by |

'AY1 | W IH1 L | W AH0 L | M IY1 T | Y UW1 | T AH0 M AA1 R OW2 | T UW0 M AA1 R OW2'

We have now successfully printed the pronunciation of the words in our sentence. Each phone is separated by a space and each word is separated by a vertical bar. In every word, there is a stress marked with a number. If you are interested in the ARPAbet transcription and the phones used refer to [ARPAbet](https://en.wikipedia.org/wiki/ARPABET) and [CMU dict webpage](http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=lead&stress=-s).

As we can see we got more words than in the original sentence. It is because a word can be pronounced in more than one way. We can now try to look for words with more than one pronunciation and print only those.

## Words that Can Be Pronounced Differently

Let's now look for words we can pronounce differently. At first we will inspect words in our sentence only. In CMU dict we will search for words that appear more than once in the dictionary and we will save all their possible pronunciation. For this type of task, we will work with a dictionary data structure. To make programming easier we will use defaultdict at some point.

We can try a different sentence this time.

In [124]:
sent2 = 'I will record a record and then will send it to you.'

In [125]:
sent2 = normalize(sent2) #normalize the sentence

In our example above we have repeating words. In this task we do not print the sentence as a whole so we can get rid of them. The easiest way to remove duplicates is to convert a list to a set and then to a list again (so we get rid of the duplicates but keep the same data structure to work with).

In [126]:
sent2 = list(set(sent2)) #convert a list to a set and back to a list

We will prepare a dictionary to store words in. We use defaultdict.

Then we iterate over the CMU dict and search for all words in our sentence. We store the words in a dictionary where keys are the words and values are pronunciation.

Thanks to defaultdict we keep our code short and simple. We do not have to use conditions for adding elements to the dictionary. 

In [136]:
def sent2_dict(sent2):
    """ Create a dictionary where words are keys and pronunciation is values """
    sent_matches = defaultdict(list)
    
    for tok in sent2:
        for entry in entries:
            if entry[0] == tok:
                sent_matches[tok].append(entry[1])
                
    return sent_matches

In [137]:
sent_pron = sent2_dict(sent2)

We can see if there are words with more than one pronunciation in our sentence. Because we store the pronunciation in a list in dictionary values the easiest thing to do is to iterate over the dictionary and print a length of each value list.

In [140]:
for key, val in sent_pron.items():
    print(key, len(val))

then 1
i 1
you 1
to 3
will 2
and 2
a 2
it 2
record 3
send 1


As we can see, there are 5 words we can pronounce differently. Let's create another dictionary, store the words in the new dictionary and print the words with the pronunciation.

In [141]:
def more_pron_dict(sent_pron):
    """ Create a new dictionary and store the words we can pronounce differently there """
    pron = {}
    
    for key, val in sent_pron.items():
        if len(val) > 1:
            pron[key] = val
            
    return pron

In [142]:
diff_pron = more_pron_dict(sent_pron)

In [143]:
diff_pron

{'to': [['T', 'UW1'], ['T', 'IH0'], ['T', 'AH0']],
 'will': [['W', 'IH1', 'L'], ['W', 'AH0', 'L']],
 'and': [['AH0', 'N', 'D'], ['AE1', 'N', 'D']],
 'a': [['AH0'], ['EY1']],
 'it': [['IH1', 'T'], ['IH0', 'T']],
 'record': [['R', 'AH0', 'K', 'AO1', 'R', 'D'],
  ['R', 'EH1', 'K', 'ER0', 'D'],
  ['R', 'IH0', 'K', 'AO1', 'R', 'D']]}

We could leave the printing as it is or make it prettier. 

To make the printing prettier we first concantenate the separate phones of each pronunciation into a string. We create another function for this task.

In [144]:
def pron_no_list(diff_pron):
    """ Create a new dictionary. In values store a list of strings with phones """
    pron_nl = defaultdict(list)

    for key, val in diff_pron.items():
        for words in val:
            pron_nl[key].append(' '.join(w for w in words))
    
    return pron_nl

In [145]:
diff_pron_nl = pron_no_list(diff_pron)

In [146]:
diff_pron_nl

defaultdict(list,
            {'to': ['T UW1', 'T IH0', 'T AH0'],
             'will': ['W IH1 L', 'W AH0 L'],
             'and': ['AH0 N D', 'AE1 N D'],
             'a': ['AH0', 'EY1'],
             'it': ['IH1 T', 'IH0 T'],
             'record': ['R AH0 K AO1 R D',
              'R EH1 K ER0 D',
              'R IH0 K AO1 R D']})

Then we will just print the dictionary.

In [147]:
for key, val in diff_pron_nl.items():
    print(key + ': ' + ' | '.join(w for w in val))

to: T UW1 | T IH0 | T AH0
will: W IH1 L | W AH0 L
and: AH0 N D | AE1 N D
a: AH0 | EY1
it: IH1 T | IH0 T
record: R AH0 K AO1 R D | R EH1 K ER0 D | R IH0 K AO1 R D
