Heteronym to Phoneme Parser

Fast parsing of English Heteronyms to Phonemes using contextual part-of-speech.

Provides the ability to convert heteronym graphemes to their phonetic pronunciations.

Designed to be used in conjunction with other fixed grapheme-to-phoneme dictionaries such as CMUdict

This package also offers a combined Grapheme-to-Phoneme dictionary, combining the functionality of fixed lookups handled by CMUdict and context-based parsing as offered by this module.

Usage

1. Combined Grapheme-to-Phoneme dictionary

The CMUDictExt class combines a pipeline for context-based heteronym parsing to phonemes and a fixed dictionary lookup replacement using the CMU Pronouncing Dictionary.

Example:

from h2p_parser.cmudictext import CMUDictExt

CMUDictExt = CMUDictExt()

# Parsing replacements for a line. This can be one or more sentences.
line = CMUDictExt.convert("The cat read the book. It was a good book to read.")
# -> "{DH AH0} {K AE1 T} {R EH1 D} {DH AH0} {B UH1 K}. {IH1 T} {W AA1 Z} {AH0} {G UH1 D} {B UH1 K} {T UW1} {R IY1 D}."

Additional optional parameters are available when defining a CMUDictExt instance:

Parameter	Type	Default Value	Description
`cmu_dict_path`	`str`	`None`	Path to a custom CMUDict file in `.txt` format
`h2p_dict_path`	`str`	`None`	Path to a custom H2p Dictionary file in `.json` format. See the example.json for the expected format.
`cmu_multi_mode`	`int`	`0`	Default selection index for CMUDict entries with multiple pronunciations as donated by the `(1)` or `(n)` format
`process_numbers`	`bool`	`True`	Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See numbers.py for details on what is covered.
`phoneme_brackets`	`bool`	`True`	Surrounds phonetic words with curly brackets i.e. `{R IY1 D}`
`unresolved_mode`	`str`	`keep`	Unresolved word resolution modes: `keep` - Keeps the text-form word in the output. `remove` - Removes the text-form word from the output. `drop` - Returns the line as `None` if any word is unresolved.

2. Heteronym-to-Phoneme parsing only

To use only the core heteronym-to-phoneme parsing functions, without fixed dictionary support, use H2p class.

Example:

from h2p_parser.h2p import H2p

h2p = H2p(preload=True) # preload flag improves first-inference performance

# checking if a line contains a heteronym
state = h2p.contains_het("There are no heteronyms in this line.")
# -> False

# replacing a single line
line = h2p.replace_het("I read the book. It was a good book to read.")
# -> "I {R EH1 D} the book. It was a good book to {R IY1 D}."

# replacing a list of lines
lines = h2p.replace_het_list(["I read the book. It was a good book to read.",
                              "Don't just give the gift; present the present.",
                              "If you were to reject the product, it would be a reject."])
# -> ["I {R EH1 D} the book. It was a good book to {R IY1 D}.",
#     "Don't just give the gift; {P R IY0 Z EH1 N T} the {P R EH1 Z AH0 N T}.",
#     "If you were to {R IH0 JH EH1 K T} the product, it would be a {R IY1 JH EH0 K T}."]

Note: Depending on your performance requirements, there is a speed improvement for processing large text line batches by using replace_het_list() with a list of all text lines, instead of making repeated calls to replace_het(). See the performance section for more details and guidelines for optimizations.

License

The code in this project is released under Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly