A simple IPA tokeniser, as simple as in:
>>> from ipatok import tokenise >>> tokenise('ˈtiːt͡ʃə') ['t', 'iː', 't͡ʃ', 'ə'] >>> tokenise('ʃːjeq͡χːʼjer') ['ʃː', 'j', 'e', 'q͡χːʼ', 'j', 'e', 'r']
tokenise(string, strict=False, replace=False, diphtongs=False, tones=False,
unknown=False, merge=None) takes an IPA string and returns a list of tokens.
A token usually consists of a single letter together with its accompanying
diacritics. If two letters are connected by a tie bar, they are also considered
a single token. Except for length markers, suprasegmentals are excluded from
the output. Whitespace is also ignored. The function accepts the following
strict: if set to
True, the function ensures that
stringcomplies to the IPA spec (the 2015 revision); a
ValueErroris raised if it does not. If set to
False(the default), the role of non-IPA characters is guessed based on their Unicode category (cf. the pitfalls section below).
replace: if set to
True, the function replaces some common substitutes with their IPA-compliant counterparts, e.g.
g → ɡ,
ɫ → l̴,
ʦ → t͡s. Refer to
ipatok/data/replacements.tsvfor a full list. If both
replaceare set to
True, replacing is done before checking for spec compliance.
diphtongs: if set to
True, the function groups together non-syllabic vowels with their syllabic neighbours (e.g.
aɪ̯would form a single token). If set to
False(the default), vowels are not tokenised together unless there is a connecting tie bar (e.g.
tones: if set to
True, tone and word accents are included in the output (accent markers as diacritics and Chao letters as separate tokens). If set to
False(the default), these are ignored.
unknown: if set to
True, the output includes (as separate tokens) symbols that cannot be classified as letters, diacritics or suprasegmentals (e.g.
$). If set to
False(the default), such symbols are ignored. It does not have effect if
strictis set to
merge: expects a
str, str → boolfunction to be applied onto each pair of consecutive tokens; those for which the output is
Trueare merged together. You can use this to, e.g., plug in your own diphtong detection algorithm:
>>> tokenise(string, diphtongs=False, merge=custom_func)
tokenize is an alias for
replace_digits_with_chao(string, inverse=False) takes an IPA string and
replaces the digits 1-5 (also in superscript) with Chao tone letters. If
inverse=True, smaller digits are converted into higher tones; otherwise,
they are converted into lower tones (the default). Equal consecutive digits
are collapsed into a single Chao letter (e.g.
55 → ˥).
>>> tokenise(replace_numbers_with_chao('ɕia⁵¹ɕyɛ²¹⁴'), tones=True) ['ɕ', 'i', 'a', '˥˩', 'ɕ', 'y', 'ɛ', '˨˩˦']
strict=True each symbol is looked up in the spec and there is no
ambiguity as to how the input should be tokenised.
strict=False IPA symbols are still handled correctly. A non-IPA symbol
would be treated as follows:
- if it is a non-modifier letter (e.g.
Γ), it is considered a consonant;
- if it is a modifier (e.g.
ˀ) or a combining mark (e.g.
ə̇), it is considered a diacritic;
- if it is a modifier tone letter (e.g.
꜍), it is considered a tone symbol;
- if it is neither of those, it is considered an unknown symbol.
Regardless of the value of
strict, whitespace characters and underscores
are considered to be word boundaries, i.e. there would not be tokens grouping
together symbols separated by these characters, even though the latter are not
included in the output.
This is a standard Python 3 package without dependencies. It is offered at the Cheese Shop, so you can install it with pip:
pip install ipatok
or, alternatively, you can clone this repo (safe to delete afterwards) and do:
python setup.py test python setup.py install
Of course, this could be happening within a virtualenv/venv as well.
other IPA packages
- lingpy is a historical linguistics suite that includes an ipa2tokens function.
- ipapy is a package for working with IPA strings.
- ipalint provides a command-line tool for checking IPA datasets for errors and inconsistencies.
- asjp provides functions for converting between IPA and ASJP.
MIT. Do as you please and praise the snake gods.