An R package for phonological analysis
For a more comprehensive vignette, visit my
website. The package requires R >= 4.1.
getFeat()andgetPhon()to work with distinctive featuresipa()phonemically transcribes words (real or not) in Portuguese, French or Spanishipa_pt()offers a more detailed transcription for Portuguese
maxent()implements a MaxEnt grammar given a tableau objectnhg()andplotNgh()can be used to generate and visualize probabilities for candidates in a Noisy Harmonic GrammarsonDisp()calculates the sonority dispersion of a given demisyllable.meanSonDisp()calculates the average dispersion for a given word (or vector of words)wug_pt()generates hypothetical words in PortuguesebiGram_pt()calculates bigram probabilities for a given wordplotVowels()generates vowel trapezoidsplotSon()plots the sonority profile of a given wordpslcontains the Portuguese Stress Lexiconpt_lexcontains a simplified version ofpslstopwords_pt,stopwords_frandstopwords_spcontain stopwords in Portuguese, French and Spanish
The function getFeat() requires a set of phonemes ph and a language
lg. It outputs the minimal matrix of distinctive features for ph
given the phonemic inventory of lg. Five languages are supported:
English, French, Italian, Portuguese, and Spanish. You can also use lg
to provide your own phonemic inventory as a vector. Here are some
examples.
library(Fonology)
getFeat(ph = c("i", "u"), lg = "English")
#> [1] "+hi" "+tense"
getFeat(ph = c("i", "u"), lg = "French")
#> [1] "Not a natural class in this language."
getFeat(ph = c("i", "y", "u"), lg = "French")
#> [1] "+syl" "+hi"
getFeat(ph = c("p", "b"), lg = "Portuguese")
#> [1] "-son" "-cont" "+lab"
getFeat(ph = c("k", "g"), lg = "Italian")
#> [1] "+cons" "+back"The function getPhon() requires a feature matrix ft and a language
lg. It outputs the set of phonemes represented by ft given the
phonemic inventory of lg. The languages supported are the same as
those supported by getFeat(), and you can again use lg to provide
your own phonemic inventory as a vector.
getPhon(ft = c("+syl", "+hi"), lg = "French")
#> [1] "u" "i" "y"
getPhon(ft = c("-DR", "-cont", "-son"), lg = "English")
#> [1] "t" "d" "b" "k" "g" "p"
getPhon(ft = c("-son", "+vce"), lg = "Spanish")
#> [1] "z" "d" "b" "ʝ" "g" "v"The function ipa() takes a word (or a vector with multiple words,
real or not) in Portuguese, French or Spanish in its orthographic form
and returns its phonemic (i.e., broad) transcription. The accuracy of
grapheme-to-phoneme conversion is at least 80% for all three languages.
Narrow transcription is available for Portuguese (based on Brazilian
Portuguese), which includes secondary stress—this can be generated by
adding narrow = T to the function. Run ipa_pt_test(),
ipa_fr_test() and ipa_sp_test() for sample words in both languages.
By default, ipa() assumes that lg = "Portuguese" (or lg = "pt")
and narrow = F.
ipa("atletico")
#> [1] "a.tle.ˈti.ko"
ipa("cantalo", narrow = T)
#> [1] "kãn.ˈta.lʊ"
ipa("antidepressivo", narrow = T)
#> [1] "ˌãn.t͡ʃi.ˌde.pɾe.ˈsi.vʊ"
ipa("feris")
#> [1] "fe.ˈris"
ipa("mejorado", lg = "sp")
#> [1] "me.xo.ˈɾa.do"
ipa("nuevos", lg = "sp")
#> [1] "nu.ˈe.bos"
ipa("informatique", lg = "fr")
#> [1] "ɛ̃.fɔʁ.ma.tik"
ipa("acheter", lg = "fr")
#> [1] "a.ʃə.te"A more detailed function, ipa_pt(), is available for Portuguese only.
In it, stress is assigned based on two scenarios. First, real words
(non-verbs) have their stress assignment derived from the Portuguese
Stress Lexicon (Garcia 2014)—if the word is listed there.
Second, nonce words follow the general patterns of Portuguese stress
as well as probabilistic tendencies shown in my work (Garcia, 2017a,
2017b, 2019). As a result, a nonce word may have antepenultimate
stress under the right conditions based on lexical statistics in the
language. Likewise, words with other so-called exceptional stress
patterns are also generated probabilistically (e.g., LH] words with
penultimate stress). Stress and weight are also used to apply both
spondaic and dactylic lowering to narrow transcriptions, following work
such as Wetzels (2007). Secondary stress is provided when narrow = T.
In the function ipa(), stress is not probabilistic (and therefore
not variable): it merely follows the orthography as well as the typical
stress rules in Portuguese (and Spanish).
There are several assumptions about surface-forms when narrow = T
(Portuguese only). Most of these assumptions can (and probably will) be
adjusted as the package improves its accuracy and coverage.
Diphthongization, for example, is sensitive to phonotactics. A word such
as CV.ˈV.CV will be narrowly transcribed as ˈCGV.CV (except when the
initial consonant is an affricate (allophonic), which seems to lower
the probability of diphthongization based on my judgement).
Diphthongization is not applied if the onset is complex. Needless to
say, these assumptions are based on a particular dialect of Brazilian
Portuguese, and I do not expect all of them to seamlessly apply to other
dialects (although some assumptions are more easily generalizable than
others).
Narrow transcription also includes (final) vowel reduction, voicing
assimilation, l-vocalization, vowel devoicing, palatalization, and
epenthesis in sC clusters and other consonant sequences that are
expected to be repaired on surface forms (e.g., kt, gn). Examples
can be generated with the function ipa_pt_test().
If you plan to tokenize texts and create a table with individual columns
for stress and syllables, you can use some simple additional helper
functions. For example, getWeight() will take a syllabified word and
return its weight profile (e.g., getWeight("kon.to") will return
HL). The function getStress()1 will return the stress position of
a given word (up to preantepenultimate stress)—the word must already be
stressed, but the symbol used can be specified in the function (argument
stress). Finally, countSyl() will return the number of syllables in
a given string, and getSyl() will extract a particular syllable from a
string. For example,
getSyl(word = "kom-pu-ta-doɾ", pos = 3, syl = "-") will take the
antepenultimate syllable of the string in question. The default symbol
for syllabification is the period.
The function maxent() learns weights given a tableau object containing
inputs, outputs, constraints, violations and observations (see
documentation). The function returns a list with different objects,
including learned weights, BIC value and predicted probabilities for
each output. If the reader wishes to pursue a comprehensive MaxEnt
analysis, I strongly recommend the maxent.ot package, which is
dedicated exclusively to MaxEnt grammars (Mayer et al, 2024). Here’s an
example of maxent() in action.
maxent_data <- tibble::tibble(
input = rep(c("pad", "tab", "bid", "dog", "pok"), each = 2),
output = c("pad", "pat", "tab", "tap", "bid", "bit", "dog", "dok", "pog", "pok"),
ident_vce = c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0),
no_vce_final = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0),
obs = c(5, 15, 10, 20, 12, 18, 12, 17, 4, 8)
)
maxent(tableau = maxent_data)
#> $predictions
#> # A tibble: 10 × 12
#> input output ident_vce no_vce_final obs harmony max_h exp_h Z obs_prob pred_prob error
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 pad pad 0 1 5 0.639 0.639 1 2.79 0.25 0.358 -1.08e- 1
#> 2 pad pat 1 0 15 0.0541 0.639 1.79 2.79 0.75 0.642 1.08e- 1
#> 3 tab tab 0 1 10 0.639 0.639 1 2.79 0.333 0.358 -2.45e- 2
#> 4 tab tap 1 0 20 0.0541 0.639 1.79 2.79 0.667 0.642 2.45e- 2
#> 5 bid bid 0 1 12 0.639 0.639 1 2.79 0.4 0.358 4.22e- 2
#> 6 bid bit 1 0 18 0.0541 0.639 1.79 2.79 0.6 0.642 -4.22e- 2
#> 7 dog dog 0 1 12 0.639 0.639 1 2.79 0.414 0.358 5.60e- 2
#> 8 dog dok 1 0 17 0.0541 0.639 1.79 2.79 0.586 0.642 -5.60e- 2
#> 9 pok pog 1 1 4 0.693 0.693 1 3.00 0.333 0.333 2.04e-11
#> 10 pok pok 0 0 8 0 0.693 2.00 3.00 0.667 0.667 -2.04e-11
#>
#> $weights
#> ident_vce no_vce_final
#> 0.05410679 0.63904039
#>
#> $log_likelihood
#> [1] -78.72152
#>
#> $log_likelihood_norm
#> [1] -7.872152
#>
#> $bic
#> [1] -152.8379Finally, a couple of functions are dedicated to Noisy Harmonic Grammars.
These are pedagogical tools that can be used to demonstrate how
probabilities are generated given constraint weights and violation
profiles for different candidates. The function nhg() takes a tableau
object and returns predicted probabilities given n simulations. The
user can also set the standard deviation for the noise used. The
function plotNhg() can be helpful to visualize how different standard
deviations affect probabilities over candidates after 100 simulations.
There are three functions in the package to analyze sonority. First,
demi(word = ..., d = ...) extracts either the first (d = 1, the
default) or second (d = 2) demisyllables of a given (syllabified) word
(or vector of words. Second, sonDisp(demi = ...) calculates the
sonority dispersion score of a given demisyllable, based on Clements
(1990)—see also Parker (2011). Note that this metric does not
differentiate sequences that respect the sonority sequencing principle
(SSP) from those that don’t, i.e., pla and lpa will have the same
score. For that reason, a third function exists,
ssp(demi = ..., d = ...), which evaluates whether a given demisyllable
respects (1) or doesn’t respect (0) the SSP. In the example below,
the dispersion score of the first demisyllable in the penult syllable is
calculated—ssp() isn’t relevant here, since all words in Portuguese
respect the SSP.
example = tibble(word = c("partolo", "metrilpo", "vanplidos"))
example = example |>
rowwise() |>
mutate(ipa = ipa(word),
syl2 = getSyl(word = ipa, pos = 2),
demi1 = demi(word = syl2, d = 1),
disp = sonDisp(demi = demi1),
SSP = ssp(demi = demi1, d = 1))
example
#> # A tibble: 3 × 6
#> # Rowwise:
#> word ipa syl2 demi1 disp SSP
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 partolo par.ˈto.lo to to 0.06 1
#> 2 metrilpo me.ˈtril.po tril tri 0.56 1
#> 3 vanplidos vam.ˈpli.dos pli pli 0.56 1You may also want to calculate the average sonority dispersion for whole
words with the function meanSonDisp(). If your words of interest are
possible or real Portuguese words, they can be entered in their
orthographic form. Otherwise, they need to be phonemically transcribed
and syllabified. In this scenario, use phonemic = T.
meanSonDisp(word = c("partolo", "metrilpo", "vanplidos"))
#> [1] 1.53The function plotVowels() creates a vowel trapezoid using ggplot2
and also returns the LaTeX code to create the same trapezoid using the
vowel package. Available
languages: Arabic, French, English, Dutch, German, Hindi, Italian,
Japanese, Korean, Mandarin, Portuguese, Spanish, Swahili, Russian,
Talian, Thai, and Vietnamese
The function biGram_pt() returns the log bigram probability for a
possible word in Portuguese (word must be broadly transcribed). The
string must use broad phonemic transcription, but no syllabification or
stress. The reference used calculate probabilities is the Portuguese
Stress Lexicon.
Two additional functions can be used to explore bigrams: nGramTbl()
generates a tibble with phonotactic bigrams from a given text, and
plotnGrams() creates a plot for inputs generated with nGramTbl().
The function wug_pt() generates a hypothetical word in Portuguese.
Note that this function is meant to be used to get you started with
nonce words. You will most likely want to make adjustments based on
phonotactic preferences. The function already takes care of some OCP
effects and it also prohibits more than one onset cluster per word,
since that’s relatively rare in Portuguese. Still, there will certainly
be other sequences that sound less natural. The function is not too
strict because you may have a wide range of variables in mind as you
create novel words. Finally, if you wish to include palatalization, set
palatalization = T—if you do that, bear in mind that biGram_pt()
won’t work as it requires phonemic transcription without syllabification
or stress.
set.seed(1)
wug_pt(profile = "LHL")
#> [1] "dra.ˈbur.me"
# Let's create a table with 5 nonce words and their bigram probability
set.seed(1)
tibble(word = character(5)) |>
mutate(word = wug_pt("LHL", n = 5),
bigram = word |> biGram_pt())
#> # A tibble: 5 × 2
#> word bigram
#> <chr> <dbl>
#> 1 dra.ˈbur.me -119.
#> 2 ze.ˈfran.ka -85.6
#> 3 be.ˈʒan.tre -84.8
#> 4 ʒa.ˈgran.fe -87.6
#> 5 me.ˈxes.vro -101.Additional functions include monthsAge() and meanAge(), both of
which can be used to convert and analyze ages following the format
yy;mm, commonly used in first language acquisition studies. It’s a
good idea to check out the index of functions (?Fonology) to take a
look at the complete list of functions available.
Parts of this project have benefited from funding from the ENVOL program at Université Laval and from the Social Sciences and Humanities Research Council of Canada (SSHRC). Different undergraduate research assistants at Université Laval have worked on the Spanish and French grapheme-to-phoneme conversion functions: Nicolas C. Bustos, Emmy Dumont, and Linda Wong. Matéo Levesque implemented comprehensive regular expressions for French transcription.
-
Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In John Kingston & Mary E. Beckman (eds.) Papers in laboratory phonology I: Between the grammar and physics of speech, 283–333. Cambridge: Cambridge University Press.
-
Garcia, G. D. (2014). Portuguese Stress Lexicon. Available at gdgarcia.ca/psl.html.
-
Garcia, G. D. (2017). Weight effects on stress: Lexicon and grammar. PhD thesis, McGill University. https://doi.org/10.31219/osf.io/bt8hk
-
Garcia, G. D. (2017). Weight gradience and stress in Portuguese. Phonology, 34(1), 41–79. https://doi.org/10.1017/S0952675717000033
-
Garcia, G. D. (2019). When lexical statistics and the grammar conflict: Learning and repairing weight effects on stress. Language, 95(4), 612–641. https://doi.org/10.1353/lan.2019.0068
-
Mayer, C., Tan, A., & Zuraw, K. R. (2024). Introducing
maxent.ot: An R package for Maximum Entropy constraint grammars. Phonological Data and Analysis, 6(4), 1–44. https://doi.org/10.3765/pda.v6art4.88 -
Parker, S. (2011). Sonority. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 1160–1184). Wiley Online Library. https://doi.org/10.1002/9781444335262.wbctp0049
-
Wetzels, L., (2007) Primary Word Stress in Brazilian Portuguese and the Weight Parameter, Journal of Portuguese Linguistics 6(1), 9-58. doi: https://doi.org/10.5334/jpl.144
Footnotes
-
Functions without
_pt,_fror_spare language-independent. ↩
