Pronunciation information #27

jmccrae · 2020-07-30T09:57:42Z

We are looking to add some pronunciation information to English WordNet and it would be good to add this as a schema extension. As I see it we would need to have the following information

The actual form
The notation scheme (e.g., IPA)
The dialect, encoded with a ISO-3166 code
Further notes. A free text for the describing the pronunciation in more detail

As such, I would suggest something like as follow:

<LexicalEntry id="ewn-transport-n">
  <Lemma writtenForm="transport" partOfSpeech="n">
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹænzˈpɔːt/</Pronunciation>
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹɑːnˈspɔːt/<Pronunciation>
    <Pronunciation notation="ipa" dialect="US" notes="GenAM">/tɹænzˈpɔɹt/</Pronunciation>
  </Lemma>
  <Sense>...</Sense>
</LexicalEntry>

goodmami · 2020-07-30T13:27:09Z

Looks good in general. A few things:

I would maybe go for a more general word for "dialect" to avoid political controversies. Maybe "variety"?
Instead of a country code in "dialect" (or whatever we call it) and some specialization under "notes", can we combine them into one bcp-47 tag? This saves an attribute and it dovetails as a specialization of the lexicon's language attribute. E.g., en-GB-x-RP.
Do we need the / in the transcription? I can't imagine a use for phonetic transcription ([...] vs /.../), so perhaps we can assume it is a phonemic transcription and drop the / characters?

jmccrae · 2020-07-30T15:37:17Z

Hi.

yes, this is a good point, variety is better than dialect
we could do this. I guess that this would mean duplicating the language code, but this is okay
I would guess that some would prefer a phonemic transcription. We could drop the / and have an attribute for phonemic transcriptions?

fcbond · 2020-07-31T05:57:24Z

I agree with Michael's suggestions.

…

On Thu, Jul 30, 2020 at 11:37 PM John McCrae ***@***.***> wrote: Hi. 1. yes, this is a good point, variety is better than dialect 2. we could do this. I guess that this would mean duplicating the language code, but this is okay 3. I would guess that some would prefer a phonemic transcription. We could drop the / and have an attribute for phonemic transcriptions? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRX6C4JD4ZZRGGD6LOTR6GHT7ANCNFSM4PNLIJMA> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

lmorgadodacosta · 2020-07-31T06:55:27Z

Hi there,

We have been working/discussing this exact topic a tiny bit for Kristang -- we are hoping to provide IPA and voice recordings for individual lemmas soon.

The problem I'd like to raised here is that Kristang shows a lot of metathesis in certain consonant clusters. E.g. ‘-dr-’(kodrah and kordah for ‘to wake up’). Within the context of revitalization, as we want people to start using a single spelling, we have decided it would be best to cluster these as "Forms" under a single lemma (i.e. the canonical form) . However, these internal forms do have different pronunciations.

Up to this point we were happy to use the Tag element (available to both Forms and Lemmas) and come up with our own "category" notation. But I think including such a Pronunciation element is definitely an improvement. However, I would like to see what you all think about:

making "Pronunciation" available to both Lemmas and Forms
adding an explicit attribute to a sound file path

goodmami · 2020-07-31T08:20:53Z

making "Pronunciation" available to both Lemmas and Forms

I think this makes sense. Actually, in a new Python-based wordnet module I'm working on, all lemmas are just forms anyway, so doing something different for lemmas and forms would be more trouble than doing the same thing (but this, at least, is just a selfish reason).

adding an explicit attribute to a sound file path

I'm less enthused about this, but if we're adding logos (#3), then it's not breaking new ground to link to external files. However, shouldn't this be a URL instead of a file path? If a file path, then absolute paths won't work, and we'd need some kind of resource directory such that the paths are relative to this directory, or something. This sounds like over-engineering.

Better, perhaps, would be that your application provides a mapping from local paths into the ids of the wordnet. The trouble is that lemmas/forms do not have their own ids, so it would have to be linked to the LexicalEntry, then to the writtenForm under that entry (are forms guaranteed to be unique under a lexical entry?).

Another issue is when you want multiple audio files for the same lemma/form (e.g., from multiple speakers). It doesn't seem like an attribute for a file path or URL would easily scale to multiple files.

1313ou · 2020-07-31T08:40:09Z

Implications: IPA symbols are not ASCII, so all tools must handle UTF8 (or whatever charset is defined as desired)

jmccrae · 2020-07-31T09:11:12Z

Yes, I had intended this to be available for Forms as well as Lemmas
We could add a URL to the sound file if available, this is useful for some even if it is not ever used
I think UTF-8 is already required by the serialization. If someone wants a strict ASCII file they will have to use an ASCII based transcription scheme

lmorgadodacosta · 2020-07-31T09:16:26Z

Sorry, by sound file path I definitely meant URL or some public URI.
I do kinda see the problem raised by mike for multiple recordings of the same lemma/form... But if pronunciations are multiple elements, then you could just provide multiple Pronunciation elements. Individual projects could then use the attribute notes to keep information about the speaker, if necessary (e.g. male/female). But I would give a quick link to something that is very meaningful under the element we're discussing.

1313ou · 2020-07-31T09:17:52Z

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

goodmami · 2020-08-24T06:32:30Z

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

I was surprised by this and thought that surely things like jalapeño and résumé would have the diacritics in EWN, even if only as alternative forms, but found nothing but ascii throughout the whole file. In any case, there are non-English wordnets with plenty of non-ascii forms, so it would be unfortunate if any tools assumed wordnets to be ascii-only.

goodmami · 2020-09-01T12:44:09Z

Returning to the issue of marking a transcription as phonemic or phonetic... What do people think of keeping the IPA delimiters (/../ or [...]) in the actual transcription? When I suggested dropping the delimiters (See (3) in my comment above), I assumed we only cared about phonemic transcriptions. Since for non-IPA transcription the attribute may be irrelevant (or implicit given the notation attribute), perhaps the phonemic attribute is a bad idea. Furthermore, the IPA delimiters are shorter and may be clearer for someone familiar with IPA.

Implement Pronunciation. Fix #27

jmccrae added the enhancement label Jul 30, 2020

jmccrae added this to the v1.1 milestone Jul 30, 2020

jmccrae linked a pull request Aug 7, 2020 that will close this issue

Implement Pronunciation. Fix #27 #30

Merged

goodmami mentioned this issue Sep 1, 2020

Implement Pronunciation. Fix #27 #30

Merged

jmccrae added a commit that referenced this issue Dec 15, 2020

Merge pull request #30 from globalwordnet/issue-27

dc5fe59

Implement Pronunciation. Fix #27

jmccrae closed this as completed in 1a43eb4 Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pronunciation information #27

Pronunciation information #27

jmccrae commented Jul 30, 2020

goodmami commented Jul 30, 2020

jmccrae commented Jul 30, 2020

fcbond commented Jul 31, 2020 via email

lmorgadodacosta commented Jul 31, 2020

goodmami commented Jul 31, 2020

1313ou commented Jul 31, 2020

jmccrae commented Jul 31, 2020

lmorgadodacosta commented Jul 31, 2020

1313ou commented Jul 31, 2020

goodmami commented Aug 24, 2020

goodmami commented Sep 1, 2020

Pronunciation information #27

Pronunciation information #27

Comments

jmccrae commented Jul 30, 2020

goodmami commented Jul 30, 2020

jmccrae commented Jul 30, 2020

fcbond commented Jul 31, 2020 via email

lmorgadodacosta commented Jul 31, 2020

goodmami commented Jul 31, 2020

1313ou commented Jul 31, 2020

jmccrae commented Jul 31, 2020

lmorgadodacosta commented Jul 31, 2020

1313ou commented Jul 31, 2020

goodmami commented Aug 24, 2020

goodmami commented Sep 1, 2020