Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pronunciation information #27

Closed
jmccrae opened this issue Jul 30, 2020 · 11 comments · Fixed by #30
Closed

Pronunciation information #27

jmccrae opened this issue Jul 30, 2020 · 11 comments · Fixed by #30
Milestone

Comments

@jmccrae
Copy link
Member

jmccrae commented Jul 30, 2020

We are looking to add some pronunciation information to English WordNet and it would be good to add this as a schema extension. As I see it we would need to have the following information

  • The actual form
  • The notation scheme (e.g., IPA)
  • The dialect, encoded with a ISO-3166 code
  • Further notes. A free text for the describing the pronunciation in more detail

As such, I would suggest something like as follow:

<LexicalEntry id="ewn-transport-n">
  <Lemma writtenForm="transport" partOfSpeech="n">
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹænzˈpɔːt/</Pronunciation>
    <Pronunciation notation="ipa" dialect="GB" notes="RP">/tɹɑːnˈspɔːt/<Pronunciation>
    <Pronunciation notation="ipa" dialect="US" notes="GenAM">/tɹænzˈpɔɹt/</Pronunciation>
  </Lemma>
  <Sense>...</Sense>
</LexicalEntry>
@jmccrae jmccrae added this to the v1.1 milestone Jul 30, 2020
@goodmami
Copy link
Member

Looks good in general. A few things:

  1. I would maybe go for a more general word for "dialect" to avoid political controversies. Maybe "variety"?
  2. Instead of a country code in "dialect" (or whatever we call it) and some specialization under "notes", can we combine them into one bcp-47 tag? This saves an attribute and it dovetails as a specialization of the lexicon's language attribute. E.g., en-GB-x-RP.
  3. Do we need the / in the transcription? I can't imagine a use for phonetic transcription ([...] vs /.../), so perhaps we can assume it is a phonemic transcription and drop the / characters?

@jmccrae
Copy link
Member Author

jmccrae commented Jul 30, 2020

Hi.

  1. yes, this is a good point, variety is better than dialect
  2. we could do this. I guess that this would mean duplicating the language code, but this is okay
  3. I would guess that some would prefer a phonemic transcription. We could drop the / and have an attribute for phonemic transcriptions?

@fcbond
Copy link
Member

fcbond commented Jul 31, 2020 via email

@lmorgadodacosta
Copy link

Hi there,

We have been working/discussing this exact topic a tiny bit for Kristang -- we are hoping to provide IPA and voice recordings for individual lemmas soon.

The problem I'd like to raised here is that Kristang shows a lot of metathesis in certain consonant clusters. E.g. ‘-dr-’(kodrah and kordah for ‘to wake up’). Within the context of revitalization, as we want people to start using a single spelling, we have decided it would be best to cluster these as "Forms" under a single lemma (i.e. the canonical form) . However, these internal forms do have different pronunciations.

Up to this point we were happy to use the Tag element (available to both Forms and Lemmas) and come up with our own "category" notation. But I think including such a Pronunciation element is definitely an improvement. However, I would like to see what you all think about:

  1. making "Pronunciation" available to both Lemmas and Forms
  2. adding an explicit attribute to a sound file path

@goodmami
Copy link
Member

  1. making "Pronunciation" available to both Lemmas and Forms

I think this makes sense. Actually, in a new Python-based wordnet module I'm working on, all lemmas are just forms anyway, so doing something different for lemmas and forms would be more trouble than doing the same thing (but this, at least, is just a selfish reason).

  1. adding an explicit attribute to a sound file path

I'm less enthused about this, but if we're adding logos (#3), then it's not breaking new ground to link to external files. However, shouldn't this be a URL instead of a file path? If a file path, then absolute paths won't work, and we'd need some kind of resource directory such that the paths are relative to this directory, or something. This sounds like over-engineering.

Better, perhaps, would be that your application provides a mapping from local paths into the ids of the wordnet. The trouble is that lemmas/forms do not have their own ids, so it would have to be linked to the LexicalEntry, then to the writtenForm under that entry (are forms guaranteed to be unique under a lexical entry?).

Another issue is when you want multiple audio files for the same lemma/form (e.g., from multiple speakers). It doesn't seem like an attribute for a file path or URL would easily scale to multiple files.

@1313ou
Copy link

1313ou commented Jul 31, 2020

Implications: IPA symbols are not ASCII, so all tools must handle UTF8 (or whatever charset is defined as desired)

@jmccrae
Copy link
Member Author

jmccrae commented Jul 31, 2020

  1. Yes, I had intended this to be available for Forms as well as Lemmas
  2. We could add a URL to the sound file if available, this is useful for some even if it is not ever used
  3. I think UTF-8 is already required by the serialization. If someone wants a strict ASCII file they will have to use an ASCII based transcription scheme

@lmorgadodacosta
Copy link

Sorry, by sound file path I definitely meant URL or some public URI.
I do kinda see the problem raised by mike for multiple recordings of the same lemma/form... But if pronunciations are multiple elements, then you could just provide multiple Pronunciation elements. Individual projects could then use the attribute notes to keep information about the speaker, if necessary (e.g. male/female). But I would give a quick link to something that is very meaningful under the element we're discussing.

@1313ou
Copy link

1313ou commented Jul 31, 2020

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

@jmccrae jmccrae linked a pull request Aug 7, 2020 that will close this issue
@goodmami
Copy link
Member

To my knowledge the current state of EWN does not use characters that require coding outside ASCII, so the current files are both ASCII and UTF8. So the relevant tests are still to come.

I was surprised by this and thought that surely things like jalapeño and résumé would have the diacritics in EWN, even if only as alternative forms, but found nothing but ascii throughout the whole file. In any case, there are non-English wordnets with plenty of non-ascii forms, so it would be unfortunate if any tools assumed wordnets to be ascii-only.

@goodmami
Copy link
Member

goodmami commented Sep 1, 2020

Returning to the issue of marking a transcription as phonemic or phonetic... What do people think of keeping the IPA delimiters (/../ or [...]) in the actual transcription? When I suggested dropping the delimiters (See (3) in my comment above), I assumed we only cared about phonemic transcriptions. Since for non-IPA transcription the attribute may be irrelevant (or implicit given the notation attribute), perhaps the phonemic attribute is a bad idea. Furthermore, the IPA delimiters are shorter and may be clearer for someone familiar with IPA.

jmccrae added a commit that referenced this issue Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants