Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fields for text-to-speech #49

Merged
merged 7 commits into from
Mar 24, 2020

Conversation

LeoFrachet
Copy link
Contributor

@LeoFrachet LeoFrachet commented Feb 16, 2017

Add text-to-speech fields for almost all *_name and *_headsign fields (exception is feed_info.feed_publisher_name), defined as:

Text-to-speech field - The field should contain the same information than its parent field (on which it falls back if it is empty). It is aimed to be read as text-to-speech, therefore, abbreviation should be either removed ("St" should be either read as "Street" or "Saint"; "Elizabeth I" should be "Elizabeth the first") or kept to be read as it ("JFK Airport" is said abbreviated).

The goal is to be able to be specific with text-to-speech tools (aka avoiding "5 Dr" to be said "five doctors") without having to remove all abbreviation in the stop_name.

Fields created are:

  • agency.tts_agency_name;
  • stops.tts_stop_name;
  • routes.tts_route_short_name;
  • routes.tts_route_long_name;
  • trips.tts_trip_headsign;
  • trips.tts_trip_short_name;
  • stop_times.tts_stop_headsign.

Discussion about allowing SSML

Different options have been discussed, but the decision is to move forward with the text-only fields for now, and to open another PR later if needed about SSM. For future reference, the last option discussed was to add another field containing and ssml_id, defined in an SSML file with the relevant data, stored in the GTFS.

[Edited 2017-02-16 09:25 EST: add "Elizabeth the first" example]
[Edited 2017-02-16 13:41 EST: add discussion about name of field]
[Edited 2017-02-16 13:44 EST: add discussion about SSML]
[Edited 2017-02-20 09:25 EST: update discussion about SSML]
[Edited 2017-02-24 14:27 EST: update the votes]
[Edited 2017-02-24 15:28 EST: update the field names]
[Edited 2017-02-24 15:29 EST: close the conversation about SSML]
[Edited 2019-11-25 17:40 EST: update the field names(fix)]
[Edited 2019-11-26 15:30 EST: vote open]

@LeoFrachet
Copy link
Contributor Author

Question: Should we say something about Roman Numerals? Aka about the fact that "Elizabeth I" should be written as "Elizabeth one" or "Elizabeth 1" in the readable field?

@abyrd
Copy link

abyrd commented Feb 16, 2017

@leofr I would say roman numerals are an excellent example of why these fields are useful, and should be included in the text. But the readable name for "Elizabeth I" should be "Elizabeth the first" completely spelled out.

@LeoFrachet
Copy link
Contributor Author

PR updated to include roman numeral.

@barbeau
Copy link
Collaborator

barbeau commented Feb 16, 2017

Huge +1 for this proposal - this would help solve a number of issues with accessibility in mobile apps and for Voice UX such as Amazon Alexa.

For abbreviations, letters should be followed by periods. For example, in Tampa we have the "University Area Transit Center", which is currently abbreviated in headsigns as "North to UATC". OneBusAway Alexa currently reads this as "u-ackt". If the text is changed to "North to U.A.T.C.", Alexa properly says each letter.

On this note, in my experience good text for Text-to-Speech (TTS) isn't always obvious. So, to help producers create good quality data for these fields, ideally I'd like to see us point to resources, specifically:

  1. Voice simulator website(s) - This would allow producers to paste in text, hit a button, and hear what that sounds like through a TTS engine. Ideally we'd suggest several major TTS engines (e.g., Google, Amazon, Apple).
  2. Best practices guide(s) for TTS - Documentation that shows DO's and DON'Ts for creating text for TTS. Again, ideally we'd suggest best practices for several from major TTS engines (e.g., Google, Amazon, Apple).

For Amazon Alexa, here are the above resources:

  1. Alexa Voice Simulator - Available within the Alexa Developer Console - unfortunately you need to create a developer account and create a sample skill to access it - https://developer.amazon.com/.
  2. Alexa - Using Text-to-Speech Effectively - https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/alexa-skills-kit-voice-design-best-practices#using-text-to-speech-effectively

I've been looking for Google/Android and Apple/iPhone resources for the above, but haven't found anything yet. I'll update if I do.

Also, I've been having an internal debate on the name of these fields. I'm not sure if readable_* really communicates the contents of the field well. Maybe speech_* or tts_* or audio_*, if TTS/speech is the primary use case? Or maybe I'm over thinking it.

@leofr Do you have any producers in mind for this?

@abyrd
Copy link

abyrd commented Feb 16, 2017

Agreed @barbeau I can certainly see why this is useful. However, things like following letters with periods seem very specific to a single language, and chasing the heuristics and behavior of specific speech platforms seems inadvisable. How commonly used is SSML? It seems well-suited to this application, and Amazon Alexa (for example) appears to understand it. https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/speech-synthesis-markup-language-ssml-reference

Maybe we should just say this field can contain plain text or SSML, and that any angle-bracketed tags should be stripped out by consumers that don't support SSML.

I also agree that readable_ may not be the best prefix. Let's consider other options. speech_ and tts_ seem good to me. Or if we find that SSML is widely supported, maybe even ssml_.

@tesobota
Copy link

Any benefit to migrate this concept to the existing proposal for the translations.txt file? As best I would understand, this would eliminate the need for multiple "readable*.txt" files - if any usage of the text "UACT" (in stop name, headsign, route name, etc.) were consistently 'translated' as "University Area Transit Center" (assuming an appropriate language tag could be designated for text-to-speech usage, i.e. 'fr' being designator for French language)
https://developers.google.com/transit/gtfs/reference/gtfs-extensions

@abyrd
Copy link

abyrd commented Feb 16, 2017

Another data point: Chrome TTS supports SSML: https://developer.chrome.com/apps/tts
I don't see any Apple speech APIs that support markup though.

@abyrd
Copy link

abyrd commented Feb 16, 2017

@tesobota I think the proposal here is to add optional fields prefixed with readable_ not entirely new txt files.

@LeoFrachet
Copy link
Contributor Author

LeoFrachet commented Feb 16, 2017

About field name:

There is indeed conversations about the naming of those fields. I do not have any strong opinion on that. I'll add the list in the PR so that everybody could vote.

About putting it in another file like the proposal for translations.txt:

If there is support for this option it is doable, but:

  • I'm not sure it would worth it for stop_name and route_*_name, since they appear only once in the GTFS.
  • You can have multiple translations for a field, but you can have only one speakable version.
  • It could only worth it IMHO for headsigns, to avoid duplication of data. But I would tend to think that it would be odd to factorize the speakable version of the headsign without factorizing the headsign itself, since there is a one-to-one relationship with them.

But if there a strong support for that, it is doable.

About SSML:

That could be done. That indeed should be optional since we do not want to afraid small agencies nor small GTFS-consumers.

But:

  • Usually, in XML files, you put a link to the schema at the top. Would that information be put somewhere in the GTFS? Or it would be directly in the spec?
  • we have to keep in mind that putting XML data into CSV field require us to escape it, which mean removing the new lines and doubling the double quotes. It probably will turn the CSV into something less human-readable.

So I think we really have to weight the pro and con when mixing two different file format. It giving me the heebie-jeebies I have to admit. IMHO it raises the question of putting it into another purely XML file.

@abyrd
Copy link

abyrd commented Feb 16, 2017

On SSML: I think this can be explained pretty simply. People should not put angle brackets (or any other special symbols for that matter) in the text, only letters spelling out words, unless they know what SSML is and intentionally want to use it.

It seems to me that XML can be included in CSV with no problem. The main characters that would need to be escaped are CR/LF and commas, and line breaks are just whitespace that should be stripped out.

I guess the schema tags would be embedded right in the field, or we could assume a schema... I don't have a good answer for that one. But I'm a pretty firm believer that text to speech is always going to sound kind of awful if people don't use IPA or markup to give it more hints. Any serious use of text to speech deserves a little effort of this kind.

@abyrd
Copy link

abyrd commented Feb 16, 2017

Oh yeah, of course the quotes in XML would be a mess.

@barbeau
Copy link
Collaborator

barbeau commented Feb 16, 2017

Amazon Alexa definitely supports SSML (or at least a subset of it), as does the Google Assistant. I don't believe that Siri, Android TalkBack or Apple VoiceOver currently support SSML.

I'd suggest we figure out a way to support both plain text and SSML. SSML is the "right" way to do this, although like @leofr I get a bad feeling about SSML in a CSV file. Yes, it's possible, but messy and prone to errors - it definitely makes it less human-readable/editable. My feeling is that adoption of SSML by producers would be much lower and take much longer than a simpler plain text version of the field, both in terms of tooling needed to configure/output SSML to GTFS as well as the agency's data entry of the SSML information. In this case, I don't want perfect to be the enemy of good - even a plain text readable field would be better than the current state. So, I think allowing plain text as a first step is good, and that may encourage more rapid adoption and in turn more producers to eventually adopt SSML.

One way to support both would be to overload the same field, and if the field contains <speak>...</speak> it's SSML, if not it's plain text. Another option would be to have two fields, like tts_plain_text_* and tts_ssml_*.

@DaveBarker
Copy link

At first I thought that SSML in the fields would be great, but it has so much overhead and is designed to do so much more than we really need, I'm not sure it's the best solution to the problem, especially on a field-by-field basis.

Another approach would be to include plain text without abbreviations in the readable_ fields, and then include a link in feed_info to a Pronunciation Lexicon Specification (PLS) document. Such a document defines the pronunciation of certain words.

An example of this would be the pronunciation of the city or street of "Worcester" in Massachusetts (rhymes with rooster):
stop_name: Worcester St
readable_stop_name: Worcester Street
Excerpt from PLS:

  <lexeme>
    <grapheme>Worcester</grapheme>
    <grapheme>ˈwʊstər</grapheme>
  </lexeme>

By providing a stand-alone document we give agencies the ability to edit all the spoken text fields without having to use a markup language, and then (if they decide it's beneficial) to add a separate file that addresses pronunciation issues that arise.

Compared to SSML we might lose the ability to specify voice, reading speed, emphasis on particular words, and duration of a pause. We might also be unable to distinguish between different pronunciations of the same spelling within the same GTFS feed.

@barbeau
Copy link
Collaborator

barbeau commented Feb 17, 2017

@DaveBarker I like the plain text + optional PLS solution - I think this would be adopted much faster by producers and consumers than SSML.

@abyrd
Copy link

abyrd commented Feb 18, 2017

I would not say I'm "in favor" of embedded XML tags as is written in the description section of this ticket. With a few days' critical distance, the idea of embedding XML in CSV does seem kind of impractical or absurd to me. The core idea was that while spelling out words in detail would help with speech synthesis, it's stopping short of allowing really reliable automatic pronunciation. Maybe the ideal would be allowing international phonetic alphabet (IPA) unicode symbols but strangely most TTS systems don't seem to support that.

I think @barbeau is right that adoption of SSML would be very low. Whatever advanced methods we might allow for specifying pronunciation, plain text would still be a fallback, so why not start with that now and perhaps extend some day in the future.

@LeoFrachet
Copy link
Contributor Author

LeoFrachet commented Feb 20, 2017

About the optional PLS, I think it is a good idea, but IMHO it shouldn't be a link, the PLS should be embed it in the zip of the GTFS for archiving purpose. The goal is that the zipped GTFS contains all the data it needs to be self-meaningful. But indeed, I would rather have another non-CSV file in the zip (in this case, a XML file) rather than having escaped XML data in a CSV field.

I agree with @abyrd, maybe we could start with a plain text field (like stop_name_tts), and if we see a broad adoption and the need for something more, then we'll open the discussion again and add PLS or SSML on the side.

But if @DaveBarker say that he would use the PLS right away, as a GTFS producer, then we could start with plain text fields (like stop_name_tts) + PLS in the zip (text_to_speech.pls).

In the future, we would then speak about another field which would be either in line SSML (stop_name_ssml) or even an id (stop_name_ssml_id) to an entity in an SSML (names.ssml).

(@abyrd: sorry to have said you were "in favor" of in-line SSML. I'm gonna updated the first message. I'm trying to keep it as a tl;dr for people who join the conversation and doesn't want to read everything.)

@barbeau
Copy link
Collaborator

barbeau commented Feb 22, 2017

I agree with @leofr that the GTFS zip should contain the PLS file. Otherwise consumers are going to need to monitor more than one link to determine if something changed and they need to update their data.

I think we should start looking for producers for this proposal. If we can find one that's interested in putting together a PLS document (and a consumer that's able to consume it) then we can include that in the spec. If not, I propose that we just adopt the plain text field for now (assuming we get the producers and consumers), and we can revisit the PLS document when the need arises for a producer.

@leofr Did you have any producers in mind? I can reach out to a few as well.

@DaveBarker Would you be interested in producing the plain text? How about the PLS?

@DaveBarker
Copy link

The MBTA could try adding speakable_ fields and a PML file to our GTFS feed. It's not something we can prioritize at the moment, though. We can do it some time in the next 3 months. I expect us to focus the speakable_ field on numbers and specific cases. We use abbreviations on our stop names like St and Rd but I'm still not convinced that any TTS systems have trouble with any of the abbreviations we use (and I've used some that are quite basic), so I don't expect us to produce alternate versions of all our names with abbreviations spelled out.

Regarding the position of the PML file I'd still advocate for it to be a URL, and not a PML actually stored within GTFS. A PML file for an agency can be used by clients that don't interact with the GTFS file, such as a screen reader for the agency's website. So it should be available outside of GTFS, and as an agency I'd prefer to host it in only one location.

@LeoFrachet
Copy link
Contributor Author

My concern on hosting it on the website is the backward compatibility.

Example:
If you have a stop name called "Worcester" with an odd pronunciation ˈwʊstər and then it is renamed "Elizabeth II", and therefore you do not have any place anymore where "Worcester" is in your current GTFS, are you gonna remove the "Worcester" pronunciation from the PML?

But I agree that it is an edge case.

So I have the feeling we currently tend to agree on plain text field + PML. With two options: URL or within the GTFS.

And I will ask in the GTFS-Slack if other GTFS producers want to be early adopters.

@barbeau
Copy link
Collaborator

barbeau commented Feb 22, 2017

I'm interested in being a consumer for this with OneBusAway Android and Alexa. As a data point, it doesn't look like I'd be able to easily use the PML, as neither platform directly supports it.

@barbeau
Copy link
Collaborator

barbeau commented Feb 22, 2017

After digging a bit more, there will be some limits on how practical PML will be for consumers. From what I can tell, neither Apple VoiceOver for iPhone nor Android TalkBack support it (or SSML), and considering PML dates back to 2008, I'm guessing we won't see much further adoption. So. these platforms would most likely be constrained to just the GTFS plain text readable fields.

Amazon Alexa and Google Assistant support SSML, but not PML, so for these platforms the consumer would need to convert PML to SSML.

So this PML:

  <lexeme>
    <grapheme>Worcester</grapheme>
    <phoneme>ˈwʊstər</phoneme>
  </lexeme>

...would need to be converted to this SSML:

<speak>
    I love visiting <phoneme alphabet="ipa" ph="ˈwʊstər">Worcester</phoneme>. 
</speak> 

This isn't too horrible, but definitely requires additional logic in the app to build the dictionary of graphemes/phonemes and pre or post-process the text from the GTFS fields before it's fed to the TTS engine.

@LeoFrachet
Copy link
Contributor Author

LeoFrachet commented Feb 22, 2017

@barbeau If you are saying that PML is a language that it at the end of its live time, then maybe we could/should use directly SSML instead.

What about using a combination of plain text field (stop_name_tts) and SSML ids (stop_name_ssml_id), linking to a gtfsSpeakable.ssml being something like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE [... todo ...]>
<gtfsSpeakables version="1.0">
  <gtfsSpeakable stopNameSsmlId="43">
    <speak>
      <phoneme alphabet="ipa" ph="ˈwʊstər">Worcester</phoneme> Station
    </speak> 
  </gtfsSpeakable>
  <gtfsSpeakable stopNameSsmlId="44">
    <speak>
      <phoneme alphabet="ipa" ph="ˈdʊstər">Dorcester</phoneme> Station
    </speak> 
  </gtfsSpeakable>
</gtfsSpeakableFields>

[Edited 2017-02-22 18:11 EST: update format of XML proposal]

@barbeau
Copy link
Collaborator

barbeau commented Feb 22, 2017

Yeah, I'm thinking that directly providing SSML probably makes more sense, as consumers would need to convert to that anyway.

@leofr IDs in the GTFS with a mapping to elements in a separate SSML file that contains the actual SSML would work. The only other option I can think of off the top of my head would be to embed the <phoneme alphabet="ipa" ph="ˈdʊstər">Dorcester</phoneme> Station directly into a stop_name_ssml field in the GTFS file, although that comes with all the caveats we've previously mentioned about XML in a CSV. (An aside for this - I'm thinking we probably only need the text that's within the <speak> tags (not the entire XML file) - it looks like that is what both Alexa and Google Assistant need as input. But, we still have the XML/CSV issues).

@DaveBarker Would @leofr's proposal above with IDs in GTFS field mapping to separate SSML file work for you?

@DaveBarker
Copy link

@DaveBarker Would @leofr's proposal above with IDs in GTFS field mapping to separate SSML file work for you?

Yes, that does look promising and we could produce it. As I'm only going by research into these existing standards an not direct experience with them I can't say much more than that at this point.

@abyrd
Copy link

abyrd commented Feb 23, 2017

I think the "readable" fields should be a separate proposal from the SSML/IPA/PML mapping. It seems like we're getting ahead of ourselves with the latter debate. While the readable fields are likely to be adopted and used rapidly, I think it's much less likely that we'll see detailed pronunciation information commonly used, and it's still not clear what the proper format is.

In the event that references to external files are the chosen method, I would be in favor of that file being included directly inside the GTFS, not using external web links which can get out of sync with the data in the feed.

This does all seem needlessly complex to me though... I'm not sure why TTS solutions can't just read IPA mixed with plain text. The ideal solution would be to just allow "dʊstər station" in the tts-readable field. Can someone try feeding this raw unicode text into Alexa / Google to see what happens?

@gcamp
Copy link
Contributor

gcamp commented Nov 26, 2019

Shouldn't change anything significant but would be nice to fix the conflict

+1 from Transit

@mgilligan
Copy link

mgilligan commented Nov 26, 2019

+1, adding tts_stop_name. It is the only field with both a producer/consumer.

@timMillet
Copy link
Contributor

conflict resolved (cc @gcamp)

@paulswartz
Copy link
Contributor

+1

@ibi-group-team
Copy link
Contributor

+1 (for IBI Group)

@flocsy
Copy link
Contributor

flocsy commented Dec 2, 2019

+1 (Moovit)

@timMillet
Copy link
Contributor

timMillet commented Dec 4, 2019

The voting period has ended and the TextToSpeech fields are now adopted!

5 votes in favor:

No abstentions and vetos.

@mgilligan
Copy link

It may be too late but I am -1 for adding anything other than tts_stop_name. Without a producer, the other fields should never be added.

@LeoFrachet
Copy link
Contributor Author

Should we allow other fields that tts_stop_name to be adopted, since they are not produced by anybody currently? It's a good question and we should discuss it.

I see three options.

Option 1: We adopt them all

Con: It's going against the process.
Pro: They are all working on the same pattern, so we adopt the pattern in all its occurrences, just like we adopt enum fields without requiring all the enum values to be used.

Option 2: We adopt only the fields which are used

Pro: It's what is described in the process
Con: It's slowly creating "shadow" specifications, if somebody needs to produce a tts_route_long_name and check the spec, they will see that such field "do not exist" and will not produce it, even if we have all agreed that it's a good idea (and even if Transit app already implemented its consumption). Even in the best-case-scenario where then know about the proposal, it will require opening a vote every time another field is used, which will clutter the spec discussion.

Option 3: Using the "EXPERIMENTAL" flag used in GTFS-rt

Con: It will clutter the spec.
Pro: It will clearly state that if somebody want to do it, it would be the way to do it. It keeps everything at the same place. Maybe we could put a time limit (1 or 2 years), and review at that point if anybody is producing and consuming it, and if nobody is, consider dropping them.

I'm personnaly in favor of Option 3 in situation like the current one, where we agreed on a pattern. @mgilligan & all, what do you think we should do?

@barbeau
Copy link
Collaborator

barbeau commented Dec 9, 2019

+1 for Option 3

@gcamp
Copy link
Contributor

gcamp commented Dec 9, 2019

+1 for option 3

@ibi-group-team
Copy link
Contributor

+1 for option 3 (for IBI Group)

@LeoFrachet
Copy link
Contributor Author

Ok so I assume the process is not to open a PR for a change on the process to official allow EXPERIMENTAL for GTFS static, with a defined scope. @timMillet Could you work on that?

@LeoFrachet
Copy link
Contributor Author

LeoFrachet commented Mar 24, 2020

We tried to come back with a proposal for defining experimental fields, but it's triggering a long list of questions (How long do they stay there? In which case do we adopt such fields?...), so the easiest seems to just adopt tts_stop_name. The vote already passed so we'll just add it.

@googlebot
Copy link
Collaborator

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@timMillet
Copy link
Contributor

@googlebot I consent.

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@timMillet timMillet merged commit f14b910 into google:master Mar 24, 2020
@stevenmwhite
Copy link
Contributor

We're (GMV Syncromatics) planning to add support as a producer for tts route info in the coming months.

This would be available for all agencies at https://gtfs-directory.syncromatics.com

We will follow the proposal as listed in this PR before the route-based fields were removed — unless anyone has issues with that proposal?

Once implemented, will open a new PR to add the fields.

@timMillet
Copy link
Contributor

Good news! All good with that proposal!
Let me know if you'd like any help.

@laurentg
Copy link

FYI we (Mecatran) support tts_* fields in our internal toolkit, as both producer and consumer. Fields most often used are route tts_short_name and route tts_long_name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet