-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fields for text-to-speech #49
Conversation
Question: Should we say something about Roman Numerals? Aka about the fact that "Elizabeth I" should be written as "Elizabeth one" or "Elizabeth 1" in the readable field? |
@leofr I would say roman numerals are an excellent example of why these fields are useful, and should be included in the text. But the readable name for "Elizabeth I" should be "Elizabeth the first" completely spelled out. |
PR updated to include roman numeral. |
Huge +1 for this proposal - this would help solve a number of issues with accessibility in mobile apps and for Voice UX such as Amazon Alexa. For abbreviations, letters should be followed by periods. For example, in Tampa we have the "University Area Transit Center", which is currently abbreviated in headsigns as "North to UATC". OneBusAway Alexa currently reads this as "u-ackt". If the text is changed to "North to U.A.T.C.", Alexa properly says each letter. On this note, in my experience good text for Text-to-Speech (TTS) isn't always obvious. So, to help producers create good quality data for these fields, ideally I'd like to see us point to resources, specifically:
For Amazon Alexa, here are the above resources:
I've been looking for Google/Android and Apple/iPhone resources for the above, but haven't found anything yet. I'll update if I do. Also, I've been having an internal debate on the name of these fields. I'm not sure if @leofr Do you have any producers in mind for this? |
Agreed @barbeau I can certainly see why this is useful. However, things like following letters with periods seem very specific to a single language, and chasing the heuristics and behavior of specific speech platforms seems inadvisable. How commonly used is SSML? It seems well-suited to this application, and Amazon Alexa (for example) appears to understand it. https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/speech-synthesis-markup-language-ssml-reference Maybe we should just say this field can contain plain text or SSML, and that any angle-bracketed tags should be stripped out by consumers that don't support SSML. I also agree that |
Any benefit to migrate this concept to the existing proposal for the translations.txt file? As best I would understand, this would eliminate the need for multiple "readable*.txt" files - if any usage of the text "UACT" (in stop name, headsign, route name, etc.) were consistently 'translated' as "University Area Transit Center" (assuming an appropriate language tag could be designated for text-to-speech usage, i.e. 'fr' being designator for French language) |
Another data point: Chrome TTS supports SSML: https://developer.chrome.com/apps/tts |
@tesobota I think the proposal here is to add optional fields prefixed with |
About field name:There is indeed conversations about the naming of those fields. I do not have any strong opinion on that. I'll add the list in the PR so that everybody could vote. About putting it in another file like the proposal for
|
On SSML: I think this can be explained pretty simply. People should not put angle brackets (or any other special symbols for that matter) in the text, only letters spelling out words, unless they know what SSML is and intentionally want to use it. It seems to me that XML can be included in CSV with no problem. The main characters that would need to be escaped are CR/LF and commas, and line breaks are just whitespace that should be stripped out. I guess the schema tags would be embedded right in the field, or we could assume a schema... I don't have a good answer for that one. But I'm a pretty firm believer that text to speech is always going to sound kind of awful if people don't use IPA or markup to give it more hints. Any serious use of text to speech deserves a little effort of this kind. |
Oh yeah, of course the quotes in XML would be a mess. |
Amazon Alexa definitely supports SSML (or at least a subset of it), as does the Google Assistant. I don't believe that Siri, Android TalkBack or Apple VoiceOver currently support SSML. I'd suggest we figure out a way to support both plain text and SSML. SSML is the "right" way to do this, although like @leofr I get a bad feeling about SSML in a CSV file. Yes, it's possible, but messy and prone to errors - it definitely makes it less human-readable/editable. My feeling is that adoption of SSML by producers would be much lower and take much longer than a simpler plain text version of the field, both in terms of tooling needed to configure/output SSML to GTFS as well as the agency's data entry of the SSML information. In this case, I don't want perfect to be the enemy of good - even a plain text readable field would be better than the current state. So, I think allowing plain text as a first step is good, and that may encourage more rapid adoption and in turn more producers to eventually adopt SSML. One way to support both would be to overload the same field, and if the field contains |
At first I thought that SSML in the fields would be great, but it has so much overhead and is designed to do so much more than we really need, I'm not sure it's the best solution to the problem, especially on a field-by-field basis. Another approach would be to include plain text without abbreviations in the readable_ fields, and then include a link in feed_info to a Pronunciation Lexicon Specification (PLS) document. Such a document defines the pronunciation of certain words. An example of this would be the pronunciation of the city or street of "Worcester" in Massachusetts (rhymes with rooster):
By providing a stand-alone document we give agencies the ability to edit all the spoken text fields without having to use a markup language, and then (if they decide it's beneficial) to add a separate file that addresses pronunciation issues that arise. Compared to SSML we might lose the ability to specify voice, reading speed, emphasis on particular words, and duration of a pause. We might also be unable to distinguish between different pronunciations of the same spelling within the same GTFS feed. |
@DaveBarker I like the plain text + optional PLS solution - I think this would be adopted much faster by producers and consumers than SSML. |
I would not say I'm "in favor" of embedded XML tags as is written in the description section of this ticket. With a few days' critical distance, the idea of embedding XML in CSV does seem kind of impractical or absurd to me. The core idea was that while spelling out words in detail would help with speech synthesis, it's stopping short of allowing really reliable automatic pronunciation. Maybe the ideal would be allowing international phonetic alphabet (IPA) unicode symbols but strangely most TTS systems don't seem to support that. I think @barbeau is right that adoption of SSML would be very low. Whatever advanced methods we might allow for specifying pronunciation, plain text would still be a fallback, so why not start with that now and perhaps extend some day in the future. |
About the optional PLS, I think it is a good idea, but IMHO it shouldn't be a link, the PLS should be embed it in the zip of the GTFS for archiving purpose. The goal is that the zipped GTFS contains all the data it needs to be self-meaningful. But indeed, I would rather have another non-CSV file in the zip (in this case, a XML file) rather than having escaped XML data in a CSV field. I agree with @abyrd, maybe we could start with a plain text field (like But if @DaveBarker say that he would use the PLS right away, as a GTFS producer, then we could start with plain text fields (like In the future, we would then speak about another field which would be either in line SSML ( (@abyrd: sorry to have said you were "in favor" of in-line SSML. I'm gonna updated the first message. I'm trying to keep it as a |
I agree with @leofr that the GTFS zip should contain the PLS file. Otherwise consumers are going to need to monitor more than one link to determine if something changed and they need to update their data. I think we should start looking for producers for this proposal. If we can find one that's interested in putting together a PLS document (and a consumer that's able to consume it) then we can include that in the spec. If not, I propose that we just adopt the plain text field for now (assuming we get the producers and consumers), and we can revisit the PLS document when the need arises for a producer. @leofr Did you have any producers in mind? I can reach out to a few as well. @DaveBarker Would you be interested in producing the plain text? How about the PLS? |
The MBTA could try adding speakable_ fields and a PML file to our GTFS feed. It's not something we can prioritize at the moment, though. We can do it some time in the next 3 months. I expect us to focus the speakable_ field on numbers and specific cases. We use abbreviations on our stop names like St and Rd but I'm still not convinced that any TTS systems have trouble with any of the abbreviations we use (and I've used some that are quite basic), so I don't expect us to produce alternate versions of all our names with abbreviations spelled out. Regarding the position of the PML file I'd still advocate for it to be a URL, and not a PML actually stored within GTFS. A PML file for an agency can be used by clients that don't interact with the GTFS file, such as a screen reader for the agency's website. So it should be available outside of GTFS, and as an agency I'd prefer to host it in only one location. |
My concern on hosting it on the website is the backward compatibility. Example: But I agree that it is an edge case. So I have the feeling we currently tend to agree on plain text field + PML. With two options: URL or within the GTFS. And I will ask in the GTFS-Slack if other GTFS producers want to be early adopters. |
I'm interested in being a consumer for this with OneBusAway Android and Alexa. As a data point, it doesn't look like I'd be able to easily use the PML, as neither platform directly supports it. |
After digging a bit more, there will be some limits on how practical PML will be for consumers. From what I can tell, neither Apple VoiceOver for iPhone nor Android TalkBack support it (or SSML), and considering PML dates back to 2008, I'm guessing we won't see much further adoption. So. these platforms would most likely be constrained to just the GTFS plain text readable fields. Amazon Alexa and Google Assistant support SSML, but not PML, so for these platforms the consumer would need to convert PML to SSML. So this PML:
...would need to be converted to this SSML:
This isn't too horrible, but definitely requires additional logic in the app to build the dictionary of graphemes/phonemes and pre or post-process the text from the GTFS fields before it's fed to the TTS engine. |
@barbeau If you are saying that PML is a language that it at the end of its live time, then maybe we could/should use directly SSML instead. What about using a combination of plain text field (
[Edited 2017-02-22 18:11 EST: update format of XML proposal] |
Yeah, I'm thinking that directly providing SSML probably makes more sense, as consumers would need to convert to that anyway. @leofr IDs in the GTFS with a mapping to elements in a separate SSML file that contains the actual SSML would work. The only other option I can think of off the top of my head would be to embed the @DaveBarker Would @leofr's proposal above with IDs in GTFS field mapping to separate SSML file work for you? |
Yes, that does look promising and we could produce it. As I'm only going by research into these existing standards an not direct experience with them I can't say much more than that at this point. |
I think the "readable" fields should be a separate proposal from the SSML/IPA/PML mapping. It seems like we're getting ahead of ourselves with the latter debate. While the readable fields are likely to be adopted and used rapidly, I think it's much less likely that we'll see detailed pronunciation information commonly used, and it's still not clear what the proper format is. In the event that references to external files are the chosen method, I would be in favor of that file being included directly inside the GTFS, not using external web links which can get out of sync with the data in the feed. This does all seem needlessly complex to me though... I'm not sure why TTS solutions can't just read IPA mixed with plain text. The ideal solution would be to just allow "dʊstər station" in the tts-readable field. Can someone try feeding this raw unicode text into Alexa / Google to see what happens? |
Shouldn't change anything significant but would be nice to fix the conflict +1 from Transit |
+1, adding |
conflict resolved (cc @gcamp) |
+1 |
+1 (for IBI Group) |
+1 (Moovit) |
The voting period has ended and the TextToSpeech fields are now adopted! 5 votes in favor:
No abstentions and vetos. |
It may be too late but I am -1 for adding anything other than |
Should we allow other fields that I see three options. Option 1: We adopt them allCon: It's going against the process. Option 2: We adopt only the fields which are usedPro: It's what is described in the process Option 3: Using the "EXPERIMENTAL" flag used in GTFS-rtCon: It will clutter the spec. I'm personnaly in favor of Option 3 in situation like the current one, where we agreed on a pattern. @mgilligan & all, what do you think we should do? |
+1 for Option 3 |
+1 for option 3 |
+1 for option 3 (for IBI Group) |
Ok so I assume the process is not to open a PR for a change on the process to official allow EXPERIMENTAL for GTFS static, with a defined scope. @timMillet Could you work on that? |
We tried to come back with a proposal for defining experimental fields, but it's triggering a long list of questions (How long do they stay there? In which case do we adopt such fields?...), so the easiest seems to just adopt |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
We're (GMV Syncromatics) planning to add support as a producer for tts route info in the coming months. This would be available for all agencies at https://gtfs-directory.syncromatics.com We will follow the proposal as listed in this PR before the route-based fields were removed — unless anyone has issues with that proposal? Once implemented, will open a new PR to add the fields. |
Good news! All good with that proposal! |
FYI we (Mecatran) support |
Add text-to-speech fields for almost all
*_name
and*_headsign
fields (exception isfeed_info.feed_publisher_name
), defined as:The goal is to be able to be specific with text-to-speech tools (aka avoiding "5 Dr" to be said "five doctors") without having to remove all abbreviation in the
stop_name
.Fields created are:
agency.tts_agency_name
;stops.tts_stop_name
;routes.tts_route_short_name
;routes.tts_route_long_name
;trips.tts_trip_headsign
;trips.tts_trip_short_name
;stop_times.tts_stop_headsign
.Discussion about allowing SSML
Different options have been discussed, but the decision is to move forward with the text-only fields for now, and to open another PR later if needed about SSM. For future reference, the last option discussed was to add another field containing and
ssml_id
, defined in an SSML file with the relevant data, stored in the GTFS.[Edited 2017-02-16 09:25 EST: add "Elizabeth the first" example]
[Edited 2017-02-16 13:41 EST: add discussion about name of field]
[Edited 2017-02-16 13:44 EST: add discussion about SSML]
[Edited 2017-02-20 09:25 EST: update discussion about SSML]
[Edited 2017-02-24 14:27 EST: update the votes]
[Edited 2017-02-24 15:28 EST: update the field names]
[Edited 2017-02-24 15:29 EST: close the conversation about SSML]
[Edited 2019-11-25 17:40 EST: update the field names(fix)]
[Edited 2019-11-26 15:30 EST: vote open]