-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2516: Add a new message type for voice messages #2516
Conversation
For reference: a previous attempt at defining voice messages (written before we had MSCs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. It seems like a fairly straightforward proposal.
As a general nit-picky comment, your lines are wrapped inconsistently -- some are short, and some are long. Please try to wrap them at ~80 characters.
typo Co-Authored-By: Hubert Chathi <hubert@uhoreg.ca>
I still have not completely figured out if it is best to strictly follow WhatsApp's example here, even from a UX perspective. I am going to collect some more examples on how different apps handle this. I'm also not sure where the semantics should be documented, i.e. how a voice message is expected to be recorded or rendered or auto-downloaded. That should be somewhat standardised across clients. Is the Client-Server-API spec the right place for that? |
I searched around a bit and included links to related issues in the proposal. |
Included a few comments from the PR
Added links.
Let's try this again... For reference: a previous attempt at defining voice messages (written before we had MSCs): #310 |
In general, we try not to mandate specifics of how clients must render content. I think it's OK to make suggestions, but since there are a variety of different clients using different interfaces, some UI may not make sense for some clients. (For example, if we said that a voice message must have a play button that the user could click on, this would not make sense for a text-based client that doesn't use the mouse.) But in general, feel free to just put these UI suggestions in the MSC for now, and we'll worry about where the right place to put them when it's time to put it in the spec itself. |
Nico (@deepbluev7:neko.dev) offered:
This is not how the spec currently reads (the example for m.audio is a song), but it would only require a minor spec change. At least Riot Web sends an m.audio event whenever an audio file is sent. You can't easily send an audio file as m.file right now. |
Added sentence about what it means to be a voice message. & used shorter line breaks
It would definitely make sense to roll the m.typing question from #310 into this one. |
IMO Flag is the way to go, it's still audio file and introducing new event type is not necessary, especially that you said those would basically be the same. |
I think a specific voice audio type is important to let bridges do their magic, so voice messages appear as native in both ends of a bridge. It might even require renaming or reencoding the file on some cases. |
Notice that both Whatsapp and Telegram distinguish between " read" (I saw I have a voice message) and "heard" (I listened to it). I think it is an important feature to be added here too. |
@@ -0,0 +1,66 @@ | |||
# Add a seperate message type for voice messages | |||
|
|||
In the matrix spec right now, there is a message type `m.audio` for audio files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to https://matrix.org/docs/spec/client_server/r0.6.1#m-audio would be helpful here.
I don't find it useful in Telegram. I guess it would be fine if there's a way to ignore the listen state in the client. I find it really annoying in Telegram that the @ can't be dismissed without listening. |
Notifications control is certainly something we'd take a look at in another MSC. This MSC should cause the voice message to just be an unread message rather than a notification (by default) though. |
The point is a frequently don't listen to voice messages, and there would have to be a way to dismiss without listening. Of course if it's out of scope just ignore me. But overall I just don't really care about that stuff. |
That would be more of a client UX problem than a spec issue. The spec just defines how to throw the voice message over the network - it's up to the clients to decide how best to render that (with whatever dismiss functionality, etc) |
I propose to introduce a new message type `m.voice` with the same | ||
contents as `m.audio`. | ||
Voice messages MUST be OGG files, Opus encoded. Other files can be | ||
sent as `m.audio`or `m.file`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought here: can we add something about audio format, ie sample rate / channel count here? I think not mandating one is probably fine, but if so we should make it clear that clients should expect anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw the more I think about this the more I draw a conclusion that the spec shouldn't care about a mandated set of values, but it should probably recommend some sane defaults (for those who just want to whack in some libraries and call it good). Clients expecting anything is fairly on-par with the latest directions of Matrix, anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uhoreg offers: | ||
> Auto-downloading of files (if clients follow WhatsApp's example) sounds | ||
like it could be a security issue. (e.g. DoS by using up users' bandwidth, | ||
could cause malicious content to be automatically downloaded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a concern.
- The client should consider the size before deciding to download or cut off the download past an "acceptable size" (AKA cache the prefix of the file) if bandwidth usage is a concern. This can be applied to all media content.
- Most client will already auto-download things like images, so if the payload can be triggered just by downloading this adds no new attack surface. (Assuming that the file isn't decoded before play is pressed).
|
||
I propose to introduce a new message type `m.voice` with the same | ||
contents as `m.audio`. | ||
Voice messages MUST be OGG files, Opus encoded. Other files can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to wonder about the choice of ogg here: it looks like webm might be edging out ogg as a more widely supported container format, plus all the common browsers can produce it natively (chrome can't mux into ogg). Would be good to at least note why we picked ogg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also prefer something more easily accessible on client platforms, which indeed seems to be WebM, at least for the web browser case. (I assume for native platforms, the various containers are roughly identical in implementation complexity.)
What's the reason for selecting Ogg? Could we use WebM instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like us to stick with opus for several reasons, some of which are already discussed in #matrix-spec (and I thought translated here, but apparently not):
- File size is small in nearly every case.
- Playback is supported on all modern environments - anything not modern is unlikely to be using voice messages (IoT) or will be unable to do other things anyways.
- It's what all the other platforms do, making it easy/trivial for compatibility. If we had to transcode like bridges already do for video then we've somewhat failed to maintain our interoperability flag. (Video is transcoded because remote networks use insane formats, but opus/ogg isn't that insane)
...and a couple more less important reasons that hurt to type on a phone, but that's the gist of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: I'm just questioning the container format, not the codec, ie. Opus in WebM, so the file size would be basically the same. Ogg does seem to be what other messaging platforms use though, so yeah, bridges would have to remux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, fair enough. We could probably get away with webm though I think we'd also have to have a good reason for going against the grain, imo
This could be solved by having clients handle auto-download responsibly, | ||
e.g. only auto-download voice messages from trusted contacts. | ||
|
||
## Unstable prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftr Element Web is going to use org.matrix.experimental.*
to see how this MSC can benefit from something like #1767
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
@@ -0,0 +1,66 @@ | |||
# Add a separate message type for voice messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having worked with this for a while now, it seems it would in fact be best to go with an extensible events format on top of m.audio
. This might block the MSC behind extensible events, but the intention would be to land with an m.voice
event type that has content
containing m.message
, m.audio
, and m.voice
. As a fallback, implementations would use a regular m.room.message
event for msgtype: "m.audio"
and "m.voice": {}
in the content
(org.matrix.msc2516.voice
during unstable implementation).
Does this sound sane? If it's too different from what you're comfortable with, let me know and I can open a new MSC to describe it in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've since split it out: #3245
What's the status here? Looks like something has already been implemented and released for Element Android. Is there more to be done for the spec? |
*Matrix devs. The MSC is done from the perspective of the foundation. |
I'm closing this MSC, it is superceded by #3245. I'm confident that the extensible events approach works at least as well as the one I proposed. I started this one almost a year ago, and soon felt in over my head. I am not involved with matrix development at all, so I didn't really know how to move this one forward. I'm glad that we're doing voice messages right, and I hope this MSC helped more than it delayed things. It sure was an interesting experience for me to dabble in the MSC process, so thanks for sticking with me! :) |
Thank you for the MSC :) As a first MSC, this one is well on the better side - it definitely helped lead towards an answer of some kind. |
Rendered
superceded by #3245
Signed-off by Ludwig Bald, hallo at ludwigbald.com