Updated: Data Model and Relay (Post Interim Updates)#95
Updated: Data Model and Relay (Post Interim Updates)#95suhasHere wants to merge 1 commit intomoq-wg:mainfrom suhasHere:dm-relays-interim
Conversation
huitema
left a comment
There was a problem hiding this comment.
I entered a few comments, but I think this is a good addition to the spec.
| * A single group is mapped to the entire track, thus spanning its lifetime. Each object mapping to a slice of that. | ||
|
|
||
| * Each audio frame is mapped to a group. In this grouping, each group has a single audio frame as the object. | ||
|
|
There was a problem hiding this comment.
I am concerned that these examples are too generic, and miss the important property of considering a group boundary as a synchronization point. Please add something about groups and synchronization points.
There was a problem hiding this comment.
Could we get away with sync points (specified in metadata), so we would NOT need the groups concept. Seems a nice simplification (if possible)
There was a problem hiding this comment.
Sync Points enable multiple functionalities
- Join in at the right boundary
- Implement necessary congestion response at the group boundaries in relays
- Cache strategies across group boundaries
There was a problem hiding this comment.
Sync Points are neccessary for new client to join at the right point.
There was a problem hiding this comment.
Sorry, I think you understood the opposite of what I wanted to say. Let me rephrase:
Can we just define sync points in MOQ, and just use that to indicate group boundaries. If yes, then perhaps we would NOT need to define groups, and that seems like a nice simplification.
There was a problem hiding this comment.
Groups provide a generic framework for anyone building things like sync points. I feel at the moq transport layer, we need to build tools with properties on which applications can innovate. Groups as defined today are not necessarily complicated. I would like to learn understand the concern further
| Different applications will organize the user experience in different way. For example, a conferencing application will let participants send and receive audio and video streams from each other, as well as other media streams, such as maybe a demonstration video, or sharing of a participant's computer screen. The number of active media streams in a conference will often vary over time, as new particpants "get the floor" or start sharing screens. A broadcast application may provide a set of video streams presenting different views of an event, the corresponding sound tracks, and perhaps a running commentary. A virtual reality application will have its own set of media streams related to photorealistic rendering and mapped textures. In some cases, audio streams will be available in several languages, or subtitle streams in different languages may complement the original videos and audio streams. | ||
|
|
||
| A Compositon is a collection of multiple media tracks that may or may not belong to a single emisssion and thus may not be scoped a single origin. | ||
|
|
|
|
||
| The only assumption required by the MoQ transport is that users can select the tracks that they want to receive, so they can subscribe to these tracks using MoQ. If a media streams is available in multiple formats or multiple languages, we expect that the catalog will provide sufficient information to let subscribers choose and request | ||
| the appropriate track. We will discuss later how subscribers who encounter congestion could, for example, unsubscribe from the high definition track of a video media and subscribe instead to a lower definition track, or maybe decide to forgo the video experience and simply receive the audio track. | ||
|
|
There was a problem hiding this comment.
I would prefer to say "subscribers encountering congestion" than using "who", which implies that subscribers are human beings.
There was a problem hiding this comment.
This seems to describe client side ABR, are we thinking on server side ABR too?
Perhaps we should allow both ABR types in the protocol (more complex, but more flexible too)
There was a problem hiding this comment.
Client side ABR is pull based. Server side ABR is push based. To support push-based, we need to extend the current pub/sub model.
There was a problem hiding this comment.
@VMatrix1900 you are totally right! Then my question is: What mode do we want MOQ to support on delivery side? push (server -> consumer/player ), pull ( server <- consumer / player), both?
There was a problem hiding this comment.
As much I like to have relay more involved in media flows , I feel it's can of worms. We don't want to end up relays needing to look like SFU, McU, Mixer eventually.
As of today, Relays are pure store and forward engines and some authz stuff for keeping things flowing. It scales and distributes well.
More application logic goes in , the more would be ask for things like that. For video conferencing example, I can imagine an Sfu talking to relay to do smart switch ( active speaker) and relays don't need to care about any of such details in the payload header. Server side ABr is a very similar usecase and it would be nice to move it to appropriate application server to do that job.
My 2 cents
There was a problem hiding this comment.
I think we need to define the protocol to work for simple relays (store and forward) as @suhasHere suggests, but not preclude extensions that could do "smarter" things at relays like server-side ABR.
| Some media publications clearly separate how the content is uploaded to a "content management center" (CMS) and then how that content is broadcast to subscribers. In that model, subscribers can use the MoQ transport to obtain media streams from the CMS acting as "origin", while uploading the content could use an entirely different "ingress" system. Some other media experience are more symmetric. For example, in a video conference, participants may publish their own video and audio tracks. These tracks will be "published" by the participants, acting as publishers. | ||
|
|
||
| Publishers MAY be configured to publish the objects to a relays based on the application configuration and topology. Publishing a track through the relay starts with a "publish" transaction that describes the track identifier. That transaction will have to be authorized by the origin, using mechanisms similar to authorizing subscriptions | ||
|
|
There was a problem hiding this comment.
... objects to relays ... (delete "a")
nits fix few wording issues move the model back to section 2 clarify tracks clarify tracks-2
|
Thank you for writing this up! This looks promising. I'll try to finish my review comments on Monday, but before that, I wanted to ask for a few things that should simplify reviewing this (as the PR is quite large as it currently is): |
@vasilvv on splitting in to multiple PRs. I am happy to do so, as that was how it was done originally. But there was a confusion on using terminology between the PRs and I heard from few reviewers that it was being hard to go back and forth between the PRs. |
kixelated
left a comment
There was a problem hiding this comment.
My high level feedback is that everything is a suggestion. You could do this, or you could do that, or you could do this, or you could do that. Nothing is concrete and there's no useful properties or assurances.
I really think the relay text should be split into a separate document. It outlines a possible architecture, a possible way of encoding IDs, a possible way of building the application, etc. None of these really have any impact on the transport protocol and it suffers from CAN syndrome.
The data model stuff has some potential but again, there needs to be stronger guarantees. For example the concept of a GROUP absolutely needs to have a property otherwise it should not exist. Saying "the application can use groups for whatever it wants" is meaningless.
| ## Media Streams, Tracks, Objects, Emissions | ||
|
|
||
| *A Warp broadcast* is a collection of multiple media tracks produced by a single origin. When subscribing to a broadcast, a peer has an option of subscribing to one, many or all media tracks within the broadcast. | ||
| When discussing the user experience, we focus on media streams, such as for example the view of a participant in a live broacast or a video conference. However, the media transport over QUIC does not directly operate on the "abstract" view of a participant, but on the an encoding of that view, suitable for transport over the Internet. In what follows, we call that encoding a "Track". |
There was a problem hiding this comment.
Not a fan of this paragraph, especially for non-media folks.
It should be more to the point, something like: "A track is an individual media encoding. The application synchronizes and renders multiple tracks. ex. audio/video, multiple participants, etc."
It also uses the "media stream" terminology which sounds like it's another term for "track". It's fair to disambiguate ("track is known as XYZ in RFC ABC"), but the terminology is quite fragmented.
Same comment goes for the above paragraph.
There was a problem hiding this comment.
Agree, we should remove all occurrences of media stream.
| A track is a transform of a media stream using a specific encoding process, a set of parameters for that encoding, and possibly an encryption process. The MoQ transport is designed to transport tracks. | ||
|
|
||
| As an example, consider a scenario where `example.org` hosts a simple live stream that anyone can subscribe to. That live stream would be a single Warp broadcast identified by the URL `https://example.org/livestream`. In the simplest implementation, it would provide only two media tracks, one with audio and one with video. In more complicated scenarios, it could provide multiple video formats of different levels of video quality; those tracks would be variants of each other. Note that the track IDs are opaque on the Warp level; if the player has not received the description of media tracks out of band in advance, it would have to request the broadcast description first. | ||
| Tracks are identified within MoQ transport by their TrackIds, which can be encoded in one of the following ways |
There was a problem hiding this comment.
No encoding in the object model...
There was a problem hiding this comment.
may be its poor choice of word. the intent was to say "application can choose to group objects in different ways for a given track"
There was a problem hiding this comment.
I think he means to say that this section should offer only concepts and not specify anything about how these concepts are transmitted on the wire.
There was a problem hiding this comment.
Yeah sorry, I should have elaborated, I meant what @afrind said.
I think the object model should be a summary of the protocol. Here are some high level components and how they interact. Specifics should go into their own sections further down the document, like encoding should be near the bottom in the messages section.
There was a problem hiding this comment.
Agreed. I am happy to make that change once we get to concepts in the object model
|
|
||
| The binary content of a track is composed of a set of objects. The decomposition of the track into objects is arbitrary. For real time applications, an object will often correspond to an unit of capture, such as for example the encoding of a single video frame, but different applications may well group several such units together, or follow whatever arrangement makes sense for the application. | ||
|
|
||
| The objects that compose a given track are organized as a series of "groups", each containing a series of objects. The scope and granularity of the grouping of objects is application defined and controlled. Some examples of how this grouping might be defined: |
There was a problem hiding this comment.
What are the properties of a group? It can't be this vague otherwise it's useless; literally just some metadata attached by the application.
If we follow the GoP example, then objects within a group can only depend on prior objects in a group. Warp uses streams to go a step further; objects within a group depend on all prior objects within a group.
There was a problem hiding this comment.
This is inline with @huitema comment as well. I will add text to explain the group properties
There was a problem hiding this comment.
My comment is along the same line -- what are the basic properties of an "object" and a "group"? Is an object atomic -- that is to say, it is only useful if the entire object is present, and a partial object has no value? Objects have metadata -- can groups also have metadata? Are groups atomic -- eg: can a receiver use a group that is missing its tail, or missing a section in the middle?
There's also a question below about "cacheability". To say something is cacheable to me also implies that it has a name or identifier - is this true of objects? Groups?
Who decides what a group is? Is it purely the publisher's decision, or can relays "regroup" objects? Another way to phrase it - are groups end-to-end or hop-by-hop? I think end-to-end but it might be good to make that explicit.
Do groups have a fixed length or can they be "appended" over time?
There was a problem hiding this comment.
It would be possible to show something like the following, or add some examples, IDK for others but for me is difficult to follow those vague specs:
Assuming streamId, server, intention, and params are in the WT session, we could do something like:
StreamID: Unique in the app domain: 12345
intention: Ingest or delivery
params: vDesiredBufferSize, aDesiredBufferSize, rewindMs, etc
trackId/groupId/elementId
trackId: h264360p2Mbps, aac32k
groupId: Elements that depend on each other? : 1 OR 2 Or... [GUID] (ex: GOPS)
elementId: 1, 2, or [GUID] (Ex: frames)
PD: Ideally groupId and elementId would be monotonically increasing providing a simple way to find next item to send/fetch (implementation should NOT rely on that, gaps can happen)
WT session:
https://[HOST]:[PORT]/[APPID]/[streamID]?[params]
example:
https://fblive.com:4433/moq-ingest/12345
https://fblive.com:4433/moq-delivery/12345?vj=2000&aj=2000&rw=0
There was a problem hiding this comment.
My comment is along the same line -- what are the basic properties of an "object" and a "group"? Is an object atomic -- that is to say, it is only useful if the entire object is present, and a partial object has no value? Objects have metadata -- can groups also have metadata? Are groups atomic -- eg: can a receiver use a group that is missing its tail, or missing a section in the middle?
[Suhas] Objects correspond to an encoded and encrypted video frame. Yes if the entire p-frame or entire idr-frame is not received (say we lost some packets , not sure how would that happen if each video frame is sent on stream though), it is still usable and depends on the decoder. Some HW decoders choke, but most of the SW decoders I have tried can deal to some decent extent.
[Suhas] "can a receiver use a group that is missing its tail, or missing a section in the middle?" --> this is purely user experience choice. Applications can mark objects within group with relative priorities ( say 60hz p-frame is less important than 30hz p-frame in a 2 temporal layer group). Relays can choose the drop the less important ones to satisfy the congestion response. Since groups represent a sycn point, IDR frame being the object-0, clients can render video to certain quality if IDr is received , but the experience will be impacts if there are drops regardless. If each object in a group is sent in its own stream and if certain streams were closed based on application decision (like RUSH), yes one can expect certain object to be missing in the group. OTOH if group is mapped to a QUIC Stream and all the objects in the current group ( GOP ) is sent on a QUIC stream (like WARP), one would get the atomic property. This is applications choice and we shouldn't mandate a decision for the application. I should be able to use WARP mode or RUSH mode and transport should provide me the necessary tools.
There's also a question below about "cacheability". To say something is cacheable to me also implies that it has a name or identifier - is this true of objects? Groups?
[Suhas] MoQ Objects are cached. GroupId, TrackId and other header metadata provide materials to key for the object lookup in the cache.
Who decides what a group is? Is it purely the publisher's decision, or can relays "regroup" objects? Another way to phrase it - are groups end-to-end or hop-by-hop? I think end-to-end but it might be good to make that explicit.
Do groups have a fixed length or can they be "appended" over time?
[Suhas[ Groups for video will be IDR transitions, so group Id increases on every IDR generation with objectId inside each group starting from 0 and increasing until the next IDR. Depending on the size of IDR interval, objects carry the same groupId under a given IDR.
There was a problem hiding this comment.
I appreciate the clear distinction that frame == OBJECT.
But of course I would strongly advocate keeping OBJECT agnostic to the media fragmentation. It's some bytes in a media container and that's all the relay needs to know. The underlying media container is responsible for delimiting frame boundaries.
Forcing media to be fragmentated at frame boundaries hurts backwards compatibility with HLS/DASH (edge must parse) and explodes the number of streams required. The only benefit is that an OBJECT has an explicit size (atomic). I don't think this is a useful property, especially considering that streams have unbounded sizes, and certainly not a reason to narrow the functionality of the protocol.
Personally, I think dropping non-reference frames is a bait (temporal scalability). Everybody has the idea at some point (myself included) but I don't think it stacks up logically or experimentally. The idea is that you encode media at a X% higher bitrate to compensate for the lower quality, so you can drop up to Y% of the bitrate during congestion to degrade the frame rate rather than drop the tail. Citation needed: it's something like a 10% higher bitrate so 20% of the bits are "nicer" to drop during a rare congestion event (1% depending on the viewer?).
High quality encodings will never use non-reference frames which this is increasingly true with newer codecs. It can only hurt the encoding efficiency to prevent a reference. Real-time latency encodings don't use non-reference frames either because they introduce latency (ex. WebRTC) or aren't worth the trade-off.
Dropping the tail of the GROUP is the only option in both extremes and that encompasses virtual all use-cases today. The only exception are hardware encoders that use fixed GoP structures, but consider that as an unintended performance side effect rather than intentional behavior.
Finally, this design won't even work for the other forms of SVC. We definitely want to support dropping the frame rate (temporal scalability), but we also want to support stuff like dropping the resolution or quality instead. This can only be done by representing SVC layers at the transport level. Although I also think SVC falls into the same "bait" category...
There was a problem hiding this comment.
But of course I would strongly advocate keeping OBJECT agnostic to the media fragmentation
I am bit surprised. This would fall into same category of your objection for groups to be agnostic to media fragmentation .. I think we need to be very clear for the app builders to get them direction
There was a problem hiding this comment.
How about this as a compromise:
OBJECT: One or more frames, no holes, has a property that specifies if it can be tail dropped
GROUP: One or more ordered OBJECTs. If a receiver has (complete) OBJECTs 1...N in a group, they are guaranteed to be able to decode N + 1. If they are missing some, then it's application defined. We'll define some metadata for relays to be able to make decisions in the face of congestion.
TRACK: One or more ordered GROUPs
This does give multiple ways to send the same thing, but perhaps given that we want the same base protocol to accomplish a variety of use cases, and allow for experimentation as we iterate, maybe that's an ok tradeoff?
There was a problem hiding this comment.
I am bit surprised. This would fall into same category of your objection for groups to be agnostic to media fragmentation .. I think we need to be very clear for the app builders to get them direction
GROUP had no properties. It wasn't actually possible for a relay to do anything based on how it was defined. It just needs some very basic properties, like "ordered".
OBJECT: One or more frames, no holes, has a property that specifies if it can be tail dropped
GROUP: One or more ordered OBJECTs. If a receiver has (complete) OBJECTs 1...N in a group, they are guaranteed to be able to decode N + 1. If they are missing some, then it's application defined. We'll define some metadata for relays to be able to make decisions in the face of congestion.
TRACK: One or more ordered GROUPs
That works, with some small nits:
- OBJECT: I wouldn't even mention frames. It's an "ordered media fragment". The size may be unknown and the contents could be partially decodable when the tail is dropped. Very similar properties to a QUIC stream.
- GROUP: That definition works for now but will need to be massaged for Christian's intra group priorities. The fact that N+1 depends on 1...N is the important bit for a relay.
- TRACK: Sure.
There was a problem hiding this comment.
I am still trying to understand why MOQ OBJECT need to have an ability to scan over one or more video frames. That is the role of a group and groups can be partially decodable, when its tail is dropped.
By making MoQ OBJECT undefined or unspecified, it is very confusing. I strongly recommend us to not mix grouping of things at different levels and we must keep these properties separate.
|
|
||
| * Each video Group of Pictures (GOP) is mapped to a group. The group would hold multiple objects, each holding one encoded video frame. | ||
|
|
||
| * Each video frame boundary is mapped to a group. There would be a single object in each group, containing a single encoded video frame. |
There was a problem hiding this comment.
Why would you do this?
There was a problem hiding this comment.
May be i wasn't clear here. I was suggesting something like RUSH, where each object is in its own stream. Agree the text as-is is not meaning that
There was a problem hiding this comment.
RUSH would still indicate that OBJECTs are in the same group, even if they're sent over separate streams. Otherwise startup would not work; the most recent object is not decodable.
There was a problem hiding this comment.
yes .. that was my intent too. I do see the confusion.
|
|
||
| * Each audio frame is mapped to a group. In this grouping, each group has a single audio frame as the object. | ||
|
|
||
| Each group is identified by its integer `GroupId` and it always starts at 0 and increases sequentially at the original media publisher. Each Object is identified by a sequentially increasing integer, called `ObjectId`, starting at 0. |
There was a problem hiding this comment.
This level of detail should go elsewhere IMO.
There was a problem hiding this comment.
yes, it might be good in the message structure section
| of the requested track, verifying that the relay is willing to act on behalf of that | ||
| origin, and then asking that origin whether the subscriber is | ||
| authorized to access the specified content. This requires that the "origin" can | ||
| be obtained by parsing the track identifier. We thus assume that the track identifier |
There was a problem hiding this comment.
Punt this to the application. There's no reason to introduce the concept of origin ID at all, and especially not dictate that the origin ID should be parsed out of the track ID. There's so many different ways to determine the origin and it seems completely out of scope.
There was a problem hiding this comment.
Just to simplify we could use a similar process as DRM. The content can be encrypted and the application can get the description key out of band
| | Origin ID | Emission reference | Track reference | | ||
| +-----------+---------------------+-----------------+ | ||
| <----- Emission identifier ------> | ||
| <---------------- Track identifier -----------------> |
There was a problem hiding this comment.
I don't understand if this is meant to be an example or something authoritive. You can just say the origin CAN be encoded in the Emission ID or Track ID.
I also don't see why the Emision ID and the Track ID need to have the same prefix. They should be separate fields like in the current draft (ex. Track ID is scoped to Broadcast ID).
There was a problem hiding this comment.
trackId need not be scoped to a emission. A Composition can serve tracks from different emitters and server their own too. Protocol shouldn't constrain it
There was a problem hiding this comment.
What's the difference between a composition and emission then? I thought compositions were tracks from different origins/emitters.
| all identifiers starting by the specified broadcast reference. | ||
|
|
||
| Structuring the track identifier as a three parts identifier may reduce the number of | ||
| transactions between relays and origin, but it also carry a small privacy risk, as the relays can now track which users subscribe to what broadcast ID. |
There was a problem hiding this comment.
What? Where did broadcast ID come from (I thought it was renamed to emission). How could the edge not know about this, especially if it was required to perform authentication as specified above?
|
|
||
| ## Relay - Publisher Interactions | ||
|
|
||
| Some media publications clearly separate how the content is uploaded to a "content management center" (CMS) and then how that content is broadcast to subscribers. In that model, subscribers can use the MoQ transport to obtain media streams from the CMS acting as "origin", while uploading the content could use an entirely different "ingress" system. Some other media experience are more symmetric. For example, in a video conference, participants may publish their own video and audio tracks. These tracks will be "published" by the participants, acting as publishers. |
There was a problem hiding this comment.
Maybe you should just make a separate "possible architectures" document. I don't see how anything in this section is relevant to the transport protocol. The same goes for some of the previous sections; advice on how to encode IDs just seems wildly out of scope.
There was a problem hiding this comment.
+1
At this point I think we should be writing the smallest amount possible in the transport document.
There was a problem hiding this comment.
Non normative supporting text should be fine to have to setup the context.
There was a problem hiding this comment.
I agree we need some amount of context to define the transport protocol, but in general, we should be trying to minimize it, especially now as we're just trying to get a baseline document we can adopt.
|
|
||
| 1. "Catalog" message carries enough authorization information, for the Relays to authorize the tracks advertised. | ||
|
|
||
| 2. Relays forward the received Catalog messaage towards Origin, for completing the authorization process. This may involve sending the messages directly to the Origin or possibly traversing another Relay. |
There was a problem hiding this comment.
I thought the publisher was the origin? Origin is a synonym for "source" after all.
I think you're absolutely trying to impose an architecture. Broadcasters push to some central hub which then starts the fanout process. However, I think that misses a huge number of use cases, and it's really not clear why the transport protocol even cares about the existence of this architecture.
There was a problem hiding this comment.
I thought the publisher was the origin? Origin is a synonym for "source" after all.
that is one possibility, but not the only one.
There was a problem hiding this comment.
Can you define "origin" then? The term just appears out of nowhere.
Relays and protocol interactions needs to be part of base protocol spec. If not, there isn't a way to tell the relay behavior when it gets moq messages and it might just fail the protocol to work or make it incorrectly work. |
I do agree there is mix of high level/logical concepts and the atoms that are needed to be defined. If we can try to get an agreement on the concrete things that work for most of the use-cases, I am sure we can fix the logic things and wordings around it. Those highlevel concepts are added to set a stage for anyone coming in from different application groups to get the context on why we have the concrete things |
afrind
left a comment
There was a problem hiding this comment.
Thanks for taking a stab here.
High level feedback:
- We should separate object model and relays into different PRs. Yes there's overlap to some degree, in that Relay may depend on Object Model, but it we have to get the object model right first, and keeping the PR focused will allow faster review and iteration.
- While we're figuring out the model, let's leave as much wire encoding details out and keep it high level. We can follow up with any necessary wire changes.
| A track is a transform of a media stream using a specific encoding process, a set of parameters for that encoding, and possibly an encryption process. The MoQ transport is designed to transport tracks. | ||
|
|
||
| As an example, consider a scenario where `example.org` hosts a simple live stream that anyone can subscribe to. That live stream would be a single Warp broadcast identified by the URL `https://example.org/livestream`. In the simplest implementation, it would provide only two media tracks, one with audio and one with video. In more complicated scenarios, it could provide multiple video formats of different levels of video quality; those tracks would be variants of each other. Note that the track IDs are opaque on the Warp level; if the player has not received the description of media tracks out of band in advance, it would have to request the broadcast description first. | ||
| Tracks are identified within MoQ transport by their TrackIds, which can be encoded in one of the following ways |
There was a problem hiding this comment.
I think he means to say that this section should offer only concepts and not specify anything about how these concepts are transmitted on the wire.
|
|
||
| The binary content of a track is composed of a set of objects. The decomposition of the track into objects is arbitrary. For real time applications, an object will often correspond to an unit of capture, such as for example the encoding of a single video frame, but different applications may well group several such units together, or follow whatever arrangement makes sense for the application. | ||
|
|
||
| The objects that compose a given track are organized as a series of "groups", each containing a series of objects. The scope and granularity of the grouping of objects is application defined and controlled. Some examples of how this grouping might be defined: |
There was a problem hiding this comment.
My comment is along the same line -- what are the basic properties of an "object" and a "group"? Is an object atomic -- that is to say, it is only useful if the entire object is present, and a partial object has no value? Objects have metadata -- can groups also have metadata? Are groups atomic -- eg: can a receiver use a group that is missing its tail, or missing a section in the middle?
There's also a question below about "cacheability". To say something is cacheable to me also implies that it has a name or identifier - is this true of objects? Groups?
Who decides what a group is? Is it purely the publisher's decision, or can relays "regroup" objects? Another way to phrase it - are groups end-to-end or hop-by-hop? I think end-to-end but it might be good to make that explicit.
Do groups have a fixed length or can they be "appended" over time?
|
|
||
| ### Emissions | ||
|
|
||
| An Emission is a logical concept and represents a grouping of a tracks from a single Origin (aka Emitter). Emissions are semantically equivalent to a broadcast in live streaming, a conference in interactive conference or a similar grouping. |
There was a problem hiding this comment.
Yes perhaps just define and use emitter (as the one sending the tracks) since Origin has a lot of weight behind it.
|
|
||
| Different applications will organize the user experience in different way. For example, a conferencing application will let participants send and receive audio and video streams from each other, as well as other media streams, such as maybe a demonstration video, or sharing of a participant's computer screen. The number of active media streams in a conference will often vary over time, as new particpants "get the floor" or start sharing screens. A broadcast application may provide a set of video streams presenting different views of an event, the corresponding sound tracks, and perhaps a running commentary. A virtual reality application will have its own set of media streams related to photorealistic rendering and mapped textures. In some cases, audio streams will be available in several languages, or subtitle streams in different languages may complement the original videos and audio streams. | ||
|
|
||
| A Compositon is a collection of multiple media tracks that may or may not belong to a single emisssion and thus may not be scoped a single origin. |
There was a problem hiding this comment.
Suggested:
A Composition is a collection of media tracks from one or more emissions.
There was a problem hiding this comment.
If I understood properly it seems Emission are a collection of tracks from same emitter and Composition is the same from different emitters.
Just a thought: It seems to me we are trying to design an application system more than a media transport protocol, could be composition something at app level. Meaning we deal with emissions (similar to transport stream), and somebody can design a compositor that ingest N emissions and create another one (emission) as result of compositing inputs
There was a problem hiding this comment.
Emission is not transport level construct either. We are talking about delivering tracks and they come from either an emission or a composition. I wonder why one is fine but not the other
There was a problem hiding this comment.
I think Emission is a transport level concept: A collection of tracks, scoped to a WT session, defined by a catalog.
There was a problem hiding this comment.
I beg to differ. A collection of tracks from one or more emitters, scoped to a WT Session, as defined by a catalog is fine definition too.
I feel we should deal with MoQ Tracks at the transport level and rest is application choosing of how to put things and where
There was a problem hiding this comment.
I think there's value in a transport level construct grouping tracks into an emission/broadcast -- for example, it affords a simple mechanism to say "I'm not watching this anymore" and cancel all related track subscriptions. See also #98, which proposes the scope of a WT session to be a single emission/broadcast.
But it seems there's still disagreement so perhaps a good point of discussion for the interim.
There was a problem hiding this comment.
The transport needs to be aware of the emission/broadcast (or maybe composition?) so it is able to prioritize based on delivery order. The delivery order is scoped to a single emission and is not comparable between emissions.
|
|
||
| ### Catalog and track selection | ||
|
|
||
| The MoQ transport tries to not make assumptions about the user experience, or the number and type of media streams handled by an application. We simply assume that the users will receive a "catalog" describing the composition. For some applications, the content of this catalog will be established at the very beginning of the session. In other case, the catalog will have to be updated by a stream of events as new media streams get added or removed from the media experience. |
There was a problem hiding this comment.
I would avoid language like "tries not to" or using the subject "we".
I'm also confused about the scope of a catalog. Is the catalog scoped to a composition (which can contain tracks from multiple emissions) or is it scoped to a single emission? The word "session" also appears here, and the relationship between session and emission is not clear.
There was a problem hiding this comment.
If a catalog defines tracks from different emitters, it is scoped to a composition. Isn't it ?
There was a problem hiding this comment.
I prefer a catalog being scoped to a single emission (and hence by definition a single emitter). We can say that a resource that defines a composition is out of scope for moq, at least for now? In a web analogy, an HTML page defines a composition of resources from potentially different entities, but the HTTP specifications don't include many details about that works.
|
|
||
| Subscribers interact with "MoQ Relays" by sending a "subscribe" command for the desired content identifed by its `SubscriptionId`. We expect that they will in most case receive the published object through relays, much like readers of web pages often receive these pages through a content distribution network. `SubscriptionId` typically identifies objects belonging to desired track or a emission for example. Relays forward the published objects to the subscribers matching the `SubscriptionId` in the subscribe request. | ||
|
|
||
| However, Relays MUST be willing to act on behalf of the subscriptions before they can |
There was a problem hiding this comment.
Do you mean 'on behalf of the subscribers' rather than subscriptions?
| Subscribers interact with "MoQ Relays" by sending a "subscribe" command for the desired content identifed by its `SubscriptionId`. We expect that they will in most case receive the published object through relays, much like readers of web pages often receive these pages through a content distribution network. `SubscriptionId` typically identifies objects belonging to desired track or a emission for example. Relays forward the published objects to the subscribers matching the `SubscriptionId` in the subscribe request. | ||
|
|
||
| However, Relays MUST be willing to act on behalf of the subscriptions before they can | ||
| forward the media, which implies that the subscriptions MUST to be authorized. If it decides to allow the subscription, it will also have to find out how to provide the desired content. |
There was a problem hiding this comment.
Why do we have to say that relays MUST authorize subscriptions. An HTTP cache "really ought to" authorize CDN requests, but it doesn't have to per any specification I don't think?
There was a problem hiding this comment.
@afrind may be I am missing something here. If the subscriptions are not authorized, how would the relay trust to participate in the delivery ?
There was a problem hiding this comment.
Of course authorization is important. I think we should not discuss it in this PR, as it's fairly orthogonal to the primary transport concepts (how to move media from one place to another). I was making the HTTP analogy -- I'm not sure but I suspect the transport RFCs that define how to move bits or define intermediaries aren't opinionated about authorization. And you can imagine relays that are public and don't authorize, etc.
| Subscribers interact with "MoQ Relays" by sending a "subscribe" command for the desired content identifed by its `SubscriptionId`. We expect that they will in most case receive the published object through relays, much like readers of web pages often receive these pages through a content distribution network. `SubscriptionId` typically identifies objects belonging to desired track or a emission for example. Relays forward the published objects to the subscribers matching the `SubscriptionId` in the subscribe request. | ||
|
|
||
| However, Relays MUST be willing to act on behalf of the subscriptions before they can | ||
| forward the media, which implies that the subscriptions MUST to be authorized. If it decides to allow the subscription, it will also have to find out how to provide the desired content. |
There was a problem hiding this comment.
have to find out how to provide the desired content
We can probably remove this sentence?
|
|
||
| ## Relay - Publisher Interactions | ||
|
|
||
| Some media publications clearly separate how the content is uploaded to a "content management center" (CMS) and then how that content is broadcast to subscribers. In that model, subscribers can use the MoQ transport to obtain media streams from the CMS acting as "origin", while uploading the content could use an entirely different "ingress" system. Some other media experience are more symmetric. For example, in a video conference, participants may publish their own video and audio tracks. These tracks will be "published" by the participants, acting as publishers. |
There was a problem hiding this comment.
+1
At this point I think we should be writing the smallest amount possible in the transport document.
|
|
||
| "Catalog" message content themselves are opaque to the Relays other than the information needed to authorize the message. Once the Catalog message is authorized, Relay would be willing to participate in forwarding the published media objects whose track identifer matches with the ones listed in the Catalog message. | ||
|
|
||
| The Relay keeps an outgoing queue of objects to be sent to the each subscriber and objects are sent in strict priority/delivery order. Relays MAY cache some of the information for short period of time and the time cached may depend on the application and also by the local cache policies. |
There was a problem hiding this comment.
Details like keeping a queue are really implementation specific and can be left out here.
jordicenzano
left a comment
There was a problem hiding this comment.
Thanks for working on this, I'm sorry but I'm very late to the party here, also I'm very new to standards so take my comments with a piece of salt
In general I would love to (wish list):
- Add more block diagrams to focus the problem to tackle and set the right context for the reader
- Clarify what use cases (architectures) we want to tackle: ingest live edge, delivery delivery live edge (to millions), rewind? vod? Highlights? video conference (1 to thousands)? All of them?
- Less "text" and more specific low level descriptions with enough details to enable any developer to start building a POC
|
|
||
| The binary content of a track is composed of a set of objects. The decomposition of the track into objects is arbitrary. For real time applications, an object will often correspond to an unit of capture, such as for example the encoding of a single video frame, but different applications may well group several such units together, or follow whatever arrangement makes sense for the application. | ||
|
|
||
| The objects that compose a given track are organized as a series of "groups", each containing a series of objects. The scope and granularity of the grouping of objects is application defined and controlled. Some examples of how this grouping might be defined: |
There was a problem hiding this comment.
It would be possible to show something like the following, or add some examples, IDK for others but for me is difficult to follow those vague specs:
Assuming streamId, server, intention, and params are in the WT session, we could do something like:
StreamID: Unique in the app domain: 12345
intention: Ingest or delivery
params: vDesiredBufferSize, aDesiredBufferSize, rewindMs, etc
trackId/groupId/elementId
trackId: h264360p2Mbps, aac32k
groupId: Elements that depend on each other? : 1 OR 2 Or... [GUID] (ex: GOPS)
elementId: 1, 2, or [GUID] (Ex: frames)
PD: Ideally groupId and elementId would be monotonically increasing providing a simple way to find next item to send/fetch (implementation should NOT rely on that, gaps can happen)
WT session:
https://[HOST]:[PORT]/[APPID]/[streamID]?[params]
example:
https://fblive.com:4433/moq-ingest/12345
https://fblive.com:4433/moq-delivery/12345?vj=2000&aj=2000&rw=0
|
|
||
| * Each video Group of Pictures (GOP) is mapped to a group. The group would hold multiple objects, each holding one encoded video frame. | ||
|
|
||
| * Each video frame boundary is mapped to a group. There would be a single object in each group, containing a single encoded video frame. |
| Each group is identified by its integer `GroupId` and it always starts at 0 and increases sequentially at the original media publisher. Each Object is identified by a sequentially increasing integer, called `ObjectId`, starting at 0. | ||
|
|
||
|
|
||
| Objects represent single addressable cacheable unit within the MoQ architecture. They carry associated header/metadata that is authenticated (but not end-to-end encrypted) and contains priority/delivery order, time to live, and other information aiding the caching/forwarding decision at the Relays. Objects are not always expected to be fully available, and thus relays have to be able to convey partial objects. Also objects may not be fully decodable by themselves; the object and application context shall provide the necessary prerequisites if that is the case. |
There was a problem hiding this comment.
What header/metadata is needed for the relay? Just IMHO if we want to do ingest / egress relay with live edge, rewind, and highlights / vod:
- 'Cache-Control': max-age=AA`, // Indicates server to cache this data for AA seconds (except init segments)
- 'TrackID': mediaType,
- 'Timestamp': timestamp, // PTS in time scale
- 'Duration': duration, // Duration in time scale (very nice to have)
- 'Type': chunkType, // key, delta, init (perhaps in trackID?)
- 'Seq-Id': seqId, // Unique and monotonically increasing inside specific media type track
- 'First-Frame-Clk': firstFrameClkms, // EPOCH ms when the 1st sample in that element was captured (Optional)
- Timescale: ts timescale (or we could put this in some header, but I think that complicates a bit the protocol)
PS: Perhaps some groupId too? (if groups are needed)
|
|
||
| Different applications will organize the user experience in different way. For example, a conferencing application will let participants send and receive audio and video streams from each other, as well as other media streams, such as maybe a demonstration video, or sharing of a participant's computer screen. The number of active media streams in a conference will often vary over time, as new particpants "get the floor" or start sharing screens. A broadcast application may provide a set of video streams presenting different views of an event, the corresponding sound tracks, and perhaps a running commentary. A virtual reality application will have its own set of media streams related to photorealistic rendering and mapped textures. In some cases, audio streams will be available in several languages, or subtitle streams in different languages may complement the original videos and audio streams. | ||
|
|
||
| A Compositon is a collection of multiple media tracks that may or may not belong to a single emisssion and thus may not be scoped a single origin. |
There was a problem hiding this comment.
If I understood properly it seems Emission are a collection of tracks from same emitter and Composition is the same from different emitters.
Just a thought: It seems to me we are trying to design an application system more than a media transport protocol, could be composition something at app level. Meaning we deal with emissions (similar to transport stream), and somebody can design a compositor that ingest N emissions and create another one (emission) as result of compositing inputs
| * A single group is mapped to the entire track, thus spanning its lifetime. Each object mapping to a slice of that. | ||
|
|
||
| * Each audio frame is mapped to a group. In this grouping, each group has a single audio frame as the object. | ||
|
|
There was a problem hiding this comment.
Could we get away with sync points (specified in metadata), so we would NOT need the groups concept. Seems a nice simplification (if possible)
| For cases where the subscriptions are successfully validated, Relay proceeed to save the subscription information by maintaining the mapping from the `SubscriptionId`s to the list of subscribers. This will enable Relays to forward on-going publishes (live or from cache) to the subscribers, if available, and also forward all the future publishes, until the subscriptions cases to exist. Relays make such forwarding and/or caching decisions, | ||
| based on match of the identfiers associated in the object's header against the list of subscribers. | ||
|
|
||
| Subscriptions received can be aggregated at the Relays. When a relay receives a publish request with data, it will forward it both towards the Origin and to any clients or relays that have a matching subscriptions. This "short circuit" of distribution by a relay before the data has even reached the Origin servers provides significant latency reduction for clients closer to the relay. |
There was a problem hiding this comment.
When a relay receives a publish request with data, it will forward it both towards the Origin and to any clients or relays that have a matching subscriptions
Is this a privacy risk? Can I send a publish request and then receive data from any client connecting?
There was a problem hiding this comment.
the expected flow is both publishes and subscribes needs to be authorized. So the answer is No and one would get the media objects corresponding to the subscribed tracks alone.
|
|
||
| ## Relay - Publisher Interactions | ||
|
|
||
| Some media publications clearly separate how the content is uploaded to a "content management center" (CMS) and then how that content is broadcast to subscribers. In that model, subscribers can use the MoQ transport to obtain media streams from the CMS acting as "origin", while uploading the content could use an entirely different "ingress" system. Some other media experience are more symmetric. For example, in a video conference, participants may publish their own video and audio tracks. These tracks will be "published" by the participants, acting as publishers. |
There was a problem hiding this comment.
Do you mean "Content Management System" (CMS)
Also are losing focus here? What CMS interactions have to do with media transport protocol?
There was a problem hiding this comment.
CMS has nothing to do with the transport protocol. This is setting up text for anyone who is reading the protocol spec coming from different application backgrounds ( ingest & distrbution vs conferencing). These systems have different ways the MoQ session gets setup. Here we are refering to setup the context and then move into specifics.
|
|
||
| 2. Relays forward the received Catalog messaage towards Origin, for completing the authorization process. This may involve sending the messages directly to the Origin or possibly traversing another Relay. | ||
|
|
||
| "Catalog" message content themselves are opaque to the Relays other than the information needed to authorize the message. Once the Catalog message is authorized, Relay would be willing to participate in forwarding the published media objects whose track identifer matches with the ones listed in the Catalog message. |
There was a problem hiding this comment.
I'm sure I'm missing something but in the case we want to do server side ABR, the relay needs to know track metadata, needs to know bitrates, timing, etc in order to push the right object at the right time. How relays will know that if "Catalog message content themselves are opaque to the Relays"
|
|
||
| "Catalog" message content themselves are opaque to the Relays other than the information needed to authorize the message. Once the Catalog message is authorized, Relay would be willing to participate in forwarding the published media objects whose track identifer matches with the ones listed in the Catalog message. | ||
|
|
||
| The Relay keeps an outgoing queue of objects to be sent to the each subscriber and objects are sent in strict priority/delivery order. Relays MAY cache some of the information for short period of time and the time cached may depend on the application and also by the local cache policies. |
There was a problem hiding this comment.
Are we just considering the live edge case? What about rewind or highlights / VOD? Are they out of scope?
|
|
||
| ## Relay Discovery and Failover | ||
|
|
||
| Relays are discovered via application defined ways that are out of scope of this |
There was a problem hiding this comment.
GOAway related: That message should also carry some information (last Id or timing or ...) if we want to do a (potentially) seamless transition
|
I apologize for not being more careful about indicating the hat I'm wearing while reviewing PRs. Please consider my comments on this PR as an individual. |
|
Has this been overtaken by events and can be closed in favor of newer PRs? |
This PR combines #69 and #67.
Data Model has been updated to keep tracks as the central concept for Moq Transport to reflect the direction of discussions at the interim
Reflect on the some of the points being discussed on the identifiers email started by Ted Hardie (https://mailarchive.ietf.org/arch/msg/moq/UqG0nPGOB3lZVzBaKox2MK9TeZY/)
Merged relay PR into a single PR since data model and relays depend on each other and keeping it separate added more confusion.
Co-authored by Christian Huitema, Will Law (from #69 ) and thanks to inputs from interim participants and reviewers of the Original PRs (kixelated, Vmatrix1900, vasilvv, wilaw, fluffy, afrind, xfdy, gwendalsimon, specerDawkins)
Thanks