RTC_API_Proposal

RTC API Proposal

CJ - Cullen comments in emphasis and start with CJ. Feel free to remove but seemed easier to put them inline here

The WhatWG proposal for real time media streams presents many fine ideas, as does an extension to the Streams API presented in that document (as proposed to the W3C audio working group). This proposal builds on those two documents to present an API for media capture and transmission in web browsers.

The primary motivations for this document are:

Some use-cases are not satisfied with either of the earlier proposals.
Some aspects of earlier proposals are amenable to simplification, and others may present unique implementation challenges, which this proposal takes into account.
Firefox already supports a rich Audio API for manipulating streams and we would like to ensure that subsequent work on video and real time communication plays well with other media APIs.

Use cases

For purposes of designing this API, we present the following use-cases. We omit the use-cases that do not pertain to the RTC working group (such as local-only media capture or audio processing), but it suffices to state that from an implementation perspective, it is important to consider all media-related APIs for coherence, and that the API proposed in this document does take those use-cases into account, even though they are not presented here.

Simple video and voice calling site
Broadcasting real time video & audio streams
Browser based MMORPG that enables player voice communication (push to talk)
Video conferencing between 3 or more users (potentially on different sites)
[Fill in more use cases from IETF document]

API Specification

The API proposed in this section is intended to be the baseline that should be provided by the browser and to give web applications the maximum amount of flexibility. Some use-cases (such as a simple video chat application) may be fulfilled by a simpler API more intuitive to web developers; however, it is hoped that such an API may be built on top of the proposed baseline. We do not preclude that a simpler API be specified by the working group, but suggest that it be mandatory for browsers to implement the following specification to ensure that all targeted use-cases are satisfied.

We split the specification into three distinct portions for clarity: definition of media streams, obtaining device access, and establishing peer connections. Implementation of all three sections are required for an end-to-end solution satisfying all the use-cases targeted.

Media streams

A media stream is an abstraction over a particular window of time-coded audio or video data (or both). The time-codes are in the stream's own internal timeline. The internal timeline can have any base offset but always advances at the same rate as real time. Media streams are not seekable in any direction. CJ This is not a huge deal for me but I find it weird that a single track could have both audio and video. Take video with stereo audio. I think of this as three tracks, a video track, and a left, and right audio track. Also if we had a high res and low res version of the same video, we could model this as two tracks. If we had right and left images for 3D video, two tracks. The tracks may all be coming from the same file or container with the coded information as obviously there are coding advantages to joined coding of highly correlated information. But from the API point of view and way users sees them. Seems like separate tracks make this clear. I'm not really worked up about how we do this but I think we need a clear strategy so that when something a bit weird, like DTMF (my canonical example of weirdness), comes along we will know if it goes in an existing track or a new track.

interface MediaStream {
    readonly attribute DOMString label;
    readonly attribute double currentTime;

    MediaStreamTrack[] tracks;
    MediaStreamRecorder record(in DOMString type);
    void stop();

    const unsigned short LIVE = 1;
    const unsigned short BLOCKED = 2;
    const unsigned short ENDED = 3;
    readonly attribute unsigned short readyState;

    attribute Function onReadyStateChange;   
    ProcessedMediaStream createProcessor(in optional Worker);
}

When the readyState of a media stream is LIVE, the window is advancing in real-time. When the state is BLOCKED, the stream does not advance (the user-agent may replace this with silence until the stream is LIVE again); and ENDED implies no further data will be received on the stream. Every stream has an associated set of tracks:

CJ - I'm pretty sure there is a typo on the RFC 8421 number

 interface MediaStreamTrack {
     readonly attribute MediaStream stream;
     readonly attribute boolean audio;
     readonly attribute boolean video;
     
     readonly attribute DOMString type; // RFC 8421 
     readonly attribute DOMString label;

     attribute Function onTypeChanged;
     attribute DOMString[] supportedTypes;
     attribute MediaStreamTrackHint hint; **CJ - perhaps hints instead of hint**

     readonly attribute double volume; **CJ - we need to be clear on volume means and units here. The voice people might think that volume refers to the average level of sound in this stream, not the gain we want applied**
     void setVolume(in double volume, in double optional startTime, in double optional duration);

     const unsigned short ENABLED = 1;
     const unsigned short DISABLED = 2;
     attribute unsigned short state;

     readonly attribute MediaBuffer buffer;
 };

The audio attribute is set if the stream carries audio data, and the video attribute is set if it carries video data (if both are set, audio & video are both included in the video buffer, depending on the encapsulation format). MediaBuffer allows web applications to access the underlying media data:

interface MediaBuffer {
    readonly attribute DOMString type; // RFC 8421 **CJ typo**
    Object getBufferData(args); // codec specific (may return the next Ogg packet in the stream, for example)
};

CJ Do we want to add a way to get a sequence number in above, some codecs you can't reconstruct without knowing something about the ordering of the packets and packets will arrive out of order some times

The programmer can also provide "hints" to the MediaStreamTrack as to the kind of data it is carrying. The MediaStreamTrack's type may then change to accommodate the provided hints, and if this is done, the onTypeChanged event handler will be called. CJ I think this hints is going the right direction of something simple enough that people could use - I'm sure lots of work will be need but lets remind everyone "less is more"

interface MediaStreamTrackHint {
    attribute boolean isMusic;
    attribute boolean isSpokenVoice;

    unsigned short AUDIO_BROADBAND = 1; **CJ - upon reflection, I don't like these, they produce a sharp cutoff and hard to know which is right. Lately I have been more of fan of algorithm that can be parameterized into a gradual change than a few modes - lets think more about what the user of the API knows, and what they want to accomplish. **
    unsigned short AUDIO_NARROWBAND = 2;
    attribute unsigned short audioBand;

    attribute unsigned long videoWidth;
    attribute unsigned long videoHeight;
    attribute unsigned long videoFrameRate;**CJ - worth noting we are seeing more cameras support 72 fps or more**

    unsigned short VIDEO_SLOW_MOVING = 1;
    unsigned short VIDEO_FAST_MOVING = 2;
    attribute unsigned short videoType;

    attribute double percentageCPU;
    attribute double percentageGPU;
};

Streams can be associated with existing HTML media elements such as <video> and <audio>, and video streams with <canvas>. Each of these tags may serve as either the input or output for a media stream, by setting or getting the stream attribute as appropriate.

partial interface HTMLMediaElement {
    attribute MediaStream stream;
};
partial interface HTMLCanvasElement {
    attribute MediaStream stream;
};

Streams can be recorded to files, which can then be accessed via the DOM File APIs:

interface MediaStreamRecorder {
    readonly attribute MediaStream stream;
    void getRecordedData(in Function onsuccess, in Function onerror);
    void stop();
};
function onsuccess(DOMString type, DOMFile file);
function onerror(DOMString error);

The type argument passed to the onsuccess callback is a string as defined in RFC8421. (This is the same format for the type attribute in MediaBuffer).

Device access

MediaStreams can be obtained from <video>, <audio> and <canvas> elements; but they can also be obtained from a user's local media devices such a webcam or microphone: CJ I think we might want to introduce the idea of a stream sink and source. If we have a camera that flows to a canvas, we need to be clear if we model this as a stream for the camera , and another stream object for the canvas, and they are linked together, or if there is one stream object that connects gets data from the camera and sends data to the canvas. I think I prefer the model where is a stream is a logical flow of data and it one or more devices or other streams can push data into it and it can push data out of it to one or more devices and streams. I can be convinced either way - just need to have a clear model

interface NavigatorMedia {
    void getMediaStream(in boolean video, in boolean audio, in Function onsuccess, in optional Function onerror);
};
Navigator implements NavigatorMedia;

function onsuccess(MediaStream stream);

const unsigned short PERMISSION_DENIED = 1;
const unsigned short RESOURCE_BUSY = 2;
const unsigned short RESOURCE_UNAVAILABLE = 3;
function onerror(unsigned short errorCode);

The caller may set the values of 'audio' and 'video' to true if they require those inputs. CJ imagine I am using a sound card with a stereo input, but I want the left channel, or the right, or I want the mono stream of the two mixed left and right. We need to be able to indicate this type of stuff - perhaps re could reuse hints here CJ I think we need a way to discover all the cameras, microphones, speakers, and output displays in the case we can access more than one monitor. Not sure if this is DAP or WEBRTC or what but seems like we need to fit in with that. If either of the requested inputs were not available, the success callback is still called; thus the application must check the type attribute of the resulting tracks in the stream handed to it to verify whether the stream contains only audio, only video, or both. If hardware to fulfil the request is unavailable the error callback is invoked with RESOURCE_UNAVAILABLE, but if hardware is available and is currently being used by another application, RESOURCE_BUSY is returned. Additionaly, the user-agent may choose to offer the user to select a local file to act as the source of the media stream in place of real hardware.

Peer connections

A peer connection provides a UDP channel of communication between two user-agents.

constructor PeerConnection(DOMString config, Function sendSignal, optional DOMString negotiationServerURN)
interface PeerConnection {
    void receivedSignal(DOMString msg);

    const unsigned short LISTENING = 1;
    const unsigned short OPENING = 2;
    const unsigned short INCOMING = 3;
    const unsigned short ACTIVE = 4;
    const unsigned short CLOSED = 5;
    readonly attribute unsigned short readyState;

    void addLocalStream(in MediaStream stream);
    void removeLocalStream(in MediaStream stream);
    readonly attribute MediaStream[] localStreams;
    readonly attribute MediaStream[] remoteStreams;

    void open();
    void accept();
    void close();
    void send(in DOMString text);

    attribute Function onMessage;
    attribute Function onRemoteStreamAdded;
    attribute Function onRemoteStreamRemoved;
    attribute Function onReadyStateChange;
};
Window implements PeerConnection;

The configuration string gives the address of a STUN or TURN server used to establish the connection. sendSignal is a function that is provided by the caller which will be called when the user-agent needs to transport and out of band signalling message to the remote peer. When a message is received from the remote peer via this channel, it must be sent to the user-agent by calling receivedSignal(). The ordering of messages is important.

CJ Proposed text to add around here: TODO

Examples

Simple Video Call

Simple video calling between two users A and B. A is making a call to B:

// User-agent A executes
<video id="localPreview"/><video id="remoteView"/>
<script>
navigator.getMediaStream(true, true, function(stream) {
    document.getElementById("localPreview").stream = stream;
    var conn = new PeerConnection("STUN foobar.net:3476", sendToB); **CJ I'd like to move the STUN <space> domain:port to just be an URI like stun:foobar.net:3478**

    function sendToB(msg) { // send via XHR to B }
    function gotFromB(msg) { conn.receivedSignal(msg); }

    conn.addLocalStream(stream);
    conn.onRemoteStreamAdded = function(remoteStream) {
        document.getElementById("remoteView").stream = remoteStream;
    };

    conn.open();
});
</script>

CJ How does the camera get connected up so it display on localPreview? Is this video with no audio? I like the conn.open - it makes it much easier to implement all this under the covers

// User-agent B executes
<video id="localPreview"/><video id="remoteView"/>
<script>
navigator.getMediaStream(true, true, function(stream) {
    document.getElementById("localPreview").stream = stream;
    var conn = new PeerConnection("STUN foobar.net:3476", sendToA);

    function sendToA(msg) { // send via XHR to A }
    function gotFromA(msg) { conn.receivedSignal(msg); }

    conn.addLocalStream(stream);
    conn.onRemoteStreamAdded = function(remoteStream) {
        document.getElementById("remoteView").stream = remoteStream;
    };
    conn.onReadyStateChanged = function() {
        if (conn.readyState == INCOMING) conn.accept();
    };
});
</script>

CJ - Again, I like the conn.accept()

Simulcast Video

Broadcasting real-time video & audio streams:

CJ small nit. Lets try and not use the term broadcast or multicast as IETF people will think we are going to send one packet and have it received by multiple people and that is not going to work with our security model. Perhaps simulcast would work as a word. BTW... Imagine the use case of a live feed of a soccer game on something like youtube. My browser could act as a client and get the feed. But then my browser could, using the stuff we are doing here could act as a server and other near by servers that wanted it could get it from me. You could get the efficiency of bittorrent for distributing live video running in the browser. Some of our customers want this ... like really want it. I used to chair the IETF WG working on this but gave up to have more time to do it in browsers

CJ connecting is backwards ... I think we want the server to be doing the accepts and have the clients initiate the connection to the server. I also think we need to be able to send different media to each client. You might not do that in this case but you would in the case of a conference server.

// This code runs on the "server". Some other part of the web page magically paints the game to the canvas
<canvas id="hockeyGame"/>
<script>
function sendToPeer(msg) { // Out of band send }
function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

var conn = navigator.createPeerConnection("TURNS example.org", sendToPeer);
conn.addLocalStream(document.getElementById("hockeyGame").stream);
conn.open(); 
</script>

CJ - Id propose changing this to something like. Note for this to work, the msg passed out of band needs to have addressing information because we will be getting messages from multiple clients and we need to send the response to the right one. This also adds a listen() function. I don't think the listen can be implicit onc creation of conn or you have a race condition on setting up the handler that will do the accept.

// This code runs on the "server". Some other part of the web page magically paints the game to the canvas
<canvas id="hockeyGame"/>
<script>
function sendToPeer(msg) { // Out of band send }
function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

var conn = navigator.createPeerConnection("turns:example.org", sendToPeer);
conn.addLocalStream(document.getElementById("hockeyGame").stream);

conn.onReadyStateChanged = function() {
        if (conn.readyState == INCOMING) conn.accept();
    };
conn.listen(); 
</script>

CJ I'm wrote this as using the onReadyStateChanged but I think we need a different callback that passes the msg that initiated the incoming state so that if two clients connect at the same time, we can accept the right one. I'm thinking something like:

conn.onIncoming = function(msg) {
        conn.accept(msg);
    };
conn.listen();

CJ following is original client code and after that I will put proposed code

// All clients subscribing to the broadcast run this code.
// TURN server does the job of initiating onRemoteStreamAdded for every client?
<video id="gameStream"/>
<script>
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = navigator.createPeerConnection("TURNS example.org", sendToPeer);
    conn.onRemoteStreamAdded = function(stream) {
        document.getElementById("gameStream").stream = stream;
    };

    conn.accept(); // You can also call accept() if readyState is not INCOMING
    // Implies that when the transition from LISTENING -> INCOMING is made, simply accept
</script>

CJ proposed code for client

// All clients subscribing to the broadcast run this code.
// TURN server does the job of initiating onRemoteStreamAdded for every client?
<video id="gameStream"/>
<script>
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = navigator.createPeerConnection("turns:example.org", sendToPeer);
    conn.onRemoteStreamAdded = function(stream) {
        document.getElementById("gameStream").stream = stream;
    };

    conn.open(); 
</script>

MMORPG

Browser based MMORPG that enables player voice communication (push to talk):

**CJ Here is what I am imagine as the big picture for this one. Bob, Cindy, and Dean are already playing Angry Programmers and Alice decide to join the came. The game will provide out of band msg deliver from Alice to other players and tells Alice about the other three players. The other three players already have connection between them but their conn objects are lesenting for new connections. Alice fors a conn object and does an open to each of Bob, Cindy, and Dean. Note the same conn object was used for all three or it is going to be hard to sort out all the mixing on this. I don't know how to set up the media processing but there should be some way to do it. And I would Alice would like Bob's audio to be processed so he sounds like the chipmunks while Dean is always too loud so he get a -6db reduction. I think a story like that could be done is we can pass one of the msg objects into the open and the game gives us msg objects we can use to reach Bob, Cindy, and Dean. The code would also need to listen for incoming connections much the above stuff. We might be able to do this with some code something like the following (note below this CJ proposed code is the original code) **

// All players
<button id="ptt"/>
<audio id="otherPlayers"/>
<script>
var mixer;
var worker = new Worker("muxer.js");
var players = ... // this is an array of objects provided by server used in connect to other players
navigator.getMediaStream(true, false, function(stream) {
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = navigator.createPeerConnection("stuns:game-server.net");
    conn.addLocalStream(stream);

    conn.onIncoming = function(msg) {
        conn.accept(msg);
    };
    conn.onRemotetreamAdded = function(remoteStream) {
        if (!mixer) mixer = remoteStream.createProcessor(worker); // StreamProcessor API TBD
        else mixer.addInput(remoteStream);
    };

    conn.listen(); 
    for( i=0; i<players.length; i++) {
         conn.open( players[i] );
    }

    var streaming = false;
    document.getElementById("ptt").onclick = {
        if (!streaming) {
            streaming = true;
            stream.readyState = stream.LIVE;
        } else {
            streaming = false;
            stream.readyState = stream.BLOCKED;
        }
    };
});
document.getElementById("otherPlayers").stream = mixer.outputStream;

**CJ - the original code is below **

// All players
<button id="ptt"/>
<audio id="otherPlayers"/>
<script>
var mixer;
var worker = new Worker("muxer.js");
navigator.getMediaStream(true, false, function(stream) {
    function sendToPeer(msg) { // Out of band send }
    function gotFromPeer(msg) { conn.receivedSignal(msg); // Out of band receive }

    var conn = navigator.createPeerConnection("STUNS game-server.net:3345");
    conn.addLocalStream(stream);

    conn.onRemotetreamAdded = function(remoteStream) {
        if (!mixer) mixer = remoteStream.createProcessor(worker); // StreamProcessor API TBD
        else mixer.addInput(remoteStream);
    };
    conn.accept();

    var streaming = false;
    document.getElementById("ptt").onclick = {
        if (!streaming) {
            streaming = true;
            stream.readyState = stream.LIVE;
        } else {
            streaming = false;
            stream.readyState = stream.BLOCKED;
        }
    };
});
document.getElementById("otherPlayers").stream = mixer.outputStream;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly