Find file
5337a1c May 18, 2016
@rsuderman @dcower @acolwell
261 lines (205 sloc) 14 KB

Spherical Video RFC

Note: This metadata scheme is superseded by the Spherical Video V2 metadata specification.

This document describes an open metadata scheme by which Matroska-like and MP4 multimedia containers may accommodate spherical video. Comments are welcome on the spatial-media-discuss mailing list or by filing an issue on GitHub.

Metadata Format

Two kinds of metadata are needed to represent various characteristics of a spherical video: Global and Local metadata. Global metadata is stored in an XML format, namespaced as


<rdf:SphericalVideo xmlns:rdf=""

Local metadata is stored either as metadata tracks or along with the video frames (see Local Metadata below).

Global Metadata

Global Metadata is metadata that applies to the file or track as a whole. It is stored in the container as defined in the following sections.


Global XML metadata is stored using Matroska/WebM's "Tags" mechanism, having the following structure:


Spherical video metadata is stored in a uniquely-identified moov.trak.uuid box to avoid collisions with other potential metadata. This box shall cite the UUID value ffcc8263-f855-4a93-8814-587a02521fdd. The XML metadata itself is written within the uuid leaf as a UTF-8 string.

  • moov
    • ...
    • trak
      • uuid[ffcc8263-f855-4a93-8814-587a02521fdd]
    • ...

Allowed Global Metadata Elements

Name Description Type Required Default V1.0 Requirements
Spherical Flag indicating if the video is a spherical video Boolean Yes - Must be true.
Stitched Flag indicating if the video is stitched. Boolean Yes - Must be true.
StitchingSoftware Software used to stitch the spherical video. String Yes -
ProjectionType Projection type used in the video frames. String Yes - Must be equirectangular.
StereoMode Description of stereoscopic 3D layout. String No mono Must be mono, left-right, or top-bottom.
SourceCount Number of cameras used to create the spherical video. Integer No -
InitialViewHeadingDegrees The heading angle of the initial view in degrees. Integer No 0
InitialViewPitchDegrees The pitch angle of the initial view in degrees. Integer No 0
InitialViewRollDegrees The roll angle of the initial view in degrees. Integer No 0
Timestamp Epoch timestamp of when the first frame in the video was recorded. Integer No -
FullPanoWidthPixels Width of the encoded video frame in pixels. Integer No See Stereo Mode.
FullPanoHeightPixels Height of the encoded video frame in pixels. Integer No See Stereo Mode.
CroppedAreaImageWidthPixels Width of the video frame to display (e.g. cropping). Integer No See Stereo Mode.
CroppedAreaImageHeightPixels Height of the video frame to display (e.g. cropping). Integer No See Stereo Mode.
CroppedAreaLeftPixels Column where the left edge of the image was cropped from the full sized panorama Integer No 0
CroppedAreaTopPixels Row where the top edge of the image was cropped from the full sized panorama Integer No 0

Stereo Mode

SEI Frame Packing Arragement and the StereoMode tag for Matroska/WebM video files can be used to describe the left/right frame layout. To include non-h264 MPEG-4 files an additional StereoMode tag will override the native stereo configuration. The supported StereoMode values are shown below with the corresponding native values.

Name Description Equivalent MKV Value Equivalent h264 SEI FPI
mono Whole frame contains a single mono view. 0 -
left-right Left half contains the left eye while the right half contains the right eye. 1 3
top-bottom The top half contains the left eye and the bottom half contains the right eye. 3 4

Cropping, initial view, and projection properties are shared across the left/right eyes. Each video frame is divided into the left/right eye regions then cropping and view information is applied treating each region as a separate video frame. Default cropping information varies with the StereoMode tag as shown below.

Name mono left-right top-bottom
CroppedAreaImageWidth Container Width. Half Container Width. Container Width.
CroppedAreaImageHeight Container Height. Container Height. Half Container Height.
FullPanoWidthPixels Container Width. Half Container Width. Container Width.
FullPanoHeightPixels Container Height. Container Height. Half Container Height.

Initial View

The default initial viewport is set such that the frame center occurs at the view center. A diagram of the rotation model for an equirectangular projection is shown below.

     -180           0           180
   90 +-------------+-------------+
      |             |             |
P     |             |      o>     |
i     |             ^             |
t   0 +-------------X-------------+
c     |             |             |
h     |             |             |
      |             |             |
  -90 +-------------+-------------+

X  - the default camera center
^  - the default up vector
o  - the image center for a pitch of 45 and a heading of 90
>  - the up vector for a rotation of 90 degrees.

Local Metadata

Version 1 supports the following Local Metadata:

  • GPS (latitude, longitude, altitude)
  • Director's Cut (viewport for each frame)
  • Hotspot (plaint text, including HTML)

Two Types of Local Metadata

These are two types of local metadata: (1) strictly per-frame and (2) arbitrary local metadata (perhaps sampled at certain intervals -- in other words, not strictly per-frame). Both types of local metadata may be used concurrently, depending on the author's needs and available metadata granularity.

Specification of Strictly Per-Frame Metadata

In this case, metadata content is stored at a frame-level accuracy: there is one chunk of metadata content for every frame.


Metadata content will go into the BlockAdditional element of the corresponding Block to which the metadata belongs.


User data unregistered SEI message syntax from the ISO-14496-10:2005 (see D.1.6). This message is an SEI message of Payload Type 5.

Specifications of Local Metadata Sampled at Intervals

In this case, the metadata content not availble at frame-level accuracy, but rather sampled as a certain time interval.


Metadata content should be stored as a separate metadata track. The metadata track entry must have the following values for specified fields:

  • Track type: 0x21 (WebVTT Metadata as mentioned here)

Each metadata chunk must be stored as either Blocks or SimpleBlocks, per the Matroska specification, with the exception that no lacing is permitted. Metadata blocks should always be key frames and must be indicated accordingly in the flags -- depending on whether it's a SimpleBlock or a Block.


Create a track with ComponentSubType set to "meta" for Timed Text Metadata. The box structure is as follows:

  • mdia
    • mdhd
    • hdlr
    • minf
      • nmhd
      • dinf
        • dref
          • url
      • stbl
        • stsd
          • gpsd
        • stts
        • stsc
        • stsz
        • stco

Local Metadata Specification

Local metadata is stored as a binary stream.

Header bits (1 to 32): flags indicating which metadata are present.

  • Bit 1: Flag for GPS data
  • Bit 2: Flag for Director's Cut data
  • Bit 3: Flag for Hotspot Data
  • Bit 4 to 32: Reserved for future use

Depending on which flags are set, the actual metadata will follow in the exact order of the flags.

Each block of local metadata will be preceded by the length of that particular block of data. This will alow parsers to skip metadata blocks.


 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|g|d|h|                        Reserved                         |
|                 GPS Data Length (if Bit 0 set)                |
|                                                               |
|                    GPS Data (if Bit 0 set)                    |
|                                                               |
|           Director's Cut Data Length (if Bit 1 set)           |
|                                                               |
|              Director's Cut Data (if Bit 1 set)               |
|                                                               |
|               Hotspot Data Length (if Bit 2 set)              |
|                                                               |
|                  Hotspot Data (if Bit 2 set)                  |
|                                                               |

The binary format for representing GPS data can be found in the MPEG3.1 GPS-V Spec. Exact specification for Director's Cut and Hotspot data are yet to be determined.


Audio can be stored in the following two ways, and metadata is needed to signal which format is used and how many streams exist. For example, the four channels of the ambisonic B format might be stored and compressed as two stereo streams.

  • Stereo, 2 channels, compressed as AAC
  • 5.1, 6 channels, compressed as AAC


Specification - ( Stereo Modes - (


See ISO/IEC 14496-10 for SEI message See ISO/IEC 14496-12 for ISO BMFF

Appendix 1 - Global Metadata Sample

    OpenCV for Windows v2.4.9

Appendix 2 - Matroska/WebM Local Metadata Track Sample

<SimpleBlock> (
  <Binary Local Metadata>