Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
getViewerPose()s use of XRSpace doesn't quite specify how views work #565
Fundamentally, the XRSpace is a position and an orientation, one that may change every frame.
However, the XR space has no inherent concept of a "view". When in immersive mode, aside from eye-level spaces, it doesn't quite make sense to request view data for the space since there's nothing telling you where the eyes are.
For example, what should
Similarly, when there are multiple views, what exactly is originOffset affecting? Can I apply it by simply premultiplying the offset to the view matrices, or is there some deconstruction of pose data that I must first do?
It may be worth explicitly modeling this as a mathematical thing based off of pose information obtained from the device.
Okay, so from reading the chromium source (which is in part written by @toji so I assume it follows the intent of the spec
I think the missing piece was understanding that
I think we should explicitly mention this in the spec, by noting:
I'd love to take a crack at writing spec text if y'all think this would be useful. I'll probably wait until I finish writing the code for servo (and figuring out the precise matrix math involved)
referenced this issue
Mar 21, 2019
@Manishearth , thanks for opening these discussions - I also think that this is somewhat underspecified in the current spec, especially since two people reading the spec may each end up with their own plausible interpretation which ends up being incompatible.
As per my comments on issue #567, I think it helps to clearly describe the poses and transforms as changing coordinates from one space to another space. Here's how I understand it, please let me know if I got it wrong.
Let's say you're on a 6DoF headset and the low-level VR API returns coordinates in tracking space, where the floor center of your play area is (0, 0, 0), and taking a step backwards puts your headset somewhere around (0, 1.6, 1). If you've requested a stationary/standing or bounded reference space, the user agent can treat tracking space and world space as equivalent (ignoring originOffset for now), and getViewerPose would return a pose with a position of (0, 1.6, 1), representing a transform from headset-relative coordinates to world space coordinates.
The per-eye poses would have a small offset from that, i.e. (-0.03, 1,6, 1) for the left eye, and possibly also a small rotation. (The Pimax 5k+ headset has angled screens which needs to be represented by such a rotation, this doesn't currently work right in Chrome.) These are basically transforms from eye space to world space. Feeding the eye space origin point (0, 0, 0) into the eye's rigidTransform matrix gives you the eye position in world space.
If you have a 3DoF headset, the low-level VR API would internally return coordinates near (0, 0, 0) with a neck model applied. If you request eye-level reference space, the user agent would return those as-is. If you request stationary/standing reference space, the user agent should still return headset poses around (0, 1.6, 0), so it adds a tracking space to standing space transform that applies an assumed floor offset:
Conversely, if you have a 6DoF headset and request eye-level poses, the low-level VR APIs would still track 6DoF position, but there'd be a center point inside the tracked area that's treated as the eye-level origin. For example, in SteamVR you mark this point by using the "reset seated origin" function.
Result would be something like this:
(This is just an example. If the low-level API natively supports an eye-level tracking equivalent, the implementation should use that directly since the native API can apply additional logic useful for seated experiences, for example hiding the SteamVR chaperone while your head is near the seated origin.)
originOffset is a transform from world space to tracking space (see issue #567), so you'd get something like this when combining it on a 3DoF headset:
If you want a view matrix, you want a transform that goes from world space to eye space, so you'd apply the inverse of the eye pose's rigidTransform. Specifically, this view matrix would transform a world point at the eye position to (0, 0, 0).
The mental model I find most productive for viewers, input sources, reference spaces and all other
For a given WebXR session, there is only one viewer in the physical world - if it's a stereo viewer, that viewer is composed of two views, each positioned and oriented themselves within the physical world. When you call
I'm not sure I follow the analogy of "mounting" the XR device on the space here. To me, that implies that the device is rigidly locked to the space in question.
The stationary eye-level space is not a space that moves with the user each frame. Instead, it is a coordinate space that, once established, stays fixed in the physical world, with its origin at the location of the user's eyes when the space was first created. When you call
This mental model also provides a natural definition for
Either way, the resulting
Agreed that we should be sure all of this is specified very exactly! However, we should take pains to avoid any mention of "tracking space origin" or other such implementation details in the spec that presume something about the manner in which the underlying XR platform is built.
For example, some native XR APIs have a single eye-level tracking space origin, with all poses expressed relative to that origin. However, other systems like Windows Mixed Reality and HoloLens allow users to walk around large unbounded areas - there is no canonical tracking origin there in which all coordinates are expressed. Instead, users make bounded or unbounded reference spaces as needed, and then ask for head poses, hand poses, etc. relative to one of those spaces.
Each well-known reference space type in WebXR is defined by how its origin is positioned and oriented relative to the physical world when created and how it adjusts over time. The design goal has been to choose definitions that can behave in an observably consistent manner for app developers across different UAs that span disparate tracking technologies. It is up to the UA to manifest the defined contract of that reference space using whatever APIs are exposed by its underlying XR platform.
@thetuvix I agree that the spec shouldn't refer to internal details of the browser implementation or underlying low-level VR APIs. My previous comment was from an implementor's point of view since @Manishearth was asking about how the different types of reference spaces relate to each other in terms of transforms, and that is an implementation detail that's not directly exposed to users of the WebXR API.
I think we're in agreement that a "space" basically consists of an origin and unit axis vectors that correspond to locations in the real world, and different spaces can be related to each other with XRRigidTransforms.
However, we do need some additional terminology to explain how things work. Currently, an
I've been roughly following @toji's terminology from his drawings in issue #477, calling the "without originOffset" reference space "tracking space", and the transformed reference space "virtual world space". In this sense "tracking space" is an actual concept that's part of the WebXR API and not just an implementation detail, but I'm very much open for suggestions for alternate terminology.
Here's a proposal for the spec to make XRRigidTransform a bit more precise - would people agree with something like this?
Using this terminology,
The viewer pose's transform is the XRRigidTransform
An eye view pose's transform is the XRRigidTransform
Sigh, this is confusing, I got originOffset backwards while trying to explain it. I've edited it, here's the corrected paragraph.
Using this terminology,
referenced this issue
Mar 23, 2019
The thing I'm grappling with here is "what happens when you do
So initially I thought this was true, however then:
Overall it seems like reference spaces are not just coordinate systems, but rather have additional magic on how they affect the viewer when used in
It seems like this mental model seems incomplete given the abilities reference spaces have (and also the term "reference space" may be misleading here). They seem to have the ability to affect
I think some concrete questions that might clear things up are:
(Apologies in advance in case I made mistakes in this answer, it's easy to get signs or transform directions wrong. It's intended to match my proposed clarifications along these lines in https://github.com/immersive-web/webxr/pull/569/files - change "diff settings" to "split" to ensure the long lines from the index.bs file don't get truncated.)
I think "the device is sitting on the pose of the given space" is a confusing way to put things. A pose is essentially a transform between two spaces. A pose of an object corresponds to a transform from its object-attached XRSpace to an reference space or other destination space, and provides a way to get coordinate values in the destination space.
So you can do
It cannot affect the viewer, no amount of math you do in the implementation will move the viewer around unless you have a haptic suit or motion platform that can physically grab them and move them around ;-)
Instead, what's happening is that the origin of the position-disabled reference space follows your head position (but not orientation) as you move around. Imagine the origin of that space stuck to the bridge of your nose, but its XYZ axes keep pointing in the same world directions (i.e. Z=north, Y=up) even when you move your head. So the viewer pose in relation to a position-disabled ref space has just the rotation component, position is zero. If you get a controller pose in that space, the position would be relative to that point between youreyes.
No, the identity reference space when used for querying viewer poses is effectively a coordinate system glued to the bridge of your nose that moves along with your head, so that +X always points towards your right eye, assuming you haven't changed originOffset. The transform from that to your viewer space is always an identity matrix, hence the name. If you use originOffset, its inverse gets applied to poses you query from it. You'll never see any actually measured headset movement or rotations in those poses, only what you set yourself via
I think you had it right initially, there's no extra magic. getPose(viewerSpace, referenceSpace) and getViewerPose(referenceSpace) are the same thing, and your descriptions of identity and position-disabled match what I had described above. Where do you see a mismatch?
eye-level is a fixed origin at a given point in space, i.e. your head's resting position when sitting in your gaming chair. If you lean 30cm to the left and have a 6DoF headset, you'll get a position of (-30, 0, 0) or so in eye-level reference space, and exactly (0, 0, 0) for position-disabled and identity. If you also tilt your head, you'll get a corresponding orientation for eye-level and position-disabled poses, but the identity reference space pose will not change at all in response to head movements.
If you have a 3DoF headset, the eye-level pose would have a smaller position change in response to head movement based on a neck model (it can't detect leaning), while position-disabled and identity would behave the same as a 6DoF headset.
Exactly the same as getViewerPose().
Does that help?
Thank you, this helps a bunch!
Not quite, though,
(It seems like your description matches my perception of
Ah, this gets to the core of my confusion; this isn't at all clear from the spec or the spatial tracking explainer
The word "identity" reference space strongly evokes an image of a reference space at rest at (0,0,0) (in some stationary reference space). This is further confusing for
Given the (to me) "natural" definition of what identity and position-disabled do, and given that these reference spaces are defined with how
Regarding the specced definition of "stationary", we currently have
This seems like an inconsistent set of definitions.
And the definition of
This is also pretty confusing and inconsistent, and is part of what made it hard for me to understand what identity/position-disabled did.
This may belong in a separate issue, but it's all very closely related so I'll leave the comment here, lmk if I should split this out. It does have some more pressing concerns since while the main topic of this thread is largely due to the spec being unclear, here we have the spec making false statements which should definitely be fixed.
I guess we can do a bunch of things here:
Ah, it wasn't clear to me that this is what you meant. It's a bit more complicated. getViewerPose returns an XRViewerPose. This is-a XRPose, and as such contains a
However, in addition to the XRPose transform, XRViewerPose also has a views array, and each view in that array has its own transform corresponding to a specific view. For a simple HMD each of those corresponds to an eye space with its corresponding offset and forward direction, but it could also be more complicated such as a display with angled screens or multiple screens per eye.
The name just means that querying viewer poses against it will always get you an identity transform. Maybe it would have been more consistent to call it position-and-orientation-disabled, but that would be clunky. It can't work the way you'd prefer since it's specifically intended for situations with no tracking, including inline mode where there's no hardware at all, so it doesn't know anything about stationary reference spaces.
Yes, the "stationary" spaces are intended for mostly-stationary users, though this isn't actually enforced (you can walk freely within the tracked area). And yes, the position-disabled and space is a special case.
No, eye-level is a normal tracking space, it's just intended to give systems flexibility to do the right thing depending on hardware capabilities. On a 6DoF headset, it's very similar to floor-level, just with a different static origin, but floor-level and eye-level would typically be related by a simple transform that doesn't change when you move around. Only an explicit "reset seated orientation" or similar would change that transform between them. Imagine the coordinate axes for floor-level being stuck on the middle of the play area, and the eye-level coordinate axes could either be right above that, or be placed where the resting head position is in a gaming chair off to the side.
On a 3DoF headset, eye-level would have an origin that moves along with the base of your neck according to the inverse of the neck model, so the origin isn't static if you do translation movement such as leaning, but that is uninteresting since the application has no way of telling that it's not static. It's basically equivalent to the 6DoF version of eye-level as long as you keep your torso in place and just tilt your head.
You're right that this is inconsistent, and the spec shouldn't say that reference spaces are stationary without clarifying these exceptions.
I'm in favor of clarifying definitions, but I think they fit the use cases pretty well, so I think it would be reasonable to stick with the current names.
The spec does say that already.
I think that would be even more confusing since it's what you use when you can't track the viewer...
I think my proposed changes should help with some of these, for example it explicitly defines matrix entries and clarifies how how pose positions relate to coordinate values. Please let me know what you think.
Overall I think this is one of the reasons I kept getting confused, a lot of space things are expressed in terms of how they behave in
Yeah, as I said below it feels like naming the space after what you're supposed to use it for as opposed to what it is seems a bit weird.
One potential "fix" is to just flatten the types and get rid of XRStationaryReferenceSpace entirely, folding everything into XRReferenceSpace directly.
Another might be to clarify that the
I meant that it doesn't track any real entity, so it's a bit confusing. It's not wrong per se, a ghost entity positioned where your head was at t=0 is still an "entity"
Ah, the way I'm looking at these is a bit inverted -- I look at these spaces as behaving the same for 3DOF and 6DOF, however in 3DOF the application gets no positional data so pretends the headset doesn't change position (ignoring neck modeling). I do think this might be a better way to spec this if we need (eventually it would probably be useful to have notes explaining what happens for 3DOF devices in the spec) since it unifies the behavior of all spaces across devices, and you simply have to define how the viewer is tracked for various kinds of devices.
(I don't think this is a priority but once we clarify all the other stuff I may take a crack and clarifying this)
It does, thank you for that! The individual spaces still need more definitions, but we can do that separately (I might try to write a PR for it once yours lands)
I think it's a bit more real than that. The spec says the origin is "near the user's head at the time of creation", but doesn't specifically require a t=0 snapshot. An approach as in SteamVR should be compliant also, where the seated origin is a calibrated spot chosen by the user to match their preferred seated location.
I think we're in agreement here. One way of looking at it is that the restricted spaces discard some information and effectively ignore that part of the original headset pose. For example,
If you want to get fancy, you could consider
Oh, yeah, to be clear my view of this isn't incompatible with yours, either the definition of the spaces change based on headset limitations, or what we look at as the coordinates of the "viewer" when defining spaces changes based on limitations.
I was suggesting to use the latter as it lets us consolidate all the device differences (inline, 3dof, 6dof) into a single definition of "the tracked viewer position", which can then be used by all the other XRSpaces. But that's a minor thing.
This was referenced
Apr 3, 2019
This was referenced
Apr 13, 2019
A piece of this story is #565 , where we're discussing if