Surface MediaPipe Iris model for the web #2526

badlogic · 2021-09-09T13:18:25Z

System information (Please provide as much relevant information as possible)

MediaPipe Solution (you are using): MediaPipe Face Mesh JavaScript
Programming language : Typescript
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state:
I'm currently using the MediaPipe Face Mesh JavaScript API for a web-based virtual puppeteering application. I am successfully able to derrive a head pose from the face geometry data. I am also able to use mouth and eye land marks to drive parameters of a puppet, altough the accuracy of the mouth landmarks requires heavy post-processing to be useful.

Eyes are a very important part of conveying emotion. As such, the application must be able to track the iris position as well as the open/close state of the eye lids. Sadly, the current eye landmarks do not include iris data, and the open/close state can not be derrived from the eye contour landmarks reliably, if at all, depending on head orientation.

I'm using the landmarks of the metric face geometry mesh (e.g. results.multiFaceGeometry[0].getMesh().getVertexBufferList()). I assumed the vertices in this mesh are invariant with respect to the head orientation. However, they are not, which changes the relative distances of landmarks in the neutral pose. Here is the metric face geometry mesh at various head orientations, illustrating that the mesh is not head orientation invariant.

Screen_Recording_2021-09-09_at_14.49.37.1.mp4

Without a head orientation invariant face mesh, it is hard to impossible to establish a neutral pose to compare the current pose against and calculate puppeteering parameters, like how closed an eye is expressed in the range [0,1]. While I can statistically treat the eye lids distance value in a way to detect eye blinking/winking/closing for an arbitrary but fixed head orientation, the tracking falls apart as soon as the user turns their heads.

The eye lid landmarks also never fully close in this model, and the left and right eye are linked to each other. In the video below, only a single eye is closed at a time, while the eye contour landmarks move for both eyes.

Screen.Recording.2021-09-09.at.15.01.24.mp4

For the iris position, I currently resort to image postprocessing, detecting the iris/pupil inside the eye contour landmarks bounding box via constrast enhancement and a simple sliding window histogram approach. The results are convincing enough under a wide range of lighting conditions. However, the additional computations are somewhat significantly contributing to overall processing time, which isn't ideal, especially in mobile web browsers.

Screen.Recording.2021-09-09.at.14.58.52.mp4

From the MediaPipe Iris web demo, it appears that all or most these issues can be solved by its model. Sadly, the iris model is not available for use on the web through the TypeScript/JavaScript API.

Will this change the current api? How?
This change would be non-breaking, and consist of additional configuration parameters, as well as additional data in the Results object.

Who will benefit with this feature?
Anyone trying to use MediaPipe Face Mesh for facial expression detection and tracking.

Please specify the use cases for this feature:
Virtual puppeteering via blend shapes.

Any Other info:
There is a separate MediaPipe Face Landmarks Detection package from the TensorFlow people. It does contain the iris model, however, it's performance in both accuracy and runtime speed is worse than the JavaScript package provided by MediaPipe themselves. It's also very confusing to have two similarly named packages, one provided by TensorFlow and one provided by MediaPipe. I understand the MediaPipe models and pipelines do differ from the one in the TensorFlow package.

Apple's ARKit does have dedicated blend shape support, which would be ideal to have in MediaPipe Face Mesh for any facial expression detection and tracking.

The text was updated successfully, but these errors were encountered:

kostyaby · 2021-09-13T01:55:24Z

Hey @badlogic,

Thanks for reaching out! I see 2 separate problems mentioned in your request:

No MediaPipe Iris JS API
Existing MediaPipe Face Geometry JS API doesn't provide enough stability for deriving ARKit-like eye blendshapes

Regarding (1), I CC'ed @chuoling and @mhays-google for visibility of this request as well as maybe to comment on a timeline

Regarding (2), I can share that the same face geometry logic + (probably, a better) face mesh tracking model is what drives specifically ARKit-like eye blendshapes for AR Puppets in Google Duo apps (coverage, you can check it out in the app). Yes, Face Mesh "normalization" via Face Geometry is not perfect, but it should get you into the ballpark for solving your problem. Another question is whether the released MediaPipe Face Mesh tracking model is good enough for AR puppeteering. I wasn't the person who wrote the Google Duo puppet heuristic, so it's hard for me to share any specifics. I CC'ed @ivan-grishchenko, he should probably have more to say regarding this topic

mhays-google · 2021-09-13T21:19:04Z

Hi! We're planning on releasing Iris data points in both our face_mesh and holistic solutions around October.

mattrossman · 2021-09-15T17:25:14Z

It's also very confusing to have two similarly named packages, one provided by TensorFlow and one provided by MediaPipe. I understand the MediaPipe models and pipelines do differ from the one in the TensorFlow package.

Could someone clarify what the difference is between the various distributions?

From what I can tell, the original @tensorflow-models/facemesh package (now deprecated) corresponded to the model from this paper: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs

The subsequent release of @tensorflow-models/face-landmarks-detection appears to repackage the old model with the option to opt-in to a higher fidelity (but less performant) model for iris tracking using the advances from this paper: Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. It's unclear whether this model also includes the improvements in to eye/lip tracking from that paper.

Today I learned there is also the @mediapipe/face_mesh package from this repo which doesn't include iris tracking, which confuses me since this package was published more recently than the tensorflow one. Is this just the same as the original facemesh package? Which distribution are developers advised to use?

Lastly, like @badlogic I am working on a blend shape puppeteering project and would appreciate guidance on how to achieve this, or better yet built-in support for some standard blend shapes. AR puppeteering was demonstrated as a use case in the Attention Mesh paper, but in practice since the mesh is not invariant to head orientation as this issue points out, it's unclear how this was actually implemented. Despite my attempts at normalization, the model often exhibits undesired blend shape activation when the user turns their head.

badlogic · 2021-09-15T18:29:33Z

I went with @tensorflow/face-landmarks-detection after all. It appears the mediapipe landmark detector exhibits better temporal coherence, i.e. less jitter, and slightly better inference performance. But it's not fit for puppeteering tasks. @tensorflow/face-landmarks-detection with enabled iris model is pretty heavy computationally, so even on a desktop device, e.g. a macMini with integrated Intel GPU, it may be too slow. I getting good results for head pose tracking by applying a simple windowed mean filter. I had to employ a somewhat complex filter for the eye close blend shape, which still falls apart for head rotations around y at around 35°. The reason is that the eye that becomes occluded is no longer tracked by the iris model in that configuration, so the landmarks revert back to what the less accurate face landmark detector uses. For my purposes, I can probably figure something out, i.e. link the eye state based on how confident the iris model was. The next big problem is mouth blend shapes, which may mean the end for my little problem. Neither the tsjs model, nor the mediapipe model are accurate enough for anything other than almost binary open/close. I may apply postprocessing similar to what i described above. But that's a much harder problem to solve with more classical image processing approaches. I assume the full optimized mediapipe model including iris detection is likely reserved for Duo, so I assume we'll not get access to it.

…

On Wed, Sep 15, 2021, 19:25 Matt Rossman ***@***.***> wrote: It's also very confusing to have two similarly named packages, one provided by TensorFlow and one provided by MediaPipe. I understand the MediaPipe models and pipelines do differ from the one in the TensorFlow package. Could someone clarify what the difference is between the various distributions? From what I can tell, the original @tensorflow-models/facemesh ***@***.***/facemesh> package (now deprecated) corresponded to the model from this paper: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs <https://arxiv.org/abs/1907.06724> The subsequent release of @tensorflow-models/face-landmarks-detection ***@***.***/face-landmarks-detection> appears to repackage the old model with the option to opt-in to a higher fidelity (but less performant) model for iris tracking using the advances from this paper: Attention Mesh: High-fidelity Face Mesh Prediction in Real-time <https://arxiv.org/abs/2006.10962>. It's unclear whether this model also includes the improvements in to eye/lip tracking from that paper. Today I learned there is also the @mediapipe/face_mesh ***@***.***/face_mesh> package from this repo which doesn't include iris tracking, which confuses me since this package was published more recently than the tensorflow one. Is this just the same as the original facemesh package? Which distribution are developers advised to use? Lastly, like @badlogic <https://github.com/badlogic> I am working on a blend shape puppeteering project and would appreciate guidance on how to achieve this, or better yet built-in support for some standard blend shapes. AR puppeteering was demonstrated as a use case in the Attention Mesh paper, but in practice since the mesh is not invariant to head orientation as this issue points out, it's unclear how this was actually implemented. Despite my attempts at normalization, the model often exhibits undesired blend shape activation when the user turns their head. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2526 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD5QBBOZWO5R5ANPCKZ5MLUCDJILANCNFSM5DXD4Q3Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

tyrmullen · 2021-09-15T20:22:51Z

Traditionally for MediaPipe we say facemesh to refer to retrieving face landmarks, while iris detection is a secondary refinement ML model which can be optionally applied afterwards. For reference, see the graphs for iris-tracking-on-top-of-face-landmarks (visualization and live web-demo here: https://viz.mediapipe.dev/demo/iris_tracking, as mentioned by @badlogic ); that demo is a bit older, but is probably still the best reference.

The @tensorflow packages are part of TF.js, so while they may use MediaPipe models, our team has usually been less involved with those ports, so I'm unable to comment in too much detail there, although hopefully that trend is changing currently and in the near future.

The @mediapipe/facemesh package will contain the latest open-sourced MediaPipe models for face landmarks, as well as the MediaPipe recommended pre- and post-processing pipelines (usually just what you'd find in the corresponding graphs under our modules/ directory). It is a standalone JS API initially created specifically for face landmarks, requiring minimal setup or extra code (we term these single-purpose turnkey offerings "Solutions APIs"), but therefore was not designed to be able to handle more complicated alternative use cases.

Note that we have a sibling module "iris_landmark" which can be used for iris tracking refinements to the face landmarks, but there is no corresponding MediaPipe JS Solution API for it yet, nor has it been integrated into facemesh (see @mhays-google's comments above for ETA).

Unfortunately, I don't believe any lip refinement code or models have been open-sourced as of yet either (nor do I know of any plans to do so).

And as for blend shapes, @ivan-grishchenko will have to weigh in.

google-ml-butler · 2021-09-23T06:48:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-09-30T07:18:43Z

Closing as stale. Please reopen if you'd like to work on this further.

mattrossman · 2021-10-13T21:52:44Z

Here's the response I got from Ivan when inquiring about computing blend shapes:

Trying to get blend shapes with simple heuristics can be challenging. Especially for more complex ones like moving eyebrows.

We used two approaches: NN and fitting.

NN approach is simple

generate synthetic dataset with expressions (we used Basel model and rigged it with https://www.polywink.com/) to get blend shapes GT

Bootstrap it with Face Mesh model to get landmarks

Train small classification NN (few fully connected layers) to predict blend shapes from landmarks.

And we rely on NN to learn different rotations and camera parameters by itself.

Fitting approach is based on using some 3DMM (e.g. same Basel rigged with Polywink) and running some optimization that will try to fit it into predicted Face Mesh landmarks. So basically you want find 3DMM blend shapes + translation/scale/rotation so that after you project it on 2D surface of image (here you should know or assume camera parameters) the landmarks should match.

Another big problem that we didn't solve at the end of the day as we want our approach to work as single shot is distinguishing between face shapes and blend shapes. E.g. did the user closed their eyes by 50% or is it their neutral state? AFAIK Apple solves by detecting you face shape as a separate step (they have depth sensor to make it more accurate and need to run it only once and they also have identity recognition, so after detecting your face shape once they can memorize it).

So I'd say the fitting approach is easier to start with and easier to debug and fine tune, while NN one can do more complex blend shapes and can easier accommodate to differnt camera parameters and face rotations.

I've been considering something similar to the NN approach outlined here, by generating a dataset containing input images mapped to blend shape outputs using a pre-rigged ReadyPlayerMe avatar and Blender's Python API. However this may be too time consuming for the scope of my project since I'm not super experienced with tensorflow. Hopefully a future release of this solution can include automatic blend shape computation.

fa18-rcs-040 · 2022-05-07T11:41:56Z

Hi MediaPipe community, I am Muhammad Adnan from Pakistan, and doing my research on the iris landmarks dataset, but I need the dataset used by the MediaPipe iris landmarks module. Can you please help me in providing that dataset, So I can proceed with my research studies. I shall be very thankful to you if you give me this favor.

badlogic added the type:feature Enhancement in the New Functionality or Request for a New Solution label Sep 9, 2021

google-ml-butler bot assigned sgowroji Sep 9, 2021

sgowroji added platform:web web related legacy:face geometry Face mesh geometry library stat:awaiting googler Waiting for Google Engineer's Response labels Sep 10, 2021

sgowroji assigned kostyaby and unassigned sgowroji Sep 10, 2021

kostyaby assigned chuoling, mhays-google and ivan-grishchenko Sep 13, 2021

sgowroji assigned ivan-grishchenko and unassigned ivan-grishchenko Sep 15, 2021

sgowroji added stat:awaiting response Waiting for user response and removed stat:awaiting googler Waiting for Google Engineer's Response labels Sep 16, 2021

google-ml-butler bot added the stale label Sep 23, 2021

google-ml-butler bot closed this as completed Sep 30, 2021

sgowroji removed stat:awaiting response Waiting for user response stale labels Sep 30, 2021

sgowroji unassigned chuoling and ivan-grishchenko Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface MediaPipe Iris model for the web #2526

Surface MediaPipe Iris model for the web #2526

badlogic commented Sep 9, 2021

kostyaby commented Sep 13, 2021

mhays-google commented Sep 13, 2021

mattrossman commented Sep 15, 2021

badlogic commented Sep 15, 2021 via email

tyrmullen commented Sep 15, 2021

google-ml-butler bot commented Sep 23, 2021

google-ml-butler bot commented Sep 30, 2021

mattrossman commented Oct 13, 2021

fa18-rcs-040 commented May 7, 2022

Surface MediaPipe Iris model for the web #2526

Surface MediaPipe Iris model for the web #2526

Comments

badlogic commented Sep 9, 2021

kostyaby commented Sep 13, 2021

mhays-google commented Sep 13, 2021

mattrossman commented Sep 15, 2021

badlogic commented Sep 15, 2021 via email

tyrmullen commented Sep 15, 2021

google-ml-butler bot commented Sep 23, 2021

google-ml-butler bot commented Sep 30, 2021

mattrossman commented Oct 13, 2021

fa18-rcs-040 commented May 7, 2022