Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface MediaPipe Iris model for the web #2526

Closed
badlogic opened this issue Sep 9, 2021 · 9 comments
Closed

Surface MediaPipe Iris model for the web #2526

badlogic opened this issue Sep 9, 2021 · 9 comments
Assignees
Labels
legacy:face geometry Face mesh geometry library platform:web web related type:feature Enhancement in the New Functionality or Request for a New Solution

Comments

@badlogic
Copy link

badlogic commented Sep 9, 2021

System information (Please provide as much relevant information as possible)

  • MediaPipe Solution (you are using): MediaPipe Face Mesh JavaScript
  • Programming language : Typescript
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state:
I'm currently using the MediaPipe Face Mesh JavaScript API for a web-based virtual puppeteering application. I am successfully able to derrive a head pose from the face geometry data. I am also able to use mouth and eye land marks to drive parameters of a puppet, altough the accuracy of the mouth landmarks requires heavy post-processing to be useful.

Eyes are a very important part of conveying emotion. As such, the application must be able to track the iris position as well as the open/close state of the eye lids. Sadly, the current eye landmarks do not include iris data, and the open/close state can not be derrived from the eye contour landmarks reliably, if at all, depending on head orientation.

I'm using the landmarks of the metric face geometry mesh (e.g. results.multiFaceGeometry[0].getMesh().getVertexBufferList()). I assumed the vertices in this mesh are invariant with respect to the head orientation. However, they are not, which changes the relative distances of landmarks in the neutral pose. Here is the metric face geometry mesh at various head orientations, illustrating that the mesh is not head orientation invariant.

Screen_Recording_2021-09-09_at_14.49.37.1.mp4

Without a head orientation invariant face mesh, it is hard to impossible to establish a neutral pose to compare the current pose against and calculate puppeteering parameters, like how closed an eye is expressed in the range [0,1]. While I can statistically treat the eye lids distance value in a way to detect eye blinking/winking/closing for an arbitrary but fixed head orientation, the tracking falls apart as soon as the user turns their heads.

The eye lid landmarks also never fully close in this model, and the left and right eye are linked to each other. In the video below, only a single eye is closed at a time, while the eye contour landmarks move for both eyes.

Screen.Recording.2021-09-09.at.15.01.24.mp4

For the iris position, I currently resort to image postprocessing, detecting the iris/pupil inside the eye contour landmarks bounding box via constrast enhancement and a simple sliding window histogram approach. The results are convincing enough under a wide range of lighting conditions. However, the additional computations are somewhat significantly contributing to overall processing time, which isn't ideal, especially in mobile web browsers.

Screen.Recording.2021-09-09.at.14.58.52.mp4

From the MediaPipe Iris web demo, it appears that all or most these issues can be solved by its model. Sadly, the iris model is not available for use on the web through the TypeScript/JavaScript API.

Will this change the current api? How?
This change would be non-breaking, and consist of additional configuration parameters, as well as additional data in the Results object.

Who will benefit with this feature?
Anyone trying to use MediaPipe Face Mesh for facial expression detection and tracking.

Please specify the use cases for this feature:
Virtual puppeteering via blend shapes.

Any Other info:
There is a separate MediaPipe Face Landmarks Detection package from the TensorFlow people. It does contain the iris model, however, it's performance in both accuracy and runtime speed is worse than the JavaScript package provided by MediaPipe themselves. It's also very confusing to have two similarly named packages, one provided by TensorFlow and one provided by MediaPipe. I understand the MediaPipe models and pipelines do differ from the one in the TensorFlow package.

Apple's ARKit does have dedicated blend shape support, which would be ideal to have in MediaPipe Face Mesh for any facial expression detection and tracking.

@badlogic badlogic added the type:feature Enhancement in the New Functionality or Request for a New Solution label Sep 9, 2021
@sgowroji sgowroji added platform:web web related legacy:face geometry Face mesh geometry library stat:awaiting googler Waiting for Google Engineer's Response labels Sep 10, 2021
@sgowroji sgowroji assigned kostyaby and unassigned sgowroji Sep 10, 2021
@kostyaby
Copy link

Hey @badlogic,

Thanks for reaching out! I see 2 separate problems mentioned in your request:

  1. No MediaPipe Iris JS API
  2. Existing MediaPipe Face Geometry JS API doesn't provide enough stability for deriving ARKit-like eye blendshapes

Regarding (1), I CC'ed @chuoling and @mhays-google for visibility of this request as well as maybe to comment on a timeline

Regarding (2), I can share that the same face geometry logic + (probably, a better) face mesh tracking model is what drives specifically ARKit-like eye blendshapes for AR Puppets in Google Duo apps (coverage, you can check it out in the app). Yes, Face Mesh "normalization" via Face Geometry is not perfect, but it should get you into the ballpark for solving your problem. Another question is whether the released MediaPipe Face Mesh tracking model is good enough for AR puppeteering. I wasn't the person who wrote the Google Duo puppet heuristic, so it's hard for me to share any specifics. I CC'ed @ivan-grishchenko, he should probably have more to say regarding this topic

@mhays-google
Copy link
Contributor

Hi! We're planning on releasing Iris data points in both our face_mesh and holistic solutions around October.

@mattrossman
Copy link

It's also very confusing to have two similarly named packages, one provided by TensorFlow and one provided by MediaPipe. I understand the MediaPipe models and pipelines do differ from the one in the TensorFlow package.

Could someone clarify what the difference is between the various distributions?

From what I can tell, the original @tensorflow-models/facemesh package (now deprecated) corresponded to the model from this paper: Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs

The subsequent release of @tensorflow-models/face-landmarks-detection appears to repackage the old model with the option to opt-in to a higher fidelity (but less performant) model for iris tracking using the advances from this paper: Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. It's unclear whether this model also includes the improvements in to eye/lip tracking from that paper.

Today I learned there is also the @mediapipe/face_mesh package from this repo which doesn't include iris tracking, which confuses me since this package was published more recently than the tensorflow one. Is this just the same as the original facemesh package? Which distribution are developers advised to use?

Lastly, like @badlogic I am working on a blend shape puppeteering project and would appreciate guidance on how to achieve this, or better yet built-in support for some standard blend shapes. AR puppeteering was demonstrated as a use case in the Attention Mesh paper, but in practice since the mesh is not invariant to head orientation as this issue points out, it's unclear how this was actually implemented. Despite my attempts at normalization, the model often exhibits undesired blend shape activation when the user turns their head.

@badlogic
Copy link
Author

badlogic commented Sep 15, 2021 via email

@tyrmullen
Copy link
Collaborator

Traditionally for MediaPipe we say facemesh to refer to retrieving face landmarks, while iris detection is a secondary refinement ML model which can be optionally applied afterwards. For reference, see the graphs for iris-tracking-on-top-of-face-landmarks (visualization and live web-demo here: https://viz.mediapipe.dev/demo/iris_tracking, as mentioned by @badlogic ); that demo is a bit older, but is probably still the best reference.

The @tensorflow packages are part of TF.js, so while they may use MediaPipe models, our team has usually been less involved with those ports, so I'm unable to comment in too much detail there, although hopefully that trend is changing currently and in the near future.

The @mediapipe/facemesh package will contain the latest open-sourced MediaPipe models for face landmarks, as well as the MediaPipe recommended pre- and post-processing pipelines (usually just what you'd find in the corresponding graphs under our modules/ directory). It is a standalone JS API initially created specifically for face landmarks, requiring minimal setup or extra code (we term these single-purpose turnkey offerings "Solutions APIs"), but therefore was not designed to be able to handle more complicated alternative use cases.

Note that we have a sibling module "iris_landmark" which can be used for iris tracking refinements to the face landmarks, but there is no corresponding MediaPipe JS Solution API for it yet, nor has it been integrated into facemesh (see @mhays-google's comments above for ETA).

Unfortunately, I don't believe any lip refinement code or models have been open-sourced as of yet either (nor do I know of any plans to do so).

And as for blend shapes, @ivan-grishchenko will have to weigh in.

@sgowroji sgowroji added stat:awaiting response Waiting for user response and removed stat:awaiting googler Waiting for Google Engineer's Response labels Sep 16, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@mattrossman
Copy link

Here's the response I got from Ivan when inquiring about computing blend shapes:

Trying to get blend shapes with simple heuristics can be challenging. Especially for more complex ones like moving eyebrows.

We used two approaches: NN and fitting.

NN approach is simple

  • generate synthetic dataset with expressions (we used Basel model and rigged it with https://www.polywink.com/) to get blend shapes GT
  • Bootstrap it with Face Mesh model to get landmarks
  • Train small classification NN (few fully connected layers) to predict blend shapes from landmarks.

And we rely on NN to learn different rotations and camera parameters by itself.

Fitting approach is based on using some 3DMM (e.g. same Basel rigged with Polywink) and running some optimization that will try to fit it into predicted Face Mesh landmarks. So basically you want find 3DMM blend shapes + translation/scale/rotation so that after you project it on 2D surface of image (here you should know or assume camera parameters) the landmarks should match.

Another big problem that we didn't solve at the end of the day as we want our approach to work as single shot is distinguishing between face shapes and blend shapes. E.g. did the user closed their eyes by 50% or is it their neutral state? AFAIK Apple solves by detecting you face shape as a separate step (they have depth sensor to make it more accurate and need to run it only once and they also have identity recognition, so after detecting your face shape once they can memorize it).

So I'd say the fitting approach is easier to start with and easier to debug and fine tune, while NN one can do more complex blend shapes and can easier accommodate to differnt camera parameters and face rotations.

I've been considering something similar to the NN approach outlined here, by generating a dataset containing input images mapped to blend shape outputs using a pre-rigged ReadyPlayerMe avatar and Blender's Python API. However this may be too time consuming for the scope of my project since I'm not super experienced with tensorflow. Hopefully a future release of this solution can include automatic blend shape computation.

@fa18-rcs-040
Copy link

Hi MediaPipe community, I am Muhammad Adnan from Pakistan, and doing my research on the iris landmarks dataset, but I need the dataset used by the MediaPipe iris landmarks module. Can you please help me in providing that dataset, So I can proceed with my research studies. I shall be very thankful to you if you give me this favor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legacy:face geometry Face mesh geometry library platform:web web related type:feature Enhancement in the New Functionality or Request for a New Solution
Projects
None yet
Development

No branches or pull requests

9 participants