Pose estimation refers to computer vision techniques that detect human figures in images and video, so that one could determine, for example, where someone’s elbow, shoulder or foot show up in an image. PoseNet does not recognize who is in an image, it is simply estimating where key body joints are.
This repo contains a set of PoseNet models that are quantized and optimized for use on Coral's Edge TPU, together with some example code to shows how to run it on a camera stream.
Why PoseNet ?
Pose estimation has many uses, from interactive installations that react to the body to augmented reality, animation, fitness uses, and more. We hope the accessibility of this model inspires more developers and makers to experiment and apply pose detection to their own unique projects, to demonstrate how machine learning can be deployed in ways that are anonymous and private.
How does it work ?
At a high level pose estimation happens in two phases:
An input RGB image is fed through a convolutional neural network. In our case this is a MobileNet V1 architecture. Instead of a classification head however, there is a specialized head which produces a set of heatmaps (one for each kind of key point) and some offset maps. This step runs on the EdgeTPU. The results are then fed into step 2)
A special multi-pose decoding algorithm is used to decode poses, pose confidence scores, keypoint positions, and keypoint confidence scores. Note that unlike in the TensorflowJS version we have created a custom OP in Tensorflow Lite and appended it to the network graph itself. This CustomOP does the decoding (on the CPU) as a post processing step. The advantage is that we don't have to deal with the heatmaps directly and when we then call this network through the Coral Python API we simply get a series of keypoints from the network.
If you're interested in the gory details of the decoding algorithm and how PoseNet works under the hood, I recommend you take a look at the original research paper or this medium post whihch describes the raw heatmaps produced by the convolutional model.
Pose: at the highest level, PoseNet will return a pose object that contains a list of keypoints and an instance-level confidence score for each detected person.
Keypoint: a part of a person’s pose that is estimated, such as the nose, right ear, left knee, right foot, etc. It contains both a position and a keypoint confidence score. PoseNet currently detects 17 keypoints illustrated in the following diagram:
Keypoint Confidence Score: this determines the confidence that an estimated keypoint position is accurate. It ranges between 0.0 and 1.0. It can be used to hide keypoints that are not deemed strong enough.
Keypoint Position: 2D x and y coordinates in the original input image where a keypoint has been detected.
Examples in this repo
NOTE: PoseNet relies on the latest Coral API (2.11.1) - please update your system before running these examples. For more information on updating see:
To install all the requirements, simply run
A minimal example that simply downloads an image, and prints the pose keypoints.
A camera example that streams the camera image through posenet and draws the pose on top as an overlay. This is a great first example to run to familiarize yourself with the network and its outputs.
Run a simple demo like this:
If the camera and monitor are both facing you, consider adding the
python3 pose_camera.py --mirror
In this repo we have included 3 posenet model files for differnet input resolutions. The larger resolutions are slower of course, but allow a wider field of view, or further-away poses to be processed correctly.
posenet_mobilenet_v1_075_721_1281_quant_decoder_edgetpu.tflite posenet_mobilenet_v1_075_481_641_quant_decoder_edgetpu.tflite posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite
You can change the camera resolution by using the --res parameter:
python3 pose_camera.py --res 480x360 # fast but low res python3 pose_camera.py --res 640x480 # default python3 pose_camera.py --res 1280x720 # slower but high res
A fun little app that demonstrates how Coral and PoseNet can be used to analyze human behavior in an anonymous and privacy-preserving way.
Posenet converts an image of a human into a mere skeleton which captures its position and movement over time, but discards any precisely identifying features and the original camera image. Because Coral devices run all the image analysis locally, the actual image is never streamed anywhere and is immediately discarded. The poses can be safely stored or analysed.
For example a store owner may want to study the bahavior of customers as they move through the store, in order to optimize flow and improve product placement. A museum may want to track which areas are most busy, at which times such as to give guidance which exhibits may currently have the shortest waiting times.
With Coral this is possible without recording anybody's image directly or streaming data to a cloud service - instead the images are immediately discarded.
The anaonymizer is a small app that demonstrates this is a fun way. To use the anonymizer set up your camera in a sturdy position. Lauch the app and walk out of the image. This demo waits until no one is in the frame, then stores the 'background' image. Now, step back in. You'll see your current pose overlayed over a static image of the background.
(If the camera and monitor are both facing you, consider adding the
This demo allows people to control musical synthesizers with their arms. Up to 3 people are each assigned a different instrument and octave, and control the pitch with their right wrists and the volume with their left wrists.
You'll need to install FluidSynth and a General Midi SoundFont:
apt install fluidsynth fluid-soundfont-gm pip3 install pyfluidsynth
Now you can run the demo like this:
The PoseEngine class
The PoseEngine class (defined in
pose_engine.py) allows easy access
to the PoseNet network from Python, using the EdgeTPU API.
You simply initialize the class with the location of the model
file and then call DetectPosesInImage, passing a numpy object that contains the
image. The numpy object should be in int8, [Y,X,RGB] format.
A minimal example might be:
import numpy as np from PIL import Image from pose_engine import PoseEngine pil_image = Image.open('couple.jpg') pil_image.resize((641, 481), Image.NEAREST) engine = PoseEngine('models/posenet_mobilenet_v1_075_481_641_quant_decoder_edgetpu.tflite') poses, inference_time = engine.DetectPosesInImage(np.uint8(pil_image)) print('Inference time: %.fms'%inference_time) for pose in poses: if pose.score < 0.4: continue print('\nPose Score: ', pose.score) for label, keypoint in pose.keypoints.items(): print(' %-20s x=%-4d y=%-4d score=%.1f'% (label, keypoint.yx, keypoint.yx, keypoint.score))
To try this, run
And you should see an output like this:
Pose Score: 0.61885977 nose x=210 y=152 score=1.0 left eye x=224 y=138 score=1.0 right eye x=199 y=136 score=1.0 left ear x=245 y=135 score=1.0 right ear x=183 y=129 score=0.8 left shoulder x=268 y=168 score=0.8 right shoulder x=161 y=172 score=1.0 left elbow x=282 y=255 score=0.6 right elbow x=154 y=254 score=0.9 left wrist x=236 y=333 score=0.7 right wrist x=163 y=301 score=0.6 left hip x=323 y=181 score=0.2 right hip x=191 y=251 score=0.0 left knee x=343 y=84 score=0.8 right knee x=162 y=295 score=0.0 left ankle x=318 y=174 score=0.1 right ankle x=167 y=309 score=0.0