## Post Estimation using Tensorflow.js 

![alt text](https://cdn-images-1.medium.com/max/1200/1*BKZqEPtvM-6xwarhABZQnA.gif "Logo Title Text 1")

- ML used to require lots of time and money to get started 
- Configurations, dependencies, hardware costs, lots of headaches
- But now anyone can train and test ML models in the browser really easily using Tensorflow.js
- Even python was more difficult (jupyter notebooks, numpy, scikit, pandas, etc)
- ML in the browser means
##### Privacy - Data is local, none leaves the clients device. Much safer.
##### Wide distribution -  JavaScript has one of the widest install bases of any language and framework. 
##### Distributed Computing - Leverage client side data from many users to help train a model

![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/04/0V4HYbZt28PHZZ3aD.png "Logo Title Text 1")

## What does the pipeline look like? 

###### Example - Using a pretrained model for classification (3 Steps) 

#### Step 1 - Load Model

- First, we'll need to import two files that define the structure of the model (model file) & its trained weights (weights manifest)

![alt text](https://cdn-images-1.medium.com/max/1600/1*m_QUz768MxAK4KXnvVL2nQ.jpeg "Logo Title Text 1")
`
import * as tdc from '@tensorflow/tfjs-core';
import { loadFrozenModel } from '@tensorflow/tfjs-converter';
import { IMAGENET_CLASSES } from './imagenet_classes';
const MODEL_URL = '/models/mobilenet/optimized_model.pb';
const WEIGHTS_URL = '/models/mobilenet/weights_manifest.json';
const INPUT_NODE_NAME = 'input';
const OUTPUT_NODE_NAME = 'MobilenetV1/Predictions/Reshape_1';
const PREPROCESS_DIVIDOR = tfc.scalar(255 / 2);
export defualt class MobileNet {
async load () {
this.model = await loadFrozenModel(MODEL_URL, WEIGHTS_URL);
`

#### Step 2 - Preprocessing


- A neural network has a specific input definition so we'll need to do some preprocessing in order to get your input into the right shape
- First, we'll convert pixels to TensorFlow.js input tensor
- Then, we'll crop the input if you want to use parts of the image
- Lastly, we'll set batch input dimensions to 0, since you only want to infer one image

`
preprocess(source) {
console.log('input size:' this.model.input.shape); // [224,224,3]
// memory enhancements - tells the system to throw away this tensor after usage
return tf.tidy(() => {
const input = tfc.fromPixels(source);
// crop the image to match the input size of mobilenet
// this is 224x224 px with 3 channel (RGB) color data
// get a square from the middle of the image
// resizing is done by the html5 canvas
const croppedImage = MobileNet.cropImage(input);
//mobilenet expects a batched input - build a [1, 224, 224, 3] tensor
const bachtedImage = croppedImage.expandDims(0);
//normalization of the pixel color channel values
//instead of 0-225 we get values between -1 and 1
return batchedImage.toFloat().div(tf.scalar(127)).sub(tf.scalar(1));
});
}
`

#### Step 3 - Inference

- The inference of an input aka a class prediction is just two lines of code

`
predict(source) {
const processedInput = this.preprocess(source);
return this.model.predict(processedInput);
}
`
- The result (prediction) will be a dictionary containing the probabilities for each class. 
- The max value within the dictionary is the most likely class

##### Good resources

- JS docs https://js.tensorflow.org/tutorials/core-concepts.html 
- My video https://www.youtube.com/watch?v=Nc8kZABv-KE
- Shiffmans series https://www.youtube.com/watch?v=Qt3ZABW5lD0 

### What is pose estimation?

![alt text](http://www.ee.cuhk.edu.hk/~wyang/images/pose-cvpr2016.jpg "Logo Title Text 1")

- Its a computer vision techniques that detect human figures in images and video
- The algorithm is simply estimating where key body joints are, not 'who' is in the image
- This has many uses 
##### interactive installations that react to the body
##### Augmented reality
##### animation
##### fitness

![alt text](https://cse.sc.edu/~fan23/projects/cvpr15/overview.png "Logo Title Text 1")

- Usually open source pose detection systems all require specialized hardware and/or cameras, as well as quite a bit of system setup. 
- With PoseNet running on TensorFlow.js anyone with a decent webcam-equipped desktop or phone can experience it in the browser
- Javascript developers can tinker and use this technology with just a few lines of code

### PoseNet

- PoseNet can be used to estimate either a single pose or multiple poses
- The single person pose detector is faster and simpler but requires only one subject present in the image 
- At a high level pose estimation happens in two phases
- First, An input RGB image is fed through a convolutional neural network.
- Either a single-pose or multi-pose decoding algorithm is used to decode poses, pose confidence scores, keypoint positions, and keypoint confidence scores from the model outputs
- A Pose — at the highest level, PoseNet will return a pose object that contains a list of keypoints and an instance-level confidence score for each detected person.

![alt text](https://cdn-images-1.medium.com/max/1600/1*3bg3CO1b4yxqgrjsGaSwBw.png "Logo Title Text 1")

- A Pose confidence score determines the overall confidence in the estimation of a pose
- It ranges between 0.0 and 1.0
- It can be used to hide poses that are not deemed strong enough.
- A Keypoint is a part of a person’s pose that is estimated, such as the nose, right ear, left knee, right foot, etc. 
- It contains both a position and a keypoint confidence score. 
- PoseNet currently detects 17 keypoints illustrated in the following diagram:

![alt text](https://cdn-images-1.medium.com/max/1600/1*7qDyLpIT-3s4ylULsrnz8A.png "Logo Title Text 1")

-  AKeypoint Confidence Score determines the confidence that an estimated keypoint position is accurate. 
- It ranges between 0.0 and 1.0. It can be used to hide keypoints that are not deemed strong enough.
- A Keypoint Position is 2D x and y coordinates in the original input image where a keypoint has been detected.

![alt text](https://cdn-images-1.medium.com/max/1600/1*SpWPwprVuNYhXs44iTSODg.png "Logo Title Text 1")

### Single Person Pose Estimation

- single-pose estimation algorithm is the simpler and faster of the two
- Its ideal use case is for when there is only one person centered in an input image or video
- the inputs for the single-pose estimation algorithm are as followss

- Input image element — An html element that contains an image to predict poses for, such as a video or image tag. 
- Image scale factor — What to scale the image by before feeding it through the network. 
- Flip horizontal — If the poses should be flipped/mirrored horizontally
- Output stride — This parameter affects the height and width of the layers in the neural network

- The outputs are as follows:

- A pose, containing both a pose confidence score and an array of 17 keypoints.
- Each keypoint contains a keypoint position and a keypoint confidence score.
- All the keypoint positions have x and y coordinates in the input image space, and can be mapped directly onto the image

- The PoseNet model is image size invariant, which means it can predict pose positions in the same scale as the original image regardless of whether the image is downscaled. 
-  The output stride determines how much we’re scaling down the output relative to the input image size. 

![alt text](https://cdn-images-1.medium.com/max/1600/1*zXXwR16kprAWLPIOKCrXLw.png "Logo Title Text 1")

- When the output stride is set to 8 or 16, the amount of input striding in the layers is reduced to create a larger output resolution


- When PoseNet processes an image, what is in fact returned is a heatmap along with offset vectors that can be decoded to find high confidence areas in the image that correspond to pose keypoints

![alt text](https://cdn-images-1.medium.com/max/1600/1*mcaovEoLBt_Aj0lwv1-xtA.png "Logo Title Text 1")

- Both of these outputs are 3D tensors with a height and width that we’ll refer to as the resolution. 
- Each heatmap is a 3D tensor of size resolution x resolution x 17, since 17 is the number of keypoints detected by PoseNe
- Each offset vector is a 3D tensor of size resolution x resolution x 34, where 34 is the number of keypoints * 2. 

- After the image is fed through the model, we perform a few calculations to estimate the pose from the outputs. 
- The single-pose estimation algorithm for example returns a pose confidence score which itself contains an array of keypoints (indexed by part ID) each with a confidence score and x, y position.
- To get the keypoints of the pose, A sigmoid activation is done on the heatmap to get the scores.
`
scores = heatmap.sigmoid()
argmax2d
`

- This is done on the keypoint confidence scores to get the x and y index in the heatmap with the highest score for each part, which is essentially where the part is most likely to exist. This produces a tensor of size 17x2, with each row being the y and x index in the heatmap with the highest score for each part.

`
heatmapPositions = scores.argmax(y, x)
`

- The offset vector for each part is retrieved by getting the x and y from the offsets corresponding to the x and y index in the heatmap for that part. 

`
offsetVector = [offsets.get(y, x, k), offsets.get(y, x, 17 + k)]
`


- To get the keypoint, each part’s heatmap x and y are multiplied by the output stride then added to their corresponding offset vector, which is in the same scale as the original image.

`
keypointPositions = heatmapPositions * outputStride + offsetVectors
`

- Finally, each keypoint confidence score is the confidence score of its heatmap position. The pose confidence score is the mean of the scores of the keypoints.

### Multi person pose estimation

![alt text](https://cdn-images-1.medium.com/max/1600/1*EZOqbMLkIwBgyxrKLuQTHA.png "Logo Title Text 1")

- Slower, but if multiple people appear in a picture, their detected keypoints are less likely to be associated with the wrong pose. 
-  Moreover, an attractive property of this algorithm is that performance is not affected by the number of persons in the input image. Whether there are 15 persons to detect or 5, the computation time will be the same.

Let’s review the inputs:

- Input image element — Same as single-pose estimation
- Image scale factor — Same as single-pose estimation
- Flip horizontal — Same as single-pose estimation
- Output stride — Same as single-pose estimation
- Maximum pose detections — An integer. Defaults to 5. The maximum number of poses to detect.
- Pose confidence score threshold — 0.0 to 1.0. Defaults to 0.5. At a high level, this controls the minimum confidence score of poses that are returned.
- Non-maximum suppression (NMS) radius — A number in pixels. At a high level, this controls the minimum distance between poses that are returned. This value defaults to 20, which is probably fine for most cases. It should be increased/decreased as a way to filter out less accurate poses but only if tweaking the pose confidence score is not good enough.

Let’s review the outputs:

- A promise that resolves with an array of poses.
- Each pose contains the same information as described in the single-person estimation algorithm.
This short code block shows how to use the multi-pose estimation algorithm:

`
const imageScaleFactor = 0.50;
const flipHorizontal = false;
const outputStride = 16;
// get up to 5 poses
const maxPoseDetections = 5;
// minimum confidence of the root part of a pose
const scoreThreshold = 0.5;
// minimum distance in pixels between the root parts of poses
const nmsRadius = 20;
const imageElement = document.getElementById('cat');
// load posenet
const net = await posenet.load();
const poses = await net.estimateMultiplePoses(
  imageElement, imageScaleFactor, flipHorizontal, outputStride,    
  maxPoseDetections, scoreThreshold, nmsRadius);
`