First, let's load the JSON file which describes the human pose task.  This is in COCO format, it is the category descriptor pulled from the annotations file.  We modify the COCO category slightly, to add a neck keypoint.  We will use this task description JSON to create a topology tensor, which is an intermediate data structure that describes the part linkages, as well as which channels in the part affinity field each linkage corresponds to.

In [1]:
import json
import trt_pose.coco
import tensorrt as trt

with open('human_pose.json', 'r') as f:
    human_pose = json.load(f)

topology = trt_pose.coco.coco_category_to_topology(human_pose)

Next, we'll load our model.  Each model takes at least two parameters, *cmap_channels* and *paf_channels* corresponding to the number of heatmap channels
and part affinity field channels.  The number of part affinity field channels is 2x the number of links, because each link has a channel corresponding to the
x and y direction of the vector field for each link.

In [2]:
import trt_pose.models

num_parts = len(human_pose['keypoints'])
num_links = len(human_pose['skeleton'])

model = trt_pose.models.resnet18_baseline_att(num_parts, 2 * num_links).cuda().eval()



Next, let's load the model weights.  You will need to download these according to the table in the README.

In [3]:
import torch

MODEL_WEIGHTS = 'resnet18_baseline_att_224x224_A_epoch_249.pth'

model.load_state_dict(torch.load(MODEL_WEIGHTS))

<All keys matched successfully>

In order to optimize with TensorRT using the python library *torch2trt* we'll also need to create some example data.  The dimensions
of this data should match the dimensions that the network was trained with.  Since we're using the resnet18 variant that was trained on
an input resolution of 224x224, we set the width and height to these dimensions.

In [4]:
WIDTH = 224
HEIGHT = 224

data = torch.zeros((1, 3, HEIGHT, WIDTH)).cuda()

Next, we'll use [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt) to optimize the model.  We'll enable fp16_mode to allow optimizations to use reduced half precision.

In [5]:
from torch2trt import torch2trt, tensorrt_converter
# #Define a custom converter for Conv2d
# @tensorrt_converter('torch.nn.Conv2d.forward')
# def convert_conv2d(ctx):
#     module = ctx.method_args[0]
#     input = ctx.method_args[1]
#     output = ctx.method_return

#     kernel = module.weight.detach().cpu().numpy()
#     bias = module.bias.detach().cpu().numpy() if module.bias is not None else None

#     # Debug prints to inspect the parameters
#     print("Converting Conv2d Layer:")
#     print("Input TRT: ", input._trt)
#     print("Output channels: ", module.out_channels)
#     print("Kernel shape: ", kernel.shape)
#     print("Bias shape: ", bias.shape if bias is not None else None)
#     print("Stride: ", module.stride)
#     print("Padding: ", module.padding)
#     print("Dilation: ", module.dilation)
#     print("Groups: ", module.groups)

#     # Create TensorRT layer
#     layer = ctx.network.add_convolution_nd(
#         input=input._trt,
#         num_output_maps=module.out_channels,
#         kernel_shape=kernel.shape[2:],
#         kernel=trt.Weights(kernel),
#         bias=trt.Weights(bias) if bias is not None else None
#     )

#     if layer is None:
#         raise RuntimeError("Failed to create convolution layer in TensorRT")

#     layer.stride_nd = tuple(module.stride)
#     layer.padding_nd = tuple(module.padding)
#     layer.dilation_nd = tuple(module.dilation)
#     layer.num_groups = module.groups
    
#     output._trt = layer.get_output(0)

# # Define a custom converter for ConvTranspose2d
# @tensorrt_converter('torch.nn.ConvTranspose2d.forward')
# def convert_conv_transpose2d(ctx):
#     module = ctx.method_args[0]
#     input = ctx.method_args[1]
#     output = ctx.method_return

#     kernel = module.weight.detach().cpu().numpy()
#     bias = module.bias.detach().cpu().numpy() if module.bias is not None else None

#     # Debug prints to inspect the parameters
#     print("Converting ConvTranspose2d Layer:")
#     print("Input TRT: ", input._trt)
#     print("Output channels: ", module.out_channels)
#     print("Kernel shape: ", kernel.shape)
#     print("Bias shape: ", bias.shape if bias is not None else None)
#     print("Stride: ", module.stride)
#     print("Padding: ", module.padding)
#     print("Dilation: ", module.dilation)
#     print("Groups: ", module.groups)

#     # Create TensorRT layer
#     layer = ctx.network.add_deconvolution_nd(
#         input=input._trt,
#         num_output_maps=module.out_channels,
#         kernel_shape=kernel.shape[2:],
#         kernel=trt.Weights(kernel),
#         bias=trt.Weights(bias) if bias is not None else None
#     )

#     if layer is None:
#         raise RuntimeError("Failed to create deconvolution layer in TensorRT")

#     layer.stride_nd = tuple(module.stride)
#     layer.padding_nd = tuple(module.padding)
#     layer.dilation_nd = tuple(module.dilation)
#     layer.num_groups = module.groups
    
#     output._trt = layer.get_output(0)


In [6]:
from custom_converters import convert_conv_transpose2d
model_trt = torch2trt(model, [data], fp16_mode=True, max_workspace_size=1<<25)

[06/04/2024-14:55:53] [TRT] [E] 3: 1.cmap_up.0:0:DECONVOLUTION:GPU:kernel weights has count 2097152 but 4194304 was expected
[06/04/2024-14:55:53] [TRT] [E] 4: 1.cmap_up.0:0:DECONVOLUTION:GPU: count of 2097152 weights in kernel, but kernel dimensions (4,4) with 512 input channels, 512 output channels and 1 groups were specified. Expected Weights count is 512 * 4*4 * 512 / 1 = 4194304
[06/04/2024-14:55:53] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2276] Error Code 4: Internal Error (1.cmap_up.0:0:DECONVOLUTION:GPU: output shape can not be computed)
[06/04/2024-14:55:53] [TRT] [E] 3: [network.cpp::addScaleNd::1151] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addScaleNd::1151, condition: qdqScale || basicScale )


AttributeError: 'NoneType' object has no attribute 'get_output'

The optimized model may be saved so that we do not need to perform optimization again, we can just load the model.  Please note that TensorRT has device specific optimizations, so you can only use an optimized model on similar platforms.

In [None]:
OPTIMIZED_MODEL = 'resnet18_baseline_att_224x224_A_epoch_249_trt_custom.pth'

torch.save(model_trt.state_dict(), OPTIMIZED_MODEL)

We could then load the saved model using *torch2trt* as follows.

In [None]:
from torch2trt import TRTModule

model_trt = TRTModule()
model_trt.load_state_dict(torch.load(OPTIMIZED_MODEL))

We can benchmark the model in FPS with the following code

In [None]:
import time

t0 = time.time()
torch.cuda.current_stream().synchronize()
for i in range(50):
    y = model_trt(data)
torch.cuda.current_stream().synchronize()
t1 = time.time()

print(50.0 / (t1 - t0))

Next, let's define a function that will preprocess the image, which is originally in BGR8 / HWC format.

In [None]:
import cv2
import torchvision.transforms as transforms
import PIL.Image
import torch

mean = torch.Tensor([0.485, 0.456, 0.406]).cuda()
std = torch.Tensor([0.229, 0.224, 0.225]).cuda()
device = torch.device('cuda')

def preprocess(image):
    global device
    device = torch.device('cuda')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = cv2.resize(image, (WIDTH, HEIGHT))  # Ensure the frame is 224x224
    image = PIL.Image.fromarray(image)
    image = transforms.functional.to_tensor(image).to(device)
    image.sub_(mean[:, None, None]).div_(std[:, None, None])
    return image[None, ...]

Next, we'll define two callable classes that will be used to parse the objects from the neural network, as well as draw the parsed objects on an image.

In [None]:
from trt_pose.draw_objects import DrawObjects
from trt_pose.parse_objects import ParseObjects

parse_objects = ParseObjects(topology)
draw_objects = DrawObjects(topology)

Assuming you're using NVIDIA Jetson, you can use the [jetcam](https://github.com/NVIDIA-AI-IOT/jetcam) package to create an easy to use camera that will produce images in BGR8/HWC format.

If you're not on Jetson, you may need to adapt the code below.

In [None]:
import time

cap = cv2.VideoCapture(0)  # Open the default webcam

def execute():
    ret, frame = cap.read()
    if not ret:
        return

    data = preprocess(frame)
    cmap, paf = model_trt(data)
    cmap, paf = cmap.detach().cpu(), paf.detach().cpu()
    counts, objects, peaks = parse_objects(cmap, paf)
    draw_objects(frame, counts, objects, peaks)
    
    cv2.imshow('Pose Estimation', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        return False
    return True

while True:
    if not execute():
        break

cap.release()
cv2.destroyAllWindows()


Next, we'll create a widget which will be used to display the camera feed with visualizations.

Finally, we'll define the main execution loop.  This will perform the following steps

1.  Preprocess the camera image
2.  Execute the neural network
3.  Parse the objects from the neural network output
4.  Draw the objects onto the camera image
5.  Convert the image to JPEG format and stream to the display widget