Skip to content

panasonic-ai/manhattan-deep-calib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption

Nobuhiko Wakai 1,*, Satoshi Sato 1, Yasunori Ishii 1, Takayoshi Yamashita 2
1 Panasonic Holdings, 2 Chubu University
* wakai.nobuhiko[at]jp.panasonic.com

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

日本語のプロジェクトページは[こちら]

Abstract

A Manhattan world lying along cuboid buildings is useful for camera angle estimation. However, accurate and robust angle estimation from fisheye images in the Manhattan world has remained an open challenge because general scene images tend to lack constraints such as lines, arcs, and vanishing points. To achieve higher accuracy and robustness, we propose a learning-based calibration method that uses heatmap regression, which is similar to pose estimation using keypoints, to detect the directions of labeled image coordinates. Simultaneously, our two estimators recover the rotation and remove fisheye distortion by remapping from a general scene image. Without considering vanishing-point constraints, we find that additional points for learning-based methods can be defined. To compensate for the lack of vanishing points in images, we introduce auxiliary diagonal points that have the optimal 3D arrangement of spatial uniformity. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets and with off-the-shelf cameras.

Background

Camera calibration is used for various computer vision tasks to recover camera rotation and fisheye distortion. However, conventional geometric-based calibration methods need a calibration object, such as a plane or a cube. To address this problem, we use learning-based calibration methods called “Deep Single Image Camera Calibration.” Image-based angle estimation under Manhattan world assumption [12] is better for miniaturized and lightweight design for cars, drones, and robots. However, accurate and robust angle estimation has remained an open challenge because general scene images tend to lack constraints such as lines, arcs, and vanishing points (VPs).

Contributions

  • We propose a heatmap-based VP estimator for recovering the rotation from a single image to achieve higher accuracy and robustness than geometry-based methods using arc detectors.

  • We introduce auxiliary diagonal points (ADPs) with an optimal 3D arrangement based on the spatial uniformity of regular octahedron groups to address the lack of VPs in an image.

Proposed method

Our network estimates extrinsics and intrinsics.

Fig. 1: Our network estimates extrinsics and intrinsics in a Manhattan world from a single image. Our estimated camera parameters are used to fully recover images by remapping them while distinguishing the front and side directions on the basis of the Manhattan world. Cyan, magenta, and yellow lines indicate the three orthogonal planes of the Manhattan frame in each of the images. The input image is generated from [38]. This figure is referred to in our CVPR2024 Figure 1.

Auxiliary diagonal points

Auxiliary diagonal points

Fig. 2: Coordinates of VPs and ADPs in a Manhattan world. The labels of the VPs and ADPs correspond to the labels described in Table 1 of our paper. This figure is referred to in our CVPR2024 Figure 3.

Without considering the VP constraint that lines are concentrated at a VP, we can define various directions of points such as the vertexes of a polyhedron. We cannot escape the trade-off between the strength of constraints and the ease of training. This trade-off depends on the arrangement of the directions of points and the number of directions. To solve this problem concerning the arrangement and number of points, we define additional VP-related points called ADPs based on the spatial symmetry, see Figure 2.

Network architecture

We found that VP estimation in images corresponds to single human pose estimation [2] in terms of labeled image-coordinate detection. Therefore, we propose a heatmap-regression network, called the "VP estimator," that detects image VP and ADPs (VP/ADPs) and is likely to avoid such degradation. For the intrinsics in [53], we use Wakai et al.'s calibration network [53] without the tilt and roll angle regressors, which is called the "distortion estimator." Therefore, our network has two estimators in Figure 1 and requires a single fisheye image.

Training and inference

Calibration pipeline for inference

Fig. 3: Calibration pipeline for inference. The intrinsics are estimated by the distortion estimator. Camera models project VP/ADPs onto the unit sphere using backprojection. The extrinsics are calculated from the fitting. The input fisheye image is generated from [38]. This figure is referred to in our CVPR2024 Figure 4.

Using the generated fisheye images with ground-truth camera parameters and VP/ADP labels, we train our two estimators independently. Figure 3 shows our calibration pipeline for inference. First, we obtain the image coordinates of VP/ADPs from the VP estimator and the intrinsics from the distortion estimator in Figure 3. Second, the detected VP/ADPs are projected onto a unit sphere in world coordinates using backprojection. This backprojection regards lens distortion using focal length and a distortion coefficient. Finally, we convert the 3D VP/ADPs to the extrinsics of pan, tilt, and roll angles by solving the absolute orientation problem [55].

Experiments

We used three large-scale datasets of outdoor panoramas, the StreetLearn dataset [38], the SP360 dataset [9], and the HoliCity dataset [64]. In StreetLearn, we used the Manhattan 2019 subset (SL-MH) and the Pittsburgh 2019 subset (SL-PB). Following the procedure for dataset generation and capture [53], we generated fisheye images from panoramic images using the generic camera models with the ground-truth camera parameters, and we captured outdoor images in Kyoto, Japan, using six off-the-shelf fisheye cameras. Note that we removed label ambiguity and did not use back labels.

Vanishing point estimation

Qualitative results of VP/ADP detection

Fig. 4: Qualitative results of VP/ADP detection using the proposed VP estimator on the SL-MH test set. The VP estimator estimated five VP and eight ADP heatmaps for each VP/ADP. This figure is referred to in our CVPR2024 Figure 5.

Results of the cross-domain evaluation for our VP estimator using HRNet-W32

Dataset Keypoint metric ↑ Mean distance error [pixel] ↓
Train Test AP AP50 AP75 AR AR50 AR75 PCK front left right top bottom VP1 ADP1 All1
SL-MH SL-MH 0.99 0.99 0.99 0.97 0.98 0.98 0.99 2.67 2.90 2.52 1.90 1.72 2.39 3.64 3.10
SL-PB 0.98 0.99 0.99 0.96 0.97 0.97 0.98 3.51 3.50 3.11 2.34 2.02 2.97 4.52 3.85
SP360 0.85 0.94 0.90 0.79 0.87 0.83 0.83 6.55 7.42 6.18 5.34 11.77 7.44 14.95 11.57
HoliCity 0.80 0.92 0.86 0.72 0.83 0.78 0.77 9.73 12.27 9.75 8.54 6.60 9.47 17.92 14.11
1 VP denotes all 5 VPs; ADP denotes all 8 ADPs; All denotes all points consisting of 5 VPs and 8 ADPs

Parameter and reprojection errors

Comparison of the absolute parameter errors and reprojection errors on the SL-MH test set

Method Backbone Mean absolute error1 REPE1 Executable
rate1
Mean fps2 #Params GFLOPs
Pan φ Tilt θ Roll ψ f k1
López-Antequera et al. [33] CVPR'19 DenseNet-161 27.60 44.90 2.32 81.99 100.0 36.4 27.4M 7.2
Wakai and Yamashita [52] ICCVW'21 DenseNet-161 10.70 14.97 2.73 30.02 100.0 33.0 26.9M 7.2
Wakai et al. [53] ECCV'22 DenseNet-161 4.13 5.21 0.34 0.021 7.39 100.0 25.4 27.4M 7.2
Pritts et al. [41] CVPR'18 25.35 42.52 18.54 96.7 0.044
Lochman et al. [32] WACV'21 22.36 44.42 33.20 6.09 59.1 0.016
Ours w/o ADPs (5 points)3 HRNet-W323 19.38 13.54 21.65 0.34 0.020 28.90 100.0 12.7 53.5M 14.53
Ours w/o VPs (8 points) HRNet-W32 10.54 11.01 8.11 0.34 0.020 19.70 100.0 12.6 53.5M 14.5
Ours (13 points) HRNet-W323 2.20 3.15 3.00 0.34 0.020 5.50 100.0 12.3 53.5M 14.5
Ours (13 points) HRNet-W48 2.19 3.10 2.88 0.34 0.020 5.34 100.0 12.2 86.9M 22.1
1 Units: pan φ, tilt θ, and roll ψ [deg]; f [mm]; k1 [dimensionless]; REPE [pixel]; Executable rate [%]
2 Implementations: López-Antequera et al. [33], Wakai and Yamashita [52], Wakai et al. [53], and ours using PyTorch [40]; Pritts et al. [41] and Lochman et al. [32] using The MathWorks MATLAB
3 (· points) is the number of VP/ADPs for VP estimators; VP estimator backbones are indicated; Rotation estimation in Figure 3 is not included in GFLOPs

Qualitative results on the synthesis images

Qualitative results on the synthesis images

Fig. 5: Qualitative results on the test sets. (a) Results of conventional methods. From left to right: input images, ground truth (GT), and results of López-Antequera et al. [33], Wakai and Yamashita [52], Wakai et al. [53], Pritts et al. [41], and Lochman et al. [32]. (b) Results of our method. From left to right: input images, GT, and the results of our method using HRNet-W32 in a Manhattan world. This figure is referred to in our CVPR2024 Figure 6.

Qualitative results on the off-the-shelf camera images

Qualitative results on the off-the-shelf camera images

Fig. 6: Qualitative results for images from off-the-shelf cameras. From top to bottom: input images, the results of the compared method (front and side direction images obtained by Lochman et al. [32]), and our method using HRNet-W32 (front and side direction images). The identifiers (IDs) correspond to the camera IDs used in [53], and the projection names are shown below the IDs. This figure is referred to in our CVPR2024 Figure 7.

These descriptions on this project page are referred to in our CVPR2024.

Links

@INPROCEEDINGS{Wakai_2024_CVPR,
    author    = {Wakai, Nobuhiko and Sato, Satoshi and Ishii, Yasunori and Yamashita, Takayoshi},
    title     = {Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages     = {11884-11894},
    year      = {2024}
}
  • Related project
    • Nobuhiko Wakai and Takayoshi Yamashita. Deep Single Fisheye Image Camera Calibration for Over 180-degree Projection of Field of View. In International Conference on Computer Vision Workshop (ICCVW), pages 1174–1183, 2021. [paper]
    • Nobuhiko Wakai and Satoshi Sato and Yasunori Ishii and Takayoshi Yamashita. Rethinking Generic Camera Models for Deep Single Image Camera Calibration to Recover Rotation and Fisheye Distortion. In European Conference on Computer Vision (ECCV), volume 13678, pages 679–698, 2022. [paper] [project]
  • Press release [press]

References

  • [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014.
  • [12] J. M. Coughlan and A. L. Yuille. Manhattan world: Compass direction from a single image by Bayesian inference. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 941–947, 1999.
  • [32] Y. Lochman, O. Dobosevych, R. Hryniv, and J. Pritts. Minimal solvers for single-view lens-distorted camera auto-calibration. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2886–2895,2021.
  • [33] M. López-Antequera, R. Marí, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez, and G. Haro. Deep single image camera calibration with radial distortion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11809–11817, 2019.
  • [38] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell. The StreetLearn environment and dataset. arXiv preprint arXiv:1903.01292, 2019.
  • [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035, 2019.
  • [41] J. Pritts, Z. Kukelova, V. Larsson, and O. Chum. Radially-distorted conjugate translations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1993–2001, 2018.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [52] N. Wakai and T. Yamashita. Deep single fisheye image camera calibration for over 180-degree projection of field of view. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1174–1183, 2021.
  • [53] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita. Rethinking generic camera models for deep single image camera calibration to recover rotation and fisheye distortion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 679–698, 2022.
  • [55] Z. Wang and Jepson. A new closed-form solution for absolute orientation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 129–134, 1994.

About

Project page of our camera calibration method accepted by CVPR2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published