*Exploratory Data Analysis*

# Understanding the Renderings from Virtual Cameras

In this notebook we visualize the camera poses during training and novel view generation. Finally, we summarize the mathematical model of virtual cameras.

In [1]:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrowPatch, FancyArrow, ArrowStyle
from mpl_toolkits.mplot3d import proj3d
from matplotlib.lines import Line2D
import torch
import itertools

from run_dnerf_helpers import get_rays
from utils import load_deepdeform_data, load_owndataset_data
from utils import Arrow3D, draw_transformed, draw_cam, draw_ray

%matplotlib notebook
%load_ext autoreload
%autoreload 2

Load DeepDeform Data

In [2]:
scene_name = "bottle"
render_pose_type = "spherical"

# images, depth_maps, poses, times, render_poses, render_times, hwff, i_split = load_deepdeform_data(f"./data/{scene_name}", True, 1, render_pose_type=render_pose_type)
images, depth_maps, poses, times, render_poses, render_times, hwff, i_split = load_owndataset_data(f"./data/{scene_name}", True, 1, render_pose_type=render_pose_type)

print(f'Loaded {scene_name}', images.shape, render_poses.shape, hwff)

Scene Object Depth: 0.35
[Info] Data scaling factor: 0.7470703125
Loaded bottle (92, 480, 360, 3) torch.Size([101, 4, 4]) [480, 360, 400.0368957519532, 400.0368957519531]


Get some rays for the first pose (code from `utils.run_dnerf_helpers.get_rays`):

In [3]:
render_pose = render_poses[0]
print("First novel camera pose to be rendered:\n", render_pose)
print("First camera pose of the training images:\n", poses[0])

# Implemented in get_rays(H, W, focal, c2w) function

c2w = render_pose[:3]
H, W, focal_x, focal_y = hwff

# Create coordinates for each pixel in the camera coordinate system
i, j = torch.meshgrid(torch.linspace(0, W-1, W), torch.linspace(0, H-1, H), indexing='ij')      # shape [240, 320], [240, 320]
i = i.t()           # pixel coordinates in X-dir
j = j.t()           # in Y-dir
# The ray directions in the camera coordinate system. 
# Center the X- and Y-coordinates to the image center and scale by focal length. The rays go in the negative Z direction.
dirs = torch.stack([(i-W*.5)/focal_x, -(j-H*.5)/focal_y, -torch.ones_like(i)], -1)                                          # shape [240, 320, 3]

# Rotate ray directions from camera frame to the world frame
rays_d = torch.sum(dirs[..., np.newaxis, :] * c2w[:3,:3], -1)  # dot product, equals to: [c2w.dot(dir) for dir in dirs]
# Translate camera frame's origin to the world frame. It is the origin of all rays.
rays_o = c2w[:3,-1].expand(rays_d.shape)

print("Ray directions shape:", rays_d.shape)
print("Ray origins shape:", rays_o.shape)
print("Ray direction for pixel [0,0] is", rays_d[0,0].tolist(), "with origin at", rays_o[0,0].tolist())

First novel camera pose to be rendered:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 1.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.4685],
        [0.0000, 0.0000, 0.0000, 1.0000]])
First camera pose of the training images:
 [[ 1.0000000e+00 -1.2096720e-08 -1.3030231e-08  2.2006795e-08]
 [ 1.2096720e-08  1.0000000e+00 -1.3962775e-08 -1.1433786e-07]
 [ 1.3030231e-08  1.3962775e-08  1.0000000e+00  4.6849683e-01]
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  1.0000000e+00]]
Ray directions shape: torch.Size([480, 360, 3])
Ray origins shape: torch.Size([480, 360, 3])
Ray direction for pixel [0,0] is [-0.4499585032463074, 0.5999446511268616, -1.0] with origin at [0.0, 0.0, 0.468496710062027]


Plot the coordinate frames and the rays.

In [4]:
fig = plt.figure(figsize=(10, 10))
ax1 = fig.add_subplot(111, projection='3d')

# xlim = [-3, 3]
# ylim = [-1, 1]
# zlim = [0, 6]
xlim = [-1.5, 1.5]
ylim = [-1.5, 1.5]
zlim = [-1, 1]
ax1.set_xlabel('X')
ax1.set_xlim(*xlim)
ax1.set_ylabel('Y')
ax1.set_ylim(*ylim)
ax1.set_zlabel('Z')
ax1.set_zlim(*zlim)
ax1.set_box_aspect((xlim[1]-xlim[0], ylim[1]-ylim[0], zlim[1]-zlim[0]))       # -> length of 1 in each dimension is visually the equal

# The world coordinate system
arrow_prop_dict = dict(mutation_scale=20, arrowstyle='simple', shrinkA=0, shrinkB=0)
ax1.add_artist(Arrow3D([0, 1], [0, 0], [0, 0], **arrow_prop_dict, color='r'))
ax1.add_artist(Arrow3D([0, 0], [0, 1], [0, 0], **arrow_prop_dict, color='b'))
ax1.add_artist(Arrow3D([0, 0], [0, 0], [0, 1], **arrow_prop_dict, color='g'))
ax1.text(-.1, -.1, 0.0, r'$0$')
ax1.text(1.1, 0, 0, r'$x$')
ax1.text(0, 1.1, 0, r'$y$')
ax1.text(0, 0, 1.1, r'$z$')

# Draw novel camera coordinate frames
# new_os = []
# for pose in render_poses[::2]:
#     rcx, rcy, rcz, new_o = draw_transformed(pose, ax1, linestyle="--")
#     new_os.append(new_o)
# ax1.plot([n[0] for n in new_os], [n[1] for n in new_os], [n[2] for n in new_os])

# Training camera frames
for pose in poses[1::5]:
    rcx, rcy, rcz, new_o = draw_transformed(pose, ax1, linestyle="-")

# Draw the training camera coordinate frame
tcx, tcy, tcz, _ = draw_transformed(poses[0], ax1, arrowstyle='simple', axes_len=0.7, linewidth=1.5, mutation_scale=20, edgecolor="black")

lgnd1 = plt.legend(handles=[tcx, tcy, tcz], 
           labels=["X", "Y", "Z"], 
           title="Training camera pose", loc=1)
plt.legend(handles=[Line2D([0], [0], color='r', ls="--"), 
                    Line2D([0], [0], color='b', ls="--"), 
                    Line2D([0], [0], color='g', ls="--"), 
                    Line2D([0], [0], color='black', ls="-"),
                    Line2D([0], [0], color='grey', ls="--")], 
           labels=["X", "Y", "Z", "Rays of 0th camera", "0th camera image"], 
           title="Novel view cameras", loc=2)
plt.gca().add_artist(lgnd1)


draw_cam(rays_o, rays_d, ax1)       # rays_o and rays_d are already in world-coordinates

fig.suptitle("Virtual Cameras in the Global Coordinate System", )
fig.tight_layout()

plt.show()

<IPython.core.display.Javascript object>

## Theory: The Virtual Camera Model

A camera is a mapping between the 3D world and a 2D image:
$$[X,Y,Z]^T\rightarrow[x,y]^T$$
where $\mathbf{X}=[X,Y,Z]^T$ is a 3D world point and $\mathbf{x}=[x,y]^T$ is the corresponding 2D point on the image. This perspective projection is modeled by the ideal pinhole camera with the mathematical relationship given by the camera matrix $\mathbf{P}\in\R^{3\times4}$:

$$
\begin{aligned}
    P &= \overbrace{K}^{\text{Intrinsic Matrix}} \times \overbrace{[R \mid \mathbf{t}]}^{\text{Extrinsic Matrix}}\\
    &=\overbrace{
        \underbrace{\left(\begin{array}{ccc}1 & 0 & x_{0} \\ 0 & 1 & y_{0} \\ 0 & 0 & 1\end{array}\right)}_{\text{ 2D Translation }} \times \underbrace{\left(\begin{array}{ccc}f_{x} & 0 & 0 \\ 0 & f_{y} & 0 \\ 0 & 0 & 1\end{array}\right)}_{\text {2D Scaling }} \times \underbrace{\left(\begin{array}{ccc}1 & s/f_{x} & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1\end{array}\right)}_{\text {2D Shear }}
    }^{\text{Intrinsic Matrix}} \times
    \overbrace{
        \underbrace{(I \mid \mathbf{t})}_{\text {3D Translation }} \times 
        \underbrace{\left(\begin{array}{c|c}R & 0 \\ \hline 0 & 1\end{array}\right)}_{\text {3D Rotation }}
    }^{\text{Extrinsic Matrix}}
\end{aligned}
$$

where $f_x$ and $f_y$ are the focal lengths (in pixels) and $x_0$ and $y_0$ are the principal point offset, i.e. the location of the principal point ("the pinhole") relative to the film's origin. The axis skew $s$ causes shear distortion in the projected image and is usually zero. In a true pinhole camera, both $f_x$ and $f_y$ have the same value, but in practive they can differ due to flaws in the camera's optics or sensor. The resulting image has non-square pixels. Finally, the extrinsic matrix rotates and translates the camera.

The image on the pinhole camera's film depicts a mirrored version of reality. Using a "virtual image" instead of the film fixes this. The virtual image has the same properties as the film image, but unlike the true image, the virtual image appears in front of the camera, and the projected image is unflipped, as shown in the following figure.

<figure>
    <div style="text-align: center;">
        <img src="./media/intrinsic-frustum-no-box.png" width="200"/>
        <p align="center">
            <b>The pinhole camera's flipped projection and the unflipped virtual camera's image.</b>
        </p>
    </div>
</figure>

Removing the true image leaves only the "viewing frustum" representation of the pinhole camera. The pinhole has been replaced by the tip of the pyramid-shaped "visibility cone", and the film is now represented by the virtual image plane.

#### References 
- [Dissecting the Camera Matrix, Part 3: The Intrinsic Matrix](https://ksimek.github.io/2013/08/13/intrinsic/)
- [16-385 Computer Vision (Kris Kitani) - Carnegie Mellon University](https://www.cs.cmu.edu/~16385/s17/Slides/11.1_Camera_matrix.pdf)