In [1]:
from IPython.core.display import HTML

HTML("""
<style>
.rendered_html {
    font-family: Monaco, monospace;
    font-size: 12px;
}
</style>
""")

### Neural Field Representation:

Mostly part of [this paper](https://arxiv.org/pdf/2107.04004)

we proceed to consider the modelling and manipulation of
dynamical systems with extremely high **degrees of freedom (DoF)**
using only visual observations, such as manipulating a container of
liquid with ice cubes floating within it (i.e., fluid-body interactions).
In such a scenario, accurately estimating the full state information of
the particle set becomes challenging when the sole inputs are RGB
images. Moreover, the usage of *keypoints* (typically a sparse set of
points attached to semantically meaningful parts of an object) in
this context is uncertain since the fluid, possessing an extremely high
DoF, continuously alters its shape during interactions.

$?$ A central question in model learning for robotic manipulation is
how to establish the state representation for learning the dynamics
model? The ideal representation should readily capture the environmental dynamics, show a strong 3D understanding of the objects in the scene, and be applicable to a wide range of object sets including
rigid or deformable objects and fluids. $?$

Some methods, instead of estimating the state of the environment, *learn dynamics in the latent space*, However, the
majority of these methods learn dynamics models using 2D convolutional neural networks and reconstruction loss, which is a similar issue to predicting dynamics in the image space, i.e., their learned
representations lack equivariance to 3D transformations. **Time contrastive networks** aim to learn viewpoint-invariant representations from multi-view inputs but do not necessitate detailed modelling of 3D contents.

here, we propose embedding **neural radiance fields**
into an **autoencoder framework**, enabling tractable inference of the
3D-structure-aware scene state for dynamic environments. By also
enforcing a **time contrastive loss** on the estimated states, we ensure
that the learned state representations are viewpoint-invariant. We
then train a **dynamics model** that predicts the evolution of the state
space conditioned on the input action, enabling control in the learned
state space.

#### More on NeRF itself:

Neural Radiance Fields (NeRF) is a neural network architecture used for novel view synthesis and 3D scene reconstruction in computer vision. It represents a scene as a **continuous volumetric function** and allows generating highly realistic images of that scene from arbitrary viewpoints.

At its core, NeRF is a fully connected neural network (usually a multi-layer perceptron, or MLP) that takes a 3D coordinate $\mathbf{x} = (x, y, z)$ and a 2D viewing direction $\mathbf{d} = (\theta, \phi)$, and outputs:
- Color $\mathbf{c} = (r, g, b)$
- Volume density $\sigma \in \mathbb{R}^{+}$

This models the radiance field of the scene—essentially, how light and color behave at every point in space when viewed from a specific direction.

How It Works (Under the Hood):

NeRF uses volumetric rendering to synthesize an image. The process includes:
1.	Ray Casting: For each pixel in the target image, cast a ray from the camera through that pixel into 3D space.
2.	Sampling: Sample multiple points along each ray (stratified sampling).
3.	MLP Evaluation: For each sampled 3D point, feed ($\mathbf{x}, \mathbf{d}$) into the network to get ($\mathbf{c}, \sigma$).
4.	Rendering Equation: Accumulate the colors and densities using the volume rendering integral:
$$C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(t) \mathbf{c}(t) \, dt$$
where $T(t) = \exp\left(-\int_{t_n}^{t} \sigma(s) \, ds\right)$ is the transmittance (i.e., how much light reaches the point without being absorbed).

#### More on Time-contrastive Network:

A Time-Contrastive Network (TCN) is a self-supervised learning approach introduced to learn useful representations from temporal signals (like videos or audio) without requiring labels. The main idea is to use time as a source of supervision — hence the name time-contrastive.

#### NeRF + AutoEncoding + time-contrastive loss + dynamic model

here, we propose embedding neural radiance fields (NeRF)
into an autoencoder framework, enabling tractable inference of the
3D-structure-aware scene state for dynamic environments.  By also
enforcing a time contrastive loss on the estimated states, we ensure
that the learned state representations are viewpoint-invariant. We
then train a dynamics model that predicts the evolution of the state
space conditioned on theinputaction, enablingcontrol inthelearned
state space.

Auto encoding framework:

$$x \xrightarrow{\text{Encoder}} z \xrightarrow{\text{Decoder}} \hat{x} \approx x$$

the approach can be summerized as follows:
1. We extend an autoencoding framework with a neural radiance field rendering module and time contrastive learning, enabling the learning of 3D-aware scene representations for dynamics modelling and control purely from visual observations.
2. By incorporating the autodecoder mechanism at test time, our framework can modify the learned representation and accomplish control tasks with the goal specified from camera viewpoints outside the training distribution. 
3. We are the first to augment neural radiance fields with a time-invariant dynamics
model, supporting future prediction and novel view synthesis across a wide range of environments with diﬀerent object types.

#### 3D neural fields for dynamic modelling:
1. an
encoder that maps the input images in to a latent state representation,
2. a decoder that generates an observation image under a certain
viewpoint based on the state representation, and
3. a dynamics
model that predicts the future state representations based on the
current state and the input action.

<img src=./images/nerf-autoencoder-time-contrastive-loss.png width=650>

#### neural radiance field for Dynamic scenes:

To enable $f_{NeRF}$ to model dynamic scenes, we
learn an encoding function $f_{enc}$ that maps the visual observations
to a feature representation $z$ for each time step and learn the volumetric radiance field decoding function based on $z$.

Time contrastive loss in this architecture:

**Time contrastive learning:** To enable the image encoder to be viewpoint invariant, we regularise the feature representation of each image $v^i_t$ using **multi-view time contrastive loss (TCN)**. The TCN loss encourages the features of images from different viewpoints at the same time step to be similar, while repulsing features of images from different time steps to be dissimilar. More specifically, given a time step $t$, we randomly select one image $I^i_t$ as the anchor and extract its image feature $v^i_t$ using the image encoder. Then we randomly select one positive image from the same time step but different camera viewpoint $I^j_t$ and one negative image from a different time step but the same viewpoint $I^i_t$. We use the same image encoder to extract their image features $v^j_t$ and $v^i_t$. we minimise the following time contrastive loss:

$$L_{tc} = \max(\|v^i_t - v^j_t\|_2^2 - \|v^i_t - v^i_t\|_2^2 + \alpha, 0),$$

where $\alpha$ is a hyperparameter denoting the margin between the positive and negative pairs.

#### Online plannig:

find the action sequence that minimises the distance between the predicted future representation and the goal representation at time $T$. given a sequence of actions, our model
can iteratively predict a sequence of latent state representations.
The latent-space dynamics model can then be used for downstream
**closed-loop control** tasks via online planning with model-predictive control (**MPC**)