# Unsupervised Learning of 3D Structure from Images

* 싸이그래머 / DGM : 파트 1 - DeepMind 논문리뷰 [1]
* 김무성

# Contents
* Abstract
* 1 Introduction
* 2 Conditional Generative Models
    - 2.1 Architectures
* 3 Experiments
    - 3.1 Generating volumes
    - 3.2 Probabilistic volume completion and denoising
    - 3.3 Conditional volume generation
    - 3.4 Performance benchmarking
    - 3.5 Multi-view training
    - 3.6 Single-view training
* 4 Discussion

#### 참고
* [3] (slide) Unsupervised Learning of 3D Structure from Images -  https://www.cs.toronto.edu/~duvenaud/courses/csc2541/slides/unsupervised-3d.pdf

# 1 Introduction

<img src="figures/cap1.png" width=500 />

# 2 Conditional Generative Models
* 2.1 Architectures

<img src="figures/cap2.png" width=800 />

## 2.1 Architectures

We denote the set of all parameters of this generative model as θ = {θr , θw , θs , θp }. 

<img src="figures/cap3.png" width=800 />

<img src="figures/cap4.png" width=800 />

#### 3D → 3D projection (identity)

see figure 3(left)

#### 3D → 2D neural projection (learned)

see figure 3(middle)

#### 3D → 2D OpenGL projection (fixed)

see figure 3(right)

# 3 Experiments
* 3.1 Generating volumes
* 3.2 Probabilistic volume completion and denoising
* 3.3 Conditional volume generation
* 3.4 Performance benchmarking
* 3.5 Multi-view training
* 3.6 Single-view training

#### 참고
* [2] (vidoe) Unsupervised Learning of 3D Structure from Images - https://www.youtube.com/watch?v=stvDAGQwL5c

We demonstrate the ability of our model to learn and exploit 3D scene representations in five challenging tasks

#### Necker cubes

##### 참고
* [4] (vidoe) The Necker Cube - https://www.youtube.com/watch?v=fEN8YAXdOak
* [5] (wikipedia) The Necker Cube - https://en.wikipedia.org/wiki/Necker_cube

This is the simplest dataset we use and consists of 40 × 40 × 40 volumes with a 10 × 10 × 10 wire-frame cube drawn at a random orientation at the center of the volume

#### Primitives

* The volumetric primitives are of size 30 × 30 × 30. 
* Each volume contains a simple solid geometric primitive (e.g., cube, sphere, pyramid, cylinder, capsule or ellipsoid) that undergoes random translations ([0, 20] pixels) and rotations ([−π, π] radians).

#### MNIST3D

* We extended the MNIST dataset [22] to create a 30 × 30 × 30 volumetric dataset by extruding the MNIST images.
* The resulting dataset has the same number of images as MNIST. 
* The data is then augmented with random translations ([0, 20] pixels) and rotations ([−π, π] radians) that are procedurally applied during training.

#### ShapeNet

##### 참고
* [6] ShapeNet - http://shapenet.cs.stanford.edu/

The ShapeNet dataset is a large dataset of 3D meshes of objects. 
* We experiment with a 40-class subset of the dataset, commonly referred to as ShapeNet40. 
* We render each mesh as a binary 30 × 30 × 30 volume.

#### NN 

For all experiments we used LSTMs with 300 hidden neurons and 10 latent variables per generation step. 
* The context encoder $f_c$(c,$s_{t−1}$) was varied for each task.
* For image inputs 
    - we used convolutions and standard spatial transformers, and 
* for volumes 
    - we used volumetric convolutions and VSTs. 
* For the class-conditional experiments, 
    - the context c is a one-hot encoding of the class. 
* As meshes are much lower-dimensional than volumes, 
    - we set the number of steps to be T = 1 
        - when working with this representation. 
* We used the Adam optimizer for all experiments.

## 3.1 Generating volumes

* When ground-truth volumes are available we can directly train the model using the identity projection operator (see section 2.1). 
* We explore the performance of our model by training on several datasets.

<img src="figures/cap5.png" width=800 />

## 3.2 Probabilistic volume completion and denoising

* We test the ability of the model to impute missing data in 3D volumes.
* This procedure simulates a Markov chain and samples from the correct distribution.

<img src="figures/cap6.png" width=800 />

## 3.3 Conditional volume generation

The models can also be trained with context representing the class of the object, allowing for class conditional generation. We train a class-conditional model on ShapeNet and show multiple samples for 10 of the 40 classes in figure 7.

<img src="figures/cap8.png" width=800 />

We can also form conditional models using a single view of 2D contexts. Our results, shown in figure 8 indicate that the model generates plausible shapes that match the constraints provided by the context and captures the multi-modality of the posterior. 

<img src="figures/cap9.png" width=800 />

## 3.4 Performance benchmarking

* We quantify the performance of the model by computing likelihood scores, varying the number of conditioning views and the number of inference steps in the model. 
* Figure 6 indicates that the number of generation steps is a very important factor for performance. 
* Additional context views generally improve the model’s performance but the effect is relatively small. 

<img src="figures/cap7.png" width=800 />

## 3.5 Multi-view training

* In most practical applications, ground-truth volumes are not available for training. 
* Instead, data is captured as a collection of images (e.g., from a multi-camera rig or a moving robot). 
* To accommodate this fact, we extend the generative model with a <font color="red">projection operator</font> that <font color="blue">maps the internal volumetric representation $h_T$ to a 2D image $\hat{x}$</font>. 
    - This map imitates a ‘camera’ in that it first applies an affine transformation to the volumetric representation, and then flattens the result using a convolutional network.
    - The parameters of this projection operator are trained jointly with the rest of the model.

#### from fixed camera locations

* In this experiment <font color="red">we train the model to learn to reproduce an image of the object</font> <font color="blue">given one or more views of it from fixed camera locations</font>. 
* It is the model’s responsibility to infer the volumetric representation as well as the camera’s position relative to the volume.

We train a model that conditions on 3 fixed context views to reproduce 10 simultaneous random views of an object.
* After training, we can sample a 3D representation given the context, and render it from arbitrary camera angles.

<img src="figures/cap10.png" width=800 />

## 3.6 Single-view training

Finally, we consider a mesh-based 3D representation and demonstrate the feasibility of training our models with a fully-fledged, black-box renderer in the loop. 
* Such renderers (e.g. OpenGL) accurately capture the relationship between a 3D representation and its 2D rendering out of the box. 
* This image is a complex function of the objects’ colors, materials and textures, positions of lights, and that of other objects. 
* By building this knowledge into the model we give hints for learning and constrain its hidden representation.

<font color="red">We consider again the Primitives dataset, however now we only have access to 2D images of the objects at training time</font>.

<img src="figures/cap11.png" width=800 />

## 4 Discussion

# 참고자료
* [1] (Paper) Unsupervised Learning of 3D Structure from Images - https://arxiv.org/abs/1607.00662
* [2] (vidoe) Unsupervised Learning of 3D Structure from Images - https://www.youtube.com/watch?v=stvDAGQwL5c
* [3] (slide) Unsupervised Learning of 3D Structure from Images -  https://www.cs.toronto.edu/~duvenaud/courses/csc2541/slides/unsupervised-3d.pdf
* [4] (vidoe) The Necker Cube - https://www.youtube.com/watch?v=fEN8YAXdOak
* [5] (wikipedia) The Necker Cube - https://en.wikipedia.org/wiki/Necker_cube
* [6] ShapeNet - http://shapenet.cs.stanford.edu/