# Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars

[[PDF](https://arxiv.org/pdf/1704.00112.pdf)]

## Table of Contents
- [Abstarct](#abstract)
- [1 Introduction](#1-introduection)
- [2 Representation and Formulation](#2-representation-and-formulation)

## Abstract
Propose a systematic learning-based approachto the generation synthetic 3D scenes and photorealistic 2D images, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms.

## Introduction
![fig1](./fig/fig1.png)
Figure 1: (a) An example automatically-generated 3D bedroom scene, rendered as a photorealistic RGB image, along with its (b) per-pixel ground truth (from top) surface normal, depth, and object identity images. (c) Another synthesized bedroom scene. Synthesized scenes include fine details—objects (e.g., duvet and pillows on beds) and their textures are changeable, by sampling the physical parameters of materials (reflectance, roughness, glossiness, etc..), and illumination parameters are sampled from continuous spaces of possible positions, intensities, and colors. (d)–(g) Rendered images of four other example synthetic indoor scenes—(d) bedroom, (e) bathroom, (f) study, (g) gym.

Learning-based pipeline:
- The sampling algorithm combines hierarchical compositions and contextual constraints to enable the generation of 3D scenes, by utilizes a stochastic grammar model represented by attributed Spatial And-Or Graph (S-AOG)
- SOTA PBR employed to yield photorealstic sythesis images, and enables an infinity variety of environmental conditions and attributes

2D images are rendered from 3D scenes containing ground truth information.

### 1.1 Related Work
- Synthetic image datasets
- 3D layout synthesis
- Image synthesis
- Stochastic scene grammar models
- Domain adaptation

### 1.2 Contributions
1. The first work , for the purpose of indoor scene understanding, introduces a learning-based configurable pipeline for generating photorealstic images of indoor scenes with perfect per-pixel ground truth
2. Propose S-AOG for scene generation
3. The first paper to provide comprehensive diagnostics with respect to algorithm stability and sensitivity to certain scene attributes
4. Demonstrate the effectiveness of our synthesized scene dataset by advancing the state-of-the-art in the pre- diction of surface normals and depth from RGB images

## 2 Representation and Formulation
### 2.1 Representation: Attributed Spatial And-Or Graph
**Capabilities:**
- Compositional hierachy: <br>
  (i) an indoor scene can be categorized  into different indoor settings (i.e. bedroom etc) <br>
  (ii) furniture can be decomposed into functional groups (e.g., a "work" group consists of a desk and a chair)

- Contextual relations: <br>
  (i) furniture pieces and walls <br>
  (ii) among furniture pieces <br>
  (iii) supported and supporting objects <br>
  (iv) objects of a functional pair

**Representation:**
![fig2](./fig/fig2.png)
Fig. 2: Scene grammar as an attributed S-AOG. The terminal nodes of the S-AOG are attributed with internal attributes (sizes)
and external attributes (positions and orientations). A supported object node is combined by an address terminal node and
a regular terminal node, indicating that the object is supported by the furniture pointed to by the address node. If the value
of the address node is null, the object is situated on the floor. Contextual relations are defined between walls and furniture,
among different furniture pieces, between supported objects and supporting furniture, and for functional groups.

**Definitions:**
- S-AOG: $\mathscr{G} = \langle S, V, R, P, E \rangle$, where <br>
  $S$: root node <br>
  $V$: vertex set including non-terminal and terminal nodes, $V_{NT} \cup V_T$ <br>
  $R$: production rules <br>
  $P$: probability model <br>
  $E$: contextual relations represented as horizontal links between nodes in the same layer

- Non-terminal Nodes: $V_{NT} = V^{And} \cup V^{Or} \cup V^{Set}$

- Production Rules for non-terminal nodes: <br>
  (i) And rules for $v \in V^{And}$: $v \rightarrow u_1 \cdot u_2 \cdot ... \cdot u_{n(v)}$ <br>
  (ii) Or rules for $v \in V^{Or}$: $v \rightarrow u_1 | u_2 | ... | u_{n(v)}$ with $\rho_1 | \rho_2 | ... | \rho_{n(v)}$ <br>
  (iii) Set rules for $v \in V^{Set}$: $v \rightarrow (nil | u^1_1 | u^2_1 | ...) ... (nil | u^1_{n(v)} | u^2_{n(v) | ...}) $ with $(\rho_{1,0} | \rho_{1,1} | \rho{1,2} | ...) ... (\rho_{n(v),0} | \rho_{n(v),1} | \rho_{n(v),2} | ...)$ <br>
  where $u^k_i$ denotes the case that object $u_i$ apperas $k$ times with probability $\rho_{i,k}$.

- Terminal nodes: <br>
  (i) regular $v \in V^r_T$: with internal $A_{in}$ (size) and external $A_{ex}$ (position and orientation) <br>
  (ii) address $v \in V^a_T$: point to regular termianl nodes and take values in the set $V^r_T \cup \{\text{nil}\}$

- Contextual Relations: $E = E_w \cup E_f \cup E_o \cup E_g$ <br>
  (i) $E_w$: relations between furniture pieces and walls <br>
  (ii) $E_f$: relations among furniture pieces <br>
  (iii) $E_o$: relations between supported and supporting objects <br>
  (iv) $E_g$: relations objects of a functional pair <br>
  Accordingly, the cliques formed in the terminal layer may be divided into four subsets: $C = C_w \cup C_f \cup C_o \cup C_g$.

- Parse Tree: A hierarchical parse tree $pt$ instantiates the S-AOG by selecting a child node for the Or-nodes as well as determining the state of each child node for the Set-nodes. A parse graph $pg$ consists of a parse tree pt and a number of contextual relations $E$ on the parse tree: $pg = (pt, E_{pt})$.

![fig3](./fig/fig3.png)
Fig. 3: (a) A simplified example of a parse graph of a bedroom. The terminal nodes of the parse graph form an MRF in the bottom layer. Cliques are formed by the contextual relations projected to the bottom layer. (b)–(e) give an example of the four types of cliques, which represent different contextual relations.