# P2: Dramatic Data!


## Table Of Content

1. Introduction
2. Preliminaries
3. Software setup
5. Grading rubric
6. Submission guidelines
7. Tips, tricks and hints

## 1. Introduction

In this project, you will design and implement a deep neural network capable of segmenting PeAR racing windows.
A sample input and output is shown below,

![Segmentation Image](./assets/segsample.jpg)

This project has three parts,
- Part 1 - Synthetic dataset generation
- Part 2 - Semantic segmentation
- Part 3 - Instance Segmentation (identify different instances of the window)

### Part 1 - Dataset generation
No dataset has been provided for this project. You are required to generate a synthetic dataset for training your network. The network must generalize to the real-world video feed given in `test_video.mp4`. For more details, please refer to section 2.1.

### Part 2 - Semantic segmentation
You will be designing a custom CNN for image segmentation, and then run inference on frames from the provided video file (`test_video.mp4`). Your task is to display the inferred segmentation output frames side by side with the original input frames and render the combined frames as a new video.

Please note that `test_video.mp4` contains multiple windows. Your network is expected to segment all the windows present in each input frame.

### Part 3 - Instance segmentation

This section is more open-ended. You will need to perform instance segmentation, which can be achieved either by applying classical techniques on the segmentation output obtained in Part 2, or by using deep learning methods. No starter code or specific instructions are provided for this section, allowing you the freedom to choose your approach. [More details here](https://www.ibm.com/topics/instance-segmentation).


**Please review the submission guidelines and grading rubric before starting your work.**

## 2. Preliminaries

### 2.1. Dataset generation and sim2real:

Your goal would be to build a lightweight neural network that has good accuracy. To accomplish this, you will have to generate windows in simulation (Blender) from different views and backgrounds and scenarios and train your network. Be sure to maintain the aspect ratio of the window accurately when generating data for good generalization to the real world (the real windows are printed with no scaling). We will describe the specifications of the window next. The window is almost a square with checkerboard pattern on the edges. Each checkerboard pattern is (highlighted in blue in Fig. 3) of a grid of size $7 \times 6$ (for X or horizontal and Y or vertical) directions. The distance from checkerboard pattern to either edge is the same dimension as one checkerboard square (yellow highlight in the figure provided below). The WPI logo will always be centered vertically and the PeAR logo will be centered horizontally (however, these can be absent too or be changed to other logos). The window will be the same WPI crimson color `(rgb(172, 43, 55))` but can appear different in different lighting. Your data can randomize the viewpoint (or window pose), window lighting, window design with respect to logos, noise, the background among others. The thickness of the window is negligible compared to the height and width of the window. There can be multiple windows in the frame. Also, each window is X-Y symmetric. A high resolution `PNG` image and the associated `PPTX` file used to generate the windows are given in the starter package in the `assets` folder.  

<div class="fig fighighlight">
  <img src="./assets/WindowWithAnnotation.png" width="100%">
  <div class="figcaption">
    The Window used in the experiments with annotation. Blue indicates checkerboard pattern. All the checkerboard squares are of the size of the yellow square as shown.
  </div>
  <div style="clear:both;"></div>
</div>

<br>

You will need to generate the following:

1. **Realistic RGB Images:** These images should contain the provided window in various settings.
2. **Segmentation Masks:** Create binary images that indicate the presence of the window in the RGB images. Cycles rendering engine in Blender can output ID Masks for you. A reference blender file (`ObjectIndex.blend`) is provided in the `assets` folder to illustrate the same. 

![Sample Data](./assets/sampledata.png)

To ensure your network generalizes well from simulation to the real world, make sure to incorporate the following variations in your dataset:
1. **Camera location and orientation:** Vary the position and angle of the camera relative to the window.
1. **Background:** Add some background image. A few images are provided in `./part1/environment` for your reference.
1. **Lighting:** Simulate various lighting conditions.
1. **Occlusion:** Add objects that block the window.
1. **Multiple Windows:** Show scenes with more than one window.
1. **Noise:** Introduce gaussian noise to the image.
1. **Blur:** Add blur to the image to simulate out of focus conditions.
1. **Color Jitter:** Adjust color properties to simulate diverse scenarios.

You will receive full credit for `Part 1` if your dataset images are properly augmented with the factors mentioned above.

You are free to use any gui/commandline operation. The following tools may be helpful for automating the data generation,
1. [Blender script](https://docs.blender.org/api/current/info_quickstart.html) - Dataset generation (randomizing the placement of camera/windows, automated rendering, etc).
2. [BlenderSynth (beta)](https://github.com/OllieBoyne/BlenderSynth) - Dataset generation.
3. [BlenderProc](https://github.com/DLR-RM/BlenderProc) - Dataset generation.
4. [TorchVision](https://pytorch.org/vision/stable/generated/torchvision.transforms.GaussianBlur.html) - Data augmentation (noise, blur,...) on-the-fly during training.

**Recommended Rendering Settings:**

In case you don’t know where these settings are, don’t worry! You are free to use any settings you like. 

1. Rendering Engine - Cycles (with GPU compute enabled)
2. Max Samples - 4
3. Noise Threshold - 0.1
4. Denoise - Optix 
5. Final Render/ Persistent Data - Enabled (this gives 10x speed up but Blender may crash). Try rendering few 100 images at a time.
6. Render resolution - 640x360

**Sample Dataset Images**

![Sample Dataset](./assets/sampleDataset.png)

You’ll need a total of approximately 50,000 augmented images for effective training. We generated around 5,000 images using Blender, and then applied augmentations such as noise, blur, and color jitter during the training process using torchvision.

### 2.2 Segmentation network in PyTorch:

Architectures similar to U-Net are recommended as a good starting point.
To keep the design simple and to speed up inference, you are free to rescale the image to smaller square format (such as 256x256). Please make sure you scale the segmented output back to the original dimension.

Here is a helpful article on [image segmentation with Pytorch](https://towardsdatascience.com/efficient-image-segmentation-using-pytorch-part-1-89e8297a0923).

## 3. Software setup

### Part 1
No sample code for data generation is provided. However, a few sample background images can be found in the `part1/environment` folder, and the window image is available in the `assets/` folder. Please add any code you write in the part1 folder.

### Part 2
A sample network training pipeline (and wandb logging code) is provided for reference. You are free to edit it as you like.
You are expected to fix any issues you face while running that pipeline. Please use [wandb](https://docs.wandb.ai/tutorials/experiments) or tensorboard to visualize the training process. 
<!-- ![Sample Wandb Webpage](./assets/wandb.png) -->

<img src="./assets/wandb.png" alt="Sample Wandb Image" width="75%">



## 4. Grading rubric

- Part 1: 40
- Part 2: 40
- Part 3: 20

- For RBE474X: Part1 + Part2 = 100% of the grade (80/80).
- For RBE595-A01-SP: You are expected to implement Part1-Part3 for getting full credits (100/100).

## 5. Submission guidelines

### 5.1 Report

Please include the following in your report,

1. If you are using Blender, screenshots of your Blender GUI Window.
2. Sample images and labels (segmentation mask) from the dataset.
3. Network architecture diagram.
4. Explain the loss function you used for training.
5. Explain the failure cases, you may have encountered.
6. Tabulate the hyperparameters you used for training. Learning rate, optimizer, etc.
7. Include the inference results for a few sample images from your validation set and the provided video.
8. Plot the training and validation loss curves per epoch. Explain any observation.

### 5.2 Video

Part 2: Save the video as part2.mp4. Please use H.264 encoding for the video.

Part 3: Save the video as part3.mp4. Please use H.264 encoding for the video.

### 5.3 Rules

1. You can choose any network architecture, but you must design it yourself. Prebuilt networks or pre-existing implementations (e.g., those found in libraries like PyTorch's torchvision) are not allowed.
2. You are allowed to use basic PyTorch layers, such as convolutional layers, transposed convolutions, pooling layers, activation functions, dropout, and batch normalization.
3. Do not upload any logs or model_checkpoint(pth) or data or wandb folder. Doing so will result in zero credits.
4. Do not upload your dataset to Canvas! Doing so will result in zero credits.
5. Report must be in LaTeX in IEEETran format from previous projects.


### 5.4 Folder structure
Your submission on ELMS/Canvas must be a ``zip`` file, following the naming convention ``GroupGROUPNUM_p2.zip``. If your group number is ``4``, then the submission file should be named ``Group4_p2.zip``. The `GROUPNUM` can be found on Canvas. The file **must have the following directory structure**. Do not change the files to run the code. You can have any helper functions in sub-folders as you wish, be sure to index them using relative paths and if you have command line arguments for your codes, make sure to have default values too. Please provide detailed instructions on how to run your code in ``README.md`` file. 

<p style="background-color:#ddd; padding:5px">
<b>NOTE:</b> 
Furthermore, the size of your submission file should <b>NOT</b> exceed more than <b>100MB</b>.
</p>

The file tree of your submission <b>SHOULD</b> resemble this:

```
GroupGROUPNUM_p2.zip
├── assets
├── part1
|   └── code files (do not submit environment folder)
├── part2
|   ├── network.py
|   ├── train.py
|   ├── turing.sh
|   ├── loadParam.py
|   ├── dataloader.py
|   ├── utils.py
|   └── Any other code files
├── part3
|   └── code files
├── Report.pdf 
├── main_notebook.ipynb
├── part2.mp4
├── part3.mp4
└── README.md
```


## 7. Tips, tricks and hints

Generating data using Blender can be fun, exciting and equally frustrating at the same time. We are adding some hints, tricks and tips to ease this process. We will do this as a series of question and answers: <br><br>

<b> 1. How do I obtain a segmentation mask of the object in the scene? Can I do this for multiple objects together? </b><br>
To obtain the segmentation mask, you will use something called the `Pass Index` in Blender. For each object, go to `Object Properties > Relations > Pass Index` and set this to a unique number. Now, in your compositing tab, you can use the `ID Mask` node and look for this `Index` and write that into a file using the `File Output` node. You can do this for as many objects as you want as long as they all have a unique `Pass Index`. This can give you the closest window mask even if you have multiple windows in the scene. Note that this will only work if you have enabled `Object Index` in `View Layer Properties > Passes > Data > Indexes` tab. More details are in this [YouTube video](https://www.youtube.com/watch?v=o2JKviMX9rE).<br><br>



<!-- <b> 2. How do I obtain the locations of the corners if I want to train my network to predict corners? </b><br>
One way is to mark the corners in 3D on the window you crafted. Now, to obtain the image locations on the image, you will use the projection equation to know where each corner lies. This is going to be the most accurate way to obtain corners and you can easily do this for as many points as you want. If you want an easier but less accurate method, here it is: you can create transparent objects (such as a circle or square or anything your heart desires and setting the `Material` to `Principles BSDF` shader and `Alpha` parameter to 0) at the corner locations and have a rigid relation with the window (if you changing window pose). Now assign a unique `Pass Index` to each of these transparent objects and obtain the mask. However, transparent objects are not rendered in the `Pass Index` by default. To enable this, you will need to enable `Object Properties > Visibility > Holdout` for the particular transparent object.<br><br> -->

<!-- <b> 3. What are we trying to predict? Can you talk about the input and output of the network? </b><br>
The overall goal of the project is to obtain 3D pose of the window from 2D image(s). You are free to use one or more neural networks for any part of the perception stack. You input could be one or more images (temporally) and the output could be either a segmentation mask of the window or some corner points or the 3D pose. These are all design choices you are free to make. Your architecture will vary based on the choice of input and output. Generally, breaking down the problem into sub-problems makes it easier for the network to be trained and generalized from sim2real better.<br><br> -->
<!-- 
<b> 4. Is it advisable to predict segmentations or the corners for the windows?  </b><br>
Both have pros and cons. Predictions of segmentations might generalize better and you get more data to work with, but this means that you need a better post-processing step. Corner predictions might be harder to train but are easier to interpret at the end. You might want to think about predicting corners more carefully, predicting directly the pixel co-ordinates is generally not advisable due to large numbers, you might want to normalize this with respect to the center of the image (here, image varies from -1 to 1 in each X and Y directions, with (0,0) being in the center). Also, be cognizant of the loss function you are using in both cases. Furthermore, think carefully about how you'll find the closest window if there are multiple windows in the frame. Also think about how you will find the window if one of the corners or parts of the window are occluded or not detected.<br><br> -->

<b> 2. How many images do I need to train on? </b><br>
This depends heavily on the number of parameters in your network. The larger your network, the more data you will need. For example, a 10MB ResNet inspired model needs a minimum of 10K to 100K images with large variations for a good generalized performance. The rule of thumb is that if you increase model size by a factor of $N$, then you need to increase data amount by $N^2$. A trick used for better generalization is to train the network in simulation and then fine-tune on real data (this can be as little as 1/10th the amount of simulation data).<br><br>


<b> 3. The data generation is super slow. What can I do? </b><br>
You can lose photorealism and train in material preview mode. This will not generalize as well but you can train your network first on material preview mode images and then fine-tune on more photo-realistic images. This is generally called curriculum learning.<br><br>


<b> 4. Is there a trick/hack to get photorealistic data faster? Do you have any tips for better sim2real transfer? </b><br>
One way is to reconstruct the real scene/window in a dense manner using photogrammetry like <a href="http://ccwu.me/vsfm/index.html">Visual SfM</a> or newer tools like <a href="https://github.com/kakaobrain/nerf-factory">NeRF</a> or <a href="https://poly.cam/gaussian-splatting">Gaussian Splats</a>. If you have an iPhone or an iPad with a LIDAR sensor, you can use apps like <a href="https://poly.cam/">Polycam</a> or <a href="https://lumalabs.ai/">LumaAI</a> or <a href="https://www.kiriengine.com/">Kiri Engine</a> (also works on Android without a LIDAR) ton reconstruct the scene. You can import these <a href="https://www.youtube.com/watch?v=kwpj7ZUtnac">point clouds</a> or meshes into Blender to obtain "photo-realistic data" for "free". This approach coupled with Blender simulation data would most likely lead to a great sim2real generalization. <br><br>

You can also download an example Blender file to play with the above things from <a href="./assets/ObjectIndex.blend">here</a>.

<a name='coll'></a>
