Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.

Config is not fully applied #6

Closed
kwea123 opened this issue Apr 18, 2020 · 12 comments
Closed

Config is not fully applied #6

kwea123 opened this issue Apr 18, 2020 · 12 comments

Comments

@kwea123
Copy link

kwea123 commented Apr 18, 2020

For example here

nerf-pytorch/train_nerf.py

Lines 117 to 123 in a14357d

model_coarse = getattr(models, cfg.models.coarse.type)(
num_encoding_fn_xyz=cfg.models.coarse.num_encoding_fn_xyz,
num_encoding_fn_dir=cfg.models.coarse.num_encoding_fn_dir,
include_input_xyz=cfg.models.coarse.include_input_xyz,
include_input_dir=cfg.models.coarse.include_input_dir,
use_viewdirs=cfg.models.coarse.use_viewdirs,
)

The hidden_size in the config is never applied... which means it will always be set to the default value 128 no matter what you set in the config file... I suspect that there are still many places where the config is not fully applied, which means your implementation might differ a lot from the original repo.

I was first astonished at the speed you mentioned in the readme, but now it seems that your model is not the same as the original repo... yours is tinier (hidden_size=128 in contrast to original 256), so no wonder your speed is faster!

Could you please carefully check again, or mention in the readme that you use a smaller model? Otherwise it is misleading.

The code is thoroughly tested (to the best of my abilities) to match the original implementation (and be much faster)! In particular, I have ensured that

Every individual module exactly (numerically) matches that of the TensorFlow implementation. This Colab notebook has all the tests, matching op for op (but is very scratchy to look at)!
Training works as expected (for Lego and LLFF scenes).

@krrish94
Copy link
Owner

Hi @kwea123,

Great catch! What you say is true for the FlexibleNeRFModel (whose default hidden_size is 128). You could replace it with PaperNeRFModel, which initializes the NeRF model with a default hidden_size of 256.

On my tests with the PaperNeRFModel (see this), I get an average speed of 14.2 it/s (on an RTX2080Ti, PyTorch 1.4.0, CUDA 9.2).

In comparison, under the same setup, this (awesome!) PyTorch port runs at 5.7 it/s (also with 3 layers lesser than my exp), and the TensorFlow release (in a different environment) runs at 3.44 it/s), again with three fewer layers.

As for not applying hidden_size in the config, I'm unsure when this changed. As you can probably see, I applied quite a few changes over the last 72 hours, which might've broken a few things along the way :)

But yes, many thanks for alerting! I know benchmarking can be quite arbitrary, and relies on several underlying factors (SW, HW, and more). I'll leave this issue open, for that reason.

(Also, the test cases that compare the TF impl to mine function-by-function are all here :) )

@kwea123
Copy link
Author

kwea123 commented Apr 18, 2020

Ok I modified the config to PaperNeRFModel, but currently there is a bug.

nerf-pytorch/nerf/models.py

Lines 164 to 165 in a14357d

xyz, dirs = x[..., : self.dim_xyz], x[..., self.dim_xyz :]
for i in range(8):

You forgot to add x = xyz between these lines, so the x in the following for loop is not the correct input. After adding that line I can successfully run the code, but in my setup (also RTX2080Ti, pytorch 1.4.0 CUDA 10.2) I can only get 8.5it/s (6.8it/s for the other pytorch impl and 5.64it/s for the original repo). Anyways thanks for the contribution, and hope you can correct these bugs here and there in the future.

@kwea123 kwea123 closed this as completed Apr 18, 2020
@kwea123 kwea123 reopened this Apr 18, 2020
@krrish94
Copy link
Owner

krrish94 commented Apr 18, 2020

@kwea123 I thought I fixed this, my bad :)

It should in fact be

x = self.relu(self.layers_xyz[0](xyz))
for i in range(1, 8):
     if i == 4:
        x = self.layers_xyz[i](torch.cat((xyz, x), -1))
    else:
        x = self.layers_xyz[i](x)

I'll merge this in in a couple days, but if it's urgent, feel free to PR

@krrish94
Copy link
Owner

Also, I'd be curious to help debug further. Did you run cache_dataset.py to precompute and cache rays?

@kwea123
Copy link
Author

kwea123 commented Apr 18, 2020

I didn't run that. I just wanted to see the speed.

@krrish94
Copy link
Owner

Caching, in my experimens, contributed to a bump up of +4 it/sec

@kwea123
Copy link
Author

kwea123 commented Apr 18, 2020

In my understanding the caching is only useful for ray_bundle, is that correct? Other values are just scalars and don't need much computation so I think caching them only increases a little speed.

Moreover, do we even need to store them on the disk? Why not just store in the RAM before training starts? For example for lego half resolution data, it has 100x400x400x8x4 bytes = 0.47GB only, which is feasible.

@krrish94
Copy link
Owner

I had the same presumption, so refrained from implementing caching when I initially put this repo out :)

Can load it all into RAM, but the random sampling at each iter seems to take longer, compared to "sampling beforehand and caching to disk". In my case, I sample about 500 random variants of 1024 samples each per view, and this becomes hard to store in RAM, more so for LLFF data.

@kwea123
Copy link
Author

kwea123 commented Apr 18, 2020

Do you have an action profile? Data loading, model forward takes how much time, backward takes how much time, etc.?

@krrish94
Copy link
Owner

I did, when I profiled. But I must admit, I didnt' religiously log most stuff. However, I jotted down a few timing details in my journal, on how much time a particular module took before vs after optimization, for a few modules.

Before I began profiling, I did a few obvious optimization steps (like ensure all Tensors are on CUDA, vectorize ops, etc.). After that, I had a time of 0.03 seconds for 1 iteration.

(The notion of an iteration is same as the original nerf code release (i.e., until the learning rate is updated for the next iter). Also, note that the time reported by the tqdm loader includes logging time to tensorboard etc.)

It was very hard to decipher running times from most of my notes (sorry!), but I've been able to recover a few.

Time to run one iteration (before opt): 0.06 seconds

Fetch one batch of data (randomly sample, no caching): 0.010634 sec
Fetch one batch of data (randomly sample, load from cache): 0.00413 sec

Positional encoding (with a for loop): 0.00157 sec
Positional encoding (with list comprehension): 0.00150 sec (no improvement)

Run network (with a for loop): 0.002 sec
Run network (with list comprehension): 0.002 sec (no improvement)

Predict volume density and render (initial): 0.05 sec
Predict volume density and render (optimized impl): 0.035 sec

Ensure tensors don't move across CPU and GPU:
sample_pdf (before optimizing): 0.09 sec
sample_pdf (after optimizing): 0.0002 sec!

run_one_iter (after optimizing): 0.036 sec

One big caveat: I haven't been able to recall the exact config params that I used.

Posting here, so it helps people by hinting where further optimization is possible :)

@yenchenlin
Copy link

yenchenlin commented Apr 21, 2020

@krrish94 num_fine: 64 is also wrong, it should be num_fine: 128. I get very similar benchmark results as @kwea123 on my computer. There might be other differences in hyperparameter that cause the runtime difference but I don't have time to check.

To compare runtime fairly, what I did is to make sure the forward/backward results in the training loop match exactly as the TF implementation for certain iterations. I think that's probably the only way to prevent hyperparameter mismatch.

Thanks for the work though!

@kwea123
Copy link
Author

kwea123 commented Aug 2, 2021

Bug not fixed, speed is questionable.
Closing.

@kwea123 kwea123 closed this as completed Aug 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants