Some questions about the reconstruction from pictures of real scene #72

sujuyu · 2021-04-28T15:12:10Z

Dear Dr. Xie:
Your work has been amazing, and I've recently been duplicating your work and trying to do some simple visualization and testing.
When I was trying to reconstruct with images of real scenes, something happened that I could not explain, so I can only disturb you and try to get some reasonable explanation.
First of all, your paper discussed the use of a single real picture for reconstruction. At first, I also tried to use a single picture with a pure background for testing, and got a satisfactory effect in some scenes of chairs.

The effect that also has partial chair and anticipatory difference are bigger.

But in other scenarios the effect is less than satisfactory.For example, rebuild a cup:

Whether the training data set is not comprehensive enough has something to do with it.
I then tried to test this with multiple input images of real chairs:
I tried to fix the position of the camera by using a rotatable chair. I only rotated the chair without changing the position of the camera in each shot to ensure that the distance between each picture and the camera would not change. In addition, I took a picture at about 30 degrees every time and processed the picture to ensure that the background was pure, as follows:

However, the output of the model is not satisfactory:

Then I tried to reduce the number of input images and found that the reconstruction results got better roughly as the number of input images decreased.For example:

My guess is that the context-fusion module is not working properly, causing this problem.Therefore, I would like to consult Dr. Xie. If I want the context fusion module to play an ideal role in the input of multiple images, what are the requirements for the use of the input images?
These are my questions, which can be summarized as follows:

How to explain the deterioration of reconstruction results when the number of input images increases in the reconstruction process?
What are the requirements for input images (such as shooting Angle and distance) if I want to get the expected good results when multiple real images are input and let the context fusion module play a role?
Finally, it would be great if I could get a more convenient contact information for you. Such as weChat.

------------------------------------------以下是中文-------------------------------------
谢博士您好：
您的工作非常令人惊喜，我最近正在复现您的工作，并尝试做一些可视化和测试的简单工作。
我在尝试用真实图片进行重建的时候，发生了一些我无法解释的事情，所以只能打扰您，试图获取一些合理的解释。
首先，您的论文中只论述了使用单张真实图片进行重建，我起初也尝试用单张且背景纯洁的图片进行测试，在部分椅子的场景中得到了令人满意的效果。

也有部分椅子的效果与预期差距较大。

但是在其他场景下的效果就不能令人满意了。例如对一个杯子进行重建：

这是否训练数据集涵盖不够全面有关系呢。
之后我试着对多张输入真实椅子图片进行测试：
我试着固定相机的位置，使用一把可旋转椅子，每次拍摄仅旋转椅子而不改变相机位置，确保每张图片距离相机的距离不发生改变，另外每间隔大致30度拍摄一张图片，并对图片进行处理，确保背景纯净，如下：

但是模型的输出效果却不能令人满意：

我试着减少输入图片的数量，发现大致随着输入图片的减少，重建结果会变得更好一些。比如下图所示：

我猜测是上下文融合模块不能正常工作导致了这个问题。因此我想咨询谢博士，如果想让上下文融合模块，在输入多张图片的时候起到理想的作用，对于输入图片有什么用的要求呢？

以上就是我的问题，总结有如下几点：
1、如何解释在重建过程中，当输入图片的数量增加时，重建结果可能会出现劣化的现象？
2、如果想在多张真实图片输入时，得到预期的良好结果，让上下文融合模块发挥作用，对于输入图片有什么样的要求（例如拍摄角度、距离）？
3、最后，如果我能获取一个您的更加方便的联系方式就再好不过了，例如微信？。

sirish-gambhira · 2021-08-30T15:22:43Z

Hello @sujuyu, @hzxie

I am trying to reproduce the results of Pix2Vox paper. I am using the following configuration to test the data.

    epoch_idx = 1
    # Enable the inbuilt cudnn auto-tuner to find the best algorithm to use
    torch.backends.cudnn.benchmark = True
    encoder = Encoder(cfg)
    decoder = Decoder(cfg)
    refiner = Refiner(cfg)
    merger = Merger(cfg)

    if torch.cuda.is_available():
        encoder = torch.nn.DataParallel(encoder).cuda()
        decoder = torch.nn.DataParallel(decoder).cuda()
        refiner = torch.nn.DataParallel(refiner).cuda()
        merger = torch.nn.DataParallel(merger).cuda()

    print('[INFO] %s Loading weights from %s ...' % (dt.now(), cfg.CONST.WEIGHTS))
    checkpoint = torch.load(cfg.CONST.WEIGHTS)
    epoch_idx = checkpoint['epoch_idx']
    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])
    if cfg.NETWORK.USE_REFINER:
            refiner.load_state_dict(checkpoint['refiner_state_dict'])
    if cfg.NETWORK.USE_MERGER:
            merger.load_state_dict(checkpoint['merger_state_dict'])
            
    encoder.eval()
    decoder.eval()
    refiner.eval()
    merger.eval()

    img1_path = '/content/Pix2Vox/cup2.png'
    img1_np = cv2.imread(img1_path, cv2.IMREAD_UNCHANGED).astype(np.float32) / 255.
    sample = np.array([img1_np])
    
    IMG_SIZE = cfg.CONST.IMG_H, cfg.CONST.IMG_W
    CROP_SIZE = cfg.CONST.CROP_IMG_H, cfg.CONST.CROP_IMG_W
    test_transforms = utils.data_transforms.Compose([
        utils.data_transforms.CenterCrop(IMG_SIZE, CROP_SIZE),
        utils.data_transforms.RandomBackground(cfg.TEST.RANDOM_BG_COLOR_RANGE),
        utils.data_transforms.Normalize(mean=cfg.DATASET.MEAN, std=cfg.DATASET.STD),
        utils.data_transforms.ToTensor(),
    ])

    
    rendering_images = test_transforms(rendering_images=sample)
    rendering_images = rendering_images.unsqueeze(0)

    with torch.no_grad():
        # Get data from data loader
        rendering_images = utils.network_utils.var_or_cuda(rendering_images)

        # Test the encoder, decoder, refiner and merger
        image_features = encoder(rendering_images)
        raw_features, generated_volume = decoder(image_features)

        if cfg.NETWORK.USE_MERGER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_MERGER:
            print("Using Merger and Refiner")
            generated_volume = merger(raw_features, generated_volume)
        else:
            generated_volume = torch.mean(generated_volume, dim=1)

        if cfg.NETWORK.USE_REFINER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_REFINER:
            generated_volume = refiner(generated_volume)
        
        generated_volume = generated_volume.squeeze(0)

        img_dir = './sample_images'
        gv = generated_volume.cpu().numpy()
        rendering_views = utils.binvox_visualization.get_volume_views(gv, os.path.join(img_dir),
                                                                    epoch_idx)

and the following settings in config.py

        __C.NETWORK.USE_REFINER                     = True
       __C.NETWORK.USE_MERGER                      = True.

I am able to reconstruct aeroplane images as given in #28. But, I am not able to reconstruct chairs as yours. The following are the outputs.
Input

Output

Input

Output

However, I am able to reconstruct the above image in a better way, by changing both __C.NETWORK.USE_REFINER and __C.NETWORK.USE_MERGER to False.

I would be grateful if you could kindly let me know how you reconstructed the above results. Kindly let me know of any changes you made in the network or configuration. Thank you for your time and consideration. I am available at sirishgam001@gmail.com.

sujuyu · 2021-08-30T15:53:45Z

Hello @sujuyu, @hzxie

I am trying to reproduce the results of Pix2Vox paper. I am using the following configuration to test the data.

    epoch_idx = 1
    # Enable the inbuilt cudnn auto-tuner to find the best algorithm to use
    torch.backends.cudnn.benchmark = True
    encoder = Encoder(cfg)
    decoder = Decoder(cfg)
    refiner = Refiner(cfg)
    merger = Merger(cfg)

    if torch.cuda.is_available():
        encoder = torch.nn.DataParallel(encoder).cuda()
        decoder = torch.nn.DataParallel(decoder).cuda()
        refiner = torch.nn.DataParallel(refiner).cuda()
        merger = torch.nn.DataParallel(merger).cuda()

    print('[INFO] %s Loading weights from %s ...' % (dt.now(), cfg.CONST.WEIGHTS))
    checkpoint = torch.load(cfg.CONST.WEIGHTS)
    epoch_idx = checkpoint['epoch_idx']
    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])
    if cfg.NETWORK.USE_REFINER:
            refiner.load_state_dict(checkpoint['refiner_state_dict'])
    if cfg.NETWORK.USE_MERGER:
            merger.load_state_dict(checkpoint['merger_state_dict'])
            
    encoder.eval()
    decoder.eval()
    refiner.eval()
    merger.eval()

    img1_path = '/content/Pix2Vox/cup2.png'
    img1_np = cv2.imread(img1_path, cv2.IMREAD_UNCHANGED).astype(np.float32) / 255.
    sample = np.array([img1_np])
    
    IMG_SIZE = cfg.CONST.IMG_H, cfg.CONST.IMG_W
    CROP_SIZE = cfg.CONST.CROP_IMG_H, cfg.CONST.CROP_IMG_W
    test_transforms = utils.data_transforms.Compose([
        utils.data_transforms.CenterCrop(IMG_SIZE, CROP_SIZE),
        utils.data_transforms.RandomBackground(cfg.TEST.RANDOM_BG_COLOR_RANGE),
        utils.data_transforms.Normalize(mean=cfg.DATASET.MEAN, std=cfg.DATASET.STD),
        utils.data_transforms.ToTensor(),
    ])

    
    rendering_images = test_transforms(rendering_images=sample)
    rendering_images = rendering_images.unsqueeze(0)

    with torch.no_grad():
        # Get data from data loader
        rendering_images = utils.network_utils.var_or_cuda(rendering_images)

        # Test the encoder, decoder, refiner and merger
        image_features = encoder(rendering_images)
        raw_features, generated_volume = decoder(image_features)

        if cfg.NETWORK.USE_MERGER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_MERGER:
            print("Using Merger and Refiner")
            generated_volume = merger(raw_features, generated_volume)
        else:
            generated_volume = torch.mean(generated_volume, dim=1)

        if cfg.NETWORK.USE_REFINER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_REFINER:
            generated_volume = refiner(generated_volume)
        
        generated_volume = generated_volume.squeeze(0)

        img_dir = './sample_images'
        gv = generated_volume.cpu().numpy()
        rendering_views = utils.binvox_visualization.get_volume_views(gv, os.path.join(img_dir),
                                                                    epoch_idx)

and the following settings in config.py

        __C.NETWORK.USE_REFINER                     = True
       __C.NETWORK.USE_MERGER                      = True.

I am able to reconstruct aeroplane images as given in #28. But, I am not able to reconstruct chairs as yours. The following are the outputs.
Input

Output

Input

Output

However, I am able to reconstruct the above image in a better way, by changing both __C.NETWORK.USE_REFINER and __C.NETWORK.USE_MERGER to False.

I would be grateful if you could kindly let me know how you reconstructed the above results. Kindly let me know of any changes you made in the network or configuration. Thank you for your time and consideration. I am available at sirishgam001@gmail.com.

hello！sirish07.
When testing the chair image above, I didn't make any changes to the network parameters.
Through my test, this network model has great randomness in the reconstruction effect of real pictures in real scenes. In the vast majority of cases, the results are unsatisfactory. And the few notable reconstructions were all of chairs .
I suspect it has something to do with the complex textures and fuzzy edges of real objects .

sirish-gambhira · 2021-08-30T16:18:28Z

Hey @sujuyu,

Thank you for your prompt response. I want to know if I am missing anything in reproducing your above results (for single-view images). Your results are much better compared to mine. Do I have to change anything in config.py (or elsewhere) or are the results differing only because of randomness? Thank you for your time.

sujuyu · 2021-08-31T02:41:41Z

Hey @sujuyu,

Thank you for your prompt response. I want to know if I am missing anything in reproducing your above results (for single-view images). Your results are much better compared to mine. Do I have to change anything in config.py (or elsewhere) or are the results differing only because of randomness? Thank you for your time.

I didn't change any parameters in config.py. The result of reconstruction of a certain image is unique. Maybe you need to retrain the network model. It needs 28 hours for NVIDIA 1080Ti graphics card.

sirish-gambhira · 2021-09-08T17:21:44Z

Hello @sujuyu

Can you kindly let me know which dataset have you used to train Pix2Vox to generate the above results? Is it ShapeNet or Pix3D? Thank you for your time.

sujuyu · 2021-09-09T02:56:34Z

Hello @sujuyu

Can you kindly let me know which dataset have you used to train Pix2Vox to generate the above results? Is it ShapeNet or Pix3D? Thank you for your time.

Hello, I only used ShapeNet.

xphn · 2022-10-24T03:13:55Z

Hi thank you so much for the great work. I am having some difficulties in implementing your work during testing on the pix3d dataset. In the instruction, it is mentioned that we need to define the link to the binvox files. However, there is no binvox file that comes with pix3d dataset. The following is the change I made in the config file

__C.DATASETS.PIX3D.VOXEL_PATH = 'C:/Users/peng/Desktop/pix2vox/datasets/pix3d/model/%s/%s/%s.binvox'

Thank you very much

hzxie · 2023-11-02T12:55:05Z

@xphn
You can voxelize the 3D meshes downloaded from http://pix3d.csail.mit.edu/data/pix3d_full.zip

hzxie closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the reconstruction from pictures of real scene #72

Some questions about the reconstruction from pictures of real scene #72

sujuyu commented Apr 28, 2021

sirish-gambhira commented Aug 30, 2021

sujuyu commented Aug 30, 2021

sirish-gambhira commented Aug 30, 2021

sujuyu commented Aug 31, 2021 •

edited

sirish-gambhira commented Sep 8, 2021

sujuyu commented Sep 9, 2021

xphn commented Oct 24, 2022

hzxie commented Nov 2, 2023

Some questions about the reconstruction from pictures of real scene #72

Some questions about the reconstruction from pictures of real scene #72

Comments

sujuyu commented Apr 28, 2021

sirish-gambhira commented Aug 30, 2021

sujuyu commented Aug 30, 2021

sirish-gambhira commented Aug 30, 2021

sujuyu commented Aug 31, 2021 • edited

sirish-gambhira commented Sep 8, 2021

sujuyu commented Sep 9, 2021

xphn commented Oct 24, 2022

hzxie commented Nov 2, 2023

sujuyu commented Aug 31, 2021 •

edited