Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the reconstruction from pictures of real scene #72

Closed
sujuyu opened this issue Apr 28, 2021 · 8 comments
Closed

Some questions about the reconstruction from pictures of real scene #72

sujuyu opened this issue Apr 28, 2021 · 8 comments

Comments

@sujuyu
Copy link

sujuyu commented Apr 28, 2021

Dear Dr. Xie:
 Your work has been amazing, and I've recently been duplicating your work and trying to do some simple visualization and testing.
 When I was trying to reconstruct with images of real scenes, something happened that I could not explain, so I can only disturb you and try to get some reasonable explanation.
 First of all, your paper discussed the use of a single real picture for reconstruction. At first, I also tried to use a single picture with a pure background for testing, and got a satisfactory effect in some scenes of chairs.
imageimage
imageimage
 The effect that also has partial chair and anticipatory difference are bigger.
imageimage
 But in other scenarios the effect is less than satisfactory.For example, rebuild a cup:
imageimage
 Whether the training data set is not comprehensive enough has something to do with it.
 I then tried to test this with multiple input images of real chairs:
 I tried to fix the position of the camera by using a rotatable chair. I only rotated the chair without changing the position of the camera in each shot to ensure that the distance between each picture and the camera would not change. In addition, I took a picture at about 30 degrees every time and processed the picture to ensure that the background was pure, as follows:
012
345
6167
8910
111213
1415

 However, the output of the model is not satisfactory:
image
 Then I tried to reduce the number of input images and found that the reconstruction results got better roughly as the number of input images decreased.For example:
imageimage
imageaddass
 My guess is that the context-fusion module is not working properly, causing this problem.Therefore, I would like to consult Dr. Xie. If I want the context fusion module to play an ideal role in the input of multiple images, what are the requirements for the use of the input images?
These are my questions, which can be summarized as follows:

  1. How to explain the deterioration of reconstruction results when the number of input images increases in the reconstruction process?
  2. What are the requirements for input images (such as shooting Angle and distance) if I want to get the expected good results when multiple real images are input and let the context fusion module play a role?
  3. Finally, it would be great if I could get a more convenient contact information for you. Such as weChat.

------------------------------------------以下是中文-------------------------------------
谢博士您好:
 您的工作非常令人惊喜,我最近正在复现您的工作,并尝试做一些可视化和测试的简单工作。
 我在尝试用真实图片进行重建的时候,发生了一些我无法解释的事情,所以只能打扰您,试图获取一些合理的解释。
 首先,您的论文中只论述了使用单张真实图片进行重建,我起初也尝试用单张且背景纯洁的图片进行测试,在部分椅子的场景中得到了令人满意的效果。
imageimage
imageimage
 也有部分椅子的效果与预期差距较大。
imageimage
 但是在其他场景下的效果就不能令人满意了。例如对一个杯子进行重建:
imageimage
 这是否训练数据集涵盖不够全面有关系呢。
 之后我试着对多张输入真实椅子图片进行测试:
 我试着固定相机的位置,使用一把可旋转椅子,每次拍摄仅旋转椅子而不改变相机位置,确保每张图片距离相机的距离不发生改变,另外每间隔大致30度拍摄一张图片,并对图片进行处理,确保背景纯净,如下:
012
345
6167
8910
111213
1415

 但是模型的输出效果却不能令人满意:
image
 我试着减少输入图片的数量,发现大致随着输入图片的减少,重建结果会变得更好一些。比如下图所示:
imageimage
imageaddass
 我猜测是上下文融合模块不能正常工作导致了这个问题。因此我想咨询谢博士,如果想让上下文融合模块,在输入多张图片的时候起到理想的作用,对于输入图片有什么用的要求呢?

以上就是我的问题,总结有如下几点:
1、如何解释在重建过程中,当输入图片的数量增加时,重建结果可能会出现劣化的现象?
2、如果想在多张真实图片输入时,得到预期的良好结果,让上下文融合模块发挥作用,对于输入图片有什么样的要求(例如拍摄角度、距离)?
3、最后,如果我能获取一个您的更加方便的联系方式就再好不过了,例如微信?。

@sirish-gambhira
Copy link

Hello @sujuyu, @hzxie

I am trying to reproduce the results of Pix2Vox paper. I am using the following configuration to test the data.

    epoch_idx = 1
    # Enable the inbuilt cudnn auto-tuner to find the best algorithm to use
    torch.backends.cudnn.benchmark = True
    encoder = Encoder(cfg)
    decoder = Decoder(cfg)
    refiner = Refiner(cfg)
    merger = Merger(cfg)

    if torch.cuda.is_available():
        encoder = torch.nn.DataParallel(encoder).cuda()
        decoder = torch.nn.DataParallel(decoder).cuda()
        refiner = torch.nn.DataParallel(refiner).cuda()
        merger = torch.nn.DataParallel(merger).cuda()

    print('[INFO] %s Loading weights from %s ...' % (dt.now(), cfg.CONST.WEIGHTS))
    checkpoint = torch.load(cfg.CONST.WEIGHTS)
    epoch_idx = checkpoint['epoch_idx']
    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])
    if cfg.NETWORK.USE_REFINER:
            refiner.load_state_dict(checkpoint['refiner_state_dict'])
    if cfg.NETWORK.USE_MERGER:
            merger.load_state_dict(checkpoint['merger_state_dict'])
            
    encoder.eval()
    decoder.eval()
    refiner.eval()
    merger.eval()

    img1_path = '/content/Pix2Vox/cup2.png'
    img1_np = cv2.imread(img1_path, cv2.IMREAD_UNCHANGED).astype(np.float32) / 255.
    sample = np.array([img1_np])
    
    IMG_SIZE = cfg.CONST.IMG_H, cfg.CONST.IMG_W
    CROP_SIZE = cfg.CONST.CROP_IMG_H, cfg.CONST.CROP_IMG_W
    test_transforms = utils.data_transforms.Compose([
        utils.data_transforms.CenterCrop(IMG_SIZE, CROP_SIZE),
        utils.data_transforms.RandomBackground(cfg.TEST.RANDOM_BG_COLOR_RANGE),
        utils.data_transforms.Normalize(mean=cfg.DATASET.MEAN, std=cfg.DATASET.STD),
        utils.data_transforms.ToTensor(),
    ])

    
    rendering_images = test_transforms(rendering_images=sample)
    rendering_images = rendering_images.unsqueeze(0)

    with torch.no_grad():
        # Get data from data loader
        rendering_images = utils.network_utils.var_or_cuda(rendering_images)

        # Test the encoder, decoder, refiner and merger
        image_features = encoder(rendering_images)
        raw_features, generated_volume = decoder(image_features)

        if cfg.NETWORK.USE_MERGER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_MERGER:
            print("Using Merger and Refiner")
            generated_volume = merger(raw_features, generated_volume)
        else:
            generated_volume = torch.mean(generated_volume, dim=1)

        if cfg.NETWORK.USE_REFINER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_REFINER:
            generated_volume = refiner(generated_volume)
        
        generated_volume = generated_volume.squeeze(0)

        img_dir = './sample_images'
        gv = generated_volume.cpu().numpy()
        rendering_views = utils.binvox_visualization.get_volume_views(gv, os.path.join(img_dir),
                                                                    epoch_idx)

and the following settings in config.py

        __C.NETWORK.USE_REFINER                     = True
       __C.NETWORK.USE_MERGER                      = True.

I am able to reconstruct aeroplane images as given in #28. But, I am not able to reconstruct chairs as yours. The following are the outputs.
Input
aero2
Output
aeroplaneVoxel

Input
chair4
Output
Chair4Voxel with aero config

However, I am able to reconstruct the above image in a better way, by changing both __C.NETWORK.USE_REFINER and __C.NETWORK.USE_MERGER to False.
Chair2Voxel

I would be grateful if you could kindly let me know how you reconstructed the above results. Kindly let me know of any changes you made in the network or configuration. Thank you for your time and consideration. I am available at sirishgam001@gmail.com.

@sujuyu
Copy link
Author

sujuyu commented Aug 30, 2021

Hello @sujuyu, @hzxie

I am trying to reproduce the results of Pix2Vox paper. I am using the following configuration to test the data.

    epoch_idx = 1
    # Enable the inbuilt cudnn auto-tuner to find the best algorithm to use
    torch.backends.cudnn.benchmark = True
    encoder = Encoder(cfg)
    decoder = Decoder(cfg)
    refiner = Refiner(cfg)
    merger = Merger(cfg)

    if torch.cuda.is_available():
        encoder = torch.nn.DataParallel(encoder).cuda()
        decoder = torch.nn.DataParallel(decoder).cuda()
        refiner = torch.nn.DataParallel(refiner).cuda()
        merger = torch.nn.DataParallel(merger).cuda()

    print('[INFO] %s Loading weights from %s ...' % (dt.now(), cfg.CONST.WEIGHTS))
    checkpoint = torch.load(cfg.CONST.WEIGHTS)
    epoch_idx = checkpoint['epoch_idx']
    encoder.load_state_dict(checkpoint['encoder_state_dict'])
    decoder.load_state_dict(checkpoint['decoder_state_dict'])
    if cfg.NETWORK.USE_REFINER:
            refiner.load_state_dict(checkpoint['refiner_state_dict'])
    if cfg.NETWORK.USE_MERGER:
            merger.load_state_dict(checkpoint['merger_state_dict'])
            
    encoder.eval()
    decoder.eval()
    refiner.eval()
    merger.eval()

    img1_path = '/content/Pix2Vox/cup2.png'
    img1_np = cv2.imread(img1_path, cv2.IMREAD_UNCHANGED).astype(np.float32) / 255.
    sample = np.array([img1_np])
    
    IMG_SIZE = cfg.CONST.IMG_H, cfg.CONST.IMG_W
    CROP_SIZE = cfg.CONST.CROP_IMG_H, cfg.CONST.CROP_IMG_W
    test_transforms = utils.data_transforms.Compose([
        utils.data_transforms.CenterCrop(IMG_SIZE, CROP_SIZE),
        utils.data_transforms.RandomBackground(cfg.TEST.RANDOM_BG_COLOR_RANGE),
        utils.data_transforms.Normalize(mean=cfg.DATASET.MEAN, std=cfg.DATASET.STD),
        utils.data_transforms.ToTensor(),
    ])

    
    rendering_images = test_transforms(rendering_images=sample)
    rendering_images = rendering_images.unsqueeze(0)

    with torch.no_grad():
        # Get data from data loader
        rendering_images = utils.network_utils.var_or_cuda(rendering_images)

        # Test the encoder, decoder, refiner and merger
        image_features = encoder(rendering_images)
        raw_features, generated_volume = decoder(image_features)

        if cfg.NETWORK.USE_MERGER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_MERGER:
            print("Using Merger and Refiner")
            generated_volume = merger(raw_features, generated_volume)
        else:
            generated_volume = torch.mean(generated_volume, dim=1)

        if cfg.NETWORK.USE_REFINER and epoch_idx >= cfg.TRAIN.EPOCH_START_USE_REFINER:
            generated_volume = refiner(generated_volume)
        
        generated_volume = generated_volume.squeeze(0)

        img_dir = './sample_images'
        gv = generated_volume.cpu().numpy()
        rendering_views = utils.binvox_visualization.get_volume_views(gv, os.path.join(img_dir),
                                                                    epoch_idx)

and the following settings in config.py

        __C.NETWORK.USE_REFINER                     = True
       __C.NETWORK.USE_MERGER                      = True.

I am able to reconstruct aeroplane images as given in #28. But, I am not able to reconstruct chairs as yours. The following are the outputs.
Input
aero2
Output
aeroplaneVoxel

Input
chair4
Output
Chair4Voxel with aero config

However, I am able to reconstruct the above image in a better way, by changing both __C.NETWORK.USE_REFINER and __C.NETWORK.USE_MERGER to False.
Chair2Voxel

I would be grateful if you could kindly let me know how you reconstructed the above results. Kindly let me know of any changes you made in the network or configuration. Thank you for your time and consideration. I am available at sirishgam001@gmail.com.

hello!sirish07.
When testing the chair image above, I didn't make any changes to the network parameters.
Through my test, this network model has great randomness in the reconstruction effect of real pictures in real scenes. In the vast majority of cases, the results are unsatisfactory. And the few notable reconstructions were all of chairs .
I suspect it has something to do with the complex textures and fuzzy edges of real objects .

@sirish-gambhira
Copy link

Hey @sujuyu,

Thank you for your prompt response. I want to know if I am missing anything in reproducing your above results (for single-view images). Your results are much better compared to mine. Do I have to change anything in config.py (or elsewhere) or are the results differing only because of randomness? Thank you for your time.

@sujuyu
Copy link
Author

sujuyu commented Aug 31, 2021

Hey @sujuyu,

Thank you for your prompt response. I want to know if I am missing anything in reproducing your above results (for single-view images). Your results are much better compared to mine. Do I have to change anything in config.py (or elsewhere) or are the results differing only because of randomness? Thank you for your time.

I didn't change any parameters in config.py. The result of reconstruction of a certain image is unique. Maybe you need to retrain the network model. It needs 28 hours for NVIDIA 1080Ti graphics card.

@sirish-gambhira
Copy link

Hello @sujuyu

Can you kindly let me know which dataset have you used to train Pix2Vox to generate the above results? Is it ShapeNet or Pix3D? Thank you for your time.

@sujuyu
Copy link
Author

sujuyu commented Sep 9, 2021

Hello @sujuyu

Can you kindly let me know which dataset have you used to train Pix2Vox to generate the above results? Is it ShapeNet or Pix3D? Thank you for your time.

Hello, I only used ShapeNet.

@xphn
Copy link

xphn commented Oct 24, 2022

Hi thank you so much for the great work. I am having some difficulties in implementing your work during testing on the pix3d dataset. In the instruction, it is mentioned that we need to define the link to the binvox files. However, there is no binvox file that comes with pix3d dataset. The following is the change I made in the config file

__C.DATASETS.PIX3D.VOXEL_PATH = 'C:/Users/peng/Desktop/pix2vox/datasets/pix3d/model/%s/%s/%s.binvox'

Thank you very much

@hzxie
Copy link
Owner

hzxie commented Nov 2, 2023

@xphn
You can voxelize the 3D meshes downloaded from http://pix3d.csail.mit.edu/data/pix3d_full.zip

@hzxie hzxie closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants