Trying to use for metric depth - ZoeDepth sanity check complains #122

michelodu · 2024-07-09T20:37:19Z

Hello,

I am trying to use ZoeDepth for metric depth. When I try to use the ZoeDepth sanity check, it complains of...
"xformers not available" as well as RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ZoeDepth:
Missing key(s) in state_dict: "core.core.pretrained.cls_token" and various other missing keys.

What am I missing, to fully enable the transformer code?

Best wishes,
Michel

dung2603 · 2024-07-10T18:48:39Z

Try to use timm == 0.6.7

michelodu · 2024-07-10T19:43:52Z

Hello,
Thanks for your kind reply.
I've just done as much, and while there is progress, there still seems to be a mismatch when I try to run the ZoeDepth sanity check as well as python train_mono.py -m zoedepth -d kitti --pretrained_resource="local::./checkpoints/depth_anything_vitl14.pth" .
RuntimeError: Error(s) in loading state_dict for ZoeDepth:
Missing key(s) in state_dict: "core.core.pretrained.cls_token", "core.core.pretrained.pos_embed", "core.core.pretrained.mask_token", "core.core.pretrained.patch_embed.proj.weight",

name: zoe
channels:

pytorch
nvidia
conda-forge
dependencies:
cuda=11.7.1
h5py=3.7.0
hdf5=1.12.2
matplotlib=3.6.2
matplotlib-base=3.6.2
mkl==2024.0
numpy=1.24.1
opencv=4.6.0
pip=22.3.1
python=3.9.7
pytorch=1.13.1
pytorch-cuda=11.7
pytorch-mutex=1.0
scipy=1.10.0
torchaudio=0.13.1
torchvision=0.14.1
pip:
- huggingface-hub==0.11.1
- timm==0.6.7
- tqdm==4.64.1
- wandb==0.13.9

Cheers,
Michel

P.S.: I don't know if this is relevant or not, but I also see the message: "xFormers not available" as this runs.

michelodu · 2024-07-10T20:13:34Z

I hard-coded a few parameters to make it easier to run on the debugger, and I'm further along. No idea why they were not input properly.
parser.add_argument("-m", "--model", type=str, default="zoedepth") #"synunet")
parser.add_argument("-d", "--dataset", type=str, default='kitti') #'nyu')
parser.add_argument("--trainer", type=str, default=None)
parser.add_argument("--pretrained_resource=", type=str, default="local::./checkpoints/depth_anything_vitl14.pth")

ZoeDepth seems better initialized now. I'll follow up on this bug later, but for now, will keep it hardcoded.

ZoeDepth(
(core): DepthAnythingCore(
(core): DPT_DINOv2(
(pretrained): DinoVisionTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14))
(norm): Identity()
)
(blocks): ModuleList(
(0): NestedTensorBlock(
(norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(attn): MemEffAttention(
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1024, out_features=1024, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
...

I now get as far as this: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key]) .

dung2603 · 2024-07-10T20:19:18Z

Try to install xformer : pip install xformers

michelodu · 2024-07-10T20:20:41Z

Thanks. I will add it to the environment.yml file. I'm afraid of breaking my Conda build proceeding otherwise.

dung2603 · 2024-07-10T23:48:28Z

To fix your error: “ ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key]) .” go to wandb.ai then create account after that writing command:” wandb login “ your id was created” . Before training

michelodu · 2024-07-11T13:11:09Z

Thanks for the info. I'll work on it now.
Progress. It fails a bit further on.

...mpr/mpr/DepthAnything/metric_depth/train_mono.py", line 173, in
mp.spawn(main_worker, nprocs=ngpus_per_node
socket.cpp:464] [c10d] The server socket has failed to bind to [::]:15024
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address

Woops found something else. Could be related. Failed to load image Python extension: libtorch_cuda_cu...

michelodu · 2024-07-11T16:21:53Z

I think that there are some limitations in the current environment.yml file. I noticed that the pytorch 1.13.1 does not overwrite an existing pytorch library on conda, say 2.2.1. if the latter is already present. The environment loads incompatible torchvision and torchaudio libraries as a result. I'm playing with competing options for this environment file.

I'll make it available to the community if there is an interest.

dung2603 · 2024-07-11T16:26:40Z

Send me messages through instagram. My id is dung26032000

michelodu · 2024-07-16T13:47:01Z

I'm going to use this post to track some of the other problems that I encountered so far. I've been getting a socket error as well. This added option seems to address that: --master_port=(some number) e.g. --master_port=25678 .
Lightning-AI/pytorch-lightning#13264

dung2603 · 2024-07-16T14:04:19Z

Screen shot your problem then send it to me

michelodu · 2024-07-16T15:31:11Z

Hi Dung. I've solved that one actually, but I'm using this thread as a document to track what I am seeing, in case anyone in the community wants to re-use the code, and also for my colleagues who will build on metric Depth Anything. The one thing that I am dealing with now is that the code imposes a path for finding the training data, and I have to ascertain where that is and overwrite those instructions. I'll let you know if I need to interact with you. Cheers, Michel

michelodu · 2024-07-16T17:54:24Z

My last obstacle is making sure that I train on the files of my choosing. This established in DepthAnything/metric_depth/zoedepth/utils/config.py .
This config file includes the following code for kitti, which assumes a different file structure than I have...
DATASETS_CONFIG = {
"kitti": {
"dataset": "kitti",
"min_depth": 0.001,
"max_depth": 80,
"data_path": os.path.join(HOME_DIR, "Kitti/raw_data"),
"gt_path": os.path.join(HOME_DIR, "Kitti/data_depth_annotated_zoedepth"),
"filenames_file": "./train_test_inputs/kitti_eigen_train_files_with_gt.txt",
"input_height": 352,
"input_width": 1216, # 704
"data_path_eval": os.path.join(HOME_DIR, "Kitti/raw_data"),
"gt_path_eval": os.path.join(HOME_DIR, "Kitti/data_depth_annotated_zoedepth"),
"filenames_file_eval": "./train_test_inputs/kitti_eigen_test_files_with_gt.txt",

    "min_depth_eval": 1e-3,
    "max_depth_eval": 80,

    "do_random_rotate": True,
    "degree": 1.0,
    "do_kb_crop": True,
    "garg_crop": True,
    "eigen_crop": False,
    "use_right": False
},

The filenames_file is particularly relevant, as ar data_path and gt_path. I need to overwrite that filenames_file with something else. The structure of that txt file is like this:

2011_09_26/2011_09_26_drive_0051_sync/image_02/data/0000000093.png 2011_09_26_drive_0051_sync/proj_depth/groundtruth/image_02/0000000093.png 721.5377

2011_09_30/2011_09_30_drive_0028_sync/image_02/data/0000002714.png 2011_09_30_drive_0028_sync/proj_depth/groundtruth/image_02/0000002714.png 707.0912

2011_09_26/2011_09_26_drive_0061_sync/image_02/data/0000000045.png 2011_09_26_drive_0061_sync/proj_depth/groundtruth/image_02/0000000045.png 721.5377

The first file is probably the raw image, while the second is the groundtruth, obviously, but there is also third entry, a number of some kind. Any idea what that represents?

michelodu · 2024-07-18T18:50:04Z

I am progressing... I managed to format my datafile as ZoeDepth expects to see it, but CUDA complains about memory...

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 15.70 GiB total capacity; 11.81 GiB already allocated; 167.81 MiB free; 12.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Will look into reducing my training set...

michelodu · 2024-07-18T20:04:07Z

For the community...
Mr Dung and I have been interacting on Instragram. He suggested that I play with the batch size in config_zoedepth.json. I tried with values of bs = 16 (original value), 4, 2 and 1. The value of 1 is the only setting that allowed train_mono.py to run properly with my Kitti data. Now my GPU is humming as it trains. I'll report back on test studies.

Edit: GPU hummed for a while but eventually the training process still ran out of memory, with arguably the lowest bs=1 setting...

Following up with export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'
https://stackoverflow.com/questions/73747731/runtimeerror-cuda-out-of-memory-how-can-i-set-max-split-size-mb

michaeltan53 · 2024-07-19T01:06:29Z

Send me messages through instagram. My id is dung26032000

Hi Dung, I met some problem for model training, can you help me to have a look? I can send my problem through Instagram to you.

dung2603 · 2024-07-19T01:08:58Z

Send me messages through instagram. My id is dung26032000

Hi Dung, I met some problem for model training, can you help me to have a look? I can send my problem through Instagram to you.

Send me now

michaeltan53 · 2024-07-21T04:50:49Z

Send me messages through instagram. My id is dung26032000

Hi Dung, I met some problem for model training, can you help me to have a look? I can send my problem through Instagram to you.

Send me now

Thanks, I've just sent the following application on instagram to you

michaeltan53 · 2024-07-21T08:19:18Z

Send me messages through instagram. My id is dung26032000

Hi Dung, I met some problem for model training, can you help me to have a look? I can send my problem through Instagram to you.

Send me now

#125 (comment) here is my current problem, and I've also sent it to you on instagram

michelodu · 2024-07-23T14:14:52Z

I am seeing PYTORCH_CUDA_ALLOC_CONF memory errors.
I'm trying this....
export 'PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:32'
It failed at export 'PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128'
which I obtained from CompVis/stable-diffusion#39 .
A second thread mentioned a value of 32 https://blog.gopenai.com/how-to-resolve-runtimeerror-cuda-out-of-memory-d48995452a0
Also this: https://medium.com/@soumensardarintmain/manage-cuda-cores-ultimate-memory-management-strategy-with-pytorch-2bed30cab1

michelodu · 2024-07-23T14:53:41Z

Is it possible to use gradient accumulation to deal with this? Can I invoke train_mono with extra parameters like --gradient_accumulation_steps and expect it to address my memory error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to use for metric depth - ZoeDepth sanity check complains #122

Trying to use for metric depth - ZoeDepth sanity check complains #122

michelodu commented Jul 9, 2024

dung2603 commented Jul 10, 2024

michelodu commented Jul 10, 2024 •

edited

Loading

michelodu commented Jul 10, 2024 •

edited

Loading

dung2603 commented Jul 10, 2024

michelodu commented Jul 10, 2024

dung2603 commented Jul 10, 2024

michelodu commented Jul 11, 2024 •

edited

Loading

michelodu commented Jul 11, 2024

dung2603 commented Jul 11, 2024

michelodu commented Jul 16, 2024

dung2603 commented Jul 16, 2024

michelodu commented Jul 16, 2024

michelodu commented Jul 16, 2024

michelodu commented Jul 18, 2024

michelodu commented Jul 18, 2024 •

edited

Loading

michaeltan53 commented Jul 19, 2024

dung2603 commented Jul 19, 2024

michaeltan53 commented Jul 21, 2024

michaeltan53 commented Jul 21, 2024

michelodu commented Jul 23, 2024 •

edited

Loading

michelodu commented Jul 23, 2024

Trying to use for metric depth - ZoeDepth sanity check complains #122

Trying to use for metric depth - ZoeDepth sanity check complains #122

Comments

michelodu commented Jul 9, 2024

dung2603 commented Jul 10, 2024

michelodu commented Jul 10, 2024 • edited Loading

michelodu commented Jul 10, 2024 • edited Loading

dung2603 commented Jul 10, 2024

michelodu commented Jul 10, 2024

dung2603 commented Jul 10, 2024

michelodu commented Jul 11, 2024 • edited Loading

michelodu commented Jul 11, 2024

dung2603 commented Jul 11, 2024

michelodu commented Jul 16, 2024

dung2603 commented Jul 16, 2024

michelodu commented Jul 16, 2024

michelodu commented Jul 16, 2024

michelodu commented Jul 18, 2024

michelodu commented Jul 18, 2024 • edited Loading

michaeltan53 commented Jul 19, 2024

dung2603 commented Jul 19, 2024

michaeltan53 commented Jul 21, 2024

michaeltan53 commented Jul 21, 2024

michelodu commented Jul 23, 2024 • edited Loading

michelodu commented Jul 23, 2024

michelodu commented Jul 10, 2024 •

edited

Loading

michelodu commented Jul 10, 2024 •

edited

Loading

michelodu commented Jul 11, 2024 •

edited

Loading

michelodu commented Jul 18, 2024 •

edited

Loading

michelodu commented Jul 23, 2024 •

edited

Loading