Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError(f"None of [{key}] are in the [{axis_name}]" #50

Closed
lmohit95 opened this issue Nov 18, 2022 · 13 comments · Fixed by #52
Closed

KeyError(f"None of [{key}] are in the [{axis_name}]" #50

lmohit95 opened this issue Nov 18, 2022 · 13 comments · Fixed by #52
Labels
bug Something isn't working HET Issues about HET

Comments

@lmohit95
Copy link

I am getting KeyError(f"None of [{key}] are in the [{axis_name}]" while running python models/load_data.py.
The error occurs in this line.

image

I have set appropriate path for criteo_dataset in load_data.py. The download, extraction and creating local files part is successful. I downloaded dataset from this website: https://www.kaggle.com/competitions/criteo-display-ad-challenge/data.

@Hankpipi
Copy link
Contributor

Hi, @lmohit95, this has been solved in #52 and sorry for the delay.

@lmohit95
Copy link
Author

Thank you. load_data.py works perfectly now. The dataset is downloaded and processed.
But while running bash tests/local_dcn_criteo.sh command, I get the following error. I have tried changing batchsize at line 62 in run_hetu.py and tried running the command, but I still get the same error.

image

I am able to run other DLRM like facebook open source DLRM using GPU, so I believe CUDA setup is correct.

@Hankpipi
Copy link
Contributor

Hankpipi commented Nov 20, 2022

@lmohit95, Hetu main brench has been updated which enables dynamic memory, please pull the new code and try again.

@lmohit95
Copy link
Author

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

@Hankpipi
Copy link
Contributor

Hankpipi commented Nov 20, 2022

@lmohit95, Hetu main brench has been updated, please try to pull the new code and try again.

Thank you. It works now. Sorry for asking lot of questions. I am facing this issue now while training criteo dataset.

#50 (comment)

I mean this problem has been solved by #47 which was merged not long before, and will it still happen when you pull these changes?

@lmohit95
Copy link
Author

I get the following error when I run bash tests/local_dcn_criteo.sh. I created hetu_config.yaml file in tmp folder and copied contents provided in README.MD. I am trying to run HET on a single GPU.

image

To avoid this error, I deliberately made file = None in __init__ function of distribute.py. While doing that, I am facing the outofmemory error.

@Hankpipi
Copy link
Contributor

Hankpipi commented Nov 20, 2022

@lmohit95, You can also update the line 120 by the following code to avoid the first error:

if args.comm is None:
executor = ht.Executor(eval_nodes, ctx=ht.gpu(0), cstable_policy=args.cache,
bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)
else:
strategy = ht.dist.DataParallel(aggregate=args.comm)
executor = ht.Executor(eval_nodes, dist_strategy=strategy, cstable_policy=args.cache,
bsp=args.bsp, cache_bound=args.bound, seed=123, log_path=executor_log_path)

For the OOM error, #47 implements dynamic memory allocation, and the gpu memory peak will be halved when you run bash tests/local_dcn_criteo.sh.

Maybe you haven't pull the latest code yet?

@lmohit95
Copy link
Author

lmohit95 commented Nov 21, 2022

Thanks a lot for everything. The tests are working perfectly now. I was accessing the forked repo mentioned in the HET paper.
While running python run_hetu.py --model dcn_criteo --all --val command, I am facing the following error:

image

I pulled latest code and downloaded criteo dataset by running load_data.py file.

@Hankpipi
Copy link
Contributor

Hankpipi commented Nov 22, 2022

@lmohit95, thanks for you feedback and sorry for my mistake.
It is true that there are still some errors in dataset processing, and I have fixed it in #54.
Please pull my code and run python load_data.py again before running run_hetu.py --model dcn_criteo --all --val.

@lmohit95
Copy link
Author

lmohit95 commented Nov 22, 2022

Thank you. Everything works now. I just wanted to clarify something regarding run_hetu.py --model dcn_criteo --all --val. This command trains and tests HET on criteo dataset right?

The paper mentions that the training process can take hours (Fig 6),

image

but in my case the training runs for a total of 10 epochs with far less overall runtime.

image

@Hsword
Copy link
Owner

Hsword commented Nov 22, 2022

It seems like you are running in a local execution mode, rather than the distributed training. That's why it's much faster.
Besides, note that the test_auc is reported every 1/10 epoch as described in

help="num of epochs, each train 1/10 data")
.

@lmohit95
Copy link
Author

Got it. Thanks for all the help!!!

@lmohit95
Copy link
Author

lmohit95 commented Dec 9, 2022

Hello,
Thanks for all the help until now.
I am running HET on criteo dataset on a single GPU node by setting HETU_VERSION = 'gpu' in HYBRID mode. I ran `bash examples/ctr/tests/hybrid_wdl_criteo.sh, but I am getting the following error:

image

This is my configuration file

shared :
  DMLC_PS_ROOT_URI : 127.0.0.1
  DMLC_PS_ROOT_PORT : 13100
  DMLC_NUM_WORKER : 2
  DMLC_NUM_SERVER : 1
  DMLC_PS_VAN_TYPE : p3
launch :
  worker : 2
  server : 1
  scheduler : true
nodes:
  - host: lmohit95
    servers: 1
    workers: 2
    chief: true

@Hsword Hsword added bug Something isn't working HET Issues about HET labels Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HET Issues about HET
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants