-
Notifications
You must be signed in to change notification settings - Fork 301
Closed

Description
Describe the bug
When training a model with 2B parameters on a dataset of size 8192x5000, more than 1TB of memory is required.
To Reproduce
This is my notebook.
The dataset I'm using is scraped, which may violate copyright laws, so I use np.zeros instead in the notebook.
Kaggle Notebook
Additional context
When I previously trained a 1B parameter model without using LoRA and 8-bit quantization, the training proceeded without any problems on TPU v3-8.
Epoch 1/5
2024-07-26 07:46:35.652807: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node StatefulPartitionedCall.
I0000 00:00:1721979998.389673 952 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(f921002fa5b937d4:0:0), session_name()
I0000 00:00:1721980017.336070 952 tpu_compile_op_common.cc:507] Found 0 programs. Skip fingerprint registration.
I0000 00:00:1721980017.349929 952 tpu_compile_op_common.cc:245] Compilation of f921002fa5b937d4:0:0 with session name took 18.9601839s and failed
E0000 00:00:1721980017.350017 952 tpu_compilation_cache_external.cc:112] Compilation failure: Aborting compilation early because it's unlikely to have enough memory. Requires 1.31T, has 14.71G available. If more detailed logging is desired, set --xla_tpu_impure_oom_fast_exit_threshold=-1
2024-07-26 07:46:57.350036: F tensorflow/core/tpu/kernels/tpu_program_group.cc:90] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7b1cc34bfe2c,7b1cc347104f,5c7a23dc696f,5c7a23dc696f&map=
*** SIGABRT received by PID 13 (TID 952) on cpu 56 from PID 13; stack trace: ***
PC: @ 0x7b1cc34bfe2c (unknown) (unknown)
@ 0x7b1bdc090387 928 (unknown)
@ 0x7b1cc3471050 13648 (unknown)
@ 0x5c7a23dc6970 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7b1cc34bfe2c,7b1bdc090386,7b1cc347104f,5c7a23dc696f&map=
E0726 07:46:57.363404 952 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0726 07:46:57.363415 952 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0726 07:46:57.363419 952 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0726 07:46:57.363442 952 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0726 07:46:57.363446 952 coredump_hook.cc:598] RAW: Dumping core locally.
Would you like to help us fix it?
steveepreston
Metadata
Metadata
Assignees
Labels
No labels