Skip to content

1TB of memory required for model training. #1711

@ghost

Description

Describe the bug

When training a model with 2B parameters on a dataset of size 8192x5000, more than 1TB of memory is required.

To Reproduce

This is my notebook.
The dataset I'm using is scraped, which may violate copyright laws, so I use np.zeros instead in the notebook.
Kaggle Notebook

Additional context

When I previously trained a 1B parameter model without using LoRA and 8-bit quantization, the training proceeded without any problems on TPU v3-8.

Epoch 1/5
2024-07-26 07:46:35.652807: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node StatefulPartitionedCall.
I0000 00:00:1721979998.389673     952 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(f921002fa5b937d4:0:0), session_name()
I0000 00:00:1721980017.336070     952 tpu_compile_op_common.cc:507] Found 0 programs. Skip fingerprint registration.
I0000 00:00:1721980017.349929     952 tpu_compile_op_common.cc:245] Compilation of f921002fa5b937d4:0:0 with session name  took 18.9601839s and failed
E0000 00:00:1721980017.350017     952 tpu_compilation_cache_external.cc:112] Compilation failure: Aborting compilation early because it's unlikely to have enough memory. Requires 1.31T, has 14.71G available. If more detailed logging is desired, set --xla_tpu_impure_oom_fast_exit_threshold=-1
2024-07-26 07:46:57.350036: F tensorflow/core/tpu/kernels/tpu_program_group.cc:90] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7b1cc34bfe2c,7b1cc347104f,5c7a23dc696f,5c7a23dc696f&map= 
*** SIGABRT received by PID 13 (TID 952) on cpu 56 from PID 13; stack trace: ***
PC: @     0x7b1cc34bfe2c  (unknown)  (unknown)
    @     0x7b1bdc090387        928  (unknown)
    @     0x7b1cc3471050      13648  (unknown)
    @     0x5c7a23dc6970  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7b1cc34bfe2c,7b1bdc090386,7b1cc347104f,5c7a23dc696f&map= 
E0726 07:46:57.363404     952 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0726 07:46:57.363415     952 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0726 07:46:57.363419     952 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0726 07:46:57.363442     952 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0726 07:46:57.363446     952 coredump_hook.cc:598] RAW: Dumping core locally.

Would you like to help us fix it?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions