-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResourceExhaustedError (see above for traceback): OOM when allocating tensor #118
Comments
The memory has almost nothing to do with the size of the input file. The example code assumes your GPU has ~12GB of memory. If it has less then you'll need to use a smaller batch size. |
Hi @jacobdevlin-google , |
surprisingly, with the small batch size, it works. However, I have a huge GPU in my server.
|
Each 8GB card is a separate device, right? This code doesn't support MultiGPU so you would have to modify it or look for a fork that does. |
yea, its separate GPU card with 8GB each. I'll look for a way to use multiple GPU for this task, I'll fork it if I find a solution soon. I'll close this issue for now. |
So Do you find a solution to use the multiple GPU? |
Same here, also looking for multiple GPU version |
I created the pre-training tf_record from the sample data ("sample_text.txt"). And, I was trying to run run_pretraining to train on that small dataset. And, I got this error,
my script to create tf_record data is:
python create_pretraining_data.py --input_file=sample_text.txt --output_file=/tmp/tf.examples.tfrecord --vocab_file=/home/maybe/bert/model/uncased_L-12_H-768_A-12/vocab.txt --do_lower_case=True --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5
and script to pre-train on the sample small data is:
python run_pretraining.py --input_file=/tmp/tf.examples.tfrecord --output_dir=/tmp/pretraining_output --do_train=true --do_eval=true --bert_config_file=/home/maybe/bert/model/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/home/maybe/bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20 --num_warmup_steps=10 --learning_rate=2e-5
I know this error, I usually used to encounter this when i am running other program that is using the same GPU resource, but at this moment nothing is running all the resources are available/free. Or when Tensorflow graph is initiated.
The text was updated successfully, but these errors were encountered: