Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shape mismatch error while loading the pretrained model #9

Closed
nasrinm opened this issue Nov 10, 2019 · 2 comments
Closed

Shape mismatch error while loading the pretrained model #9

nasrinm opened this issue Nov 10, 2019 · 2 comments

Comments

@nasrinm
Copy link

nasrinm commented Nov 10, 2019

I get a shape mismatch error while running the t5_mesh_transformer either for training or fine-tuning.
Following is an example fine-tuning run, using a sample WMT TSV file:

$ t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir=${DATA_DIR} --gin_file="gs://t5-data/pretrained_models/11B/operative_config.gin" --gin_file="models/bi_v1.gin" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" --gin_param="tsv_dataset_fn.filename = 'gs://XYZbucket/t5/misc/news-commentary-v14.ar-it.tsv'"

Then I get the following error:

ERROR:tensorflow:Error recorded from training_loop: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.
E1110 22:05:58.563133 140034595272448 error_handling.py:75] Error recorded from training_loop: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.
INFO:tensorflow:training_loop marked as finished
I1110 22:05:58.563437 140034595272448 error_handling.py:101] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1110 22:05:58.563559 140034595272448 error_handling.py:135] Reraising captured error
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3159, in _model_fn
    _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3604, in _train_on_tpu_system
    device_assignment=ctx.device_assignment)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/tpu.py", line 1277, in split_compile_and_shard
    name=name)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/tpu.py", line 992, in split_compile_and_replicate
    outputs = computation(*computation_inputs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3589, in multi_tpu_train_steps_on_single_shard
    inputs=[0, _INITIAL_LOSS])
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/training_loop.py", line 178, in while_loop
    condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2753, in while_loop
    return_same_structure)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2245, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2170, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/training_loop.py", line 121, in body_wrapper
    outputs = body(*(inputs + dequeue_ops))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3588, in <lambda>
    lambda i, loss: [i + 1, single_tpu_train_step(i)],
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1715, in train_step
    self._call_model_fn(features, labels))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 567, in my_model_fn
    init_checkpoint, {v: v for v in restore_vars}
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1940, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1947, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 329, in _init_from_checkpoint
    tensor_name_in_ckpt, str(variable_map[tensor_name_in_ckpt])
ValueError: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.

Note that the error is replicable using any of the pretrained models, not just 11B param one.

@craffel
Copy link
Collaborator

craffel commented Nov 11, 2019

Hi, I think the issue is that your command includes both the
--gin_file="gs://t5-data/pretrained_models/11B/operative_config.gin"
and
--gin_file="models/bi_v1.gin"
flags. The latter flag is overwriting the values from the pretrained operative config and messing up the model hparams so that they don't match the pretrained checkpoint. Looking at the readme I can see that this is not explained at all, sorry about that. Can you give that a try to confirm that's the issue you're facing, and then we can update the readme? Thanks.

@nasrinm
Copy link
Author

nasrinm commented Nov 12, 2019

Thanks, yes, that solved it!

@nasrinm nasrinm closed this as completed Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants