Shape mismatch error while loading the pretrained model #9

nasrinm · 2019-11-10T22:29:54Z

I get a shape mismatch error while running the t5_mesh_transformer either for training or fine-tuning.
Following is an example fine-tuning run, using a sample WMT TSV file:

$ t5_mesh_transformer --tpu="${TPU_NAME}" --gcp_project="${PROJECT}" --tpu_zone="${ZONE}" --model_dir="${MODEL_DIR}" --t5_tfds_data_dir=${DATA_DIR} --gin_file="gs://t5-data/pretrained_models/11B/operative_config.gin" --gin_file="models/bi_v1.gin" --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" --gin_param="tsv_dataset_fn.filename = 'gs://XYZbucket/t5/misc/news-commentary-v14.ar-it.tsv'"

Then I get the following error:

ERROR:tensorflow:Error recorded from training_loop: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.
E1110 22:05:58.563133 140034595272448 error_handling.py:75] Error recorded from training_loop: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.
INFO:tensorflow:training_loop marked as finished
I1110 22:05:58.563437 140034595272448 error_handling.py:101] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1110 22:05:58.563559 140034595272448 error_handling.py:135] Reraising captured error
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3159, in _model_fn
    _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3604, in _train_on_tpu_system
    device_assignment=ctx.device_assignment)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/tpu.py", line 1277, in split_compile_and_shard
    name=name)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/tpu.py", line 992, in split_compile_and_replicate
    outputs = computation(*computation_inputs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3589, in multi_tpu_train_steps_on_single_shard
    inputs=[0, _INITIAL_LOSS])
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/training_loop.py", line 178, in while_loop
    condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2753, in while_loop
    return_same_structure)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2245, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2170, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/tpu/training_loop.py", line 121, in body_wrapper
    outputs = body(*(inputs + dequeue_ops))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3588, in <lambda>
    lambda i, loss: [i + 1, single_tpu_train_step(i)],
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1715, in train_step
    self._call_model_fn(features, labels))
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 567, in my_model_fn
    init_checkpoint, {v: v for v in restore_vars}
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1940, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1947, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "/home/nasrinm/anaconda3/envs/t5/lib/python3.6/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 329, in _init_from_checkpoint
    tensor_name_in_ckpt, str(variable_map[tensor_name_in_ckpt])
ValueError: Shape of variable decoder/block_000/layer_000/SelfAttention/k:0 ((768, 768)) doesn't match with shape of tensor decoder/block_000/layer_000/SelfAttention/k ([1024, 16384]) from checkpoint reader.

Note that the error is replicable using any of the pretrained models, not just 11B param one.

The text was updated successfully, but these errors were encountered:

craffel · 2019-11-11T05:03:21Z

Hi, I think the issue is that your command includes both the
--gin_file="gs://t5-data/pretrained_models/11B/operative_config.gin"
and
--gin_file="models/bi_v1.gin"
flags. The latter flag is overwriting the values from the pretrained operative config and messing up the model hparams so that they don't match the pretrained checkpoint. Looking at the readme I can see that this is not explained at all, sorry about that. Can you give that a try to confirm that's the issue you're facing, and then we can update the readme? Thanks.

nasrinm · 2019-11-12T02:15:23Z

Thanks, yes, that solved it!

nasrinm closed this as completed Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape mismatch error while loading the pretrained model #9

Shape mismatch error while loading the pretrained model #9

nasrinm commented Nov 10, 2019

craffel commented Nov 11, 2019

nasrinm commented Nov 12, 2019

Shape mismatch error while loading the pretrained model #9

Shape mismatch error while loading the pretrained model #9

Comments

nasrinm commented Nov 10, 2019

craffel commented Nov 11, 2019

nasrinm commented Nov 12, 2019