Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue training existing deepspeech model using checkpoints #2297

Closed
Taco-Network opened this issue Aug 13, 2019 · 9 comments
Closed

Issue training existing deepspeech model using checkpoints #2297

Taco-Network opened this issue Aug 13, 2019 · 9 comments

Comments

@Taco-Network
Copy link

Taco-Network commented Aug 13, 2019

Linux Ubuntu 18.04
Tensorflow-gpu 1.14.0
Python 3
Cuda 10
CuDNN 7.5
GTX 1080

We have had an issue using the checkpoint for an existing model for fine tuning. We are unsure of what this error means

python3 DeepSpeech.py --checkpoint_dir /home/shop/Downloads/deepspeech-0.5.1-checkpoint/ --trie /home/shop/deepspeech-0.5.1-models/trie --lm_binary_dir /home/shop/deepspeech-0.5.1-models/lm.binary --train_files /home/shop/Downloads/en/clips/train.csv --test_files /home/shop/Downloads/en/clips/test.csv --dev_files /home/shop/Downloads/en/clips/dev.csv --summary_dir /home/shop/DeepSpeech/summary/ --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --n_hidden 2048 --learning_rate .0001 --dropout_rate 0.15 --epoch -1 --lm_alpha 0.75 --lm_beta 1.85 --export_dir /home/shop/DeepSpeech/new/new

W0813 16:48:33.340879 140668025968448 deprecation_wrapper.py:119] From /home/shop/DeepSpeech/util/config.py:60: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0813 16:48:34.819179 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:494: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    
W0813 16:48:34.920831 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:348: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
W0813 16:48:34.921059 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:349: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
W0813 16:48:34.921202 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:351: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
W0813 16:48:35.401269 140668025968448 deprecation.py:506] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0813 16:48:36.486336 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0813 16:48:37.247787 140668025968448 deprecation.py:323] From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I0813 16:48:37.249122 140668025968448 saver.py:1280] Restoring parameters from /home/shop/Downloads/deepspeech-0.5.1-checkpoint/model.v0.5.1
Traceback (most recent call last):
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias not found in checkpoint
	 [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias not found in checkpoint
	 [[node save/RestoreV2 (defined at DeepSpeech.py:457) ]]

Original stack trace for 'save/RestoreV2':
  File "DeepSpeech.py", line 844, in <module>
    tfv1.app.run(main)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 828, in main
    train()
  File "DeepSpeech.py", line 457, in train
    checkpoint_saver = tfv1.train.Saver(max_to_keep=FLAGS.max_to_keep)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1296, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1614, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 844, in <module>
    tfv1.app.run(main)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 828, in main
    train()
  File "DeepSpeech.py", line 475, in train
    loaded = try_loading(session, checkpoint_saver, checkpoint_filename, 'most recent')
  File "DeepSpeech.py", line 392, in try_loading
    saver.restore(session, checkpoint_path)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias not found in checkpoint
	 [[node save/RestoreV2 (defined at DeepSpeech.py:457) ]]

Original stack trace for 'save/RestoreV2':
  File "DeepSpeech.py", line 844, in <module>
    tfv1.app.run(main)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/shop/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 828, in main
    train()
  File "DeepSpeech.py", line 457, in train
    checkpoint_saver = tfv1.train.Saver(max_to_keep=FLAGS.max_to_keep)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
@kdavis-mozilla
Copy link
Contributor

It appears you are training an existing model using Deep speech 0.5.1.

Training with Deep speech 0.5.1. requires Tensorflow 1.13.1, see the 0.5.1 README, not Tensorflow 1.14.0 which it appears that you are using.

@Taco-Network
Copy link
Author

Taco-Network commented Aug 14, 2019

Missed that detail about Tensorflow 1.13.1. Got that installed and removed old version now i have this error:

python3 DeepSpeech.py --checkpoint_dir /home/shop/Downloads/deepspeech-0.5.1-checkpoint/ --trie /home/shop/deepspeech-0.5.1-models/trie --lm_binary_dir /home/shop/deepspeech-0.5.1-models/lm.binary --train_files /home/shop/Downloads/en/clips/train.csv --test_files /home/shop/Downloads/en/clips/test.csv --dev_files /home/shop/Downloads/en/clips/dev.csv --summary_dir /home/shop/DeepSpeech/summary/ --train_batch_size 24 --dev_batch_size 48 --test_batch_size 48 --n_hidden 2048 --learning_rate .0001 --dropout_rate 0.15 --epoch 1 --lm_alpha 0.75 --lm_beta 1.85 --export_dir /home/shop/DeepSpeech/new/new

WARNING:tensorflow:From /home/shop/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
Traceback (most recent call last):
  File "DeepSpeech.py", line 844, in <module>
    tfv1.app.run(main)
  File "/home/shop/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 828, in main
    train()
  File "DeepSpeech.py", line 411, in train
    iterator = tfv1.data.Iterator.from_structure(tfv1.data.get_output_types(train_set),
AttributeError: module 'tensorflow._api.v1.compat.v1.data' has no attribute 'get_output_types'

Also tried different numpy versions 1.13.3 and 1.15.4. Any ideas? Thanks!

@lissyx
Copy link
Collaborator

lissyx commented Aug 14, 2019

@superdutyf3 Are you sure you did checkout v0.5.1 tag ?

@Taco-Network
Copy link
Author

we are using deepspeech 0.5.1.

pip3 install 'deepspeech == 0.5.1'

and removed alpha build

@lissyx
Copy link
Collaborator

lissyx commented Aug 14, 2019

we are using deepspeech 0.5.1.

pip3 install 'deepspeech == 0.5.1'

and removed alpha build

You are mixing everything. This is not for training, this is for inference. Please ensure git checkout v0.5.1 in your git clone before running python DeepSpeech.py

@Taco-Network
Copy link
Author

That worked!
git checkout -b v0.5.1
Successfully restored checkpoint and now continuing training the model
Sorry for the misunderstanding and thank you very much for your help!

@JRMeyer
Copy link
Contributor

JRMeyer commented Aug 15, 2019

EDIT: @reuben addresses this problem on Discourse here: https://discourse.mozilla.org/t/error-on-loading-0-5-1-checkpoints-with-current-master-deepspeech-codebase/43585


Original post:

@kdavis-mozilla --- the solution discussed here is to downgrade the DeepSpeech to v0.5.1

However, is there a solution to re-export checkpoints from a trained v0.5.1 model using v0.6.0 code, resulting in a v0.6.0 model?

We're trying this now, but exporting breaks with the same error as above, specifically:

tensorflow.python.framework.errors_impl.NotFoundError: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias not found in checkpoint

Also, downgrading tensorflow to 1.13.1, with deepspeech v0.6.0 breaks as well, with the same error.

So, v0.6.0 seems to be completely breaking compatibility, yes?

@lissyx
Copy link
Collaborator

lissyx commented Aug 16, 2019

So, v0.6.0 seems to be completely breaking compatibility, yes?

Yes, that's why it's a 0.6 and not 0.5.2 :)

@lock
Copy link

lock bot commented Sep 15, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants