Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: DefaultVocabulary file is expected to contain an entry for (...) #297

Open
sdlmw opened this issue Nov 26, 2019 · 7 comments
Open
Labels

Comments

@sdlmw
Copy link

sdlmw commented Nov 26, 2019

`[2019-11-25 21:28:58] [config] tied-embeddings-all: false
[2019-11-25 21:28:58] [config] tied-embeddings-src: false
[2019-11-25 21:28:58] [config] train-sets:
[2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.en
[2019-11-25 21:28:58] [config] - /191030/1_ForTrain/train.tok.zh
[2019-11-25 21:28:58] [config] transformer-aan-activation: swish
[2019-11-25 21:28:58] [config] transformer-aan-depth: 2
[2019-11-25 21:28:58] [config] transformer-aan-nogate: false
[2019-11-25 21:28:58] [config] transformer-decoder-autoreg: self-attention
[2019-11-25 21:28:58] [config] transformer-dim-aan: 2048
[2019-11-25 21:28:58] [config] transformer-dim-ffn: 2048
[2019-11-25 21:28:58] [config] transformer-dropout: 0
[2019-11-25 21:28:58] [config] transformer-dropout-attention: 0
[2019-11-25 21:28:58] [config] transformer-dropout-ffn: 0
[2019-11-25 21:28:58] [config] transformer-ffn-activation: swish
[2019-11-25 21:28:58] [config] transformer-ffn-depth: 2
[2019-11-25 21:28:58] [config] transformer-guided-alignment-layer: last
[2019-11-25 21:28:58] [config] transformer-heads: 8
[2019-11-25 21:28:58] [config] transformer-no-projection: false
[2019-11-25 21:28:58] [config] transformer-postprocess: dan
[2019-11-25 21:28:58] [config] transformer-postprocess-emb: d
[2019-11-25 21:28:58] [config] transformer-preprocess: ""
[2019-11-25 21:28:58] [config] transformer-tied-layers:
[2019-11-25 21:28:58] [config] []
[2019-11-25 21:28:58] [config] type: amun
[2019-11-25 21:28:58] [config] ulr: false
[2019-11-25 21:28:58] [config] ulr-dim-emb: 0
[2019-11-25 21:28:58] [config] ulr-dropout: 0
[2019-11-25 21:28:58] [config] ulr-keys-vectors: ""
[2019-11-25 21:28:58] [config] ulr-query-vectors: ""
[2019-11-25 21:28:58] [config] ulr-softmax-temperature: 1
[2019-11-25 21:28:58] [config] ulr-trainable-transformation: false
[2019-11-25 21:28:58] [config] valid-freq: 10000u
[2019-11-25 21:28:58] [config] valid-max-length: 1000
[2019-11-25 21:28:58] [config] valid-metrics:
[2019-11-25 21:28:58] [config] - cross-entropy
[2019-11-25 21:28:58] [config] valid-mini-batch: 32
[2019-11-25 21:28:58] [config] vocabs:
[2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.en
[2019-11-25 21:28:58] [config] - /191030/2_ForTune/Tune.zh
[2019-11-25 21:28:58] [config] word-penalty: 0
[2019-11-25 21:28:58] [config] workspace: 2048
[2019-11-25 21:28:58] [config] Model is being created with Marian v1.7.6 1d4ba73 2019-05-11 17:16:31 +0100
[2019-11-25 21:28:58] Using single-device training
[2019-11-25 21:28:58] [data] Loading vocabulary from text file /191030/2_ForTune/Tune.en
[2019-11-25 21:28:58] Error: DefaultVocabulary file /191030/2_ForTune/Tune.en is expected to contain an entry for
[2019-11-25 21:28:58] Error: Aborted from marian::DefaultVocab::load(const string&, size_t)::<lambda(const string&, const string&, marian::Word)> in /marian/src/data/default_vocab.cpp:154

[CALL STACK]
[0x592ded]
[0x594c0c]
[0x586ed5]
[0x587c7b]
[0x59cccc]
[0x5a8771]
[0x4d071d]
[0x4f9392]
[0x42e213]
[0x40c0da]
[0x7f8724bb9830] __libc_start_main + 0xf0
[0x42b7f9]
`

I got this mistake while training the engine. Please I ask how to solve this problem

@emjotde
Copy link
Member

emjotde commented Nov 26, 2019

Hi,
did you create the vocabulary somehow by hand? It seems to be missing some symbols.

@sdlmw
Copy link
Author

sdlmw commented Nov 26, 2019

Hi emjotde,
I just simply extracted part from the train set as the vocabulary,

@emjotde
Copy link
Member

emjotde commented Nov 26, 2019

You would need to use the marian_vocab binary for this. Marian requires a number of special symbols, if those are missing you will get errors like this.

@sdlmw
Copy link
Author

sdlmw commented Nov 26, 2019

ok, I will retry usage: ./marian/build/marian-vocab [OPTIONS]

@snukky snukky changed the title Aborted (core dumped) Error: DefaultVocabulary file is expected to contain an entry for (...) Dec 9, 2019
@snukky
Copy link
Member

snukky commented Dec 9, 2019

Closing, the question has been answered.

@snukky snukky closed this as completed Dec 9, 2019
@snukky snukky added the faq label Dec 9, 2019
@frankseide frankseide reopened this Dec 9, 2019
@frankseide
Copy link
Contributor

Actually no need to use marian_vocab. You just need these three entries, and by convention at the start of the vocab:

<unk>
<s>
</s>

I create my vocabs with something like this:

echo -e '<unk>\n<s>\n</s>' > VOCAB
cat CORPUS \
  | tr ' \r' '\n' \
  | sort -u | grep . \
  >> VOCAB

I believe marian_vocab creates a JSON file. JSON is not a suitable format for representing vocabularies, because JSON implies UTF-8 encoding. That is problematic because

  • I found corpora do contain invalid UTF-8, causing mal-formed JSON files;
  • Marian is encoding-agnostic.

@sappho192
Copy link

Actually no need to use marian_vocab. You just need these three entries, and by convention at the start of the vocab:

<unk>
<s>
</s>

@frankseide Thank you so much! You saved my day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants