Changes concerning the handling of vocab file #64
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are few minor changes proposed in this PR
vocab.ende.32768here. It would be better to have the actual size of vocab file in the suffix._TARGET_VOCAB_SIZEnumber of subtokens for vocab file, it would be better to perform binary search while enforcing themin_count.vocab_sizein model_params.py and this value ofvocab_sizeis used for setting dimensions ofshared_weightsin embedding_layer.py. Sinceembedding_lookupis performed on thisshared_weightvariable by index, the model would crash on a non-gpu system. The reason it works on a GPU enabled system is because tf.gather stores 0, rather than raising error, if out of bound index is found. Returning zero for out of bound index might have some problem for runs and can also increase variance in convergence. For the current dataset and implementation of tokenizer, we getvocab_sizeas33945it would be safe to go with this value. The better way would be to parse out the suffix of vocab file during runtime.