Changes concerning the handling of vocab file #64

tremblerz · 2018-06-12T05:39:23Z

There are few minor changes proposed in this PR

Currently, vocab filename is hardcoded as vocab.ende.32768 here. It would be better to have the actual size of vocab file in the suffix.
Since the tokenizer attempts to find _TARGET_VOCAB_SIZE number of subtokens for vocab file, it would be better to perform binary search while enforcing the min_count.
Currently, we maintain vocab_size in model_params.py and this value of vocab_size is used for setting dimensions of shared_weights in embedding_layer.py. Since embedding_lookup is performed on this shared_weight variable by index, the model would crash on a non-gpu system. The reason it works on a GPU enabled system is because tf.gather stores 0, rather than raising error, if out of bound index is found. Returning zero for out of bound index might have some problem for runs and can also increase variance in convergence. For the current dataset and implementation of tokenizer, we get vocab_size as 33945 it would be safe to go with this value. The better way would be to parse out the suffix of vocab file during runtime.

Rebase

ddkang · 2018-06-12T19:37:52Z

bitfort · 2018-06-13T17:05:47Z

Hi tremblerz, thanks for reaching out! We're really excited to see your interest in MLPerf.

I think you bring up an interesting point, and indeed we haven't tested this on CPUs. Your PR is helpful for us to dig into this issue, and we are actively looking at CPU support. Prior to making any changes to the code, I'll have to speak with the working group and discuss this issue.

In the mean time, I'll inquire with the author to get more feedback on this.

Best,
Victor

tremblerz · 2018-07-02T18:39:53Z

Hi @bitfort any updates?

bitfort · 2018-07-02T20:33:28Z

We looked over the the change, and it makes sense. I think we will generally move in this direction. I believe the the author had some ideas to dynamically determine size. I think you hit on an important point.

I'm adding this as a topic to discuss in an upcoming meeting. Unfortunately we are moving a little slowly right now :)

Victor

tremblerz · 2018-10-01T17:43:31Z

Hi @bitfort
any update on this one?

jiyonghe · 2019-03-21T02:14:28Z

done,many thanks

linrio · 2019-06-17T03:13:19Z

Fixed it in vocab_size

matthew-frank · 2022-12-02T15:28:16Z

In an effort to do a better job maintaining this repo, we're closing PRs for retired benchmarks. The old benchmark code still exists , but has been moved to https://github.com/mlcommons/training/tree/master/retired_benchmarks/transformer.

If you think there is useful cleanup to be done to the retired_benchmarks subtree, please submit a new PR.

tremblerz added 2 commits June 11, 2018 17:57

Merge pull request #1 from mlperf/master

d0b2d82

Rebase

Fix vocab file name and change vocabsize in params

2c87299

tremblerz mentioned this pull request Aug 1, 2018

[Transformer] InvalidArgumentError at the beginning of training a transformer model on CPU #110

Closed

johnyjyu mentioned this pull request Aug 1, 2018

[Transformer] InvalidArgumentError at the beginning of training a transformer model on CPU tensorflow/models#4900

Closed

tremblerz mentioned this pull request Oct 30, 2018

Transformer Changes mlcommons/training_policies#109

Closed

matthew-frank closed this Dec 2, 2022

github-actions bot locked and limited conversation to collaborators Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes concerning the handling of vocab file #64

Changes concerning the handling of vocab file #64

Uh oh!

tremblerz commented Jun 12, 2018

Uh oh!

ddkang commented Jun 12, 2018

Uh oh!

bitfort commented Jun 13, 2018

Uh oh!

tremblerz commented Jul 2, 2018

Uh oh!

bitfort commented Jul 2, 2018

Uh oh!

tremblerz commented Oct 1, 2018

Uh oh!

jiyonghe commented Mar 21, 2019

Uh oh!

linrio commented Jun 17, 2019

Uh oh!

matthew-frank commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Changes concerning the handling of vocab file #64

Changes concerning the handling of vocab file #64

Uh oh!

Conversation

tremblerz commented Jun 12, 2018

Uh oh!

ddkang commented Jun 12, 2018

Uh oh!

bitfort commented Jun 13, 2018

Uh oh!

tremblerz commented Jul 2, 2018

Uh oh!

bitfort commented Jul 2, 2018

Uh oh!

tremblerz commented Oct 1, 2018

Uh oh!

jiyonghe commented Mar 21, 2019

Uh oh!

linrio commented Jun 17, 2019

Uh oh!

matthew-frank commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants