Skip to content

Conversation

@tremblerz
Copy link
Contributor

There are few minor changes proposed in this PR

  1. Currently, vocab filename is hardcoded as vocab.ende.32768 here. It would be better to have the actual size of vocab file in the suffix.
  2. Since the tokenizer attempts to find _TARGET_VOCAB_SIZE number of subtokens for vocab file, it would be better to perform binary search while enforcing the min_count.
  3. Currently, we maintain vocab_size in model_params.py and this value of vocab_size is used for setting dimensions of shared_weights in embedding_layer.py. Since embedding_lookup is performed on this shared_weight variable by index, the model would crash on a non-gpu system. The reason it works on a GPU enabled system is because tf.gather stores 0, rather than raising error, if out of bound index is found. Returning zero for out of bound index might have some problem for runs and can also increase variance in convergence. For the current dataset and implementation of tokenizer, we get vocab_size as 33945 it would be safe to go with this value. The better way would be to parse out the suffix of vocab file during runtime.

@ddkang
Copy link
Contributor

ddkang commented Jun 12, 2018

@bitfort

@bitfort
Copy link

bitfort commented Jun 13, 2018

Hi tremblerz, thanks for reaching out! We're really excited to see your interest in MLPerf.

I think you bring up an interesting point, and indeed we haven't tested this on CPUs. Your PR is helpful for us to dig into this issue, and we are actively looking at CPU support. Prior to making any changes to the code, I'll have to speak with the working group and discuss this issue.

In the mean time, I'll inquire with the author to get more feedback on this.

Best,
Victor

@tremblerz
Copy link
Contributor Author

Hi @bitfort any updates?

@bitfort
Copy link

bitfort commented Jul 2, 2018

We looked over the the change, and it makes sense. I think we will generally move in this direction. I believe the the author had some ideas to dynamically determine size. I think you hit on an important point.

I'm adding this as a topic to discuss in an upcoming meeting. Unfortunately we are moving a little slowly right now :)

Victor

@tremblerz
Copy link
Contributor Author

Hi @bitfort
any update on this one?

@jiyonghe
Copy link

done,many thanks

@linrio
Copy link

linrio commented Jun 17, 2019

Fixed it in vocab_size

@matthew-frank
Copy link
Contributor

In an effort to do a better job maintaining this repo, we're closing PRs for retired benchmarks. The old benchmark code still exists , but has been moved to https://github.com/mlcommons/training/tree/master/retired_benchmarks/transformer.

If you think there is useful cleanup to be done to the retired_benchmarks subtree, please submit a new PR.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants