Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE code used in source and target data #12

Closed
xixiddd opened this issue Nov 6, 2018 · 1 comment
Closed

BPE code used in source and target data #12

xixiddd opened this issue Nov 6, 2018 · 1 comment

Comments

@xixiddd
Copy link

xixiddd commented Nov 6, 2018

Hi, Shamil Chollampatt.
In the Model and Training Details Section of your paper, you said that "Each of the source and target vocabularies consists of 30K most frequent BPE tokens from the source and target side of the parallel data, respectively.", but according to this line in the preprocessing scripts(i.e. training/preprocess.sh), it seems that you only use the target-end data to learn BPE codes and then, apply it to both source and target data.

@shamilcm
Copy link
Collaborator

shamilcm commented Nov 9, 2018

The BPE model is trained using 30,000 operations using the target side of the training data according to the line that you pointed to. The source/target vocabularies for the encoder-decoder model consist of 30,000 most frequent subwords (or BPE segmented tokens) from the source/target sides of the parallel data (see line) .

@shamilcm shamilcm closed this as completed Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants