Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is your WMT16 EN-Ro Dataset Preprocessed? #10

Closed
ictnlp-wshugen opened this issue Nov 26, 2019 · 1 comment
Closed

How is your WMT16 EN-Ro Dataset Preprocessed? #10

ictnlp-wshugen opened this issue Nov 26, 2019 · 1 comment

Comments

@ictnlp-wshugen
Copy link

ictnlp-wshugen commented Nov 26, 2019

Thank you for providing us the preprocessed dataset.
Could do please tell me How is your WMT16 EN-Ro Dataset Preprocessed?
From raw 612422 sentence pairs to 608319 sentence pairs?
Also, it seems that the dataset (En-Ro) has been shuffled or reorganized?

@jaseleephd
Copy link
Collaborator

We used the preprocessing scripts provided by Rico Sennrich, which filters out sentences too long or too short.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants