Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion regarding data #164

Open
kr-sundaram opened this issue Aug 7, 2020 · 1 comment
Open

Confusion regarding data #164

kr-sundaram opened this issue Aug 7, 2020 · 1 comment

Comments

@kr-sundaram
Copy link

Thank you very much for making the repository public!

I have one confusion regarding the train and valid file in mono and para directory for NMT task for model pre-training and fine-tuning tasks.

As stated in the README file. I understand dict.en.txt and dict.zh.txt should be exact same in both mono and para directory. And in para directory bilingual data should be there in order to fine-tune the model for fine-tune task. The confusion i have is basically for mono directory and number of examples it should contain for both the languages in their respective train and valid files.

Whether number of sentences and the sentences itself in both languages can differ for mono directory, right? I mean it should not matter if one uses, lets say, 100 sentences for en and 200 sentences for zh as they are just bunch of monolingual data.

The only point to note that is both mono and para directory should share same dictionary files, right?

@StillKeepTry
Copy link
Contributor

Yes. It does not matter for the number of sentences for monolingual data. And for each language, it shares the same dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants