Confusion regarding data #164

kr-sundaram · 2020-08-07T20:00:04Z

Thank you very much for making the repository public!

I have one confusion regarding the train and valid file in mono and para directory for NMT task for model pre-training and fine-tuning tasks.

As stated in the README file. I understand dict.en.txt and dict.zh.txt should be exact same in both mono and para directory. And in para directory bilingual data should be there in order to fine-tune the model for fine-tune task. The confusion i have is basically for mono directory and number of examples it should contain for both the languages in their respective train and valid files.

Whether number of sentences and the sentences itself in both languages can differ for mono directory, right? I mean it should not matter if one uses, lets say, 100 sentences for en and 200 sentences for zh as they are just bunch of monolingual data.

The only point to note that is both mono and para directory should share same dictionary files, right?

The text was updated successfully, but these errors were encountered:

StillKeepTry · 2020-08-10T12:01:29Z

Yes. It does not matter for the number of sentences for monolingual data. And for each language, it shares the same dictionary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion regarding data #164

Confusion regarding data #164

kr-sundaram commented Aug 7, 2020

StillKeepTry commented Aug 10, 2020

Confusion regarding data #164

Confusion regarding data #164

Comments

kr-sundaram commented Aug 7, 2020

StillKeepTry commented Aug 10, 2020