Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is dataset='/ichec/work/dl4mt_data/nec_files/wiki.tok.txt.gz'? #62

Open
JunjieHu opened this issue Oct 6, 2016 · 2 comments
Open

Comments

@JunjieHu
Copy link

JunjieHu commented Oct 6, 2016

Hi

When I try to run the demo in session0, I notice there should be a wiki dataset, but it is missing. I go through the data folder and run the .sh file to download the europarl-v7.fr-en and newstest dataset, but I cannot find the wiki. Can you point out how to obtain the wiki.tok.txt.gz file?

Thanks

@orhanf
Copy link
Collaborator

orhanf commented Oct 6, 2016

Hi @JunjieHu. Unfortunately, the provided scripts under data directory do not download wikipedia dumps, because of their size (~3.5G), but you can manually download them from here.

If you just want to play with the code, you can use either fr or en side of the europarl-v7.fr-en

@KevinYuk
Copy link

Hi @orhanf . Thanks for your comments before.

We found so many Wikimedia Downloads link(e.g.: Database backup dumps, Mirror Sites of the XML dumps provided above, Static HTML dumps, DVD distributions, Analytics data files, Other files, Kiwix files).
could you please help clarify which one should we use? We would appreciate it.

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants