Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create script to automatically download Universal Dependencies datasets #1

Closed
Hyperparticle opened this issue Aug 15, 2018 · 5 comments

Comments

@Hyperparticle
Copy link
Owner

Hyperparticle commented Aug 15, 2018

A nice convenience would be to automatically download any UD dataset and preprocess it for training. It would be desirable to select a language and dataset_name and navigate to the GitHub repo where it can be downloaded with urllib. This may require some web scraping for finding the GitHub page, and a regex to match train, dev, and test data.

@foxik
Copy link
Collaborator

foxik commented Aug 16, 2018

If I recall, the only official source of UD data is the LINDAT release -- the Github repos are usually used only for development (i.e., they are not required to contain branch or tag with latest release).

@dan-zeman Am I right, or is it possible to get the stable releases from Github?

@Hyperparticle
Copy link
Owner Author

@foxik I was not aware that UD is hosted on LINDAT. Going through the UD website, I could not find any links to LINDAT datasets, there were just the GitHub repos.

@dan-zeman
Copy link

Hmm, maybe we should think of making this more explicit and visible on the UD website. I can see how you can overlook it if you are looking just for one language and never go further once you click on the language... But in fact, the information is quite explicit on the title page below the flags. If you scroll long enough, or if you hit CTRL+F and type "download", you will end up at the Download section and see the link to Lindat. And you get all languages in one big package, you cannot download just one selected language.

Otherwise, it is actually possible to get stable releases from Github, although it is not the preferred way (because we want download statistics at one place, i.e., Lindat). Since we learned the first time that some people just take their data from Github and write papers about it, we reversed the branch logic and now we try to make sure that the contents of the master branch of each repo always corresponds to the most recent official release, while all fixes in the meantime happen in the dev branch. You still don't have 100% certainty that you get the right data if a treebank was released in the past, then became invalid due to stricter validation rules, was not fixed and was not included in the last release.

@Hyperparticle
Copy link
Owner Author

@dan-zeman Ah, I never noticed the download section at the bottom of the page, thanks! This should simplify things. And I agree, the download section should be more prominent (perhaps mentioned at or moved to the top of the page).

@dan-zeman
Copy link

I have added a link to the Download section from each treebank's section. Hope this helps to find it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants