Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of datasets | Monolingual raw files #13

Closed
jchwenger opened this issue May 12, 2020 · 5 comments
Closed

List of datasets | Monolingual raw files #13

jchwenger opened this issue May 12, 2020 · 5 comments

Comments

@jchwenger
Copy link

Hi,

Just stumbling across this tool, which looks promising!

I've been wondering:

  • is there an option to display the list of available datasets (and their available languages) on opus? That'd be great for browsing & programmatically download a bunch/all of them.
  • also: is there an option to download the raw monolingual files ? I'm looking into unsupervised language modeling, and so that's the one I'd be using.

Thanks a lot in advance!

@miau1
Copy link
Member

miau1 commented May 13, 2020

Yes, you can use opus_get for that. To list available raw datasets for a language:
opus_get --source <lang_id> --preprocess raw --list
And to download them, just remove the --list flag:
opus_get --source <lang_id> --preprocess raw
You can find more examples here: https://github.com/Helsinki-NLP/OpusTools/blob/master/opustools_pkg/README.md#opus_get

@jchwenger
Copy link
Author

Amazing! Thank you so much!! Silly me for not reading properly. That's just great, it works nicely.

One last question, however. I notice that the 'raw' category gives me xml files, you can try:

opus_get --source fr --preprocess raw -d bible-uedin -dl test

I'm after the untokenized raw text, if at all possible. I can get it through the api by doing something like:

corpus="EUconst"
curl "http://opus.nlpl.eu/opusapi/?corpus=$corpus&source=fr&preprocessing=mono&version=latest" \
  | jq '.corpora[1].url' \
  | xargs wget

(As the api gives me an object with key "corpora", and the second option within that is the raw, untokenized monolingual text file.)

Cf. here.

Any chance you might integrate that option in the future? That list functionality is great, also with the size displayed, I'll be able to use that programmatically (to sort the queue downloads by size).

@jchwenger
Copy link
Author

In case, I made this little script that uses the Python utility to download the mono raw files.

@miau1
Copy link
Member

miau1 commented May 14, 2020

Ok, I see. Yes, the raw files are xml files that contain untokenized text. We can probably add the raw non-xml files to be able to be downloaded with opus_get, good suggestion!

There is also the opus_cat script that can be used to get plain monolingual text from a given corpus:
opus_cat --directory <corpus_name> --language <lang_id> --plain --no_ids
opus_cat was originally designed to take a quick look into a corpus, and its output always includes file names and the output is always tokenized. In the future, we will also probably add options to remove the filenames and to output untokenized text.

@jchwenger
Copy link
Author

Sounds good, thanks for your answer! Awesome that you developed this utility, very useful. As you can see I could get what I wanted with a bit of Python scripting, but thanks for the opus_cat as well, I'm sure it'll come in handy in the future for me as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants