List of datasets | Monolingual raw files #13

jchwenger · 2020-05-12T16:30:53Z

Hi,

Just stumbling across this tool, which looks promising!

I've been wondering:

is there an option to display the list of available datasets (and their available languages) on opus? That'd be great for browsing & programmatically download a bunch/all of them.
also: is there an option to download the raw monolingual files ? I'm looking into unsupervised language modeling, and so that's the one I'd be using.

Thanks a lot in advance!

miau1 · 2020-05-13T07:06:22Z

Yes, you can use opus_get for that. To list available raw datasets for a language:
opus_get --source <lang_id> --preprocess raw --list
And to download them, just remove the --list flag:
opus_get --source <lang_id> --preprocess raw
You can find more examples here: https://github.com/Helsinki-NLP/OpusTools/blob/master/opustools_pkg/README.md#opus_get

jchwenger · 2020-05-13T13:04:39Z

Amazing! Thank you so much!! Silly me for not reading properly. That's just great, it works nicely.

One last question, however. I notice that the 'raw' category gives me xml files, you can try:

opus_get --source fr --preprocess raw -d bible-uedin -dl test

I'm after the untokenized raw text, if at all possible. I can get it through the api by doing something like:

corpus="EUconst"
curl "http://opus.nlpl.eu/opusapi/?corpus=$corpus&source=fr&preprocessing=mono&version=latest" \
  | jq '.corpora[1].url' \
  | xargs wget

(As the api gives me an object with key "corpora", and the second option within that is the raw, untokenized monolingual text file.)

Cf. here.

Any chance you might integrate that option in the future? That list functionality is great, also with the size displayed, I'll be able to use that programmatically (to sort the queue downloads by size).

jchwenger · 2020-05-13T20:14:55Z

In case, I made this little script that uses the Python utility to download the mono raw files.

miau1 · 2020-05-14T04:20:58Z

Ok, I see. Yes, the raw files are xml files that contain untokenized text. We can probably add the raw non-xml files to be able to be downloaded with opus_get, good suggestion!

There is also the opus_cat script that can be used to get plain monolingual text from a given corpus:
opus_cat --directory <corpus_name> --language <lang_id> --plain --no_ids
opus_cat was originally designed to take a quick look into a corpus, and its output always includes file names and the output is always tokenized. In the future, we will also probably add options to remove the filenames and to output untokenized text.

jchwenger · 2020-05-14T12:36:58Z

Sounds good, thanks for your answer! Awesome that you developed this utility, very useful. As you can see I could get what I wanted with a bit of Python scripting, but thanks for the opus_cat as well, I'm sure it'll come in handy in the future for me as well.

jchwenger closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of datasets | Monolingual raw files #13

List of datasets | Monolingual raw files #13

jchwenger commented May 12, 2020

miau1 commented May 13, 2020

jchwenger commented May 13, 2020

jchwenger commented May 13, 2020

miau1 commented May 14, 2020

jchwenger commented May 14, 2020

List of datasets | Monolingual raw files #13

List of datasets | Monolingual raw files #13

Comments

jchwenger commented May 12, 2020

miau1 commented May 13, 2020

jchwenger commented May 13, 2020

jchwenger commented May 13, 2020

miau1 commented May 14, 2020

jchwenger commented May 14, 2020