-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent IDs #1
Comments
Also, note that there is a bit of inconsistency inside zipping of item (3) as well. $ unzip oneindia_20210320_en_ml.zip
Archive: oneindia_20210320_en_ml.zip
creating: en-ml/
inflating: en-ml/oneindia_train.ml
inflating: en-ml/oneindia_train.en
$ unzip pibarchives_2014_2016_en_ml.zip
Archive: pibarchives_2014_2016_en_ml.zip
inflating: en-ml/.DS_Store
inflating: __MACOSX/en-ml/._.DS_Store
inflating: en-ml/pib_arch_train.en
inflating: en-ml/pib_arch_train.ml $ unzip wikipedia-en-ml-20210201.zip
Archive: wikipedia-en-ml-20210201.zip
inflating: en-ml/ml.txt
inflating: en-ml/en.txt
|
I added these datasets to v0.3.2
See mappings
P.S. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
Thanks for your efforts in creating/curating these datasets! These are priceless and greatly advance NLP for Indian languages.
I tried adding them into
mtdata
thammegowda/mtdata#81Since the README says your datasets are still growing, I am wondering whats the best long-term strategy is for keeping in sync.
For now, I can
grep -i -o 'http[^ ]*zip' README.md
, but the immediate concern is about consistency in determining name, version, and languages of datasets from URL.The way current files are named (which act as ID for corpus) is a bit inconsistent. For example, consider these:
_
and get(name, version, lang1, lang2)
, so this is great. we can seeoneindia
is the name,20210320
is the version, anden_ml
are langs.2014_2016
as version, though it would have been nice to have2014to2016v1
as version. so splitting by_
would give exactly(name, version, lang1, lang2)
as in item 1.(name, version, lang1, lang2)
. There are more datasets matching item (1) than item (3) pattern, so I am inclined to call this abnormal.Could you please consider having a consistent format in dataset IDs? It'd greatly help the automated downloaders such as
mtdata
.Otherwise, do you really want your users to manually download 196 zip files via browser, and extract and merge them? :)
Thanks.
P.S https://github.com/thammegowda/mtdata#dataset-id
The text was updated successfully, but these errors were encountered: