New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bbaw egyptian #2290
Bbaw egyptian #2290
Conversation
Hi @phiwi, Thanks for contributing this nice dataset. If you have any blocking problem or question, do not hesitate to ask here. We are pleased to help you. Could you please first synchronize with our master branch? From your branch
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also remove the file datasets/dummy/0.0.0/dummy_data.zip
, because you have already attached the dummy data in datasets/bbaw_egyptian/dummy/0.0.0/dummy_data.zip
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this one :)
I left a few comments
Also could you remove the file at datasets/dummy/0.0.0/dummy_data.zip
please ?
datasets/bbaw_egyptian/README.md
Outdated
annotations_creators: | ||
- specialized egyptologists | ||
language_creators: | ||
- found | ||
languages: | ||
- de, en, eg | ||
licenses: | ||
- cc-by-4.0 | ||
multilinguality: | ||
- multilingual | ||
size_categories: | ||
- 100K<n<1000K | ||
source_datasets: | ||
- extended|wikipedia | ||
task_categories: | ||
- translation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specialized egyptologists
is not a valid annotations_creators
tag. You can use this instead:
annotations_creators:
- expert-generated
There is a tool to create those tags here.
For the languages, there should be one language per line:
- de
- en
- eg
Finally the task_ids
tags are missing:
task_categories:
- conditional-text-generation
task_ids:
- machine-translation
} | ||
``` | ||
|
||
### Contributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Contributions | |
### Contributions | |
Thanks to [@phiwi](https://github.com/phiwi) for adding this dataset. |
def _split_generators(self, dl_manager): | ||
"""Returns SplitGenerators.""" | ||
my_urls = self._URLS | ||
data_dir = dl_manager.download_and_extract(my_urls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no extraction
data_dir = dl_manager.download_and_extract(my_urls) | |
data_dir = dl_manager.download(my_urls) |
Thanks ! Can you check that you have |
Reformatted with black. |
Hi @phiwi, there are still some minor problems in relation with the tags you used in the dataset card (README.md). Here you can find the output of the metadata validator:
|
@albertvillanova corrected :-) |
Thanks, @phiwi. Now all tests should pass green. However, I think there is still an issue with the language code:
I am not sure what to do in this case... Maybe @lhoestq has an idea? Maybe adding the code to the list? https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/languages.json |
I have just checked that in the list of valid codes there are already ISO 639-2 codes. Therefore, I would suggest you to add it to the list:
and change it in the dataset card. |
Done. |
Hope, everything is okay right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Let's see if @lhoestq has any other suggestions before merging it to master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now thanks !
This is the "hieroglyph corpus" that I could unfortunately not contribute during the marathon. I re-extracted it again now, so that it is in the state as used in my paper (seee documentation). I hope it satiesfies your requirements and wish every scientist out their loads of fun deciphering a 5.000 years old language :-)