Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bbaw egyptian #2290

Merged
merged 23 commits into from May 6, 2021
Merged

Bbaw egyptian #2290

merged 23 commits into from May 6, 2021

Conversation

phiwi
Copy link
Contributor

@phiwi phiwi commented Apr 29, 2021

This is the "hieroglyph corpus" that I could unfortunately not contribute during the marathon. I re-extracted it again now, so that it is in the state as used in my paper (seee documentation). I hope it satiesfies your requirements and wish every scientist out their loads of fun deciphering a 5.000 years old language :-)

@albertvillanova
Copy link
Member

albertvillanova commented May 3, 2021

Hi @phiwi,

Thanks for contributing this nice dataset. If you have any blocking problem or question, do not hesitate to ask here. We are pleased to help you.

Could you please first synchronize with our master branch? From your branch bbaw_egyptian, type:

git fetch upstream master
git merge upstream/master

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also remove the file datasets/dummy/0.0.0/dummy_data.zip, because you have already attached the dummy data in datasets/bbaw_egyptian/dummy/0.0.0/dummy_data.zip.

datasets/bbaw_egyptian/bbaw_egyptian.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this one :)

I left a few comments
Also could you remove the file at datasets/dummy/0.0.0/dummy_data.zip please ?

Comment on lines 2 to 17
annotations_creators:
- specialized egyptologists
language_creators:
- found
languages:
- de, en, eg
licenses:
- cc-by-4.0
multilinguality:
- multilingual
size_categories:
- 100K<n<1000K
source_datasets:
- extended|wikipedia
task_categories:
- translation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specialized egyptologists is not a valid annotations_creators tag. You can use this instead:

annotations_creators:
- expert-generated

There is a tool to create those tags here.

For the languages, there should be one language per line:

- de
- en
- eg

Finally the task_ids tags are missing:

task_categories:
- conditional-text-generation
task_ids:
- machine-translation

}
```

### Contributions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Contributions
### Contributions
Thanks to [@phiwi](https://github.com/phiwi) for adding this dataset.

def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
my_urls = self._URLS
data_dir = dl_manager.download_and_extract(my_urls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no extraction

Suggested change
data_dir = dl_manager.download_and_extract(my_urls)
data_dir = dl_manager.download(my_urls)

@lhoestq
Copy link
Member

lhoestq commented May 3, 2021

Thanks ! Can you check that you have black==21.4b0 and run make style again ? This should fix the "check_code_quality" CI issue

@phiwi
Copy link
Contributor Author

phiwi commented May 3, 2021

Reformatted with black.

@albertvillanova
Copy link
Member

Hi @phiwi, there are still some minor problems in relation with the tags you used in the dataset card (README.md).

Here you can find the output of the metadata validator:

WARNING:root:❌ Failed to validate 'datasets/bbaw_egyptian/README.md':
Could not validate the metada, found the following errors:
* field 'size_categories':
	['100K<n<1000K'] are not registered tags for 'size_categories', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/size_categories.json
* field 'task_ids':
	['machine translation'] are not registered tags for 'task_ids', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/tasks.json
* field 'languages':
	['eg'] are not registered tags for 'languages', reference at https://github.com/huggingface/datasets/tree/master/src/datasets/utils/resources/languages.json

@phiwi
Copy link
Contributor Author

phiwi commented May 5, 2021

@albertvillanova corrected :-)

@albertvillanova
Copy link
Member

albertvillanova commented May 5, 2021

Thanks, @phiwi. Now all tests should pass green.

However, I think there is still an issue with the language code:

  • the code for the Ancient Egyptian is not ar-EG
  • there is no ISO 639-1 code for the Ancient Egyptian
  • there is an ISO 639-2 code: egy; but this code will not pass the validation test because it is not in the list of valid codes

I am not sure what to do in this case... Maybe @lhoestq has an idea? Maybe adding the code to the list? https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/languages.json

@albertvillanova
Copy link
Member

albertvillanova commented May 5, 2021

I have just checked that in the list of valid codes there are already ISO 639-2 codes. Therefore, I would suggest you to add it to the list:

"egy": "Egyptian (Ancient)",

and change it in the dataset card.

@phiwi
Copy link
Contributor Author

phiwi commented May 5, 2021

Done.

@phiwi
Copy link
Contributor Author

phiwi commented May 6, 2021

Hope, everything is okay right now.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Let's see if @lhoestq has any other suggestions before merging it to master.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now thanks !

@lhoestq lhoestq merged commit 056b432 into huggingface:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants