-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add swedish_medical_ner dataset #2940
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thank you !
I just added a few comments:
language_creators: | ||
- found | ||
languages: | ||
- sv-SE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sv-SE | |
- sv-SE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
}) | ||
|
||
In: data['train'][0]['sentence'] | ||
Out: '{kropp} beskrivs i till exempel människokroppen, anatomi och f' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove the parenthesis, brackets and curly brackets in _generate_examples
?
This way people can start training models without further preprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave this to the end users. I am not sure if I should edit the original data much given the data licence. As of now the index fields match with the brackets in the sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok since the index match the text with the brackets it's fine.
Note that the data license shouldn't be an issue if we want to add further data processing in the script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! It looks all good now :)
I just did some minor changes
Adding the Swedish Medical NER dataset, listed in "Biomedical Datasets - BigScience Workshop 2021"