Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add swedish_medical_ner dataset #2940

Merged
merged 4 commits into from
Oct 5, 2021
Merged

add swedish_medical_ner dataset #2940

merged 4 commits into from
Oct 5, 2021

Conversation

bwang482
Copy link
Contributor

Adding the Swedish Medical NER dataset, listed in "Biomedical Datasets - BigScience Workshop 2021"

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thank you !

I just added a few comments:

language_creators:
- found
languages:
- sv-SE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- sv-SE
- sv-SE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

})

In: data['train'][0]['sentence']
Out: '{kropp} beskrivs i till exempel människokroppen, anatomi och f'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove the parenthesis, brackets and curly brackets in _generate_examples?
This way people can start training models without further preprocessing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this to the end users. I am not sure if I should edit the original data much given the data licence. As of now the index fields match with the brackets in the sentences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok since the index match the text with the brackets it's fine.
Note that the data license shouldn't be an issue if we want to add further data processing in the script

datasets/swedish_medical_ner/swedish_medical_ner.py Outdated Show resolved Hide resolved
datasets/swedish_medical_ner/swedish_medical_ner.py Outdated Show resolved Hide resolved
datasets/swedish_medical_ner/swedish_medical_ner.py Outdated Show resolved Hide resolved
datasets/swedish_medical_ner/swedish_medical_ner.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! It looks all good now :)

I just did some minor changes

datasets/swedish_medical_ner/README.md Show resolved Hide resolved
datasets/swedish_medical_ner/README.md Outdated Show resolved Hide resolved
datasets/swedish_medical_ner/README.md Outdated Show resolved Hide resolved
@lhoestq lhoestq merged commit fdc02f3 into huggingface:master Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants