New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Danish Europarl dataset #2730
Danish Europarl dataset #2730
Conversation
6448 sentences extracted from europarl-v7.da-en.da
Corrected 2 error in my extraction script.
Thanks for this work. Once QA is finished please update the first comment to include all the information that we requested for the German version here Thanks! |
Hi, I'd like to contribute getting Danish data in the dataset, and found this PR. Am I understanding correctly that this needs proofreading? If so, do let me know what I can do to help. I'm a native Danish speaker and software developer, so I should be able to assist with QA and some coding/scripting if needed. I looked at the guidelines here: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission and would be happy to setup a google spreadsheet as suggested, and begin proofreading. But let me know what would be helpful, if anything :) |
That would be awesome. |
Great, I've started manually reviewing. I don't use google sheets, so I'm using a local copy of the QA spreadsheet, which i will submit when done. |
Hey folks, just a heads up that because this PR was first created so long ago, the target is |
Thanks for the info! Will do when my review is ready. |
@Pertel any update on this? |
Ok, thanks! |
I have extracted 5136 sentences from the Europarl dataset.
The file contains sentences with
The Europarl dataset is filtered in the following way.
The first file 'europarl-filter-da.txt' contains 293800 sentences, after shuffling and piping through the 'uniq_stem' script there is typical 5000-6000 sentences left.
The
filter.sh
script isI started with the Dutch script and work from there. The first line
grep -v '[^a-zæøåA-ZÆØÅ,.?! ]'
is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.Also I removed abbreviations and words related to war.
Next I made a ruby script called 'uniq_stem.rb' that looks like this
It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.
Hope it is useful.