Danish Europarl dataset #2730

Isomorph70 · 2020-05-18T16:24:28Z

I have extracted 5136 sentences from the Europarl dataset.

The file contains sentences with

no more than 14 words
uppercase letter at start.
no uppercase letters inside the sentence
no abbreviations
the symbols . ? or ! in the end
no numbers, (sqare) brakes,...
no foreign letters and alphabets
no misspelled words according to Hunspell

The Europarl dataset is filtered in the following way.

cat europarl-v7.da-en.da | source filter.sh >europarl-filter-da.txt
shuf europarl-filter-da.txt | ruby uniq_stem.rb >europarl-v7-da.txt

The first file 'europarl-filter-da.txt' contains 293800 sentences, after shuffling and piping through the 'uniq_stem' script there is typical 5000-6000 sentences left.

The filter.sh script is

grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' | # white-list letter and symbols 
grep -v '.[A-ZÆØÅ]' | # remove names(starting with uppercase letter) from inner text
awk 'length<102' |
awk 'length>25' |
awk 'NF<=14' |
grep -P '^[A-ZÆØÅ]' | # only lines that starts with uppercase latter
grep -P '[a-zæøå0-9][.?!…]*$' | # only lines that ends with a word and a end symbol
grep -P '[a-zæøå][a-zæøå][a-zæøå]' |
grep -v '[a-zæøå]\..' | # remove abbreviation  like hr. and stk.
grep -v -P ' (hr|nr).$' | # remove abbreviation in the end.
sed -r 's/  +/ /g' |
sed -r 's/ ?'"'"' s /'"'"'s /g' |
grep -v -P '(bomb|dræb|drab|tortur|terror|myrd)' |
grep -v ' dør[ ,.]' |
grep -v ' kl\.' |
hunspell -d da_DK -L -G | # spell check to filter out misspellings
sort |
uniq

I started with the Dutch script and work from there. The first line grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.
Also I removed abbreviations and words related to war.

Next I made a ruby script called 'uniq_stem.rb' that looks like this

# coding: utf-8

stemcount=Hash.new(0)
while gets
  t=$_
  s=t.gsub(/[^\s\w\dæøåÆØÅ]/,' ') # make unusual symbols into spaces
     .gsub!(/[\s]+/,' ')         # remove multiple spaces
     .downcase                   # make all letters lowercase
     .split(' ') 
     .delete_if { |w| w.to_i>0}  # remove numbers
     .collect { |w| w[0..3]}     # only keep stem part
  
  mc=s.collect {|w| stemcount[w]}.min
  #  puts [mc,s].inspect
  if mc<1
    puts t
    s.each { |w| stemcount[w]+=1} 
  end
end  

# puts [stemcount.length,stemcount].inspect

It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.

Hope it is useful.

6448 sentences extracted from europarl-v7.da-en.da

Corrected 2 error in my extraction script.

nukeador · 2020-05-22T14:38:24Z

Thanks for this work. Once QA is finished please update the first comment to include all the information that we requested for the German version here

#2539 (comment)

Thanks!

Pertel · 2021-08-02T20:46:22Z

Hi, I'd like to contribute getting Danish data in the dataset, and found this PR. Am I understanding correctly that this needs proofreading? If so, do let me know what I can do to help.

I'm a native Danish speaker and software developer, so I should be able to assist with QA and some coding/scripting if needed.

I looked at the guidelines here: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission and would be happy to setup a google spreadsheet as suggested, and begin proofreading. But let me know what would be helpful, if anything :)

Isomorph70 · 2021-08-03T14:34:07Z

Hi, I'd like to contribute getting Danish data in the dataset, and found this PR. Am I understanding correctly that this needs proofreading? If so, do let me know what I can do to help.

I'm a native Danish speaker and software developer, so I should be able to assist with QA and some coding/scripting if needed.

I looked at the guidelines here: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission and would be happy to setup a google spreadsheet as suggested, and begin proofreading. But let me know what would be helpful, if anything :)

That would be awesome.

Pertel · 2021-08-05T21:23:50Z

Great, I've started manually reviewing.
The sample size calculator told me 2299 sentences were enough, so i randomly picked 2399 sentences, just to be on the safe side.

I don't use google sheets, so I'm using a local copy of the QA spreadsheet, which i will submit when done.
If anyone wants a copy for any reason, I'll happily upload it.

phirework · 2021-08-24T16:32:46Z

Hey folks, just a heads up that because this PR was first created so long ago, the target is master which is no longer the default branch on Common Voice so I won't be able to accept the PR as-is. Whenever the sentence review is complete, you'll need to re-create this PR and request to merge into main.

Pertel · 2021-08-24T21:06:56Z

Thanks for the info! Will do when my review is ready.

ftyers · 2023-09-13T16:34:45Z

@Pertel any update on this?

Pertel · 2023-09-13T17:20:13Z

@Pertel any update on this?

Yes, sorry I never did come back and update this! My new PR #3259 was merged way back in 2021, so this should just be closed.

ftyers · 2023-09-13T17:55:40Z

Ok, thanks!

Isomorph70 added 2 commits May 16, 2020 17:14

Add files via upload

9956904

6448 sentences extracted from europarl-v7.da-en.da

add Europarl Danish sentences

11e476d

Corrected 2 error in my extraction script.

Isomorph70 mentioned this pull request May 29, 2020

Update with Danish numbers. JRMeyer/common-voice-stats#7

Merged

Isomorph70 referenced this pull request in fnielsen/awesome-danish May 29, 2020

Add Common Voice

8bb305a

Pertel mentioned this pull request Sep 13, 2021

Add danish europarl sentences #3259

Merged

zcolleyz added the Bulk sentence submission label Mar 23, 2022

ftyers closed this Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Danish Europarl dataset #2730

Danish Europarl dataset #2730

Isomorph70 commented May 18, 2020 •

edited

nukeador commented May 22, 2020

Pertel commented Aug 2, 2021

Isomorph70 commented Aug 3, 2021

Pertel commented Aug 5, 2021

phirework commented Aug 24, 2021

Pertel commented Aug 24, 2021

ftyers commented Sep 13, 2023

Pertel commented Sep 13, 2023 •

edited

ftyers commented Sep 13, 2023

Danish Europarl dataset #2730

Danish Europarl dataset #2730

Conversation

Isomorph70 commented May 18, 2020 • edited

nukeador commented May 22, 2020

Pertel commented Aug 2, 2021

Isomorph70 commented Aug 3, 2021

Pertel commented Aug 5, 2021

phirework commented Aug 24, 2021

Pertel commented Aug 24, 2021

ftyers commented Sep 13, 2023

Pertel commented Sep 13, 2023 • edited

ftyers commented Sep 13, 2023

Isomorph70 commented May 18, 2020 •

edited

Pertel commented Sep 13, 2023 •

edited