Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Danish Europarl dataset #2730

Closed
wants to merge 2 commits into from
Closed

Conversation

Isomorph70
Copy link
Contributor

@Isomorph70 Isomorph70 commented May 18, 2020

I have extracted 5136 sentences from the Europarl dataset.

The file contains sentences with

  • no more than 14 words
  • uppercase letter at start.
  • no uppercase letters inside the sentence
  • no abbreviations
  • the symbols . ? or ! in the end
  • no numbers, (sqare) brakes,...
  • no foreign letters and alphabets
  • no misspelled words according to Hunspell

The Europarl dataset is filtered in the following way.

cat europarl-v7.da-en.da | source filter.sh >europarl-filter-da.txt
shuf europarl-filter-da.txt | ruby uniq_stem.rb >europarl-v7-da.txt

The first file 'europarl-filter-da.txt' contains 293800 sentences, after shuffling and piping through the 'uniq_stem' script there is typical 5000-6000 sentences left.

The filter.sh script is

grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' | # white-list letter and symbols 
grep -v '.[A-ZÆØÅ]' | # remove names(starting with uppercase letter) from inner text
awk 'length<102' |
awk 'length>25' |
awk 'NF<=14' |
grep -P '^[A-ZÆØÅ]' | # only lines that starts with uppercase latter
grep -P '[a-zæøå0-9][.?!…]*$' | # only lines that ends with a word and a end symbol
grep -P '[a-zæøå][a-zæøå][a-zæøå]' |
grep -v '[a-zæøå]\..' | # remove abbreviation  like hr. and stk.
grep -v -P ' (hr|nr).$' | # remove abbreviation in the end.
sed -r 's/  +/ /g' |
sed -r 's/ ?'"'"' s /'"'"'s /g' |
grep -v -P '(bomb|dræb|drab|tortur|terror|myrd)' |
grep -v ' dør[ ,.]' |
grep -v ' kl\.' |
hunspell -d da_DK -L -G | # spell check to filter out misspellings
sort |
uniq

I started with the Dutch script and work from there. The first line grep -v '[^a-zæøåA-ZÆØÅ,.?! ]' is a white-list, that tells witch symbols is allowed in this collection of sentences. This is super effective in removing all strange symbols from the lines, also I remove numbers, since the file have a lot of case numbers and similar.
Also I removed abbreviations and words related to war.

Next I made a ruby script called 'uniq_stem.rb' that looks like this

# coding: utf-8

stemcount=Hash.new(0)
while gets
  t=$_
  s=t.gsub(/[^\s\w\dæøåÆØÅ]/,' ') # make unusual symbols into spaces
     .gsub!(/[\s]+/,' ')         # remove multiple spaces
     .downcase                   # make all letters lowercase
     .split(' ') 
     .delete_if { |w| w.to_i>0}  # remove numbers
     .collect { |w| w[0..3]}     # only keep stem part
  
  mc=s.collect {|w| stemcount[w]}.min
  #  puts [mc,s].inspect
  if mc<1
    puts t
    s.each { |w| stemcount[w]+=1} 
  end
end  

# puts [stemcount.length,stemcount].inspect

It lets a sentence parse through if it contain a stem (first 4 symbol of a word) that is not seen before. This gives a lot of variation in the sentence collection.

Hope it is useful.

6448 sentences extracted from europarl-v7.da-en.da
Corrected 2 error in my extraction script.
@nukeador
Copy link
Contributor

Thanks for this work. Once QA is finished please update the first comment to include all the information that we requested for the German version here

#2539 (comment)

Thanks!

Isomorph70 referenced this pull request in fnielsen/awesome-danish May 29, 2020
@Pertel
Copy link
Contributor

Pertel commented Aug 2, 2021

Hi, I'd like to contribute getting Danish data in the dataset, and found this PR. Am I understanding correctly that this needs proofreading? If so, do let me know what I can do to help.

I'm a native Danish speaker and software developer, so I should be able to assist with QA and some coding/scripting if needed.

I looked at the guidelines here: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission and would be happy to setup a google spreadsheet as suggested, and begin proofreading. But let me know what would be helpful, if anything :)

@Isomorph70
Copy link
Contributor Author

Hi, I'd like to contribute getting Danish data in the dataset, and found this PR. Am I understanding correctly that this needs proofreading? If so, do let me know what I can do to help.

I'm a native Danish speaker and software developer, so I should be able to assist with QA and some coding/scripting if needed.

I looked at the guidelines here: https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission and would be happy to setup a google spreadsheet as suggested, and begin proofreading. But let me know what would be helpful, if anything :)

That would be awesome.

@Pertel
Copy link
Contributor

Pertel commented Aug 5, 2021

Great, I've started manually reviewing.
The sample size calculator told me 2299 sentences were enough, so i randomly picked 2399 sentences, just to be on the safe side.

I don't use google sheets, so I'm using a local copy of the QA spreadsheet, which i will submit when done.
If anyone wants a copy for any reason, I'll happily upload it.

@phirework
Copy link
Contributor

Hey folks, just a heads up that because this PR was first created so long ago, the target is master which is no longer the default branch on Common Voice so I won't be able to accept the PR as-is. Whenever the sentence review is complete, you'll need to re-create this PR and request to merge into main.

@Pertel
Copy link
Contributor

Pertel commented Aug 24, 2021

Thanks for the info! Will do when my review is ready.

@ftyers
Copy link
Collaborator

ftyers commented Sep 13, 2023

@Pertel any update on this?

@Pertel
Copy link
Contributor

Pertel commented Sep 13, 2023

@Pertel any update on this?

Yes, sorry I never did come back and update this! My new PR #3259 was merged way back in 2021, so this should just be closed.

@ftyers
Copy link
Collaborator

ftyers commented Sep 13, 2023

Ok, thanks!

@ftyers ftyers closed this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants