Skip to content

A curated list of awesome resources for Danish language technology

License

Notifications You must be signed in to change notification settings

mbenezra/awesome-danish

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 

Repository files navigation

Awesome Danish

A curated list of awesome resources for Danish language technology

Corpora

  • NST
    • NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
    • NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
    • NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
    • NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
  • CLARIN-DK-UCPH
  • SemDaX For educational, teaching or research purposes only. POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences
  • NOMCO is "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
  • Danish Propbank, commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
  • DKhate, corpus of 3600 hate speech from Twitter and Reddits as well as news comments (to appear in 2019)

Parallel corpora:

  • Europarl, parallel sentences between Danish and English from the European Parlament.
  • WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish

Dictionaries and ontologies

  • NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
  • DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
  • Retskrivningsordbogen. The official Danish spelling dictionary digitally avaiable under its own special license.
    • Opslagsord og ordklasser in CSV format.
    • Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
    • Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
  • The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
    • Primary distribution site at http://da.speling.org/ seems no longer available
    • In Debian-based distributions the word list may be installed with sudo aptitude install aspell-da and extracted with spell -d da dump master.
  • The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
  • Wikidata lexemes, structured database with metadata bout lexemes, their forms and their sense. Over 50.000 lexemes including 1.800 Danish in June 2019

Automatic Speech Recognition

  • kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.

Speech Synthesis (text-to-speech)

  • espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
  • ResponsiveVoice - Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.

Sentiment analysis

  • AFINN - Danish lexicons annotated for sentiment.
  • afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with pip install afinn.

Lexical norms

Embeddings

Fundamental processing

  • cstlemma - lemmatiser
  • Lemmy - Lemmatizer for Danish in Python
  • daner - Named entity extraction.
  • DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
  • dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
  • DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
  • StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
  • bornholmsk - Datasets and embeddings for the Bornholmsk dialect.

Resources about resources

About

A curated list of awesome resources for Danish language technology

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published