Skip to content

isoos/awesome-hungarian-nlp

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Awesome NLP Resources for Hungarian

Awesome Check Links stars

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers - Gyรถrgy Orosz

Table of contents

  1. Tools
    1. Word tokenization, sentence splitting
    2. Morphology
    3. PoS / Morphological taggers
    4. Taggers / Chunkers
    5. Pipelines with Hungarian NLP components
    6. Syntactic parsers
    7. Semantic analysis
    8. Other
  2. Language models
    1. Word embeddings
    2. Transformer models
  3. Datasets
    1. Corpora
      1. Raw corpora
      2. Annotated corpora
      3. Parallel corpora
    2. Linguistic resources
    3. Linked Open Data
    4. Geo data
    5. Speech related data
  4. Academy
    1. Journals
    2. Conferences
    3. Institutes
  5. Learning resources
    1. Books
    2. Courses
    3. Tutorials
  6. Communities
  7. Other Hungarian related resource collections

Tools

Notations:

  • ๐Ÿ‘Œ Easy to install and use
  • ๐Ÿš€ Commercial-friendly license
  • ๐Ÿ’ฏ Pretrained models are available or not needed

Word tokenization, sentence splitting

  • huntoken ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Hungarian word and sentence splitter
  • quntoken ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ New Hungarian tokenizer based on quex, huntoken

Morphology

  • emMorph (Humor) ๐Ÿ’ฏ Hungarian morphological analyzer based on Humor
  • emMorphPy ๐Ÿ‘Œ๐Ÿ’ฏA wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer
  • hunmorph ๐Ÿš€๐Ÿ’ฏ is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.
  • hunmorph-foma ๐Ÿš€๐Ÿ’ฏ Hungarian morpholical analyzer and generator based on hunmorph.
  • hunspell ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is an open-source spell-checker, stemmer and morphological analyzer
  • lara-hungarian-nlp ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ LARA is a lightweight Python NLP library for ChatBots in Hungarian.
  • Lemmagen ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ project aims at providing standardized open source multilingual platform for lemmatisation. (Python package for v3 | C# project for v3)
  • Simplemma ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is a simple multilingual lemmatizer for Python

PoS / Morphological taggers

  • hunpos ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
  • PurePos ๐Ÿ‘Œ๐Ÿš€ Open source morphological tagger based on HunPos
  • purepos.py ๐Ÿ‘Œ๐Ÿš€ Python wrapper for PurePos

Taggers / Chunkers

  • HunTag ๐Ÿ‘Œ๐Ÿš€ A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
  • HunTag3 ๐Ÿ‘Œ๐Ÿš€ Improved version of the original HunTag
  • SzegedNER ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Named Entity Recognition tool for Hungarian and English
  • DBpedia Spotlight ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text. Docker image
  • emBERT ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

Pipelines with Hungarian NLP components

  • magyarlanc ๐Ÿ‘Œ๐Ÿ’ฏ A toolkit for the basic linguistic processing of Hungarian
  • magyarlanc_spark ๐Ÿ‘Œ๐Ÿ’ฏ Spark wrapper for magyarlanc
  • HuSpaCy ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Industrial-strength Hungarian Natural Language Processing
  • huNLP ๐Ÿ‘Œ๐Ÿ’ฏ An experimental unified Java and REST API for magyarlanc and szegedNER
  • hunlp-GATE ๐Ÿ’ฏ GATE plugin containing Hungarian NLP tools as GATE processing resources
  • Trendminer Hungarian Processing Pipeline ๐Ÿš€ Hungarian NLP pipeline for social media text analysis (TrendMiner project)
  • Google Syntaxnet ๐Ÿš€๐Ÿ’ฏ Neural Models of Syntax
  • UDPipe ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
  • polyglot ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is a natural language pipeline that supports massive multilingual applications.
  • emtsv ๐Ÿ‘Œ๐Ÿ’ฏ is a text processing system with inter-module communication via tsv + REST API
  • Stanza ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ is a Python NLP Library for Many Human Languages
  • spaCy StanfordNLP ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline
  • trankit ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Syntactic parsers

  • hunpars ๐Ÿš€๐Ÿ’ฏ A rule based Hungarian syntactical analyzer
  • HunParse ๐Ÿš€๐Ÿ’ฏ An NLTK-based parser using KR-style morphological annotation
  • Anagramma Parser A parser based on psycholinguistics principles
  • benepar ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

Semantic analysis

Other

  • emLam ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Preprocessing scripts for Hungarian Language Modeling
  • pywnxml ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)
  • Hun-appointment-chatbot ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ A simple Hungarian chatbot for booking an appointment using the Rasa framework.
  • neural-punctuator ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Automatic punctuation restoration with BERT models for English and Hungarian
  • hunaccent ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ Small Footprint Diacritic Restoration for Hungarian
  • Diacritics_restoration ๐Ÿš€๐Ÿ’ฏ Lightweight Diacritics Restoration with Dilated Convolutional Neural Networks
  • NYTK MT ๐Ÿ‘Œ๐Ÿš€๐Ÿ’ฏ NYTK Machine translation models
  • syntax-augmentation-nmt ๐Ÿš€๐Ÿ’ฏ Syntax-based data augmentation for Hungarian-English machine translation
  • anonymizer_hu ๐Ÿš€๐Ÿ’ฏ The Hungarian anonymization tool for CURLICAT

Language models

Word embeddings

Transformer models

  • huBERT Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia
  • HIL* Transformer models Pretrained transformer models provided by HILANCO
  • PULI-BERT-Large is a Hungarian BERT large model based on MegatronBERT
  • PULI-GPT-2 is a Hungarian GPT-2 model
  • PULI-GPT-3SX is a Hungarian GPT-NeoX model (6.7 billion parameter)

Datasets

Corpora

Raw corpora

  • Hungarian Webcorpus With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.
  • Hungarian Webcorpus 2.0 The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.
  • OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)
  • emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
  • Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.
  • web2corpus Automatically created multilingual web corpus
  • CC-100 Monolingual Datasets from Web Crawl Data

Annotated corpora

  • CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe, together with word embeddings of dimension 100 computed from lowercased texts by word2vec
  • OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
  • HunEmPoli corpus was built using pre-agenda speeches of the Hungarian National Assembly (2014-2018) and consists 764008 tokens/36475 sentences. Aspect level emotion annotation, with 39840 identified emotions, in addition, marked the keywords that evoked the emotion.
  • The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.
  • Hungarian sentiment corpus (HuSent) is a deeply annotated Hungarian sentiment corpus. It is composed of Hungarian opinion texts written about different types of products, published on the homepage [http://divany.hu/]
  • Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
  • Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.
  • Universal Dependencies
  • Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.
  • KorKor Pilotcorpus is a gold standard corpus consisting of multiple layers such as dependency parse and coreference annotations
  • NerKor is a gold standard named entity annotated corpus containing 1 million tokens.
  • NerKor 1.41e A 1M+-token Hungarian named entity dataset with ~30 entity types derived from NYTK-NerKor
  • hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
  • Mazsola database contains 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
  • PrevCons is a database of 21K hapaxes of verbs with verbal prefixes
  • Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
  • HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.
  • HuLU Hungarian Language Understanding Benchmark Kit
    • HuCOLA Hungarian Corpus of Linguistic Acceptability
    • HuCoPA Hungarian Choice of Plausible Alternatives Corpus
    • HuSST Hungarian version of the Sentiment Treebank
    • HuWNLI Anaphora resolution datasets for Hungarian as an inference task
    • HuWS is the Hungarian set of the Winograd schemas
  • HuRC Hungarian Corpus for Reading Comprehension with Commonsense Reasoning
  • ELTE Poetry Corpus is a database of complete poems of 50 Hungarian canonical poets together with the sound devices of the poems and the grammatical features of words in XML format
  • ELTE Novel Corpus is a database of 400 Hungarian novels (with the annotation of structural units and the grammatical features of words in TEI XML format)
  • ELTE Drama Corpus is a database of 58 dramas (with the annotation of structural units and the grammatical features of words in TEI XML format)
  • HumSum-1 is a dataset containing over 1.1M unique news articles with lead and other metadata
  • HAPP is the Hungarian translation of the Definite Pronoun Resolution Dataset.

Parallel corpora

  • Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.
  • SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
  • HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.
  • CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl
  • CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian
  • Hungarian-Russian Prisoner of War Database
  • TED talks transcripts parallel corpus sentence aligned TED talks including Hungarian.
  • TaPaCo Corpus is a paraphrase corpus for 73 languages, including Hungarian, extracted from the Tatoeba database
  • Duolingo STAPLE is a dataset of comprehensive accepted translations from English to 5 different languages, including Hungarian
  • PPDB is an automatically extracted database containing millions of paraphrases in 16 different languages, including Hungarian
  • OpenSubtitles Corpus contains movie subtitles and alignments for 62 languages, including Hungarian
  • [OPUS Corpus][https://opus.nlpl.eu] is a growing collection of translated texts from the web
  • MASSIVE dataset is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation.
  • PWS is a parallel collection of the Winograd schemas in seven languages (including Hungarian)

Linguistic resources

Linked Open Data

Geo data

Speech related data

Academy

Journals

Conferences

  • MSZNY Conference on Hungarian Computational Linguistics (since 2003)

Institutes

Learning resources

Books

Courses

Tutorials

Communities

Other Hungarian related resource collections

About

A curated list of NLP resources for Hungarian

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published