Skip to content

lmorgadodacosta/CantoneseWN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The Cantonese Wordnet

This repository contains the data for the Cantonese Wordnet project.

This project was created and is continuously updated by Joanna Ut-Seong Sio (Palacký University, Czech Republic) and Luis Morgado da Costa (Vrije Universiteit Amsterdam, the Netherlands).

Our wordnet contains data both in traditional characters and in Jyutping (a romanisation system for Cantonese developed by the Linguistic Society of Hong Kong in 1993). The Cantonese wordnet is currently supported in two formats:

  • the Lexical Markup Framework (LMF) compatible XML, released and maintained by the Global Wordnet Association;
  • a legacy TSV format adopted by the original version of the Open Multilingual Wordnet; (due to format constraints, not all data are available in the legacy format -- i.e. Jyutping forms).

Currently the Cantonese Wordnet contains over 16,500 hand-checked lemmas and respective romanizations, distributed across all major parts-of-speech. More descriptive statistics and methodology can be found in its canonical citation (see below).

Demo

In the future, the Cantonese Wordnet will be included in the Open Multilingual Wordnet (OMW). However, as OMW is currently undergoing restructuring, we are hosting it here in the meantime.

Notable features

  • Our wordnet is fully handchecked by trained linguists;
  • For each lemma, both its Jyutping and character representations are included. Concerning Jyutping, we include as much variation in pronunciation as possible (including bin3jam1 變⾳ ‘changed tone’ and laan5jam1 懶⾳ ‘lazy pronunciation’); Concerning character representations, we also include as much variation as possible, given that there is no official standardization;
  • Following recent trends, our wordnet is not limited to open class words, it also includes functional words (e.g., classifiers and post-verbal particles);
  • Our wordnet is being developed alongside a companion corpus (The Cantonese Wordnet Corpus), which is also being sense-tagged. This corpus is being used in attestation of senses, as well as to provide example sentences to individual sense-usages;

License

The Cantonese wordnet is released under a Creative Commons Attribution 4.0 International License (CC BY 4.0) and its canonical citation is:

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2019). Building the Cantonese Wordnet. In Proceedings of the Tenth Global Wordnet Conference (GWC 2019), pp. 206-215. Wroclaw, Poland.

If you use any data from the Cantonese Wordnet Corpus, please also cite the following paper:

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2022). Enriching Linguistic Representation in the Cantonese Wordnet and Building the New Cantonese Wordnet Corpus. Proceedings of the 13th Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). Marseille, France.

References:

Sio, Joanna Ut-Seong & Morgado da Costa, Luís. (2023). The Open Cantonese Sense-Tagged Corpus. Proceedings of the 12th International Global Wordnet Conference. San Sebastian, Spain.

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2022). Enriching Linguistic Representation in the Cantonese Wordnet and Building the New Cantonese Wordnet Corpus. Proceedings of the 13th Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). Marseille, France.

Sio, Joanna Ut-Seong & Morgado da Costa, Luis. (2019). Building the Cantonese Wordnet. In Proceedings of the Tenth Global Wordnet Conference (GWC 2019), pp. 206-215. Wroclaw, Poland.