Skip to content

jacksonllee/multi-tiered-cantonese-word-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Tiered Cantonese Word Segmentation

Paper Conference

This data is the subset of the Hong Kong Cantonese Corpus (HKCanCor) that has been re-segmented by the multi-tiered word segmentation scheme described in the following paper:

  • Charles Lam, Chaak-ming Lau, and Jackson L. Lee. 2024. Multi-Tiered Cantonese Word Segmentation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11993–12002, Torino, Italy. ELRA and ICCL.
@inproceedings{lam-etal-2024-multi-tiered,
    title = "Multi-Tiered {C}antonese Word Segmentation",
    author = "Lam, Charles  and
      Lau, Chaak-ming  and
      Lee, Jackson L.",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1047",
    pages = "11993--12002",
    abstract = "Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of {``}word{''} is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of {``}word{''} in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.",
}

This data is released under the CC BY 4.0 license, the same license associated with the source HKCanCor data.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors