Can it analyse Chinese? #68

cdmikechen · 2016-06-15T12:29:49Z

I recently know this mallet. I am an engineer in China. My Boss ask me for telling that if it can analyse Chinese. I found it has some test case about English, Japanese and so on, but do not have Chinese. So I want to ask that question.
If it can, so what should I do? I must change base code or add some plugins?
If some one knows, please tell me.
Thanks!

mimno · 2016-06-15T12:49:32Z

Mallet is not language-specific, but does not support Chinese tokenization. You would need to run something like the Stanford Chinese segmenter on input text before importing.

chandrasg · 2017-09-24T14:04:59Z

Hi @mimno , I applied text segmentation on chinese text which I am using as input to mallet.

My input looks like this:
id1 id2 【食尚玩家冰糖红烧肉】）五花肉切块，葱

I am trying to capture both english and chinese tokens. Is this the right way to do it?
../mallet-2.0.8/bin/mallet import-file --input inputfile.txt --token-regex '([\p{L}\p{M}]+|[\p{IsHan}]+)' --output inputfile.mallet --keep-sequence --remove-stopwords --stoplist-file stopwords-zh.json

Thanks!

cdmikechen closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can it analyse Chinese? #68

Can it analyse Chinese? #68

cdmikechen commented Jun 15, 2016

mimno commented Jun 15, 2016

chandrasg commented Sep 24, 2017

Can it analyse Chinese? #68

Can it analyse Chinese? #68

Comments

cdmikechen commented Jun 15, 2016

mimno commented Jun 15, 2016

chandrasg commented Sep 24, 2017