Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document properties #2

Closed
craigpfeifer opened this issue Oct 14, 2015 · 4 comments
Closed

Document properties #2

craigpfeifer opened this issue Oct 14, 2015 · 4 comments
Assignees

Comments

@craigpfeifer
Copy link

The properties file contains the properties:

stopWordListName = data/CN.nw.wordlist.txt
endWordListName = data/CN.endlist.txt
forbiddenCharListName = data/CN.charlist.txt
stopThreshold = 50
forbiddenThreshold = 800
minAV = 5
minCount = 3
minDocumentCount = 5
terminologyThreshold = 0.6

Could you document what each parameter is? I think the first 3 are obvious, but the others could use some explanation.

Thanks!

@ivanhe ivanhe self-assigned this Oct 14, 2015
@craigpfeifer
Copy link
Author

Also, if there are other properties squirreled away somewhere that can affect the results, that would be useful as well!

@ivanhe
Copy link
Owner

ivanhe commented Oct 14, 2015

I will document the parameters, but what matters most is using a Chinese word segmenter that works well on your data.

@craigpfeifer
Copy link
Author

Agreed. Bad segmentation is impossible to recover from.

@ivanhe
Copy link
Owner

ivanhe commented Oct 14, 2015

Updated README.md. Thank you!

@ivanhe ivanhe closed this as completed Oct 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants