Latest commit 58149d8 Jan 23, 2017 @mjhsieh mjhsieh hotfix - accidental sorting
Permalink
..
Failed to load latest commit information.
bin
memo
.gitignore
BPMFBase.txt
BPMFMappings.txt
BPMFPunctuations.txt
Makefile
README
Symbols.txt
exclusion.txt
heterophony1.list
heterophony2.list
heterophony3.list
phrase.occ

README

├── BPMFBase.txt		Bopomofo Table for chars
├── BPMFMappings.txt		Bopomofo Table for 2-char words to 6-char words
|                                  Originally simplified from tsi.src of libtabe
|                                  (BSD Licensed) with my own and zonble's
|                                  modification.
├── BPMFPunctuations.txt	Punctuation table
├── heterophony1.list		table for char heterophony, first order
├── heterophony2.list		table for char heterophony, second order
├── heterophony3.list		table for char heterophony, third order
├── phrase.occ			table for appearance of a word in a text pool
├── Makefile			for conveniences for developers
└── bin
    ├── cook.py			The tool to build the data.txt for McBopomofo
    ├── buildFreq.bash		The awk script for counting occurance
    ├── bpmfmap.py		mapping bpmf automatically and primitively
    ├── cook-plain-bpmf.py	build data file for tranditional bpmf ime
    ├── count.bash		a wrapper for a C-based counting facility
    ├── count.occurrence.c	a C-based counting facility for command line
    └── count.occurrence.py	a python-based counter for the whole list

----- Editorial Rule -----
* when in doubt, use google/yahoo search to confirm the rarity of the phrases
  for example search: "一瞻丰采" site:.tw
  - if the amount of the results is under 1000, it's likely okay to remove this
    phrase
  - if tsi.src variants showed up in the first page of the results, you should
    remove this phrase. Tsi.src variants include SEO sites, this phenomenom is
    quite amazing itself.
  - for idioms, it's possible the first hit is the idiom database, it's up to 
    the editor to decide.
  - if this candidate is apparent part of a long phrase and the segmentation is
    apparently incorrect, find a way to remove the phrase candidate and replace
    it with the complete one, but forget about it if the length is longer than
    6.
* We need to constantly remind ourself, IME is not an idiom database.