-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a new setting: Chinese conversion style #227
Conversation
お疲れ様。這樣一來大方向就對上了。 |
機會教育一下:在台灣的中文語境中,平輩對平輩講謝謝就夠了,「辛苦了」是一種上對下的講法。 |
@zonble 感謝告知。 |
在整個東亞的氣氛中,台灣大概已經算是比較沒有上下之分的地方,我自己也沒有很計較這些,但有一些很細微的地方如果沒有處理好,是會讓一些人不舒服的。尤其是你的繁體中文已經讓人無法直覺區分是不是台灣人了,可能會造成「啊這個人怎麼這麼說話」之類的。 |
@zonble 再次感謝您分享經驗。 |
基本上我還是先把這種轉換方式寫成需要額外開啟的選項。大概理由如下
|
@zonble 轉換整個 model 實際上可能有些火力過猛。 |
剛才一時忘記了這句。看來目前可能已經是極限了。如果是的話那就這樣吧。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
m_externalConverter = externalConverter; | ||
} | ||
|
||
const vector<Unigram> McBopomofoLM::filterAndTransformUnigrams(const vector<Unigram> unigrams, const unordered_set<string>& excludedValues, unordered_set<string>& insertedValues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps comment that insertedValues
is an in/out parameter and this method has the side-effect of updating that unordered set.
Option 0: converts the output. Option 1: converts the models.
這個功能可能還需要改進。 |
看來轉換對象應該是 candidate controller。 這樣搞下去會越來越複雜,不然擱置好了。 |
A note to certain future forkers who don't want this feature: Zonble solved a problem in af7afe1 : user phrases, even if not duplicated, can be read twice in lieu of once. If anyone only wants to stop multiple readings of a user phrase without implementing the entire feature of this commit, simply find this line in vector<Unigram> filterredUserUnigrams = m_userPhrases.unigramsForKey(key); to: vector<Unigram> filterredUserUnigrams; Credit: Hiraku. |
Your comment confuses me. The variable |
如果我沒搞錯的話,這個 commit 的目的遠遠不只是移除 filterredUserUnigrams 吧? |
@zonble 我明白您在誤會什麼了。 是這樣。我一開始是這樣講:如果將來某個 forker 不想要 PR 277,但又想解決 lukhnos 引入的「使用者詞彙重複讀取」的 bug 的話,那他就把 而且剛才我又測試了一遍,證實了我昨晚的猜想:您並非是將使用者語彙檔案內的重複條目就地清除、而是只是在記憶體當中排除重複而已(正好您引入的這個新功能又與您這個 PR 引入的 deduplicator 設計相呼應、是 OpenCC 前置轉換所必需的): |
剛剛發現 af7afe1 還有一個隱藏價值: 至於有沒有 af7afe1 導致的「候選字內是否有反映使用者語彙表內有出現重複的內容的情況」,在這種大是大非下已經無所謂了。有這種需求,不如針對 UserPhrasesLM 插入一段自動去重複內容的處理。 P.S.: 小麥注音當年支持的全字庫的範圍遠遠少於目前的全字庫 2020 版的範圍。 |
A new setting is added. Now users can choose one of the following option to do Chinese conversion
This is how McBopomofoLM process a unigram