Adds a new setting: Chinese conversion style #227

zonble · 2022-01-15T12:25:33Z

A new setting is added. Now users can choose one of the following option to do Chinese conversion

Convert the output: just like the current way. The Chinese converter is only applied when we send the text to the client app.
Convert the models: When the node walker is asking for the unigrams from McBopomofoLM, the Chinese converter would be applied on what the language model returns.

This is how McBopomofoLM process a unigram

Check if the original value of the unigram exists in the excluded phrases.
Convert the value with the phrase replacer, a custom user table.
Convert the value with an external converter (a C++ lambda) that may call OpenCC or VXHanConvert, if Chinese conversion is on.
Check if the converted value already inserted.

ShikiSuen · 2022-01-15T13:31:34Z

お疲れ様。這樣一來大方向就對上了。
(Edit: 把[辛苦了]換成[お疲れ様]以解除上下觀感誤會。)

zonble · 2022-01-15T14:31:31Z

機會教育一下：在台灣的中文語境中，平輩對平輩講謝謝就夠了，「辛苦了」是一種上對下的講法。

ShikiSuen · 2022-01-15T14:34:13Z

@zonble 感謝告知。
看來五十年日殖帶來的影響還真不小。我是完全當作下對上的講法在用的。

zonble · 2022-01-15T14:40:41Z

在整個東亞的氣氛中，台灣大概已經算是比較沒有上下之分的地方，我自己也沒有很計較這些，但有一些很細微的地方如果沒有處理好，是會讓一些人不舒服的。尤其是你的繁體中文已經讓人無法直覺區分是不是台灣人了，可能會造成「啊這個人怎麼這麼說話」之類的。

ShikiSuen · 2022-01-15T14:42:27Z

@zonble 再次感謝您分享經驗。

zonble · 2022-01-15T14:46:34Z

基本上我還是先把這種轉換方式寫成需要額外開啟的選項。大概理由如下

跟只轉換輸出結果比較起來，整個轉換 model，會連沒用到的其他詞都要做轉換，會消耗的計算比較多
從 Yahoo 輸入法到現在，過去其實沒有太多人計較這種只轉換輸出結果的轉換方式，我自己覺得輸入法產品盡可能不要影響原本的用戶習慣，所以是做成添加一種轉換模式，而不是直接替代原本的轉換模式

ShikiSuen · 2022-01-15T14:53:08Z

@zonble 轉換整個 model 實際上可能有些火力過猛。
要是只轉換那些「要送往輸入 buffer 與 Voltaire candidate controller 的那部分內容」呢？

ShikiSuen · 2022-01-15T15:03:23Z

Convert the value with an external converter (a C++ lambda) that may call OpenCC or VXHanConvert, if Chinese conversion is on.

剛才一時忘記了這句。看來目前可能已經是極限了。如果是的話那就這樣吧。

lukhnos

LGTM.

lukhnos · 2022-01-16T00:35:13Z

Source/Engine/McBopomofoLM.cpp

+    m_externalConverter = externalConverter;
+}
+
+const vector<Unigram> McBopomofoLM::filterAndTransformUnigrams(const vector<Unigram> unigrams, const unordered_set<string>& excludedValues, unordered_set<string>& insertedValues)


Perhaps comment that insertedValues is an in/out parameter and this method has the side-effect of updating that unordered set.

Option 0: converts the output. Option 1: converts the models.

ShikiSuen · 2022-01-17T01:30:39Z

這個功能可能還需要改進。
目前的這個前置轉換功能在開啟的時候，所有新添入的使用者語彙都會被自動轉簡。
（我不知道 https://github.com/rime/squirrel 是怎麼解決這個問題的。）

ShikiSuen · 2022-01-17T01:50:20Z

看來轉換對象應該是 candidate controller。
candidate controller 處理 candidates 時的 array 有三欄：Index, Converted Candidate, Actual Candidate。
當且僅當有轉換的漢字的內容存在「多個意義不一樣的繁體字對應單個簡體字」的情況的時候，這種 Candidate 不要省略去重，而是顯示為 Displayed Candidate = Converted Candidate (Actual Candidate) 這種形態。
例如：ㄑㄧㄢˋ 的條目內出現的是 纤(縴)。這是 RIME / Squirrel 顯示的形態。
在此基礎上，自訂語彙也好，半衰記憶模組也好，都是針對 Actual Candidate 來進行的、而非 Converted Candidate。
當使用者決定輸入哪個候選字之後，送入最終 buffer 的會是 Converted Candidate。

這樣搞下去會越來越複雜，不然擱置好了。
簡繁體詞庫彼此不獨立，就會出現這種「需要在技術方面下非常大的成本、才能達成近乎完美」的情況。
RIME / Squirrel 已經近乎完美了（把 OpenCC 的能力用到极致），但仍舊受到 OpenCC 能力的制約。
所以我這邊才做了簡繁體詞庫獨立的設計的。
不過，簡繁體詞庫獨立，需要使用者分別維護兩套自訂語彙就是了。

ShikiSuen · 2022-01-21T14:29:16Z

A note to certain future forkers who don't want this feature:

Zonble solved a problem in af7afe1 : user phrases, even if not duplicated, can be read twice in lieu of once.
However, Zonble's purpose in that commit was to remove duplicates dynamically in the RAM before they get pushed to the candidate controller.

If anyone only wants to stop multiple readings of a user phrase without implementing the entire feature of this commit, simply find this line in McBopomofoLM.cpp:

vector<Unigram> filterredUserUnigrams = m_userPhrases.unigramsForKey(key);

to:

vector<Unigram> filterredUserUnigrams;

Credit: Hiraku.

zonble · 2022-01-21T17:03:41Z

Your comment confuses me. The variable filterredUserUnigrams was already removed in af7afe1

ShikiSuen · 2022-01-22T00:36:09Z

如果我沒搞錯的話，這個 commit 的目的遠遠不只是移除 filterredUserUnigrams 吧？
不然我很難想像 filterAndTransformUnigrams 是做什麼的。

ShikiSuen · 2022-01-22T01:04:25Z

@zonble 我明白您在誤會什麼了。

是這樣。我一開始是這樣講：如果將來某個 forker 不想要 PR 277，但又想解決 lukhnos 引入的「使用者詞彙重複讀取」的 bug 的話，那他就把vector<Unigram> filterredUserUnigrams = m_userPhrases.unigramsForKey(key);換成vector<Unigram> filterredUserUnigrams;應該就可以了。

而且剛才我又測試了一遍，證實了我昨晚的猜想：您並非是將使用者語彙檔案內的重複條目就地清除、而是只是在記憶體當中排除重複而已（正好您引入的這個新功能又與您這個 PR 引入的 deduplicator 設計相呼應、是 OpenCC 前置轉換所必需的）：

ShikiSuen · 2022-01-23T06:47:36Z

剛剛發現 af7afe1 還有一個隱藏價值：
如果要引入一個單獨的 language model 處理全字庫的話，在 af7afe1 的基礎上可以完美排除全字庫內出現的「已經被輸入法主字庫收錄的條目」。而要是沒有套用 af7afe1 的話，處理起來就很麻煩。

至於有沒有 af7afe1 導致的「候選字內是否有反映使用者語彙表內有出現重複的內容的情況」，在這種大是大非下已經無所謂了。有這種需求，不如針對 UserPhrasesLM 插入一段自動去重複內容的處理。

P.S.: 小麥注音當年支持的全字庫的範圍遠遠少於目前的全字庫 2020 版的範圍。
但全都塞到 BPMFBase 裡面恐又不太現實。
竊以為，如果真要做的話，沒準單獨一個 language model 專門處理全字庫，可能比較合適。

zonble requested a review from lukhnos January 15, 2022 12:29

lukhnos approved these changes Jan 16, 2022

View reviewed changes

zonble added 5 commits January 16, 2022 15:04

Filters duplicated unigram values properly.

af7afe1

Adds an option to let users to choose Chinse conversion style.

96eca18

Option 0: converts the output. Option 1: converts the models.

Refactors the function to filter and transform unigrams in McBopomofoLM.

84fa341

Fixes a wrong API call.

812f59b

Updates comments and fixes a typo.

d99b883

zonble force-pushed the master branch from 756a33d to d99b883 Compare January 16, 2022 07:04

zonble merged commit fa8f508 into openvanilla:master Jan 16, 2022

openvanilla locked as resolved and limited conversation to collaborators Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a new setting: Chinese conversion style #227

Adds a new setting: Chinese conversion style #227

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022 •

edited

Loading

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

lukhnos left a comment

lukhnos Jan 16, 2022

ShikiSuen commented Jan 17, 2022 •

edited

Loading

ShikiSuen commented Jan 17, 2022 •

edited

Loading

ShikiSuen commented Jan 21, 2022 •

edited

Loading

zonble commented Jan 21, 2022 •

edited

Loading

ShikiSuen commented Jan 22, 2022 •

edited

Loading

ShikiSuen commented Jan 22, 2022 •

edited

Loading

ShikiSuen commented Jan 23, 2022 •

edited

Loading

Adds a new setting: Chinese conversion style #227

Adds a new setting: Chinese conversion style #227

Conversation

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022 • edited Loading

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

zonble commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

ShikiSuen commented Jan 15, 2022

lukhnos left a comment

Choose a reason for hiding this comment

lukhnos Jan 16, 2022

Choose a reason for hiding this comment

ShikiSuen commented Jan 17, 2022 • edited Loading

ShikiSuen commented Jan 17, 2022 • edited Loading

ShikiSuen commented Jan 21, 2022 • edited Loading

zonble commented Jan 21, 2022 • edited Loading

ShikiSuen commented Jan 22, 2022 • edited Loading

ShikiSuen commented Jan 22, 2022 • edited Loading

ShikiSuen commented Jan 23, 2022 • edited Loading

ShikiSuen commented Jan 15, 2022 •

edited

Loading

ShikiSuen commented Jan 17, 2022 •

edited

Loading

ShikiSuen commented Jan 17, 2022 •

edited

Loading

ShikiSuen commented Jan 21, 2022 •

edited

Loading

zonble commented Jan 21, 2022 •

edited

Loading

ShikiSuen commented Jan 22, 2022 •

edited

Loading

ShikiSuen commented Jan 22, 2022 •

edited

Loading

ShikiSuen commented Jan 23, 2022 •

edited

Loading