fix(search): Handle word searches with char codes below 0x3000 #1607

bryanjenningz · 2023-06-06T20:23:45Z

Words like ＤＶＤ-ＲＯＭ were not matching due to the hyphen (represented by a char code below 0x3000) in the middle. In order to include, entries like ＋α, logic was removed from rikaicontent.ts that skipped lookups when the letter the mouse hovered over was not a normal Japanese char code. (This included other various ranges which didn't show additional impacts as of this PR.)

These changes are thought to have been added for performance reasons which are no longer relevant but if some problem does crop up we can solve it more narrowly and with better documentation at that time.

See word entries affected by this change (122 unique words)

[
  "〇×",
  "〇×テスト",
  "＋α",
  "○",
  "○×",
  "○×テスト",
  "○○",
  "◎",
  "△",
  "※印",
  "２×４工法",
  "２進-５進",
  "４°",
  "ＣＤ-ＤＡ",
  "ＣＤ-Ｒ",
  "ＣＤ-ＲＯＭ",
  "ＣＯＶＩＤ-１９",
  "ＤＶＤ-Ｒ",
  "ＤＶＤ-ＲＡＭ",
  "ＤＶＤ-ＲＯＭ",
  "ＨＡＮＡ-ＢＩ",
  "Ｈα線",
  "Ｊ-ＰＯＰ",
  "Ｋ-１",
  "Ｋ-ＰＯＰ",
  "Ｋ-Ｔ境界",
  "ＲＦ-ＩＤ",
  "ＵＴＦ-８",
  "Ｗｉ-Ｆｉ",
  "ウラン-２３８",
  "プラスα",
  "マイクロRNA",
  "マッチョ♀ウーマン",
  "Α",
  "Β",
  "Γ",
  "Γ分布",
  "Δ",
  "Ε",
  "Ζ",
  "Η",
  "Θ",
  "Θ関数",
  "Ι",
  "Κ",
  "Λ",
  "Μ",
  "Ν",
  "Ξ",
  "Ο",
  "Π",
  "Ρ",
  "Σ",
  "Τ",
  "Υ",
  "Φ",
  "Χ",
  "Ψ",
  "Ω",
  "α",
  "α-ｈｅｌｉｘ",
  "α-ヘリックス",
  "α-メチルトリプタミン",
  "α-リノレン酸",
  "αらせん",
  "αメチルトリプタミン",
  "αリノレン酸",
  "α化",
  "α化米",
  "α線",
  "α波",
  "α分類学",
  "α米",
  "α崩壊",
  "α螺旋",
  "α粒子",
  "β",
  "β-カロチン",
  "β-カロテン",
  "βカロチン",
  "βカロテン",
  "βテスト",
  "βバージョン",
  "β線",
  "β波",
  "β版",
  "β崩壊",
  "β粒子",
  "γ",
  "γ-オリザノール",
  "γアミノ酪酸",
  "γリノレン酸",
  "γ線",
  "γ分布",
  "γ崩壊",
  "δ",
  "ε",
  "ζ",
  "η",
  "θ",
  "θ理論",
  "ι",
  "κ",
  "λ",
  "μ",
  "μ粒子",
  "ν",
  "ξ",
  "ο",
  "π",
  "π中間子",
  "π電子",
  "ρ",
  "σ",
  "σ電子",
  "τ",
  "τ粒子",
  "υ",
  "φ",
  "χ",
  "ψ",
  "ω"
]

This list of words was collected by running this code:

const uniqueWords = [
  ...new Set(
    rcxDict.wordDict
      .trim()
      .split("\n")
      .map((line) => line.split(" ")[0])
  ),
];
const wordsWithCharCodesBelow0x3000 = uniqueWords.filter((word) =>
  [...word].some((char) => char.charCodeAt(0) < 0x3000)
);
console.log(
  "wordsWithCharCodesBelow0x3000",
  JSON.stringify(wordsWithCharCodesBelow0x3000, null, 2)
);

I wrote snapshot tests to make sure none of the words without character codes below 0x3000 were affected by this change. I didn't include those tests in this pull request because they take a while to run (since there are over 200,000 entries in the dictionary) and they make this pull request bigger. I can add those tests in a separate pull request if we think that's a good idea so we can verify diffs of word searches for all words in the dictionary. This would give us more confidence whenever we change the wordSearch function because it shows all the diffs for each search result in the dictionary and it would allow us to drastically simplify the wordSearch function in the future with confidence that we aren't breaking of the previous behavior.

I also wrote code to do a check to make sure all words in the dictionary are searchable and found there is only a 1 unsearchable word in the dictionary now:

const uniqueWords = [
  ...new Set(
    rcxDict.wordDict
      .trim()
      .split("\n")
      .map((line) => line.split(" ")[0])
  ),
];
const unsearchableWords = uniqueWords.filter(
  (word) =>
    !rcxDict
      .wordSearch(word, false)
      ?.data.find((data) => data.entry.startsWith(word + " "))
);
console.log("unsearchableWords", unsearchableWords); // Logs ['～']

So this means after this pull request gets merged, the only unsearchable word entry in the dictionary is "～ [にょろ] /(n) tilde/wave dash/".

This word is a single character with a character code of 65374, which is 0xff5e in hexadecimal (which is above 0x3000, which explains why this pull request doesn't handle this word). After searching through the code for this number 0xff5e, it looks like there's some code that handles this character code that needs to be modified slightly to account for this word in the dictionary. That can be handled in a subsequent pull request to keep this pull request short and simple.

Fixes: #1453

- Fixes melink14#1453

- Tests the word from melink14#1453

codecov · 2023-06-06T20:28:27Z

Codecov Report

Merging #1607 (d28b446) into main (7023f15) will increase coverage by 0.17%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1607      +/-   ##
==========================================
+ Coverage   79.66%   79.84%   +0.17%     
==========================================
  Files           7        7              
  Lines        3010     2992      -18     
  Branches      189      185       -4     
==========================================
- Hits         2398     2389       -9     
+ Misses        607      599       -8     
+ Partials        5        4       -1

Impacted Files	Coverage Δ
extension/data.ts	`82.48% <ø> (+0.12%)`	⬆️
extension/rikaicontent.ts	`76.13% <ø> (+0.25%)`	⬆️

extension/data.ts

extension/test/data_test.ts

melink14

Just a quick review before work; I'll try to look again afterwards! (It looks reasonable though!)

extension/data.ts

extension/test/data_test.ts

extension/data.ts

melink14 · 2023-06-06T22:04:27Z

#190 is the issue for handling ～

As far other tests, I think we don't need to test every single word; it should be enough to test a representative word and be confident that similar words behave the same. I'd be interested more in the doubts you had while developing since as you say the goal of tests is to allow making changes without worrying!

- This if-check was there for performance, but there's no difference in performance so it can be removed.

…s://github.com/bryanjenningz/rikaikun into fix-word-searches-with-char-codes-below-0x3000

bryanjenningz · 2023-06-06T22:59:12Z

I'd be interested more in the doubts you had while developing since as you say the goal of tests is to allow making changes without worrying!

I didn't have any specific doubts while developing. I'm just not familiar with the code in wordSearch so I just wrote some tests to see the diffs before and after my change to be 100% sure the wordSearch output is changed how I expected it to change.

extension/data.ts

- This range check was preventing words like "＋α" from being searched in the dictionary.

extension/rikaicontent.ts

melink14 · 2023-06-07T13:38:24Z

I changed the PR body a bit to make it read better (imo) once it gets merged and becomes the commit message body. Just for your information, I had two goals:

Remove information that was already in the PR title.
Make the initial paragraph give additional context around why the changes were done (including the final rikaicontent change)

An astute reader would be able to figure that stuff out but since code sleuthing months/years later is hard, I tend to try to make it as easy as possible for my future self. :)

- Before, non-Japanese text would not trigger xsearch for performance reasons, which caused words starting with non-Japanese characters to not trigger xsearch. - These performance reasons don't appear to be relevant anymore, but if there are any performance problems in the future, we can filter in a more granular way in the future to fix any performance problems.

bryanjenningz · 2023-06-07T18:29:31Z

I changed the PR body a bit to make it read better (imo) once it gets merged and becomes the commit message body. Just for your information, I had two goals:
* Remove information that was already in the PR title.

* Make the initial paragraph give additional context around why the changes were done (including the final rikaicontent change)
An astute reader would be able to figure that stuff out but since code sleuthing months/years later is hard, I tend to try to make it as easy as possible for my future self. :)

Thanks so much! I appreciate that. The PR has a nice initial paragraph now. I'll be sure to add more context like this in future PRs so that it's easier to understand for my future self too.

melink14 · 2023-06-11T05:52:20Z

@all-contributors Add @bryanjenningz for code.

allcontributors · 2023-06-11T05:52:29Z

@melink14

I've put up a pull request to add @bryanjenningz! 🎉

@bryanjenningz

Adds @bryanjenningz as a contributor for code. This was requested by melink14 [in this comment](#1607 (comment))

## [2.5.13](v2.5.12...v2.5.13) (2023-06-12) ### Bug Fixes * **dict:** Update dictionaries to latest versions ([#1616](#1616)) ([b665d3b](b665d3b)) * **search:** Handle word searches with char codes below 0x3000 ([#1607](#1607)) ([f94efb7](f94efb7))

melink14 · 2023-06-12T10:43:08Z

🎉 This PR is included in version 2.5.13 🎉

The release is available on:

Your semantic-release bot 📦🚀

bryanjenningz added 3 commits June 6, 2023 12:45

fix(search): Handle words with char codes below 0x3000

61db0bb

- Fixes melink14#1453

test(search): Test word search with punctuation

2ed8fe0

- Tests the word from melink14#1453

Merge branch 'main' into fix-word-searches-with-char-codes-below-0x3000

7d6ddec

bryanjenningz commented Jun 6, 2023

View reviewed changes

extension/data.ts Outdated Show resolved Hide resolved

bryanjenningz commented Jun 6, 2023

View reviewed changes

extension/data.ts Outdated Show resolved Hide resolved

bryanjenningz commented Jun 6, 2023

View reviewed changes

extension/test/data_test.ts Show resolved Hide resolved

melink14 requested changes Jun 6, 2023

View reviewed changes

extension/data.ts Outdated Show resolved Hide resolved

extension/test/data_test.ts Show resolved Hide resolved

extension/data.ts Outdated Show resolved Hide resolved

bryanjenningz added 2 commits June 6, 2023 15:40

refactor(search): Remove if-check for char codes below 0x3000

0429eba

- This if-check was there for performance, but there's no difference in performance so it can be removed.

Merge branch 'fix-word-searches-with-char-codes-below-0x3000' of http…

0ec176d

…s://github.com/bryanjenningz/rikaikun into fix-word-searches-with-char-codes-below-0x3000

bryanjenningz commented Jun 6, 2023

View reviewed changes

extension/data.ts Show resolved Hide resolved

bryanjenningz requested a review from melink14 June 7, 2023 00:01

fix(search): Remove char code range check in rikaicontent

b44b0ef

- This range check was preventing words like "＋α" from being searched in the dictionary.

bryanjenningz commented Jun 7, 2023

View reviewed changes

extension/rikaicontent.ts Show resolved Hide resolved

melink14 previously approved these changes Jun 7, 2023

View reviewed changes

bryanjenningz dismissed melink14’s stale review via d28b446 June 7, 2023 18:28

bryanjenningz requested a review from melink14 June 7, 2023 18:32

melink14 approved these changes Jun 7, 2023

View reviewed changes

mergify bot merged commit f94efb7 into melink14:main Jun 7, 2023

bryanjenningz deleted the fix-word-searches-with-char-codes-below-0x3000 branch June 9, 2023 03:25

allcontributors bot mentioned this pull request Jun 11, 2023

docs: Add bryanjenningz as a contributor for code #1608

Merged

mergify bot pushed a commit that referenced this pull request Jun 11, 2023

docs: Add bryanjenningz as a contributor for code (#1608)

d76ef2d

Adds @bryanjenningz as a contributor for code. This was requested by melink14 [in this comment](#1607 (comment))

melink14 added the released label Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(search): Handle word searches with char codes below 0x3000 #1607

fix(search): Handle word searches with char codes below 0x3000 #1607

bryanjenningz commented Jun 6, 2023 •

edited by melink14

Loading

codecov bot commented Jun 6, 2023 •

edited

Loading

melink14 left a comment

melink14 commented Jun 6, 2023

bryanjenningz commented Jun 6, 2023 •

edited

Loading

melink14 commented Jun 7, 2023

bryanjenningz commented Jun 7, 2023

melink14 commented Jun 11, 2023

allcontributors bot commented Jun 11, 2023

melink14 commented Jun 12, 2023

fix(search): Handle word searches with char codes below 0x3000 #1607

fix(search): Handle word searches with char codes below 0x3000 #1607

Conversation

bryanjenningz commented Jun 6, 2023 • edited by melink14 Loading

codecov bot commented Jun 6, 2023 • edited Loading

Codecov Report

melink14 left a comment

Choose a reason for hiding this comment

melink14 commented Jun 6, 2023

bryanjenningz commented Jun 6, 2023 • edited Loading

melink14 commented Jun 7, 2023

bryanjenningz commented Jun 7, 2023

melink14 commented Jun 11, 2023

allcontributors bot commented Jun 11, 2023

melink14 commented Jun 12, 2023

bryanjenningz commented Jun 6, 2023 •

edited by melink14

Loading

codecov bot commented Jun 6, 2023 •

edited

Loading

bryanjenningz commented Jun 6, 2023 •

edited

Loading