Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(search): Handle word searches with char codes below 0x3000 #1607

Conversation

bryanjenningz
Copy link
Contributor

@bryanjenningz bryanjenningz commented Jun 6, 2023

Words like DVD-ROM were not matching due to the hyphen (represented by a char code below 0x3000) in the middle. In order to include, entries like +α, logic was removed from rikaicontent.ts that skipped lookups when the letter the mouse hovered over was not a normal Japanese char code. (This included other various ranges which didn't show additional impacts as of this PR.)

These changes are thought to have been added for performance reasons which are no longer relevant but if some problem does crop up we can solve it more narrowly and with better documentation at that time.

See word entries affected by this change (122 unique words)
[
  "〇×",
  "〇×テスト",
  "+α",
  "○",
  "○×",
  "○×テスト",
  "○○",
  "◎",
  "△",
  "※印",
  "2×4工法",
  "2進-5進",
  "4°",
  "CD-DA",
  "CD-R",
  "CD-ROM",
  "COVID-19",
  "DVD-R",
  "DVD-RAM",
  "DVD-ROM",
  "HANA-BI",
  "Hα線",
  "J-POP",
  "K-1",
  "K-POP",
  "K-T境界",
  "RF-ID",
  "UTF-8",
  "Wi-Fi",
  "ウラン-238",
  "プラスα",
  "マイクロRNA",
  "マッチョ♀ウーマン",
  "Α",
  "Β",
  "Γ",
  "Γ分布",
  "Δ",
  "Ε",
  "Ζ",
  "Η",
  "Θ",
  "Θ関数",
  "Ι",
  "Κ",
  "Λ",
  "Μ",
  "Ν",
  "Ξ",
  "Ο",
  "Π",
  "Ρ",
  "Σ",
  "Τ",
  "Υ",
  "Φ",
  "Χ",
  "Ψ",
  "Ω",
  "α",
  "α-helix",
  "α-ヘリックス",
  "α-メチルトリプタミン",
  "α-リノレン酸",
  "αらせん",
  "αメチルトリプタミン",
  "αリノレン酸",
  "α化",
  "α化米",
  "α線",
  "α波",
  "α分類学",
  "α米",
  "α崩壊",
  "α螺旋",
  "α粒子",
  "β",
  "β-カロチン",
  "β-カロテン",
  "βカロチン",
  "βカロテン",
  "βテスト",
  "βバージョン",
  "β線",
  "β波",
  "β版",
  "β崩壊",
  "β粒子",
  "γ",
  "γ-オリザノール",
  "γアミノ酪酸",
  "γリノレン酸",
  "γ線",
  "γ分布",
  "γ崩壊",
  "δ",
  "ε",
  "ζ",
  "η",
  "θ",
  "θ理論",
  "ι",
  "κ",
  "λ",
  "μ",
  "μ粒子",
  "ν",
  "ξ",
  "ο",
  "π",
  "π中間子",
  "π電子",
  "ρ",
  "σ",
  "σ電子",
  "τ",
  "τ粒子",
  "υ",
  "φ",
  "χ",
  "ψ",
  "ω"
]

This list of words was collected by running this code:

const uniqueWords = [
  ...new Set(
    rcxDict.wordDict
      .trim()
      .split("\n")
      .map((line) => line.split(" ")[0])
  ),
];
const wordsWithCharCodesBelow0x3000 = uniqueWords.filter((word) =>
  [...word].some((char) => char.charCodeAt(0) < 0x3000)
);
console.log(
  "wordsWithCharCodesBelow0x3000",
  JSON.stringify(wordsWithCharCodesBelow0x3000, null, 2)
);

I wrote snapshot tests to make sure none of the words without character codes below 0x3000 were affected by this change. I didn't include those tests in this pull request because they take a while to run (since there are over 200,000 entries in the dictionary) and they make this pull request bigger. I can add those tests in a separate pull request if we think that's a good idea so we can verify diffs of word searches for all words in the dictionary. This would give us more confidence whenever we change the wordSearch function because it shows all the diffs for each search result in the dictionary and it would allow us to drastically simplify the wordSearch function in the future with confidence that we aren't breaking of the previous behavior.

I also wrote code to do a check to make sure all words in the dictionary are searchable and found there is only a 1 unsearchable word in the dictionary now:

const uniqueWords = [
  ...new Set(
    rcxDict.wordDict
      .trim()
      .split("\n")
      .map((line) => line.split(" ")[0])
  ),
];
const unsearchableWords = uniqueWords.filter(
  (word) =>
    !rcxDict
      .wordSearch(word, false)
      ?.data.find((data) => data.entry.startsWith(word + " "))
);
console.log("unsearchableWords", unsearchableWords); // Logs ['~']

So this means after this pull request gets merged, the only unsearchable word entry in the dictionary is "~ [にょろ] /(n) tilde/wave dash/".

This word is a single character with a character code of 65374, which is 0xff5e in hexadecimal (which is above 0x3000, which explains why this pull request doesn't handle this word). After searching through the code for this number 0xff5e, it looks like there's some code that handles this character code that needs to be modified slightly to account for this word in the dictionary. That can be handled in a subsequent pull request to keep this pull request short and simple.

Fixes: #1453

@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Merging #1607 (d28b446) into main (7023f15) will increase coverage by 0.17%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1607      +/-   ##
==========================================
+ Coverage   79.66%   79.84%   +0.17%     
==========================================
  Files           7        7              
  Lines        3010     2992      -18     
  Branches      189      185       -4     
==========================================
- Hits         2398     2389       -9     
+ Misses        607      599       -8     
+ Partials        5        4       -1     
Impacted Files Coverage Δ
extension/data.ts 82.48% <ø> (+0.12%) ⬆️
extension/rikaicontent.ts 76.13% <ø> (+0.25%) ⬆️

extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review before work; I'll try to look again afterwards! (It looks reasonable though!)

extension/data.ts Outdated Show resolved Hide resolved
extension/test/data_test.ts Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
@melink14
Copy link
Owner

melink14 commented Jun 6, 2023

#190 is the issue for handling ~

As far other tests, I think we don't need to test every single word; it should be enough to test a representative word and be confident that similar words behave the same. I'd be interested more in the doubts you had while developing since as you say the goal of tests is to allow making changes without worrying!

- This if-check was there for performance, but there's no difference in
performance so it can be removed.
@bryanjenningz
Copy link
Contributor Author

bryanjenningz commented Jun 6, 2023

I'd be interested more in the doubts you had while developing since as you say the goal of tests is to allow making changes without worrying!

I didn't have any specific doubts while developing. I'm just not familiar with the code in wordSearch so I just wrote some tests to see the diffs before and after my change to be 100% sure the wordSearch output is changed how I expected it to change.

- This range check was preventing words like "+α" from being searched in the dictionary.
@melink14
Copy link
Owner

melink14 commented Jun 7, 2023

I changed the PR body a bit to make it read better (imo) once it gets merged and becomes the commit message body. Just for your information, I had two goals:

  • Remove information that was already in the PR title.
  • Make the initial paragraph give additional context around why the changes were done (including the final rikaicontent change)

An astute reader would be able to figure that stuff out but since code sleuthing months/years later is hard, I tend to try to make it as easy as possible for my future self. :)

melink14
melink14 previously approved these changes Jun 7, 2023
- Before, non-Japanese text would not trigger xsearch for performance reasons,
which caused words starting with non-Japanese characters to not
trigger xsearch.
- These performance reasons don't appear to be relevant anymore,
but if there are any performance problems in the future, we can filter
in a more granular way in the future to fix any performance problems.
@bryanjenningz
Copy link
Contributor Author

I changed the PR body a bit to make it read better (imo) once it gets merged and becomes the commit message body. Just for your information, I had two goals:

* Remove information that was already in the PR title.

* Make the initial paragraph give additional context around why the changes were done (including the final rikaicontent change)

An astute reader would be able to figure that stuff out but since code sleuthing months/years later is hard, I tend to try to make it as easy as possible for my future self. :)

Thanks so much! I appreciate that. The PR has a nice initial paragraph now. I'll be sure to add more context like this in future PRs so that it's easier to understand for my future self too.

@mergify mergify bot merged commit f94efb7 into melink14:main Jun 7, 2023
13 checks passed
@bryanjenningz bryanjenningz deleted the fix-word-searches-with-char-codes-below-0x3000 branch June 9, 2023 03:25
@melink14
Copy link
Owner

@all-contributors Add @bryanjenningz for code.

@allcontributors
Copy link
Contributor

@melink14

I've put up a pull request to add @bryanjenningz! 🎉

mergify bot pushed a commit that referenced this pull request Jun 11, 2023
Adds @bryanjenningz as a contributor for code.

This was requested by melink14 [in this comment](#1607 (comment))
melink14 pushed a commit that referenced this pull request Jun 12, 2023
## [2.5.13](v2.5.12...v2.5.13) (2023-06-12)

### Bug Fixes

* **dict:** Update dictionaries to latest versions ([#1616](#1616)) ([b665d3b](b665d3b))
* **search:** Handle word searches with char codes below 0x3000 ([#1607](#1607)) ([f94efb7](f94efb7))
@melink14
Copy link
Owner

🎉 This PR is included in version 2.5.13 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DVD-ROM isn't recognized
2 participants