Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dict): Make update-db script resilient to multiple spaces separating kanji entry fields #1917

Merged
merged 1 commit into from
Jan 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion extension/data/kanji.dat
Original file line number Diff line number Diff line change
Expand Up @@ -9187,7 +9187,7 @@
觚|B148 S13 N4303 V5544 P1-7-6 I2n11.1 Ygu1|コ さかずき|||cup
觜|B148 S13 N4304 V5546 P2-6-7 I2n11.2 Yzui3 Yzi1|シ スイ くちばし はし|||beak, bill
觝|B148 S12 V5545 H1498 P1-7-5 I2n10.2 Ydi3|テイ ふ.れる|||touch, feel, collide with, conflict with
解|B148 G5 S13 F176 N4306 V5548 H1517 DK1017 DL1375 L1814:unravel DN1955:unravel E632 IN474 P1-7-6 I4g9.1 Yjie3 Yjie4 Yxie4|カイ ゲ と.く と.かす と.ける ほど.く ほど.ける わか.る さと.る|さとる とけ||unravel, notes, key, explanation, understanding, untie, undo, solve, answer, cancel, absolve, explain, minute
解|B148 G5 S13 F176 N4306 V5548 H1517 DK1017 DL1375 L1814:unravel DN1955:unravel E632 IN474 P1-7-6 I4g9.1 Yjie3 Yjie4 Yxie4|カイ ゲ と.く と.かす と.ける ほど.く ほぐ.す わか.る さと.る|さとる とけ||unravel, notes, key, explanation, understanding, untie, undo, solve, answer, cancel, absolve, explain, minute
觥|B148 S13 P1-7-6 Ygong1|コウ つのさかずき|||cup made of horn, obstinate
触|B148 G8 S13 F904 N4305 V5547 H1518 DK1018 DL1376 L1813:contact DN1954:contact E1428 IN874 P1-7-6 I6d7.10 Ychu4|ショク ふ.れる さわ.る さわ|||contact, touch, feel, hit, proclaim, announce, conflict
觧|B148 S13 V5549 P1-7-6 I4g9.1 Yjie3 Yjie4 Yxie4|カイ ゲ と.く と.かす と.ける ほど.く ほど.ける わか.る さと.る|||notes, key, explanation, understanding
Expand Down
11 changes: 7 additions & 4 deletions utils/update-db.ts
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,8 @@ class KanjiDictParser extends Writable {
// confuse the output.
// (Also a comma could confuse it too. Would mean we should
// switch to ; as a separator in that case.)
// In general, a single space is used as a separator, but sometimes more spaces appear and the spec is unclear so it's best to assume
// one or more spaces separate fields.
//
// e.g. 士|3B4E U58eb B33 G4 S3 F526 J1 N1160 V1117 H3405 DP4213 DK2129 DL2877 L319 DN341 K301 O41 DO59 MN5638 MP3.0279 E494 IN572 DA581 DS410 DF1173 DH521 DT441 DC386 DJ755 DG393 DM325 P4-3-2 I3p0.1 Q4010.0 DR1472 Yshi4 Wsa シ さむらい T1 お ま T2 さむらい {gentleman} {scholar} {samurai} {samurai radical (no. 33)}
//
Expand All @@ -295,7 +297,7 @@ class KanjiDictParser extends Writable {
// - Meanings, command separated
// (All | delimited)
const matches = line.match(
/^(\S+) (?:.=.=== )?((?:[\x21-\x7a]+ )+)((?:[\x80-\uffff.-]+ )+)?(?:T1 ((?:[\x80-\uffff.-]+ )+))?(?:T2 ((?:[\x80-\uffff.-]+ )+))?((?:\{[^}]+\} ?)*)?$/
/^(\S+) +(?:.=.=== )?((?:[\x21-\x7a]+ +)+)((?:[\x80-\uffff.-]+ +)+)?(?:T1 ((?:[\x80-\uffff.-]+ +)+))?(?:T2 ((?:[\x80-\uffff.-]+ +)+))?((?:\{[^}]+\} *)*)?$/
);
if (matches === null) {
console.log(`Failed to parse line: ${line}`);
Expand All @@ -304,7 +306,7 @@ class KanjiDictParser extends Writable {
}

// Trim references
const refs = matches[2].trim().split(' ');
const refs = matches[2].trim().split(/ +/);
const refsToKeep = [];
let hasB = false;
for (const ref of refs) {
Expand Down Expand Up @@ -341,7 +343,7 @@ class KanjiDictParser extends Writable {

// Prepare meanings
if (matches[6]) {
const meanings = matches[6].trim().split('} {');
const meanings = matches[6].trim().split(/} *{/);
if (meanings.length) {
meanings[0] = meanings[0].slice(1);
const end = meanings.length - 1;
Expand All @@ -362,7 +364,8 @@ class KanjiDictParser extends Writable {

this.#index[matches[1]] = matches
.slice(2)
.map((part) => (part ? part.trim() : ''))
// Replace any instances of 2 or more spaces with one space.
.map((part) => (part ? part.trim().replace(/ {2,}/, ' ') : ''))
.join('|');

callback();
Expand Down
Loading