Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

makorin0315 · 2021-05-27T22:44:59Z

NOTE: This is the first of the three items reported by @Rei-hub in issue #137, to enable analysis of EMR's daily progress notes that are less formal than discharge summaries.

Extract a word followed by +/- without parentheses as a single entity

The previous improvement (#31) enabled Katakana or numbers enclosed in parentheses to be concatenated with the preceding Concept as a single entity. This works in many cases, especially in stylized documents, and is useful for identifying the relation of negation. (e.g. heart murmur(-) → no heart murmur)
However, in informal text such as daily progress notes, there is a problem. Some entities are followed by +/- without parentheses. Even in these cases, +/- symbol should be concatenated with the preceding Concept as a single entity because doctors describe the text with the same intention, and this enables us to clarify the relation of negation. Is this improvement technically possible?
Importantly, in many cases of these, there is often no space between an entity and +/-, whereas there is often half-width or full-width space after +/- to separate from the next entity.

makorin0315 · 2021-05-27T23:22:14Z

In a separate e-mail, I've requested @Rei-hub to check his data to find out exactly which characters are used to as the minus sign. By default, the following characters will be considered to be the minus sign for this issue:

single-width hyphen minus (-): U+002D
double-width hyphen minus (－): U+FF0D

There are other characters that "look like" the minus sign, and depending on the author's preference (or typo), it's more than possible that such characters are used instead, especially in informal notes. For example:

minus (−): U+2212
hyphen (‐): U+2010
non-breaking hyphen (‑): U+2011
hyphen bullet(⁃): U+2043
en dash (–): U+2013
em dash (—): U+2014
cho-on (ー): U+30FC
half-width cho-on (ｰ): U+FF70
figure dash (‒): U+2012
horizontal bar (―): U+2015
... and there may be others

If we are to consider only the single- and double-width hyphen minus characters as the minus sign, the requested change is quite simple and relatively innocuous to other types of text. We already suspect that the cho-on characters (half- and full-width) may be used sometimes unintentionally. Once I hear from Dr. Noguchi on the character usage within his data, I'll look into any possible ramification of including such characters in the change.

Rei-hub · 2021-06-19T14:20:58Z

Thank you for your valuable feedback, and sorry for the late response.
As I mentioned in an earlier mail, I have carefully checked if the various types of “minus-like” symbols are in the actual medical progress notes, and found the following four symbols were used other than “ +002D” and “U+FF0D” you mentioned.

Katakana-Hiragana Prolonged Sound Mark (ー): U+30FC
Halfwidth Katakana-Hiragana Prolonged Sound Mark (ｰ): U+FF70
HORIZONTAL BAR (―): U+2015
Hyphen (‐): U+2010

Considering how they are used in the actual sentences, there seems to be no problem in concatenation with the preceding Concept as a single entity for all cases. I would like to get your feedback about the possibility of an unintended effect on other expressions.
Thank you for your continued support.

makorin0315 · 2022-02-27T01:26:53Z

The implementation will involve the following:

append "+" or "-" (and its variations) to the preceding Concept if followed by a space
append "+" or "-" (and its variations) to the preceding Concept if followed by end-of-sentence
There is no issue with Merge attempt #1. There is no issue with "+" for Compiler auto-detection required #2. However, such implementation would have a negative impact when processing news articles in the test corpus, when hyphens are used at the beginning and end of the sentence to indicate a section title of sort. For example, article authors often use the format:

－公道での自動運転実施における技術・ノウハウ、ガイドラインを共有－

In this case, the "－" at the end of the line should not be concatenated with 共有. To overcome this, we need a rule for cases where both the first character & last character of a sentence are hyphens. For such sentences, the hyphen will not be appended to the word before it. I am not aware if such solution still interferes with clinical text, but this is the best solution for now

Also, there are cases where the first hyphen is somewhere in the middle of a sentence:

海洋細菌で見つけた新しい光エネルギー利用機構－塩化物イオンを輸送するポンプの発見－

Such cases can potentially be handled, but the hyphen within a sentence often serves another purpose, for example:

HCCを疑いますが、経胆管的に肝門部-総肝管まで進展し、4cm長の腫瘤を形成し、両側肝内胆管は拡張しています。(FromTo)
－CX－５の生産開始以来、２年４か月で到達－ (part of a word)

For this issue, this particular case, i.e., hyphen in the middle and end of a sentence - will not be dealt with. The hyphen at the end of the sentence will be appended to the last character for the time being, although it is not ideal as an output.

In addition, there are cases that is beyond our control, depending on how the author wants to use a hyphen:

２０代、スマホからの予約が急増－

The hyphen at the end of the sentence will be appended to the last character for the time being, although it is not ideal as an output.

makorin0315 · 2022-02-28T17:32:04Z

@Rei-hub - this issue has been addressed as described above. There are some cases that cannot be addressed, but I believe the general output is as expected or better, even in non-medical/clinical text. Please have a look when get a chance. Thanks for your feedback as always.

makorin0315 self-assigned this May 27, 2021

makorin0315 mentioned this issue May 27, 2021

Japanese: Request for some improvements of entity extraction algorithm in terms of more accurate analysis of medical colloquial text #137

Closed

makorin0315 closed this as completed in cff2a7e Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

makorin0315 commented May 27, 2021

makorin0315 commented May 27, 2021

Rei-hub commented Jun 19, 2021

makorin0315 commented Feb 27, 2022

makorin0315 commented Feb 28, 2022

Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

Comments

makorin0315 commented May 27, 2021

Extract a word followed by +/- without parentheses as a single entity

makorin0315 commented May 27, 2021

Rei-hub commented Jun 19, 2021

makorin0315 commented Feb 27, 2022

makorin0315 commented Feb 28, 2022