Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese medical (1 of 3): append "+" or "-" to the preceding Concept if followed by a space #138

Closed
makorin0315 opened this issue May 27, 2021 · 4 comments
Assignees

Comments

@makorin0315
Copy link
Collaborator

NOTE: This is the first of the three items reported by @Rei-hub in issue #137, to enable analysis of EMR's daily progress notes that are less formal than discharge summaries.

Extract a word followed by +/- without parentheses as a single entity

The previous improvement (#31) enabled Katakana or numbers enclosed in parentheses to be concatenated with the preceding Concept as a single entity. This works in many cases, especially in stylized documents, and is useful for identifying the relation of negation. (e.g. heart murmur(-) → no heart murmur)
However, in informal text such as daily progress notes, there is a problem. Some entities are followed by +/- without parentheses. Even in these cases, +/- symbol should be concatenated with the preceding Concept as a single entity because doctors describe the text with the same intention, and this enables us to clarify the relation of negation. Is this improvement technically possible?
Importantly, in many cases of these, there is often no space between an entity and +/-, whereas there is often half-width or full-width space after +/- to separate from the next entity.

image

@makorin0315
Copy link
Collaborator Author

In a separate e-mail, I've requested @Rei-hub to check his data to find out exactly which characters are used to as the minus sign. By default, the following characters will be considered to be the minus sign for this issue:

single-width hyphen minus (-): U+002D
double-width hyphen minus (-): U+FF0D

There are other characters that "look like" the minus sign, and depending on the author's preference (or typo), it's more than possible that such characters are used instead, especially in informal notes. For example:

minus (−): U+2212
hyphen (‐): U+2010
non-breaking hyphen (‑): U+2011
hyphen bullet(⁃): U+2043
en dash (–): U+2013
em dash (—): U+2014
cho-on (ー): U+30FC
half-width cho-on (ー): U+FF70
figure dash (‒): U+2012
horizontal bar (―): U+2015
... and there may be others

If we are to consider only the single- and double-width hyphen minus characters as the minus sign, the requested change is quite simple and relatively innocuous to other types of text. We already suspect that the cho-on characters (half- and full-width) may be used sometimes unintentionally. Once I hear from Dr. Noguchi on the character usage within his data, I'll look into any possible ramification of including such characters in the change.

@Rei-hub
Copy link

Rei-hub commented Jun 19, 2021

Thank you for your valuable feedback, and sorry for the late response.
As I mentioned in an earlier mail, I have carefully checked if the various types of “minus-like” symbols are in the actual medical progress notes, and found the following four symbols were used other than “ +002D” and “U+FF0D” you mentioned.

  1. Katakana-Hiragana Prolonged Sound Mark (ー): U+30FC
  2. Halfwidth Katakana-Hiragana Prolonged Sound Mark (ー): U+FF70
  3. HORIZONTAL BAR (―): U+2015
  4. Hyphen (‐): U+2010

Considering how they are used in the actual sentences, there seems to be no problem in concatenation with the preceding Concept as a single entity for all cases. I would like to get your feedback about the possibility of an unintended effect on other expressions.
Thank you for your continued support.

@makorin0315
Copy link
Collaborator Author

The implementation will involve the following:

  1. append "+" or "-" (and its variations) to the preceding Concept if followed by a space

  2. append "+" or "-" (and its variations) to the preceding Concept if followed by end-of-sentence

  3. There is no issue with Merge attempt #1. There is no issue with "+" for Compiler auto-detection required #2. However, such implementation would have a negative impact when processing news articles in the test corpus, when hyphens are used at the beginning and end of the sentence to indicate a section title of sort. For example, article authors often use the format:

-公道での自動運転実施における技術・ノウハウ、ガイドラインを共有-

In this case, the "-" at the end of the line should not be concatenated with 共有. To overcome this, we need a rule for cases where both the first character & last character of a sentence are hyphens. For such sentences, the hyphen will not be appended to the word before it. I am not aware if such solution still interferes with clinical text, but this is the best solution for now

  1. Also, there are cases where the first hyphen is somewhere in the middle of a sentence:

海洋細菌で見つけた新しい光エネルギー利用機構-塩化物イオンを輸送するポンプの発見-

Such cases can potentially be handled, but the hyphen within a sentence often serves another purpose, for example:

HCCを疑いますが、経胆管的に肝門部-総肝管まで進展し、4cm長の腫瘤を形成し、両側肝内胆管は拡張しています。(FromTo)
-CX-5の生産開始以来、2年4か月で到達- (part of a word)

For this issue, this particular case, i.e., hyphen in the middle and end of a sentence - will not be dealt with. The hyphen at the end of the sentence will be appended to the last character for the time being, although it is not ideal as an output.

  1. In addition, there are cases that is beyond our control, depending on how the author wants to use a hyphen:

20代、スマホからの予約が急増-

The hyphen at the end of the sentence will be appended to the last character for the time being, although it is not ideal as an output.

@makorin0315
Copy link
Collaborator Author

@Rei-hub - this issue has been addressed as described above. There are some cases that cannot be addressed, but I believe the general output is as expected or better, even in non-medical/clinical text. Please have a look when get a chance. Thanks for your feedback as always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants