Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simstring matcher could produce span with corrected term #38

Open
nourG22 opened this issue Apr 15, 2024 · 1 comment
Open

Simstring matcher could produce span with corrected term #38

nourG22 opened this issue Apr 15, 2024 · 1 comment

Comments

@nourG22
Copy link

nourG22 commented Apr 15, 2024

Issue

When using SimString for typo detection, it accurately identifies misspelled words like "mxnidipine" but lacks functionality to produce the corrected version of the word, such as "manidipine".

Reproduction Steps

  1. Input text containing misspelled words, such as "mxnidipine".
  2. Utilize SimString for word detection, even when encountering typos.

Current Behavior

SimString accurately identifies misspelled terms but does not provide the corrected version.

Cause

The current logic of SimString is constrained within a parent class, lacking a specialized run() method which hinders the generation of corrected terms.

Suggested Solution

Implement a run() specialization within the SimString matcher class to enable the generation of corrected terms. This specialization should extend to both SimString matcher and regular expression matcher functionalities.

@ghisvail
Copy link
Contributor

Thanks @nourG22 for the very detailed reporting.

I have discussed this issue with the rest of the team, and we were thinking of an alternative solution which would provide more flexibility. Your report was very useful to kickstart the discussion with a realistic use case.

The proposed alternative is the following: instead of replacing the text within the span produced by the SimstringMatcher, the span could be enhanced by another normalization (whose name is up for discussion, let's call it typo correction for the sake of it), which may then be applied to the text in a follow-up operation (up for discussion too) or just carried around within the rest of the processing.

This way, users still have the choice to keep or correct for the typo, or use an alternative normalization (e.g. UMLS) and still carry the information about the match around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants