Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question re mixed alpha and numeric needles #9

Closed
HamptonNorth opened this issue Oct 27, 2022 · 3 comments
Closed

Question re mixed alpha and numeric needles #9

HamptonNorth opened this issue Oct 27, 2022 · 3 comments
Labels
question Further information is requested

Comments

@HamptonNorth
Copy link

HamptonNorth commented Oct 27, 2022

I have option interLeft set to 2 (strict) so searching starts at beginning of needles/words.

An example UK post codes (think ZIP code) is CW3 5BQ. uFuzzy treats 'CW3' as 2 needles 'CW' and '3'. It fails to find the complete postcode string 'CW3 5BQ 'and also fails to find '5bq'.

Is there any option to stop splitting mixed alpha and numric words into multiple needles

Test string:

[
  "Line with UK postcodes. A typical UK post code is CW3 5BQ. Some numbers 3 5 53 and 7. Some letters C, w, wW, W, WC and CW.  I should be able to match CW3 but not CW5 and I should be able to match 5BQ "
]

test_UK_postcodes

@leeoniya
Copy link
Owner

Is there any option to stop splitting mixed alpha and numric words into multiple needles

there is an undocumented option for how terms can be split, though setting it to null or '' likely wouldnt work. you can probably provide some non-regex punct char so it never matches, like ~. i'll push a fix in a bit that allows it to be an empty string or null to skip this.

intraSplit?: PartialRegExp; // '[A-Za-z][0-9]|[0-9][A-Za-z]|[a-z][A-Z]'

it's a good question whether setting interLft and/or interRgt to strict should automatically skip term splitting. i don't think so because the inter/intra terminology is relative to the supplied terms, so they would have to be intraLft and intraRgt.

side note:
in your example i don't think you need to set interLft to strict though. even if it internally splits the term into two, they can still be immediately adjacent in the match. the limit on that adjacency is interIns, so if you set that to 0, it should match cw3, even if internally it's represented as cw 3, though the splitting could have additional undesirable effects on rank order and match strictness of non-postal-code terms.

@leeoniya
Copy link
Owner

leeoniya commented Oct 27, 2022

f67efb2 should allow intraSplit to be '' or null to prevent term splitting.

it also adds a new intraBound option that's used for the "boosting" aspects of matching any terms as substrings at those case-change and alpha-num boundaries. this way it can be opted out of separately from splitting.

@leeoniya leeoniya added the question Further information is requested label Oct 27, 2022
@HamptonNorth
Copy link
Author

Added intraSplit to my options, set to ''

My UK postcode search all works - thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants