Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable fuzzy search as default search mode #69

Closed
torrin47 opened this issue Feb 27, 2017 · 1 comment
Closed

Enable fuzzy search as default search mode #69

torrin47 opened this issue Feb 27, 2017 · 1 comment

Comments

@torrin47
Copy link
Contributor

Below is the email chain for context. Asking Esri for their thoughts before we get started on this. Will work to formulate more specific requirements.

From: Greene, Ana
Sent: Wednesday, February 22, 2017 8:59 AM
To: Hultgren, Torrin Hultgren.Torrin@epa.gov
Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com
Subject: RE: Full text search thoughts

Hi guys,
Did I ever respond to this? Just catching up…only 2 weeks behind on email…

I totally agree that the wildcard and fuzzy searches should be the default. And like the advanced search dialog. I’d like to go ahead and put all of this on our list of near term development projects.

Thanks,

Ana Greene, M.S., PMP
Environmental Dataset Gateway (EDG) Program Manager
Office of Environmental Information (OEI)
Office of Information Management (OIM)
U.S. Environmental Protection Agency
(o): 202-566-2132
(c): 571-232-7860
Greene.Ana@epa.gov
https://edg.epa.gov/

From: Hultgren, Torrin
Sent: Tuesday, February 07, 2017 7:26 PM
To: Greene, Ana Greene.ana@epa.gov
Cc: Pierson, Suzanne Pierson.Suzanne@epa.gov; Harness, Catherine Harness.Catherine@epa.gov; Suma Malothu smalothu@innovateteam.com
Subject: Full text search thoughts

Hi Ana,

I believe I’ve figured out the source of our continuing confusion about full text search. It was legitimately disabled years ago, but has been working for some time, yet perhaps not in the way we might expect, so I think there’s still some room for improvement, or at least adjustment. I think a lot of our confusion revolves around partial search terms and whether or not they’re considered a match. I think we can all remember a time when we used to have to be very careful about our search terms, and we couldn’t assume that search engines would appropriately match partial words or misspellings, yet these days we take it for granted. Lucene is quite capable of handling any match type we want it to, but the default is the old strict way. If we do a search for the first part of your email address, by default it will come up blank, even though there are records containing your email address:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=greene.ana

EDG has “advanced Lucene syntax” if anyone chose to read the help, and could apply a wildcard to their search, which just means that indexed terms that aren’t exact matches but contain the string are returned:

https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=*greene.ana*

Which gives us all 6 records that contain your email address. In theory this slows performance, but we’d need orders of magnitude more records in our index before we’d notice any difference. There’s a last option that’s kind of fun – though it doesn’t seem to work with the direct link, so you’ll have to try it manually If you do a search for greene.ana~ it will conduct a “fuzzy search”, where it will include “misspellings” or words that are very similar – it should return a bunch of records with “Greenspace” in the title.

I’m not sure about you, but I think my own expectation these days is that wildcards and fuzzy searches would be the default – I’d prefer a search to return too many results that I could filter through or refine than too few. But that may also because of an assumption that the search engine would do a good job of ranking/sorting those results so the most relevant ones would appear first, and I don’t know how valid an assumption that is with the EDG. I think we could figure out how to adjust the scoring/ranking algorithm under the hood of the EDG, but I’m not at all sure how we’d measure whether our tweaks were making search results more or less relevant. And if we were to make fuzzy searches the default, I wonder how we’d allow someone to opt-out if they wanted a more strict match? Perhaps we could show an “advanced search” dialog if they wished:

http://www.lucenetutorial.com/lucene-query-builder.html
https://www.google.com/advanced_search

Anyway, curious to know your thoughts. Definitely been on the brain today.
Torrin Hultgren
EPA National Geospatial Support Team
Innovate!, Inc. | hultgren.torrin@epa.gov | 703-922-9090 x737

@torrin47
Copy link
Contributor Author

This issue was moved to USEPA/EPA_Environmental_Dataset_Gateway#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant