Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives for tag search #2153

Closed
gergely-ujvari opened this Issue Apr 9, 2015 · 17 comments

Comments

Projects
None yet
7 participants
@gergely-ujvari
Copy link
Contributor

commented Apr 9, 2015

  1. Let's make a search for the tag piga-1
  2. The result is 8 annotations, however some of them doesn't have this tag.
    I.e. annotation wGLvONWRTHilnSeyOXQ5Ew has only two tags: unidad 1 and unidad 4

This is because for tags we use the icu_normalizer analyzer which tokenizes the tags into words, and it tokenizes piga-1 into piga and 1 and it tokenizes unidad 1 into unidad and 1.
And the 1 token is a match.

One possible solution is to add a new tag analyzer something like this:

            'tags': {
                'tokenizer': 'keyword',
                'filter': ['icu_folding', 'lowercase']
            }

@gergely-ujvari gergely-ujvari added the Bug label Apr 9, 2015

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2015

The context here is that bioscience is a huge use case for H, they are all about controlled vocabularies, and terms in those vocabularies are wont to include all sorts of special characters.

@tilgovi

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2015

👍 I think it makes sense to do the unicode normalization and case folding but not to tokenize.

@tilgovi

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2015

Although, there may be some tokenizations that do make sense, actually. Tokenizing on whitespace makes sense to me.

In the world of average people using tag folksonomies (not highly controlled vocabularies for bioscience, for instance) it's likely that some people will write two tags where others would write one, "sf housing" vs "sf" and "housing".

Stated in the most general way I can think: as a user I would rather wade through false positives than not get the results I need.

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2015

Social tagging has sadly never converged on a convention for handling tags
meant as singletons that include whitespace and punctuation. But whatever
the user winds up perceiving as a single tag had better behave like one in
all search and navigation contexts.

On Thu, Apr 9, 2015 at 11:16 AM, Randall Leeds notifications@github.com
wrote:

Although, there may be some tokenizations that do make sense, actually.
Tokenizing on whitespace makes sense to me.

In the world of average people using tag folksonomies (not highly
controlled vocabularies for bioscience, for instance) it's likely that some
people will write two tags where others would write one, "sf housing" vs
"sf" and "housing".

Stated in the most general way I can think: as a user I would rather wade
through false positives than not get the results I need.


Reply to this email directly or view it on GitHub
#2153 (comment).

@tilgovi

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2015

This is why semantic tags and plain text tags are different in OA model. The needs and expectations for each are incompatible, I think.

What's needed for this issue to progress is a decision about what are reasonable expectations for the behavior of our tag field and how is expected to be queried.

I would favor closing this issue as "won't fix" and addressing semantic tags as a separate feature later. For folksonomic usage, I think matching any part of a tag makes sense, since some users may choose to include delimiters in tags where others would use multiple tags.

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2015

I agree that what we have here are folksonomic tags, and that semantic tags are an OA thing out of scope for this issue.

"As a user I would rather wade through false positives than not get the results I need."

You might, I would not. More importantly I've observed users in two of our domains -- scholcomm, bioscience -- who would hate the pica-1 behavior above.

We can revisit when we have more data but for now I'm sure it's better to make the character strings typed into our tag fields be atomic and unbreakable.

@tilgovi

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2015

I'm unconvinced.

I would very much rather not break the existing and obvious use case for tags to support users that are relying on tags as a fallback for the absence of semantic tags.

To me this is a perfect example of where an innocuous suggestion results in us not seeing and prioritizing and accepting the costs associated with making production-ready features.

I can easily see the report that comes after we "fix" this. "I had tagged something 'open-source' and then I couldn't find it when I searched for 'open source' and I was really confused."

Please think about it again tell me whether you really want to do that.

@csillag

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2015

it's better to make the character strings typed into our tag fields be atomic and unbreakable.

That's that behavior I would accept as a user, without giving it too much thought.

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2015

This isn't on the critical path for the current sprint, so let's leave it open while I gather some data.

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2015

Here's a concordance of tags in use so far:

http://jonudell.net/h/tag_concordance.html

It does not bear directly on the question of whether users are depending on partial matching, but is a useful window onto how the population we have so far thinks about tags.

A matter for a separate issue, for example, is that the current UI enables some to create as single tags what they clearly meant as multiples, e.g.:

#YouthVoices #KQEDEdSpace #DoNowGrad
#fasting #nutritionalketosis

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2015

Here is a screenshot that may help advance the discussion.

At http://www.theatlantic.com/technology/archive/2014/08/advertising-is-the-internets-original-sin/376041/ I searched for sin and the answer was:

Found 5 results

We don't highlight matches so it wasn't obvious at first. Eyeballing the results suggested that only jallred used the word sin (twice) in his annotation.

Then I ran an in-page search and the screenshot tells the tale. We matched:

bu_sin_ess
_sin_ister
adverti_sin_g
promi_sin_g

There may be exceptions, but in general nobody expects that.

@tilgovi

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2015

So we've got to be specific when discussing issues. In-page search in the sidebar is a different beast from the rest of the API and stream search. It runs entirely client-side and I expect you're only seeing the tip of the surprise iceberg.

A comprehensive solution for this needs to be discussed. I would even consider removing the feature from the sidebar in the meantime, since the number of places where there are enough annotations to make searching the sidebar important is low right now.

@judell

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2015

OK, I'll file a separate issue that refers here. They're connected w/respect to expectations about partial-match search results, but sure, let's isolate the use cases.

@nichtich

This comment has been minimized.

Copy link

commented Jan 12, 2016

@tilgovi wrote:

I would favor closing this issue as "won't fix" and addressing semantic tags as a separate feature later

Is there an issue for semantic tags as additional feature request yet? I guess some discussion is needed to find consensus on what "semantic tags" (selecting tags from a controlled vocabulary in my opinion) is meant to be.

@robertknight

This comment has been minimized.

Copy link
Member

commented Jan 12, 2016

I'm not sure about the status of everything that might be called "semantic tagging", but controlled-vocabulary tagging specifically is definitely on our roadmap as an important need in certain domains.

@nickstenning nickstenning removed the Bug label Feb 10, 2016

@nickstenning

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2016

The bug as described in the original message no longer exists:

$ curl 'https://hypothes.is/api/search?tags=piga-1'
{
  total: 1,
  ...
}

Closing.

@judell

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2016

"The bug as described in the original message no longer exists"

I did not realize that was fixed. Excellent!

I notice we still have:

https://hypothes.is/stream?q=tags:piga-1 -> many results

However I think I see what's gone wrong there, will file separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.