Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use only click data? #288

Closed
laxmimerit opened this issue Feb 24, 2022 · 3 comments
Closed

Can I use only click data? #288

laxmimerit opened this issue Feb 24, 2022 · 3 comments
Labels
question Further information is requested

Comments

@laxmimerit
Copy link

Hi,
Thanks for stepping in to solve one of the major issue in personalized content serving.
I have a doubt in ranking events. I believe it is like impression events where all impression is grouped together for an user.

image

Can I use only click data? i.e. I want to use only metadata and interaction events for personalization. Please suggest changes in this config if how can I skip ranking event.

interactions:
  - name: click
    weight: 1.0
features:
  - name: popularity
    type: number
    scope: item
    source: metadata.popularity

  - name: vote_avg
    type: number
    scope: item
    source: metadata.vote_avg

  - name: vote_cnt
    type: number
    scope: item
    source: metadata.vote_cnt

  - name: budget
    type: number
    scope: item
    source: metadata.budget

  - name: release_date
    type: number
    scope: item
    source: metadata.release_date

  - name: runtime
    type: number
    scope: item
    source: metadata.runtime

  - name: title_length
    type: word_count
    source: metadata.title
    scope: item

  - name: genre
    type: string
    scope: item
    source: metadata.genres
    values:
      - drama
      - comedy
      - thriller
      - action
      - adventure
      - romance
      - crime
      - science fiction
      - fantasy
      - family
      - horror
      - mystery
      - animation
      - history
      - music

  - name: ctr
    type: rate
    top: click
    bottom: impression
    scope: item
    bucket: 24h
    periods: [7,30]

  - name: liked_genre
    type: interacted_with
    interaction: click
    field: metadata.genres
    scope: session
    count: 10
    duration: 24h

  - name: liked_actors
    type: interacted_with
    interaction: click
    field: metadata.actors
    scope: session
    count: 10
    duration: 24h

  - name: liked_tags
    type: interacted_with
    interaction: click
    field: metadata.tags
    scope: session
    count: 10
    duration: 24h

  - name: liked_director
    type: interacted_with
    interaction: click
    field: metadata.director
    scope: session
    count: 10
    duration: 24h
@vgoloviznin vgoloviznin added the question Further information is requested label Feb 24, 2022
@shuttie
Copy link
Collaborator

shuttie commented Feb 24, 2022

The reason there is a requirement to have a ranking/impression data - we use it to teach an underlying ML model on what item is relevant, and what is not. So generally speaking, the Learn-to-Rank approach is about training some sort of a binary classifier, which is then asked a question "given items A and B, which one of them should be ranked higher?"

If you have impression data with information that items A-B-C-D were shown, and item C was clicked afterwards, then you can make an assumption that items A+B+C were examined by the visitor, but only C is actually relevant from this group. So you teach model that C should be ranked higher than A and B.

The other important reason why ranking data is essential is position information: people tend to click on first items in the list much more frequently. In ecommerce the number is around 50% of clicks going on top-5 results. Then a click on position 1 is not really giving you much information on was this item relevant at all. But if visitor scrolled to the bottom of the list and only there found something relevant enough to make a click - then this click is a pure gold from the value perspective.

And the last point: we have quite a lot of quite important feature extractors using the impression information, like rate one to compute per-item CTR/conversion rates. Which item will be clicked, the one with 100 clicks and 100k impressions, or the one with 50 clicks and 100 impressions? If you only count clicks, it's not that clear anymore, as everything is relevant.

In recommender systems theory there is an approach to deal with this type of problem when there is no negative feedback: you can just sample some random non-clicked items from the inventory and imagine that they are your negative samples. But usually it gives a sub-par result in quality: it's still a synthetic data and saying that this particular pair of socks is less relevant than a teapot - probably will lead to bad ranking results.

I'm wondering why do you have this problem? Is it more about collecting historical data to train the model? In our demo (the one on demo.metarank.ai and in RanklensTest) we use only around 5k user sessions and it's more than enough to get the real impact on the ranking, so I guess you don't need to wait years to collect enough data for initial training. This ranking events from the JSON format perspective are the actual request bodies you send to Metarank API to do the reranking itself, so you only need to log them and wait for some time.

@laxmimerit
Copy link
Author

Thank you so much for such a detailed explanation!
I have got the impression data with the relative position where it was shown. I don't find any position or any other related variable in your movie dataset. In the ranking event, there is a list of items with zero relevancy for all items. In the interaction event, there is just item_id and other session-related info.

How algorithm will know the position of the click?

If it is taking index position from ranking event then I don't think it is a correct way to do it. Because for one session, it is okay but when data is shown over the multiple session (which is an ideal case in e-commerce), it is very much possible that the impression list sequence will vary a lot. In that case, it would be kind of impossible to relate click and impression position. I would like to also extend my question over the relevancy. For what purpose it is used and why it is zero?

@shuttie
Copy link
Collaborator

shuttie commented Mar 2, 2022

There is an event schema doc describing the format of the events, and there are a couple of important points from there:

  • all events do have an unique identifier, the id field
  • each interaction event as an explicit impression field, pointing to a corresponding parent ranking event identifier, so later they both can be joined together. We actually join ranking with all the interactions happened later into a single click-through. So for each ranking we can clearly see which items were clicked, and which were examined and ignored.
  • in ranking event there is an items field with ordering of items, which were actually displayed to the visitor. Their ordering is important, as position is taken from there.

So there is no single constant ranking of items needed in Metarank. Each time you present a listing (for example, search results) to a visitor, you should emit it as an event downstream. And each time visitor clicks on an item in this particular ranking, you also should sent a yet another event with clicked item id AND parent ranking, which resulted in this click.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants