Skip to content
This repository has been archived by the owner on Nov 10, 2023. It is now read-only.

user_item and item_item recommender tables #12

Open
koaning opened this issue Dec 1, 2021 · 4 comments
Open

user_item and item_item recommender tables #12

koaning opened this issue Dec 1, 2021 · 4 comments

Comments

@koaning
Copy link
Collaborator

koaning commented Dec 1, 2021

Given a log of weighted user-item interactions, can we generate a item-item recommendation table and a user-item recommendation table?

Kind of! We can calculate p(item_a | item_b) and p(item_a) which is can be reweighed into a table with recommendations. We can also do something similar for users. After all, a user that interactive with items a, b and c will have a score for item x defined via;

p(item_x | user) = p(item_x | item_a, item_b, item_c)
                 \propto p(item_x | item_a) p(item_x| item_b) p(item_x|item_c)
@ritchie46
Copy link
Member

Interesting.. Would every cell in one table need to be computed with all others?

@koaning
Copy link
Collaborator Author

koaning commented Dec 1, 2021

I don't think so unless every user has interacted with every item.

I've started with a item-item count table though.

def item_item_counts(dataf, user_col="user", item_col="item"):
    """
    Computers item-item overlap counts from user-item interactions, useful for recommendations.
    
    This function is meant to be used in a `.pipe()`-line.

    Arguments:
        - dataf: polars dataframe
        - user_col: name of the column containing the user id
        - item_col: name of the column containing the item id
    """
    return (dataf
        .with_columns([
            pl.col(pl.col(item_col)).list().over('user').explode().alias("item_rec"),
        ])
        .filter(pl.col(item_col) != pl.col("item_rec"))
        .with_columns([
            pl.col(user_col).count().over(pl.col(item_col)).alias("n_item"),
            pl.col(user_col).count().over('item_rec').alias("n_item_rec"),
            pl.col(user_col).count().over([pl.col(item_col), 'item_rec']).alias("n_both")
        ])
        .select(['item', 'item_rec', 'n_item', 'n_item_rec', 'n_both'])
        .drop_duplicates()
    )

Something is telling me these kinds of queries are gonna benchmark reaaaal well.

@koaning
Copy link
Collaborator Author

koaning commented Dec 1, 2021

Hebbes.

It's something like this;

result = (df
  .pipe(remove_outliers)
  .with_column(
      pl.col('item').list().over('user').explode().alias("item_rec")
  )
  .filter(pl.col("item") != pl.col("item_rec"))
  .with_columns([
    pl.col('user').count().over('item').alias("n_item"),
    pl.col('user').count().over('item_rec').alias("n_item_rec"),
    pl.col('user').count().over(['item', 'item_rec']).alias("n_both")
  ])
)

(result
  .with_column((pl.col('n_both')/pl.col('n_item')).alias('rating'))
  .filter(pl.col('n_both') > 10)
  .sort(['item', 'rating'], reverse=True))

@koaning
Copy link
Collaborator Author

koaning commented Dec 1, 2021

@ritchie46 does polars support log?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants