Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on indices mapping - pure collaborative-filtering example #12

Closed
micheledemeo opened this issue Oct 13, 2020 · 3 comments
Closed

Comments

@micheledemeo
Copy link

import numpy as np
import pandas as pd
from libreco.data import random_split, DatasetPure
from libreco.algorithms import SVDpp  # pure data, algorithm SVD++

data = pd.read_csv("examples/sample_data/sample_movielens_rating.dat", sep="::", 
                   names=["user", "item", "label", "time"])

# split whole data into three folds for training, evaluating and testing
train_data, eval_data, test_data = random_split(data, multi_ratios=[0.8, 0.1, 0.1])

train_data, data_info = DatasetPure.build_trainset(train_data)
eval_data = DatasetPure.build_testset(eval_data)
test_data = DatasetPure.build_testset(test_data)

train_data.item_indices[np.where(train_data.user_indices==2124)]
# => array([ 990, 2207, 2125, 2051, 2534, 2452,  950, 1219, 1680, 1110])

data[(data['user']==2124) & (data['item']==990)]
# => no record for user 2124 & item 990**

Can you clarify how the map of indices works?

@Shadz13
Copy link

Shadz13 commented Oct 13, 2020

I have also encountered this for the you tube model and i am also working on tracing it back to the original id numbers for both item and user. My mapping differs from the original user id number.

Does this also change for the item id, and how can we match this back?

@massquantity -Can you please provide clarity on the you tube model as well, instead of us creating a new issue?

@massquantity
Copy link
Owner

Well sorry guys, I think this whole id-mapping thing needs a more thorough design.
The way of mapping can be found in libreco/data/data_info.py. When you call DatasetPure.build_trainset, the mapping will happen to all users and items. Here is the code:

    @property
    def user2id(self):
        unique = np.unique(self.interaction_data["user"])
        u2id = dict(zip(unique, range(self.n_users)))
        u2id[-1] = len(unique)   # -1 represent new user
        return u2id

Basically this operation maps original ids into range of [0, n_users]. Because it's way more convenient to deal with ids by mapping index first in the library.

To get the original ids :

>>> mapping_user = data_info.user2id        # get dict of mapping from original index to index used in the library
>>> mapping_user[2124] = ...
>>> mapping_id = data_info.id2user            # get dict of mapping from index used in the library to original index

And this also works for item2id and id2item.

@micheledemeo
Copy link
Author

Thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants