Building datasets #318

ghost · 2018-06-22T13:33:06Z

Hello !

Thank you for this open source package, it help a lot and your work is amazing.

I just a have a silly question about dataset construction. I followed the example for my data:
user (160.000 x 300) and item (4000 x 4).

dataset = Dataset()
dataset.fit(users=(x['id_user'] for x in user),
            items=(x['id_item'] for x in item),
            user_features=((x['id_user'], [[x[col] for col in list_columns_user]]) for x in user),
            item_features=((x['id_item'], [[x[col] for col in list_columns_item]]) for x in item))

But when I try dataset.user_features_shape() I get (160000, 160000). shouldn't I rather have this (160000, 300) ?

Indeed, we can read in the documentation :

Returns
-------
(num user ids, num user features): tuple of ints

and my num user features is 300. So there is an error in what I did?

Sorry for the stupid question!

The text was updated successfully, but these errors were encountered:

maciejkula · 2018-06-22T13:45:44Z

This is the expected result. By default, LightFM adds a feature per every user and item. You can disable that in the constructor.

maciejkula · 2018-06-22T13:46:40Z

(Well, you should get 160000 x 1600300 or something like that. Are your feature names the same as some of your user ids?)

ghost · 2018-06-22T13:53:03Z

Oh ok.

No, none of my features names are the same as user ids.

maciejkula · 2018-06-22T13:56:16Z

What data are you passing as user_features and item_features? It should just be a list (or other iterable) of the names of user/item features.

(So if you have 300 user features it should be 300 long).

ghost · 2018-06-22T14:09:26Z

I misunderstood this line in the example : dataset.fit_partial(items=(x['ISBN'] for x in get_book_features()), item_features=(x['Book-Author'] for x in get_book_features()))

I was trying to do the same thing, obviously the wrong way.

by modifying using just the name, I have the right result. Thank you !

ghost · 2018-06-22T17:02:07Z

Sorry to re-open this, but after that I continued by building the users/items features (always following the example and the documentation):

# Creat user matrix
user_features = dataset.build_user_features(((x['id_user'],list_columns_user) for x in user), True)
print(repr(user_features))

Like in the documentation :

data (iterable of the form) – (user id, [list of feature names])

x[id_user] being my user id and list_columns_user being my list of features names. But when I visualize one row of the user_features, I only get 0 everywhere except in the index of that row. in other term, user_features is just the identity matrix.

Example:

user_features[1, :].todense()
Out[31]: matrix([[0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

Is it the excepted result ? If yes, I think I don't really understand how the user features matrix is build and how it's different from the collaborative filtering.

maciejkula · 2018-06-22T17:10:07Z

You need to pass an iterable of tuples of (id, [list of features for that id]) into build_features. It looks like at the moment you're passing the same features for every user?

ghost · 2018-06-22T17:36:08Z

At the moment I pass the user features name in fact, that what I read in the doc.
I tried with the actual list of features for each id, like that:

user_features = dataset.build_user_features(((x['id_user'], x[list_columns_user]) for x in user), True)
(user is a csv.DictReader like in your example)

but it did change a thing :(

maciejkula · 2018-06-22T17:39:37Z

Are you passing features for the second user? Is the resulting matrix an identity matrix? Can you post a short gist that reproduces this?

It may be useful to print some of the elements in your iterator and make sure that they are what you think they are. Is x[list_columns_user] an iterable of things?

maciejkula · 2018-06-22T17:40:45Z

(If you think the docs are unclear on this point please make a PR with improvements.)

maciejkula · 2018-06-22T17:55:03Z

One more pointer: if you are using generators, you can only iterate over a generator once: subsequent iterations will yield zero elements. Maybe you are creating one csv reader, using it for fit, then trying to use it again for build_features?

ghost · 2018-06-25T14:23:38Z

Sorry for late answer, I no longer had access to the data.
you nailed it! It was the generator problem, I didn't know that we can't iterate over it plus then once. I don't really know how to deal with this type. So I redid all the treatment with pandas (I know it better), and I think it works!
I build a new user_features and it is a diagonal matrix with 0.0454 everywhere on the diagonal. Is the fact that it's the same value across the diagonal is excepted?

I didn't normalized my values, just used the parameter normalize=True while building user/item features (with build_item_features and build_user_features) may be this can lead to a mistake ?

maciejkula · 2018-06-25T15:17:12Z

If all users have the same number of features the value on the diagonal will be the same for all of them.

ghost · 2018-06-25T20:12:15Z

Got it. thanks a lot, I really really appreciated your help.

kewlcoder · 2018-10-03T18:49:15Z

This is the expected result. By default, LightFM adds a feature per every user and item. You can disable that in the constructor.

Hi Maciej,
I couldn't find anything in the constructor that can disable the addition of a feature per every user & item. Can you please help.
Documentation link - [https://lyst.github.io/lightfm/docs/lightfm.html]

class lightfm.LightFM(no_components=10, k=5, n=10, learning_schedule=’adagrad’, loss=’logistic’, learning_rate=0.05, rho=0.95, epsilon=1e-06, item_alpha=0.0, user_alpha=0.0, max_sampled=10, random_state=None)

igorkf · 2020-11-15T02:39:08Z

It's possible to pass more than one user feature for each user?
Like:

[
  (0, {'category': 'horror', 'sex': 'male'}),
  (1, {'category': 'romantic', 'sex': 'female'}),
   ...
]

ghost closed this as completed Jun 22, 2018

ghost reopened this Jun 22, 2018

ghost closed this as completed Jun 25, 2018

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building datasets #318

Building datasets #318

ghost commented Jun 22, 2018 •

edited by ghost

maciejkula commented Jun 22, 2018

maciejkula commented Jun 22, 2018

ghost commented Jun 22, 2018 •

edited by ghost

maciejkula commented Jun 22, 2018 •

edited

ghost commented Jun 22, 2018

ghost commented Jun 22, 2018

maciejkula commented Jun 22, 2018

ghost commented Jun 22, 2018

maciejkula commented Jun 22, 2018 •

edited

maciejkula commented Jun 22, 2018

maciejkula commented Jun 22, 2018 •

edited

ghost commented Jun 25, 2018 •

edited by ghost

maciejkula commented Jun 25, 2018

ghost commented Jun 25, 2018

kewlcoder commented Oct 3, 2018 •

edited

igorkf commented Nov 15, 2020

Building datasets #318

Building datasets #318

Comments

ghost commented Jun 22, 2018 • edited by ghost

maciejkula commented Jun 22, 2018

maciejkula commented Jun 22, 2018

ghost commented Jun 22, 2018 • edited by ghost

maciejkula commented Jun 22, 2018 • edited

ghost commented Jun 22, 2018

ghost commented Jun 22, 2018

maciejkula commented Jun 22, 2018

ghost commented Jun 22, 2018

maciejkula commented Jun 22, 2018 • edited

maciejkula commented Jun 22, 2018

maciejkula commented Jun 22, 2018 • edited

ghost commented Jun 25, 2018 • edited by ghost

maciejkula commented Jun 25, 2018

ghost commented Jun 25, 2018

kewlcoder commented Oct 3, 2018 • edited

igorkf commented Nov 15, 2020

ghost commented Jun 22, 2018 •

edited by ghost

ghost commented Jun 22, 2018 •

edited by ghost

maciejkula commented Jun 22, 2018 •

edited

maciejkula commented Jun 22, 2018 •

edited

maciejkula commented Jun 22, 2018 •

edited

ghost commented Jun 25, 2018 •

edited by ghost

kewlcoder commented Oct 3, 2018 •

edited