danb27/sar_single_node_improvements: allow for getting most frequent … #1666

danb27 · 2022-03-09T18:30:16Z

Description

Bug fix: The documentation says that self.item_frequencies should contain the frequencies of every item, but in fact, it currently shows a different value, generated by using the diagonal of the response from self.compute_cooccurrence_matrix, which is not actually equal to item frequencies when there are duplicates in the dataframe. The test we had written before (test_get_popularity_based_topk) was only passing because the test it has no duplicates. If you add any duplicates to the test, it will fail, and nothing in the fit() method will stop you or warn you that you are using the object incorrectly. Many of the tests we have written do include duplicates also.
Improvement: Allow users to get similar users using the new method, self.get_user_based_topk, similarly to how users are currently able to get similar items using self.get_item_based_topk
Improvement: Allow users to use the self.get_popularity_based_topk method with items=False to get the most popular users rather than the most popular items. The parameter defaults to True (return items, not users) for backwards compatibility
Add tests for 1-3

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

Looking forward to your feedback!

…users, similar users, and fix self.item_frequencies to actually use frequencies as stated in documentation

danb27 · 2022-03-09T21:01:36Z

the issue (I'm no longer sure that it is a bug) with the frequencies was introduced here: ##1588

The problem happens when there are duplicate user-item pairs in the dataset. A note was added to the docstring telling the user not to provide duplicates, but there is no warning or error thrown if duplicates are present. If there were an error thrown, then a lot of our current tests would fail also.

Without any warning or without dropping duplicates, you can get weird results like this when the user DOES have duplicates:

I added a warning if there are duplicates found in the dataset, but I think we can still improve by changing how we calculate frequency to be unrelated to the cooccurrence matrix (just count occurrences in the user provided dataframe).

Any thoughts?

…cate user-item pairs

codecov-commenter · 2022-03-10T01:43:19Z

Codecov Report

Merging #1666 (7038a14) into staging (b7d79ab) will increase coverage by 23.33%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##           staging    #1666       +/-   ##
============================================
+ Coverage     0.00%   23.33%   +23.33%     
============================================
  Files           88       87        -1     
  Lines         9121     9132       +11     
============================================
+ Hits             0     2131     +2131     
- Misses           0     7001     +7001

Flag	Coverage Δ
nightly	`?`
pr-gate	`23.33% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
recommenders/models/sar/sar_singlenode.py	`97.22% <100.00%> (+97.22%)`	⬆️
recommenders/datasets/mind.py	`0.00% <0.00%> (ø)`
recommenders/datasets/movielens.py	`66.37% <0.00%> (+66.37%)`	⬆️
recommenders/utils/python_utils.py	`97.50% <0.00%> (+97.50%)`	⬆️
recommenders/datasets/download_utils.py	`90.00% <0.00%> (+90.00%)`	⬆️
recommenders/models/newsrec/io/mind_iterator.py	`0.00% <0.00%> (ø)`
recommenders/models/newsrec/models/base_model.py	`0.00% <0.00%> (ø)`
recommenders/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7d79ab...7038a14. Read the comment docs.

miguelgfierro

@danb27 this is fantastic work, I added some questions and suggestions. I'll add more people to review because the work you are doing is super good.

Something I would like to ask Dan, please be patient with the review process, SAR is a key algo for us that has a lot of implications, so it will take us some time to review

recommenders/models/sar/sar_singlenode.py

miguelgfierro · 2022-03-15T09:51:21Z

recommenders/models/sar/sar_singlenode.py

        """
+        if sum(df[[self.col_user, self.col_item]].duplicated()) > 0:


@danb27 @anargyri @simonzhaoms @loomlike @gramhagen hey can we have a discussion on whether we want to have a warning or throw an error when there are duplicates? What are your perspectives?

I think either option is better than just having a docstring. If there is a use case for running SAR with duplicates, then I think warning is best. But, if we believe that the algorithm should NEVER see duplicates, then I would choose error.

If we do choose to go with error, we will need to fix multiple tests. If we go with warning, we don't need to fix the tests (but still probably should IMO). I was just hesitant to change a bunch of tests without first having this discussion.

Let me know what you all think.

If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri

So currently, yes duplicates will give us weird results in the cooccurrence matrix. We can fix this by simply removing duplicates from a copy of the dataframe before calculating cooccurrence matrix like this:

df = df.copy().drop_duplicates(subset=[self.col_user_id, self.col_item_id])

that way, the user can provide duplicates, it wont harm our cooccurrence matrix, our user and item frequencies will still be accurate (including duplicates, if there are any) because of my new way of retrieving the frequencies, and all the unit tests will still pass.

If we do not do this, then the problem is that all unit tests that use this dataset: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L96 will fail, which seems very suspicious to me.

@danb27 for this I would say that it is better to check whether there are duplicates, and if so, throw an error.
In my experience df.duplicated().any() is faster than any(df.duplicated()) but not sure if your implementation is faster. In sum(df[[self.col_user, self.col_item]].duplicated()) I would consider also ratings column. I can see a case where the ratings are related to clicks and purchase, where both signals are valid, so we want to keep both.

@miguelgfierro The csv file with duplicates is located at https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsage.csv

I tested locally after deduping with this code:

df = df.sort_values( "Timestamp", ascending=False ) df = df.drop_duplicates( ["UserId", "MovieId", "Timestamp"], keep="first" )

which is enough to pass all of the unit tests with the code I just pushed. The tests will fail for now, but once you get the deduped version up in storage I can swap the link and everything should be ready to go.

alternatively, I could just add the above code when we load the data here: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L139 or to the tests that use the data, but I think those options would be worse because in my mind the test dataset should be ready to go. Do you have a preference?

Here you go: https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsageNoDups.csv

Agree with you, probably better to use directly the data

I'm getting this error: urllib.error.HTTPError: HTTP Error 404: The requested content does not exist. with the above link

Sorry, I didn't set the right permissions. Now it should work

tests passed locally! PR is hopefully good to go now after latest commit.

recommenders/models/sar/sar_singlenode.py

miguelgfierro · 2022-03-15T10:10:11Z

recommenders/models/sar/sar_singlenode.py

+    def get_user_based_topk(self, users, top_k=10, sort_top_k=True):
+        """Get top K similar users to provided seed users based on similarity metric defined.
+        This method will take a set of users and use them to recommend the most similar users to that set
+        based on the similarity matrix fit during training.
+
+        Args:
+            users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns
+            top_k (int): number of top users to recommend
+            sort_top_k (bool): flag to sort top k results
+
+        Returns:
+            pandas.DataFrame: similar users to the users provided in users
+        """


One question @danb27, the computation above of the user coocurrence is only used when we want to use the method get_user_based_topk?

correct. in fit(), if we set compute_user_similarity=True, user_cooccurrence will be calculated and used to create self.user_similarity, which is required to run get_user_based_topk.

I went back and forth on whether to compute user similarity always, but decided to create the compute_user_similarity parameter so that the user could keep performance the same as before the PR if they do not require the added functionality.

Happy to have my mind changed if you all would prefer to have this computation occur every time so that you can always call any of the methods, regardless of how you fit.

what would be the change required and the performance difference if we do all the computation of user_coocurrence inside the function get_user_based_topk?

In theory, we would calculate both user_coocurrence and self.user_similarity (they will be different unless 'coocurrence' is the similarity type) the first time get_user_based_topk is used by checking if self.user_similarity is None, but there are a couple problems with that:

the first time you call the method, it would take a lot longer than the subsequent times* (with a non-trivial dataset it can take some time to calculate user similarity)

the behavior of the first call to the method would be significantly slower than calling the get_item_based_topk method, this might be perceived as a bug.

I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?

Some questions:

How can we create this result?

Right now we can input a list of user_ids, what is the meaning of the output?

Currently, I call get_item_based_topk() like this:

for item_id in item_ids: temp_df = pd.DataFrame({'item_id': [item]}, dtype=DTYPE) similar_items_df = model.get_item_based_topk(temp_df, k=K) ...

With my PR, I would be able to do something very similar for get_user_based_topk()

We could implement this as a different method, because what currently occurs if you give this method a list of user ids is the following:

From the docstring of get_item_based_topk(): " This method will take a set of items and use them to recommend the most similar items to that set based on the similarity matrix fit during training. This allows recommendations for cold-users (unseen during training), note - the model is not updated." The same is true for the new get_user_based_topk() method, it returns the most similar users given a set of other users.

I think it would be valuable to have both item and user similar. Given a list of items, I would like to get a list of the most similar ones for each item, and given a list of users, I would like to get a list of the most similar users for each user.

danb27 · 2022-03-15T21:30:22Z

@miguelgfierro Thanks for the initial review! I agree with all of your comments and will push a commit once we resolve some of the open questions with yourself and the other engineers.

re: needing time to review - No problem at all! I am a big fan of the package and want to make sure we get this right!

miguelgfierro · 2022-03-21T16:59:39Z

recommenders/models/sar/sar_singlenode.py

        """
+        if sum(df[[self.col_user, self.col_item]].duplicated()) > 0:


If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri

tests/unit/recommenders/models/test_sar_singlenode.py

miguelgfierro · 2022-03-21T17:36:02Z

recommenders/models/sar/sar_singlenode.py

+    def get_user_based_topk(self, users, top_k=10, sort_top_k=True):
+        """Get top K similar users to provided seed users based on similarity metric defined.
+        This method will take a set of users and use them to recommend the most similar users to that set
+        based on the similarity matrix fit during training.
+
+        Args:
+            users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns
+            top_k (int): number of top users to recommend
+            sort_top_k (bool): flag to sort top k results
+
+        Returns:
+            pandas.DataFrame: similar users to the users provided in users
+        """


I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?

Some questions:

How can we create this result?

Right now we can input a list of user_ids, what is the meaning of the output?

tests/unit/recommenders/models/test_sar_singlenode.py

…ded test example

danb27 · 2022-03-21T23:34:28Z

@miguelgfierro Sorry for the back and forth, but your comments made me realize a much simpler + better way to achieve what I was trying to do. Here is a summary of what the PR does now:

change how we calculate user + item frequencies to come directly from the user's data
remove need for user to drop duplicates ahead of calling .fit(). The issue was when we calculated item cooccurrences and there were duplicates, but we can instead just temporarily drop any duplicates in the user dataframe before calculating cooccurrences. this ensures that our unit tests are still working AND our cooccurrences are accurate
allow someone to retrieve the most frequent users, the same way we can already retrieve the most popular items
allow someone to retrieve the k most similar users based on our user affinity matrix. This allows ratings to be factored into the results of grabbing the most similar users and also avoids the weird situation I had created where we needed to calculate user cooccurrences if we wanted to call a specific method.

I think these changes are a lot better and hopefully make the PR more palatable.

What do you think?

Is there any reason we should still tell users to not include duplicates? I am assuming that if our unit test dataset has duplicates, then we should also be accepting duplicates, but I don't know where this test data comes from.

anargyri · 2022-03-22T17:20:06Z

Hi @danb27 and thanks for your contributions! I think there is some more background info about the issues you and @miguelgfierro are discussing.

About the item frequencies, there are two cases we provide for: data without timestamps (just users, items and ratings) and data with timestamps. In the latter case, i.e. when the data scientist enables the time decay flag, we take the time decay into account for computing item frequencies, see https://github.com/microsoft/recommenders/blob/c4435a9af5836f3d472cfa44b312841a8121923c/recommenders/models/sar/sar_singlenode.py#L231 and afterwards.

Firstly, your code uses value_counts() for computing the item frequencies, which makes sense in the former case but not in the latter.

Secondly, we chose to require that the data frame has no duplicates. In the former case, this means that (users, items) pairs are unique, in the latter that (users, items, timestamps) are unique.

There are a few reasons for this. One is that having duplicates in the DF almost certainly is a sign of bad practice during data preparation. The other is that it is more efficient to incur the cost of aggregation / deduplication just once (during data prep) and not every time the SAR code is run. Yet another is flexibility for defining the ratings (also called weights); in the case of duplicates, the way these should be aggregated for each user - item should be left to the discretion of the data scientist, since there is no single correct way of doing this -- it depends on the recommendation scenario.

So there is a trade-off between people missing the requirement in the doc string vs. computational overhead for aggregation / checking for duplicates vs. flexibility. I think we can live with the former, but probably we should emphasize the requirement more, especially in the notebooks.

I am curious, is there a recommendations use case for this feature? Similar items is a very common scenario that can be found in many e-commerce websites, but I have never seen a scenario with similar users. Have you encountered this in practice? Also bear in mind that in almost all recommendation scenarios the users are anonymized and there should not be any user information leaking to other users.
Similarly as above, why would users want to see "popular users" and is this allowed anyway?

danb27 · 2022-03-22T18:39:47Z

@anargyri Thank you for your clarification on point 1. This is very helpful. I will think about if I can adapt my code or if I should just change those parts back to how they were before.

Of course this could be misused if you are leaking information to other users. But I believe there are legitimate use cases also. For example, you want to make sure that your model has some overlap in the recommendations for similar users, but that similar users are not getting the exact same recommendations. Please note that this sort of functionality is featured within recommenders for LightFM, see the LightFM notebook and the code
You might want to look at how your model performs for users with a lot of data points vs. very few to understand your model biases and at what number of datapoints you know enough about a user to make good recommendations.

tests/unit/recommenders/models/test_sar_singlenode.py

danb27 · 2022-03-24T20:13:32Z

So given Andreas' comment on the frequencies, I went back to the old way of calculating item_frequencies, without using value_counts, and also rewrote how we calculate user_frequencies to not use the dataframe directly (this variable is only needed if the answer to question 2 below is yes).

I think there are two remaining questions:

do we want to have some sort of warning / error if a data scientist does provide duplicates?
are we okay with providing these methods for grabbing the most frequent users and grabbing similar users, similar to the functionality provided in the LightFM code? Of course these could be misused, but I think there are legitimate uses for them as well.

If the answer to both is no, then we can probably just close this PR.

@miguelgfierro @anargyri Please let me know how to proceed when you get a chance.

miguelgfierro · 2022-04-01T08:03:30Z

@danb27 sorry for not answering earlier. Answering your questions:

do we want to have some sort of warning / error if a data scientist does provide duplicates?

I think we should throw an error if there are duplicates, see details of faster implementations in this comment

are we okay with providing these methods for grabbing the most frequent users and grabbing similar users, similar to the functionality provided in the LightFM code? Of course these could be misused, but I think there are legitimate uses for them as well.

I think this is ok.

Any other perspective @anargyri?

miguelgfierro

I think the PR is good to go, minor addition for Dan.

@anargyri is there anything else?

tests/unit/recommenders/models/test_sar_singlenode.py

recommenders/models/sar/sar_singlenode.py

anargyri

See the comment above about avoiding the dense matrix.

recommenders/models/sar/sar_singlenode.py

danb27/sar_single_node_improvements: allow for getting most frequent …

6e09f4e

…users, similar users, and fix self.item_frequencies to actually use frequencies as stated in documentation

danb27 requested review from miguelgfierro, gramhagen, anargyri, loomlike and wutaomsft as code owners March 9, 2022 18:30

Merge branch 'staging' into danb27/sar_single_node_improvements

053af09

danb27/sar_single_node_improvements: add warning when there are dupli…

bf7c730

…cate user-item pairs

miguelgfierro reviewed Mar 15, 2022

View reviewed changes

miguelgfierro mentioned this pull request Mar 15, 2022

Adding FBT recommender to the repository #1457

Closed

4 tasks

miguelgfierro reviewed Mar 21, 2022

View reviewed changes

danb27 added 4 commits March 21, 2022 12:55

split up tests for different functions. remove duplicates from hardco…

39ec239

…ded test example

PR feedback on variable naming

831923d

use user affinities to calculate similar users

7865475

revert unneeded changes

224245b

remove need to drop duplicates from docstring

d086aae

miguelgfierro reviewed Mar 24, 2022

View reviewed changes

tests/unit/recommenders/models/test_sar_singlenode.py Outdated Show resolved Hide resolved

tests/unit/recommenders/models/test_sar_singlenode.py Outdated Show resolved Hide resolved

tests/unit/recommenders/models/test_sar_singlenode.py Show resolved Hide resolved

danb27 added 2 commits March 24, 2022 11:07

PR feedback: improve test_get_topk_most_similar_users

2ed5ece

go back to old item_frequencies approach, new user_frequencies approach

d72d9ff

danb27 added 2 commits April 5, 2022 09:16

add error when user passes duplicates to .fit()

e60c420

update test data link to remove duplicates

f88eab7

miguelgfierro approved these changes Apr 8, 2022

View reviewed changes

tests/unit/recommenders/models/test_sar_singlenode.py Show resolved Hide resolved

tests/unit/recommenders/models/test_sar_singlenode.py Show resolved Hide resolved

Merge branch 'staging' into danb27/sar_single_node_improvements

f047fb4

miguelgfierro requested a review from yueguoguo as a code owner April 8, 2022 08:27

add test for duplicate data error

ae5afbb

anargyri reviewed Apr 8, 2022

View reviewed changes

recommenders/models/sar/sar_singlenode.py Outdated Show resolved Hide resolved

anargyri requested changes Apr 8, 2022

View reviewed changes

compute user frequencies outside .fit(), separate the test for it

65bbd81

anargyri reviewed Apr 8, 2022

View reviewed changes

recommenders/models/sar/sar_singlenode.py Outdated Show resolved Hide resolved

remove accidentally committed code

7038a14

anargyri approved these changes Apr 8, 2022

View reviewed changes

miguelgfierro merged commit 4106333 into recommenders-team:staging Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

danb27 commented Mar 9, 2022 •

edited

danb27 commented Mar 9, 2022 •

edited

codecov-commenter commented Mar 10, 2022 •

edited

miguelgfierro left a comment

miguelgfierro Mar 15, 2022

danb27 Mar 15, 2022

miguelgfierro Mar 21, 2022

danb27 Mar 21, 2022 •

edited

miguelgfierro Apr 1, 2022

danb27 Apr 5, 2022 •

edited

miguelgfierro Apr 7, 2022

danb27 Apr 7, 2022

miguelgfierro Apr 7, 2022

danb27 Apr 7, 2022

miguelgfierro Mar 15, 2022

danb27 Mar 15, 2022

miguelgfierro Mar 17, 2022

danb27 Mar 17, 2022

miguelgfierro Mar 21, 2022

danb27 Mar 21, 2022

miguelgfierro Apr 1, 2022

danb27 commented Mar 15, 2022

miguelgfierro Mar 21, 2022

miguelgfierro Mar 21, 2022

danb27 commented Mar 21, 2022

anargyri commented Mar 22, 2022

danb27 commented Mar 22, 2022

danb27 commented Mar 24, 2022 •

edited

miguelgfierro commented Apr 1, 2022

miguelgfierro left a comment

anargyri left a comment

		"""
		if sum(df[[self.col_user, self.col_item]].duplicated()) > 0:

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

Conversation

danb27 commented Mar 9, 2022 • edited

Description

Checklist:

danb27 commented Mar 9, 2022 • edited

codecov-commenter commented Mar 10, 2022 • edited

Codecov Report

miguelgfierro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danb27 Mar 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danb27 Apr 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danb27 commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danb27 commented Mar 21, 2022

anargyri commented Mar 22, 2022

danb27 commented Mar 22, 2022

danb27 commented Mar 24, 2022 • edited

miguelgfierro commented Apr 1, 2022

miguelgfierro left a comment

Choose a reason for hiding this comment

anargyri left a comment

Choose a reason for hiding this comment

danb27 commented Mar 9, 2022 •

edited

danb27 commented Mar 9, 2022 •

edited

codecov-commenter commented Mar 10, 2022 •

edited

danb27 Mar 21, 2022 •

edited

danb27 Apr 5, 2022 •

edited

danb27 commented Mar 24, 2022 •

edited