User similarity #1329

mayhem · 2021-03-09T11:19:22Z

This PR is the user similarity work that lucifer and I have been working on, based on Param's work.

This reverts commit 7adb05b. Reverting because it isn't relevant to this PR

We default to None but do not handle it anywhere. None will cause an error while calculating the from_date offset which is expecting an int.

…server into user-similarity

…ity data handler

…server into user-similarity

…pe names

…server into user-similarity

…er data import.

…stent with the similar-users endpoint

…server into user-similarity

admin/sql/create_foreign_keys.sql

paramsingh · 2021-03-10T13:42:33Z

listenbrainz_spark/query_map.py

@@ -35,6 +36,7 @@
    'cf.recommendations.recording.recommendations': listenbrainz_spark.recommendations.recording.recommend.main,
    'import.mapping': listenbrainz_spark.request_consumer.jobs.import_dump.import_mapping_to_hdfs,
    'import.artist_relation': listenbrainz_spark.request_consumer.jobs.import_dump.import_artist_relation_to_hdfs,
+    'similarity.similar_users': listenbrainz_spark.recommendations.recording.user_similarity.main


Why is user_similarity in recommendations.recording ? It should be outside in a separate user_similarity package imo.

I think because we're still using recording data to calculate this.... But I'll let lucifer chime in...

Yes, I intend to move user_similarity to a different package. I initially put it there to allow easy reuse of code as the implementation was not final or tested. Now that we have a working implementation, I'll do the cleanup (add comments, type annotations, refactor code) and add tests.

paramsingh · 2021-03-10T13:43:28Z

listenbrainz_spark/recommendations/recording/create_dataframes.py

-
-
-def save_dataframe_metadata_to_hdfs(metadata):
+def save_dataframe_metadata_to_hdfs(metadata, df_metadata_path):


is it possible to add type annotations to this function?

listenbrainz_spark/recommendations/recording/create_dataframes.py

listenbrainz/db/stats.py

paramsingh · 2021-03-10T17:47:31Z

listenbrainz_spark/recommendations/recording/user_similarity.py

+    }
+
+
+def threshold_similar_users(matrix, threshold):


types and docstrings on these methods would really help readability.

listenbrainz_spark/recommendations/recording/user_similarity.py

paramsingh · 2021-03-10T17:49:57Z

listenbrainz_spark/recommendations/recording/user_similarity.py

+        current_app.logger.error(str(err), exc_info=True)
+        raise
+
+    tuple_mapped_rdd = playcounts_df.rdd.map(lambda x: MatrixEntry(x["recording_id"], x["user_id"], x["count"]))


style thing again, but comments in this and the next few lines would greatly enhance readability in my opinion. It's not obvious at the outset what a co-ordinate matrix or a indexed-row-matrix or vectors_mapped_rdd is

I have added some description in the docstring.

Co-authored-by: Param Singh <iliekcomputers@gmail.com>

paramsingh

looks good to me, thanks for making those changes!

paramsingh and others added 30 commits July 25, 2020 13:40

Cleanup stats module

04ed04e

Proof of concept of user similarity based on collaborative filtering

ff6de6c

Bring back an index I removed by mistake

6576a9a

Revert "Cleanup stats module"

0a69e8a

This reverts commit 7adb05b. Reverting because it isn't relevant to this PR

put similar users table in the recommendation schema

c12d54c

Use ujson

9cdc1e2

Harden against prod data and print shape of csr_matrix for debugging

c1d8e5c

Fix table name in get query

dce4508

Fix conflict

e1bc13e

Merge branch 'master' into user-similarity to bring it up to date.

a98bffb

Remove the stats subdir since spark will do the calculation.

f4be9a6

Peppy

76a62ea

Merge branch 'master' into user-similarity

971fc89

Make train_model_window parameter compulsory

7246fa6

We default to None but do not handle it anywhere. None will cause an error while calculating the from_date offset which is expecting an int.

Refactor create_dataframes.py to allow reuse in other jobs

e0e9b3d

FIx tests after refactoring

a63f80d

Add request for user similarity and fix up dataframes test

8c213f0

Merge branch 'user-similarity' of github.com:metabrainz/listenbrainz-…

1b343b5

…server into user-similarity

Remove the old code to write user similarity and add skeleton similar…

cbfa3d4

…ity data handler

Add initial implementation of user similarity job

b1e025a

Interim check-in

c94b556

Merge branch 'user-similarity' of github.com:metabrainz/listenbrainz-…

944da6b

…server into user-similarity

First cut at data importing code

fa11b68

Update integration with spark request_consumer

0db0d09

Change job_type name to similar_users

447d87e

Fix missing comma, dump sanity check and correct spark request job ty…

0304f28

…pe names

Merge branch 'user-similarity' of github.com:metabrainz/listenbrainz-…

1207ae6

…server into user-similarity

Rename the dataframes similar_users job type. Fixes to the similar us…

5723fe8

…er data import.

Try to fix the spark user similarity entry point

072475d

Users should be columns not rows

772d810

mayhem and others added 13 commits March 10, 2021 13:05

Undo the spark-reader startup for dev. Not needed.

f57c4cb

Add similar users to cron. Add docstrings for some functions.

810e1ce

Specify number of days in cron job

e7af4d1

Fix up the endpoints.

5bc21d0

Merge branch 'master' into user-similarity

a4560f6

Improve error handling

4eef3ea

Add docstrings and fix up response of similar-to endpoint to be consi…

7f6ffa7

…stent with the similar-users endpoint

Make user_name consistent

17b0244

Add test for get_similar_users

379fd95

Add tests for two endpoints. Set dataframe duration to 2 years.

da8776d

Merge branch 'user-similarity' of github.com:metabrainz/listenbrainz-…

d801157

…server into user-similarity

pepmeister

f3f12e8

Add test for get_user_by_id

277f575

paramsingh reviewed Mar 10, 2021

View reviewed changes

mayhem and others added 8 commits March 10, 2021 19:19

Update admin/sql/create_foreign_keys.sql

ac875da

Co-authored-by: Param Singh <iliekcomputers@gmail.com>

Add some comments to the thresholding function, pep8

2d2f851

Add typing hints and comments to user_similarity.py

5541df5

Document spark bug

40a2a20

Move user_similarity.py to its own package

838188c

Delete unused method

fc0baf3

Rename parameter

febc928

Add typing hints

a0ad6f5

amCap1712 force-pushed the user-similarity branch from 6caffa2 to a0ad6f5 Compare March 11, 2021 14:08

amCap1712 marked this pull request as ready for review March 11, 2021 14:09

mayhem requested a review from paramsingh March 11, 2021 14:09

amCap1712 requested review from vansika and alastair March 11, 2021 14:09

paramsingh approved these changes Mar 11, 2021

View reviewed changes

mayhem merged commit 3cf1d79 into master Mar 11, 2021

mayhem deleted the user-similarity branch March 11, 2021 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User similarity #1329

User similarity #1329

mayhem commented Mar 9, 2021 •

edited

paramsingh Mar 10, 2021

mayhem Mar 10, 2021

amCap1712 Mar 10, 2021 •

edited

paramsingh Mar 10, 2021

amCap1712 Mar 11, 2021

paramsingh Mar 10, 2021

amCap1712 Mar 11, 2021

paramsingh Mar 10, 2021

amCap1712 Mar 11, 2021

paramsingh left a comment



		def save_dataframe_metadata_to_hdfs(metadata):
		def save_dataframe_metadata_to_hdfs(metadata, df_metadata_path):

User similarity #1329

User similarity #1329

Conversation

mayhem commented Mar 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amCap1712 Mar 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh left a comment

Choose a reason for hiding this comment

mayhem commented Mar 9, 2021 •

edited

amCap1712 Mar 10, 2021 •

edited