Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better genre detection and track recommendation #ML #data #83

Closed
adrienjoly opened this issue Aug 11, 2017 · 15 comments
Closed

Better genre detection and track recommendation #ML #data #83

adrienjoly opened this issue Aug 11, 2017 · 15 comments

Comments

@adrienjoly
Copy link
Member

adrienjoly commented Aug 11, 2017

For music lovers, discovering new music is essential.

Spotify is well known for the quality of their "Discover Weekly" playlist, containing a personalised selection of tracks based on your listening history.

On Openwhyd, current ways to discover music are:

  1. listening to your stream, after having followed users with similar musical taste;
  2. or listening to hot tracks, which are classified by genres.

The first way is purely relying on humans and luck.

The second way relies on a list of 16 genres (a quite limited and vague selection of genres), in which popular tracks are classified, based on the names of the playlists that hold them. This kinda works but it's far from perfect. For example, we had to create a hard-coded rule to prevent Daft Punk songs from being recognised as Punk Rock music!

In order to discover new music by discovering relevant people to follow, we had also experienced showing a measure of profile similarity, but it was only based on the number of artists that were added by both users.

=> Anyone interested in exploring new ways to discover music on Openwhyd?

@adrienjoly
Copy link
Member Author

@Marinlemaignan
Copy link

This is a rather exciting feature to add!
I was thinking that, maybe, we could replace plTags.js by another db backedup service that would first ask discogs' api infos about a track/album/artist's metadata, and then store them back for later use, instead of being hardcoded as it is now. So we can then build something that would be able to evolve bits by bits. Also mongodb's perfect for this kind of job.

@adrienjoly
Copy link
Member Author

Hi @Marinlemaignan !

I'd be happy to replace plTags.js when we have a fully-functional solution that is better than the current one, while maintaining:

  • user satisfaction on the hot tracks page (which is also openwhyd's landing page)
  • and relevance of the users that are recommended based on selected genres, during user onboarding (also fed by plTags)

One way we could transition gently to a new system:

  • the new system (e.g. based on discogs) is developed outside of openwhyd's repo,
  • inspired by the use cases provided above, automated tests are written to ensure that the new system works as expected (or better),
  • when hosted online, the new system's database could be populated at the same time as plTags's system does, by pluging a web-hook (or something like that).

What do you think?
Are you interested in working on this?

@adrienjoly
Copy link
Member Author

/cc @florentpietot

@SkinyMonkey
Copy link
Contributor

I have experimented extensively with discog's API.

It's very complete, extremely promising but ... the number of request is of 60 requests .. per minute.

https://www.discogs.com/developers/#page:home,header:home-rate-limiting

There is no way to go around this. A partnership would be the only solution and I doubt that they would be attracted by a partnership that does not bring them anything.

What we could do is identify albums and point to their products/sellings. They would not split in such a big showcase as openwhyd.

A solution is to host their database. There is docker images to download their monthly dump and index it in mongodb.
A mongo-connector to an elasticsearch database then allow us to get the extra performance to be efficient on the lookup.
I tried, it was working well.

But even then a few other problems arise :

  • youtube/soundcloud names are not always the right ones, mispelled etc
  • identify the right album from discog is sometime difficult, i spent a lot of time on this.
  • the images of the albums are not available in the dump, would be cool to display it instead of the youtube artwork for example
  • the links to youtube videos linked to the album
    yes .. yes they have that and it would be an amazing feature!
    imagine, you post a track and bam! You get other tracks from the same album
    but they are not available in the dump

A solution that I studied would be scraping ... but they wouldn't like it and what a dirty solution.

I'm not saying it's impossible, just that it's not a bulletproof approach.

@adrienjoly
Copy link
Member Author

Thanks for sharing our ideas and notes with us, @SkinyMonkey !

@adrienjoly
Copy link
Member Author

WIP:

Florent Piétot is currently analysing Openwhyd's data set, and thinking of ways to leverage it (e.g. use clustering and/or machine learning techniques for better genre detection and music recommendation).

@adrienjoly
Copy link
Member Author

adrienjoly commented Nov 3, 2018

During a "Hackergarten" meetup in Paris, Mihangy, Damien and I wrote a python script that turns playlog.json.log into a anonymised csv file in which each line associates a user (identified by a number) to a youtube track id that the user listened to.

👉 2c095d2

The goal was to provide a starting point for the development of a music recommendation algorithm based on Openwhyd's playback logs, while preserving the privacy of its users. (i.e. data anonymisation)

Next steps:

  • cluster similar songs by similarity, e.g. groups of tracks that we're listened by more than one user
  • build a mini website that would recommend tracks, based on a user-given youtube video
  • build a mini website that would recommend tracks and openwhyd users to subscribe to, based on the user's openwhyd profile
  • integrate it as a discovery feature on openwhyd.org

@adrienjoly
Copy link
Member Author

The data science cheatsheets provided on this repo may help :-) https://github.com/FavioVazquez/ds-cheatsheets

@adrienjoly
Copy link
Member Author

This also may help: https://github.com/trekhleb/homemade-machine-learning (examples of machine learning techniques in Python, based on Andrew Ng's MOOC)

@adrienjoly
Copy link
Member Author

During Hackergarten meetup, Sébastien Treguer suggested the following next step:

First easy step, implement a collaborative filtering for recommendation, with a matrix of users (in rows) and content like videos (in columns)

http://surpriselib.com/

@adrienjoly
Copy link
Member Author

adrienjoly commented Mar 2, 2019

For reference: Mihangy is experimenting with Jupyter Notebooks and SurpriseLib. He opened a google group to discuss data analysis tasks on openwhyd's data using those tools.

Aidan O'Donnell and Patrick Allain also showed interest in these initiatives, during this week's Hackergarten.

adrienjoly pushed a commit that referenced this issue Apr 3, 2019
# [1.5.0](v1.4.9...v1.5.0) (2019-04-03)

### Features

* add timestamp to anonymised playlog entries ([45caf45](45caf45)), closes [#83](#83)
* can anonymise playlog with timestamps or ObjectIDs ([461623b](461623b)), closes [#83](#83)
@adrienjoly
Copy link
Member Author

adrienjoly commented Apr 3, 2019

For reference, I published a 700MB history/playlog file in https://github.com/openwhyd/openwhyd-data

At some point, it may be worth picking a license and publishing the data on open data listings like awesomedata/awesome-public-datasets. Suggestions are welcome!

@adrienjoly
Copy link
Member Author

This list of best practices could help: https://github.com/microsoft/recommenders

@adrienjoly
Copy link
Member Author

Music genre detection and genre-based streams were removed in #399. => Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants