<h1>RYM for Discogs</h1>

As an avid music lover and record collector, I use both <a href="https://www.rateyourmusic.com/">RateYourMusic</a> and <a href="https://www.discogs.com/">Discogs</a>. In a few aspects, there is some overlap between RateYourMusic and Discogs; both services allow you to keep track of the music you have, and in addition to that they both allow you to catalog exactly what physical (or sometimes digital) copy of a certain musical work that you have, and also give it a rating. However, Discogs (in my opinion and lots of other peoples') has proven to be much more effective for the former, and RateYourMusic is much more effective for the latter.

For my catalog of over 700 records (as of May 2020), which I keep track of on Discogs, if I want to see the RYM ratings I have to look them up on the RYM website one-by-one, and this takes a lot of time. This project will attempt to take the collection of a specified user from Discogs and gather the corresponding data from RYM.

<h1>First steps: Python imports</h1>

We will need a few libraries, which must be installed with pip if you would like to run the code on your own computer. The commands for installing these libraries are:

pip install pandas<br>
pip install beautifulsoup4<br>
pip install google-api-python-client<br>
pip install fuzzywuzzy<br>
pip install python-Levenshtein (for fuzzywuzzy)

In [1]:
import json
import time
import urllib
import pandas as pd
from bs4 import BeautifulSoup
from googleapiclient.discovery import build
from fuzzywuzzy import fuzz

We are fortunate enough that Discogs has an API, and furthermore, that API is completely free without any use of an API key (at least for the data that we want to get). The first thing we should do is retrieve and store the user's collection as a dataframe.

In [2]:
username = input('Enter the username to get that user\'s collection: ')

# The number of records to get per request
# 500 is the maximum
per_page = 500

try:
    collection = json.loads(urllib.request.urlopen('https://api.discogs.com/users/' + username + \
                              '/collection/folders/0/releases?per_page=' + str(per_page)).read())
    df_discogs = pd.DataFrame(columns = ['id', 'master_id', 'year', 'artists', 'title', \
                                         'labels', 'styles', 'master_url', 'formats', \
                                         'rym_url', 'rating_value', 'rating_count'])

    while True:
        print("Getting page " + str(collection['pagination']['page']) + ' of results')
        data_row = {}
        for release in collection['releases']:
            info = release['basic_information']
            
            data_row['id'] = info['id']
            data_row['master_id'] = info['master_id']
            data_row['artists'] = [(artist['name'], artist['resource_url']) \
                                   for artist in info['artists']]
            data_row['title'] = info['title']
            data_row['labels'] = [(label['catno'], label['name'], label['resource_url']) \
                                  for label in info['labels']]
            data_row['styles'] = info['styles']
            data_row['master_url'] = info['master_url']
            data_row['formats'] = []
            for f in info['formats']:
                if 'descriptions' in f:
                    data_row['formats'] += [i for i in f['descriptions']]
            df_discogs = df_discogs.append(data_row, ignore_index = True)

        # If this was the last page then end the loop
        if collection['pagination']['page'] == collection['pagination']['pages']:
            break;
        # Very important to be responsible with the privilege of using
        # a web API
        time.sleep(2)
        # Get the next page
        next_page = collection['pagination']['urls']['next']
        collection = json.loads(urllib.request.urlopen(next_page).read())
except:
    print("The user was not found!!! Try again with a different username.")
    
print("Done!")

Enter the username to get that user's collection: BethesdaMD
Getting page 1 of results
Getting page 2 of results
Done!


Unfortunately RateYourMusic does not have an API, and additionally running a GET request to any part of their website from Python will result in an instant ban. Rather than try to circumvent this, and violate their Terms of Service. We will have to be limited to a workaround that doesn't - Google Custom Search API. We will make a search query that returns the relevant page on RYM, and fortunately, the data we need too (which may or may not be slightly outdated, because it was from the last indexing of the page).

First we will need a Google API key to make this work:

In [None]:
gapi_key = input('Please enter your Google API key: ')
service = build("customsearch", "v1", developerKey = gapi_key)

We will obviously need a good query that will give us the results we desire. To construct such a query, we will obviously need to include the artists' names and also the name of the album or musical work. But we also have to be aware of one other thing: the name of an album can also be the name of a single. If we're looking for the single and not the album or vice versa, this is where the format descriptions we retrieved from the Discogs collection will come in handy.

On the Google Custom Search API, there are four different search engines for this program. When we want the album version, we will search rateyourmusic.com/release/album/ \*, when we want the single version we will search rateyourmusic.com/release/single/ \*, for the EP version rateyourmusic.com/release/ep/ \*, and for a compilation album rateyourmusic.com/release/comp/ \*. We also need to be aware that Discogs doesn't <u>always</u> label releases with 'album' and 'single'. We can still test to see if those terms are still present in the format list, but in addition a single can be represented as '7"' or '45 RPM' in the array, and an album can be represented as 'LP' as well (however it might also be a compilation, in which case 'Compilation' will be present). An EP is usually a '7"' record as well, however and we will have to assume that it's a safe assumption to make that an EP will always have the 'EP' tag.

Note that the order we check these things matters; for example, if a release is an EP but we decide to first check if it's a single, it will possibly see the '7"' tag and treat the release as if it were a single. We can do two things to combat this. We can either check to see if it's an album first, then an EP, then a single, or we can check in any order we want and when we see a '7"' tag, make sure 'EP' is not present if it's a single. We shall do the latter in this code, to make it more easily maintainable.

Every time we do the Google search, we take the first result. It is very possible that the search result is not what we are looking for. In some cases (the case where the entry just doesn't exist on RYM), we might get an album with almost a completely different name as that result, but in other cases (most notably the case of Led Zeppelin) we might get an album with an almost identical name that isn't what we are looking for. This is because many of the Led Zeppelin albums have almost the same name (e.g. Led Zeppelin, Led Zeppelin II, Led Zeppelin III)

In the code below, we call fuzz.ratio() between the two names (the one from Discogs and the one from RYM) to see how similar they are before we modify the dataframe. This poses an interesting problem -- how accurate do we want to be? If we're willing to tolerate some false positives for comprehensiveness, we can set the desired accuracy to lower than 100 and go through manually to clean them out later, but if we want to be 100 percent accurate, we will set the desired similarity of the two album names to 100 (which we have decided to do for now), at the expense of some true positives. Theoretically, it is likely that a band has the same name for two different albums, but seeing as this is extremely unlikely, we can ignore this case.

Also note that we iterate through the search results and find the one with the maximum number of ratings of anything that matched with the desired ratio. This is because RYM has a lot of different pages for a single album (one for each edition that was released), all of the reviews for all of the different editions of the album are added together in the master release page. However, there is no way of guaranteeing that the master release will be the first Google result, or even any search result on the first page (we could download the second page as well, but this would incur twice the API cost and minimal improvement, because the master release usually occurs on the first page of results).

In [None]:
album_formats = ['LP', 'Album']
single_formats = ['Single', '7"', '45 RPM']

for i, row in df_discogs.iterrows():
    formats = row['formats']
    artists = ' '.join([tup[0] for tup in row['artists']])
    title = row['title']
    # master = urllib.request.urlopen(row['master_url']).read()
    # master = json.loads(master)
    
    if any(f in formats for f in album_formats) and 'Compilation' not in formats:
        search_engine = '006887234305358427609:33s1kree9nz'
    elif 'Compilation' in formats:
        search_engine = '006887234305358427609:q5xnhjecylp'
    # Single
    elif any(f in formats for f in single_formats) and 'EP' not in formats:
        search_engine = '006887234305358427609:trpuqbqv0bz'
    # EP
    elif 'EP' in formats:
        search_engine = '006887234305358427609:gsvyzpgpmwj'

    search = service.cse().list(q = artists + ' ' + title, cx = search_engine).execute()
    
    num_results = int(search['searchInformation']['totalResults'])
    
    if num_results:
        max_result = None
        max_ratings = 0
        for result in search['items']:
            link = result['link']
            title2 = result['pagemap']['musicalbum'][0]['name']
            similarity = fuzz.ratio(title.lower(), title2.lower())

            if similarity == 100:
                rating_info = result['pagemap']['aggregaterating'][0]
                rating_value = rating_info['ratingvalue']
                rating_count = int(rating_info['ratingcount'])

                if rating_count >= max_ratings:
                    max_result = result
                    max_ratings = rating_count
                
        if max_result:
            link = max_result['link']
            rating_info = max_result['pagemap']['aggregaterating'][0]
            rating_value = rating_info['ratingvalue']
            rating_count = int(rating_info['ratingcount'])
            
            row['rym_url'] = link
            row['rating_value'] = rating_value
            row['rating_count'] = rating_count
            
            print(artists + ' - ' + '"' + title + '" ---- Rating: ' + str(rating_value) +
                      ' Rating Count: ' + str(rating_count) + ' ---- Row ' + str(i))
        else:
            print('Titles do not match! Discogs title: ' + title)
    else:
        print('No results for ' + artists + ' - "' + title + '" ---- Row ' + str(i))
    
    # Don't use up quota
    time.sleep(1)

Finally, we write the data as a csv so that we have it saved.

In [None]:
df_discogs.to_csv('collection_data.csv')

More to come soon!