<h1>RYM and Discogs</h1>

As an avid music lover and record collector, I use both <a href="https://www.rateyourmusic.com/">RateYourMusic</a> and <a href="https://www.discogs.com/">Discogs</a>. In a few aspects, there is some overlap between RateYourMusic and Discogs; both services allow you to keep track of the music you have, and in addition to that they both allow you to catalog exactly what physical (or sometimes digital) copy of a certain musical work that you have, and also give it a rating. However, Discogs (in my opinion and lots of other peoples') has proven to be much more effective for the former, and RateYourMusic is much more effective for the latter.

For my catalog of over 700 records (as of May 2020), which I keep track of on Discogs, if I want to see the RYM ratings I have to look them up on the RYM website one-by-one, and this takes a lot of time. This project will attempt to take the collection of a specified user from Discogs and gather the corresponding data from RYM.

<h1>First steps: Python imports</h1>

We will need a few libraries, which must be installed with pip if you would like to run the code on your own computer. The commands for installing these libraries are:

pip install pandas<br>
pip install beautifulsoup4<br>
pip install google-api-python-client<br>
pip install fuzzywuzzy<br>
pip install python-Levenshtein (for fuzzywuzzy)

In [1]:
import json
import time
import urllib
import pandas as pd
from bs4 import BeautifulSoup
from googleapiclient.discovery import build
from fuzzywuzzy import fuzz

We are fortunate enough that Discogs has an API, and furthermore, that API is completely free without any use of an API key (at least for the data that we want to get). The first thing we should do is retrieve and store the user's collection as a dataframe.

In [2]:
username = input('Enter the username to get that user\'s collection: ')

# The number of records to get per request
# 500 is the maximum
per_page = 500

try:
    collection = json.loads(urllib.request.urlopen('https://api.discogs.com/users/' + username + \
                              '/collection/folders/0/releases?per_page=' + str(per_page)).read())
    df_discogs = pd.DataFrame(columns = ['id', 'master_id', 'year', 'artists', 'title', \
                                         'labels', 'styles', 'master_url', 'formats', \
                                         'rym_url', 'rating_value', 'rating_count'])

    while True:
        print("Getting page " + str(collection['pagination']['page']) + ' of results')
        data_row = {}
        for release in collection['releases']:
            info = release['basic_information']
            
            data_row['id'] = info['id']
            data_row['master_id'] = info['master_id']
            data_row['artists'] = [(artist['name'], artist['resource_url']) \
                                   for artist in info['artists']]
            data_row['title'] = info['title']
            data_row['labels'] = [(label['catno'], label['name'], label['resource_url']) \
                                  for label in info['labels']]
            data_row['styles'] = info['styles']
            data_row['master_url'] = info['master_url']
            data_row['formats'] = []
            for f in info['formats']:
                if 'descriptions' in f:
                    data_row['formats'] += [i for i in f['descriptions']]
            df_discogs = df_discogs.append(data_row, ignore_index = True)

        # If this was the last page then end the loop
        if collection['pagination']['page'] == collection['pagination']['pages']:
            break;
        # Very important to be responsible with the privilege of using
        # a web API
        time.sleep(2)
        # Get the next page
        next_page = collection['pagination']['urls']['next']
        collection = json.loads(urllib.request.urlopen(next_page).read())
except:
    print("The user was not found!!! Try again with a different username.")
    
print("Done!")

Enter the username to get that user's collection: BethesdaMD
Getting page 1 of results
Getting page 2 of results
Done!


Unfortunately RateYourMusic does not have an API, and additionally running a GET request to any part of their website from Python will result in an instant ban. Rather than try to circumvent this, and violate their Terms of Service. We will have to be limited to a workaround that doesn't - Google Custom Search API. We will make a search query that returns the relevant page on RYM, and fortunately, the data we need too (which may or may not be slightly outdated, because it was from the last indexing of the page).

First we will need a Google API key to make this work:

In [None]:
gapi_key = input('Please enter your Google API key: ')
service = build("customsearch", "v1", developerKey = gapi_key)

We will obviously need a good query that will give us the results we desire. To construct such a query, we will obviously need to include the artists' names and also the name of the album or musical work. But we also have to be aware of one other thing: the name of an album can also be the name of a single. If we're looking for the single and not the album or vice versa, this is where the format descriptions we retrieved from the Discogs collection will come in handy.

On the Google Custom Search API, there are four different search engines for this program. When we want the album version, we will search rateyourmusic.com/release/album/ \*, when we want the single version we will search rateyourmusic.com/release/single/ \*, for the EP version rateyourmusic.com/release/ep/ \*, and for a compilation album rateyourmusic.com/release/comp/ \*. We also need to be aware that Discogs doesn't <u>always</u> label releases with 'album' and 'single'. We can still test to see if those terms are still present in the format list, but in addition a single can be represented as '7"' or '45 RPM' in the array, and an album can be represented as 'LP' as well (however it might also be a compilation, in which case 'Compilation' will be present). An EP is usually a '7"' record as well, however and we will have to assume that it's a safe assumption to make that an EP will always have the 'EP' tag.

Note that the order we check these things matters; for example, if a release is an EP but we decide to first check if it's a single, it will possibly see the '7"' tag and treat the release as if it were a single. We can do two things to combat this. We can either check to see if it's an album first, then an EP, then a single, or we can check in any order we want and when we see a '7"' tag, make sure 'EP' is not present if it's a single. We shall do the latter in this code, to make it more easily maintainable.

Every time we do the Google search, we take the first result. It is very possible that the search result is not what we are looking for. In some cases (the case where the entry just doesn't exist on RYM), we might get an album with almost a completely different name as that result, but in other cases (most notably the case of Led Zeppelin) we might get an album with an almost identical name that isn't what we are looking for. This is because many of the Led Zeppelin albums have almost the same name (e.g. Led Zeppelin, Led Zeppelin II, Led Zeppelin III)

In the code below, we call fuzz.ratio() between the two names (the one from Discogs and the one from RYM) to see how similar they are before we modify the dataframe. This poses an interesting problem -- how accurate do we want to be? If we're willing to tolerate some false positives for comprehensiveness, we can set the desired accuracy to lower than 100 and go through manually to clean them out later, but if we want to be 100 percent accurate, we will set the desired similarity of the two album names to 100 (which we have decided to do for now), at the expense of some true positives. Theoretically, it is likely that a band has the same name for two different albums, but seeing as this is extremely unlikely, we can ignore this case.

Also note that we iterate through the search results and find the one with the maximum number of ratings of anything that matched with the desired ratio. This is because RYM has a lot of different pages for a single album (one for each edition that was released), all of the reviews for all of the different editions of the album are added together in the master release page. However, there is no way of guaranteeing that the master release will be the first Google result, or even any search result on the first page (we could download the second page as well, but this would incur twice the API cost and minimal improvement, because the master release usually occurs on the first page of results).

In [15]:
album_formats = ['LP', 'Album']
single_formats = ['Single', '7"', '45 RPM']

for i, row in df_discogs.iterrows():
    formats = row['formats']
    artists = ' '.join([tup[0] for tup in row['artists']])
    title = row['title']
    
    if row['master_url']:
        master = urllib.request.urlopen(row['master_url']).read()
        master = json.loads(master)
        
        tracks = [track['title'] for track in master['tracklist']]
        
        if tracks:
            track = tracks[0]
        else:
            print('No tracks!')
            track = ''
    else:
        track = ''
    
    if any(f in formats for f in album_formats) and 'Compilation' not in formats:
        search_engine = '006887234305358427609:33s1kree9nz'
    elif 'Compilation' in formats:
        search_engine = '006887234305358427609:q5xnhjecylp'
    # Single
    elif any(f in formats for f in single_formats) and 'EP' not in formats:
        search_engine = '006887234305358427609:trpuqbqv0bz'
    # EP
    elif 'EP' in formats:
        search_engine = '006887234305358427609:gsvyzpgpmwj'

    search = service.cse().list(q = artists + ' ' + title + ' ' + track, cx = search_engine).execute()
    
    num_results = int(search['searchInformation']['totalResults'])
    
    if num_results:
        max_result = None
        max_ratings = 0
        for result in search['items']:
            link = result['link']
            title2 = result['pagemap']['musicalbum'][0]['name']
            similarity = fuzz.ratio(title.lower(), title2.lower())

            if similarity >= 90:
                rating_info = result['pagemap']['aggregaterating'][0]
                rating_value = rating_info['ratingvalue']
                rating_count = int(rating_info['ratingcount'])

                if rating_count >= max_ratings:
                    max_result = result
                    max_ratings = rating_count
                
        if max_result:
            link = max_result['link']
            rating_info = max_result['pagemap']['aggregaterating'][0]
            rating_value = rating_info['ratingvalue']
            rating_count = int(rating_info['ratingcount'])
            
            row['rym_url'] = link
            row['rating_value'] = rating_value
            row['rating_count'] = rating_count
            
            print(artists + ' - ' + '"' + title + '" ---- Rating: ' + str(rating_value) +
                      ' Rating Count: ' + str(rating_count) + ' ---- Row ' + str(i))
        else:
            print('Titles do not match! Discogs title: ' + title)
    else:
        print('No results for ' + artists + ' - "' + title + '" ---- Row ' + str(i))
    
    # Don't use up quota
    time.sleep(2)

Billy Preston - "Will It Go Round In Circles / Blackbird" ---- Rating: 3.60 Rating Count: 128 ---- Row 0
Titles do not match! Discogs title: Rise
Procol Harum - "A Salty Dog" ---- Rating: 3.69 Rating Count: 1657 ---- Row 2
Chuck Mangione - "Fun And Games" ---- Rating: 2.90 Rating Count: 52 ---- Row 3
Procol Harum - "Live - In Concert With The Edmonton Symphony Orchestra" ---- Rating: 3.87 Rating Count: 618 ---- Row 4
Chuck Mangione - "Feels So Good" ---- Rating: 3.36 Rating Count: 277 ---- Row 5
Chuck Mangione - "Tarantella" ---- Rating: 3.65 Rating Count: 15 ---- Row 6
Chuck Mangione - "Children Of Sanchez" ---- Rating: 3.61 Rating Count: 165 ---- Row 7
Titles do not match! Discogs title: An Evening Of Magic - Live At The Hollywood Bowl
Titles do not match! Discogs title: A Salty Dog / Conquistador
Procol Harum - "Home" ---- Rating: 3.62 Rating Count: 789 ---- Row 10
The Crusaders - "Free As The Wind" ---- Rating: 3.52 Rating Count: 122 ---- Row 11
Steely Dan - "Pretzel Logic" ---- Ra

Wilson Pickett - "Hey Jude" ---- Rating: 3.58 Rating Count: 164 ---- Row 96
Roberta Flack - "First Take" ---- Rating: 3.73 Rating Count: 676 ---- Row 97
Boz Scaggs - "Boz Scaggs" ---- Rating: 3.49 Rating Count: 510 ---- Row 98
Titles do not match! Discogs title: In The Court Of The Crimson King (An Observation By King Crimson)
Titles do not match! Discogs title: In Philadelphia
Bette Midler - "The Divine Miss M" ---- Rating: 3.53 Rating Count: 180 ---- Row 101
Aretha Franklin - "I Never Loved A Man The Way I Love You" ---- Rating: 3.91 Rating Count: 4176 ---- Row 102
Led Zeppelin - "Led Zeppelin II" ---- Rating: 3.96 Rating Count: 24405 ---- Row 103
John Coltrane - "Giant Steps" ---- Rating: 4.11 Rating Count: 11422 ---- Row 104
Wilson Pickett - "The Wicked Pickett" ---- Rating: 3.79 Rating Count: 310 ---- Row 105
John Coltrane - "My Favorite Things" ---- Rating: 4.11 Rating Count: 9680 ---- Row 106
Titles do not match! Discogs title: The Complete Recordings Vol. 1
Titles do not match!

Titles do not match! Discogs title: The Benny Goodman Band
The Beach Boys - "The Smile Sessions" ---- Rating: 4.24 Rating Count: 6687 ---- Row 194
The Allman Brothers Band - "Eat A Peach" ---- Rating: 3.89 Rating Count: 3775 ---- Row 195
The Allman Brothers Band - "Brothers And Sisters" ---- Rating: 3.78 Rating Count: 2523 ---- Row 196
No results for Kent Harian Orchestra - "Echoes Of Joy" ---- Row 197
Titles do not match! Discogs title: What Can I Say
Ludwig Van Beethoven Bruno Walter Columbia Symphony Orchestra - "Symphony No. 6 "Pastorale"" ---- Rating: 3.99 Rating Count: 45 ---- Row 199
No results for Igor Stravinsky Leonard Bernstein The New York Philharmonic Orchestra - "Firebird Suite / Petrushka (Complete)" ---- Row 200
Gustav Holst Leonard Bernstein The New York Philharmonic Orchestra - "The Planets" ---- Rating: 3.92 Rating Count: 249 ---- Row 201
No results for Gustav Mahler Bruno Walter Columbia Symphony Orchestra - "Mahler's 1st Symphony In D Major "Titan"" ---- Row 202
Ti

Chicago (2) - "Chicago 13" ---- Rating: 2.57 Rating Count: 221 ---- Row 282
Chicago (2) - "Chicago XIV" ---- Rating: 2.70 Rating Count: 201 ---- Row 283
Chuck Mangione - "Journey To A Rainbow" ---- Rating: 3.30 Rating Count: 12 ---- Row 284
Chuck Mangione - "Disguise" ---- Rating: 3.31 Rating Count: 13 ---- Row 285
Wynton Marsalis - "Hot House Flowers" ---- Rating: 3.47 Rating Count: 77 ---- Row 286
Bill Withers - "Watching You Watching Me" ---- Rating: 2.73 Rating Count: 79 ---- Row 287
Taj Mahal - "Giant Step / De Ole Folks At Home" ---- Rating: 3.70 Rating Count: 353 ---- Row 288
Chicago (2) - "Chicago Transit Authority" ---- Rating: 3.77 Rating Count: 2451 ---- Row 289
Bruce Springsteen - "Born To Run" ---- Rating: 3.97 Rating Count: 12238 ---- Row 290
Ramsey Lewis - "Tequila Mockingbird" ---- Rating: 3.18 Rating Count: 36 ---- Row 291
Bill Withers - "'Bout Love" ---- Rating: 2.97 Rating Count: 72 ---- Row 292
Titles do not match! Discogs title: In The Tradition
Blood, Sweat And Te

The Moody Blues - "Days Of Future Passed" ---- Rating: 3.82 Rating Count: 6153 ---- Row 373
Titles do not match! Discogs title: Baby May
King Crimson - "In The Court Of The Crimson King" ---- Rating: 4.31 Rating Count: 38326 ---- Row 375
King Crimson - "Red" ---- Rating: 4.22 Rating Count: 20116 ---- Row 376
Jim Lowe (2) The High Fives (2) - "The Green Door / (The Story Of) The Little Man In Chinatown" ---- Rating: 3.43 Rating Count: 77 ---- Row 377
Joanna Newsom - "Ys" ---- Rating: 3.97 Rating Count: 14011 ---- Row 378
No results for Plague Of Locusts - "Utah, Gateway To Nevada!" ---- Row 379
No results for Audubon Society Of Rhode Island - "Birds On A May Morning" ---- Row 380
Bobby Bland - "I'll Take Care Of You / That's Why" ---- Rating: 3.81 Rating Count: 138 ---- Row 381
Titles do not match! Discogs title: Mama's Big Ones: Her Greatest Hits
Four Tops - "Keeper Of The Castle" ---- Rating: 3.26 Rating Count: 77 ---- Row 383
Steppenwolf - "Steppenwolf" ---- Rating: 3.66 Rating Count

Grover Washington, Jr. - "All The King's Horses" ---- Rating: 3.27 Rating Count: 51 ---- Row 471
Grover Washington, Jr. - "Mister Magic" ---- Rating: 3.63 Rating Count: 494 ---- Row 472
Grover Washington, Jr. - "Inner City Blues" ---- Rating: 3.37 Rating Count: 95 ---- Row 473
Grover Washington, Jr. - "A Secret Place" ---- Rating: 3.33 Rating Count: 74 ---- Row 474
Grover Washington, Jr. - "Live At The Bijou" ---- Rating: 3.56 Rating Count: 54 ---- Row 475
The Jimi Hendrix Experience - "Are You Experienced" ---- Rating: 4.14 Rating Count: 21362 ---- Row 476
Bud Shank Clare Fischer - "Brasamba" ---- Rating: 3.10 Rating Count: 9 ---- Row 477
Bobby Womack - "The Womack "Live"" ---- Rating: 3.53 Rating Count: 41 ---- Row 478
Dean Friedman - ""Well, Well," Said The Rocking Chair." ---- Rating: 3.44 Rating Count: 44 ---- Row 479
Dean Friedman - "Dean Friedman" ---- Rating: 3.40 Rating Count: 22 ---- Row 480
Jim Croce - "The Faces I've Been" ---- Rating: 3.51 Rating Count: 68 ---- Row 481
Ors

Carole King - "Rhymes & Reasons" ---- Rating: 3.37 Rating Count: 237 ---- Row 570
Eric Dolphy - "Out There" ---- Rating: 3.85 Rating Count: 2041 ---- Row 571
Shelly Manne & His Friends - "Modern Jazz Performances Of Songs From My Fair Lady" ---- Rating: 3.75 Rating Count: 80 ---- Row 572
John Coltrane - "Soultrane" ---- Rating: 3.58 Rating Count: 1674 ---- Row 573
Sonny Rollins Quartet - "Tenor Madness" ---- Rating: 3.72 Rating Count: 707 ---- Row 574
Miles Davis All Stars - "Walkin'" ---- Rating: 3.88 Rating Count: 8 ---- Row 575
Etta Jones - "Don't Go To Strangers" ---- Rating: 3.74 Rating Count: 69 ---- Row 576
Art Blakey & The Jazz Messengers - "Caravan" ---- Rating: 3.74 Rating Count: 430 ---- Row 577
Thelonious Monk Septet - "Monk's Music" ---- Rating: 3.93 Rating Count: 2564 ---- Row 578
Mount Eerie - "Now Only" ---- Rating: 3.57 Rating Count: 4814 ---- Row 579
The Beatles - "Sgt. Pepper's Lonely Hearts Club Band" ---- Rating: 4.14 Rating Count: 39067 ---- Row 580
The Beatles - 

Dionne Warwick - "Anyone Who Had A Heart" ---- Rating: 3.53 Rating Count: 108 ---- Row 668
Talking Heads - "The Name Of This Band Is Talking Heads" ---- Rating: 4.22 Rating Count: 4392 ---- Row 669
Focus (2) - "Focus 3" ---- Rating: 3.80 Rating Count: 1419 ---- Row 670
Talking Heads - "Talking Heads: 77" ---- Rating: 3.81 Rating Count: 14797 ---- Row 671
Talking Heads - "More Songs About Buildings And Food" ---- Rating: 3.87 Rating Count: 12403 ---- Row 672
Talking Heads - "Fear Of Music" ---- Rating: 4.01 Rating Count: 16190 ---- Row 673
Talking Heads - "Remain In Light" ---- Rating: 4.25 Rating Count: 30419 ---- Row 674
Talking Heads - "Speaking In Tongues" ---- Rating: 3.84 Rating Count: 11259 ---- Row 675
Soft Cell - "Tainted Love / Where Did Our Love Go" ---- Rating: 3.98 Rating Count: 1469 ---- Row 676
Titles do not match! Discogs title: Presenting Thad Jones • Mel Lewis & "The Jazz Orchestra"
Gladys Knight And The Pips - "Neither One Of Us" ---- Rating: 3.76 Rating Count: 188 --

Titles do not match! Discogs title: The Best Of Bill Evans
Titles do not match! Discogs title: The Cole Porter Songbook
Ella Fitzgerald - "The Rodgers And Hart Songbook" ---- Rating: 4.25 Rating Count: 10 ---- Row 767
Stan Getz - "Communications '72" ---- Rating: 3.66 Rating Count: 25 ---- Row 768
The Velvet Underground Nico (3) - "The Velvet Underground & Nico" ---- Rating: 4.25 Rating Count: 40215 ---- Row 769
Gerry Mulligan & The Concert Jazz Band - "At The Village Vanguard" ---- Rating: 3.95 Rating Count: 89 ---- Row 770
Titles do not match! Discogs title: Sings The George And Ira Gershwin Song Book
Ann Gilbert - "In A Swingin' Mood" ---- Rating: 0.00 Rating Count: 0 ---- Row 772
Mike Oldfield - "Tubular Bells" ---- Rating: 3.75 Rating Count: 7350 ---- Row 773
The Dramatics - "Whatcha See Is Whatcha Get / Thankful For Your Love" ---- Rating: 3.84 Rating Count: 144 ---- Row 774
Charles Wright & The Watts 103rd St Rhythm Band - "Do Your Thing / A Dance, A Kiss And A Song" ---- Rating

Finally, we write the data as a csv so that we have it saved.

In [16]:
df_discogs.to_csv('collection_data.csv')

More to come soon!