# Data downloading

This is the first part of the project - finding resources with music data that were free and open to use, and gathering as much data as possible to later use. After some web research three main data sources were chosen: "Musicbrainz" database, "Spotify API", "1 million songs dataset". The motivation behind using more than one source, was that none of available sources was fully complete (consisted all type of information that were thought as necessary). For example albums information in Musicbrainz was very messy and seemd to be incomplete in many cases. On the contrary, in Spotify this information was correct but gender information was not present here. Thanks to that approach, emerged dataset is more complete than any of already existing datasets. It is also expandable and can be easily updated in the future. All the downloaded data is stored in the local instance of MongoDB database.  

Remark: This part was done in many iterations and with a lot of repetitions due to some services limitations (requests limit on servers, limited bandwidth and connection problems) and finding better approaches to emerged problems. This is why some parts of the code may be not very clear and well understandable. Some other parts were not finally used, mostly because of the speed of downloading.  

In [2]:
import musicbrainzngs
import sys
from pymongo import MongoClient
from pymongo import errors
from datapackage import Package

## Get list of all country codes

In order to download any data from spotify at least some kind of information about artist must be provided in the request. Browsing and dowlnoading all data part by part is not possible. 
First idea was to use country codes, which were one of information in their API. Unfortunately this approach did't work, because of some limitations to the request parameters: 'offset', 'limit', which caused only part of data was possible to download. (This part of code was deleted)

However list of country codes turned out to be very helpful while downloading data from Musicbrainz.


In [3]:
package = Package('https://datahub.io/core/country-list/datapackage.json')

countries = []

# get list of all resources:
resources = package.descriptor['resources']
resourceList = [resources[x]['name'] for x in range(0, len(resources))]
print(resourceList)

# print all tabular data(if exists any)
resources = package.resources
for resource in resources:
    if resource.tabular:
        country_table = resource.read()
        for elem in country_table:
            countries.append(elem[1])
        
print(countries)

['validation_report', 'data_csv', 'data_json', 'country-list_zip', 'data']
['AF', 'AX', 'AL', 'DZ', 'AS', 'AD', 'AO', 'AI', 'AQ', 'AG', 'AR', 'AM', 'AW', 'AU', 'AT', 'AZ', 'BS', 'BH', 'BD', 'BB', 'BY', 'BE', 'BZ', 'BJ', 'BM', 'BT', 'BO', 'BQ', 'BA', 'BW', 'BV', 'BR', 'IO', 'BN', 'BG', 'BF', 'BI', 'KH', 'CM', 'CA', 'CV', 'KY', 'CF', 'TD', 'CL', 'CN', 'CX', 'CC', 'CO', 'KM', 'CG', 'CD', 'CK', 'CR', 'CI', 'HR', 'CU', 'CW', 'CY', 'CZ', 'DK', 'DJ', 'DM', 'DO', 'EC', 'EG', 'SV', 'GQ', 'ER', 'EE', 'ET', 'FK', 'FO', 'FJ', 'FI', 'FR', 'GF', 'PF', 'TF', 'GA', 'GM', 'GE', 'DE', 'GH', 'GI', 'GR', 'GL', 'GD', 'GP', 'GU', 'GT', 'GG', 'GN', 'GW', 'GY', 'HT', 'HM', 'VA', 'HN', 'HK', 'HU', 'IS', 'IN', 'ID', 'IR', 'IQ', 'IE', 'IM', 'IL', 'IT', 'JM', 'JP', 'JE', 'JO', 'KZ', 'KE', 'KI', 'KP', 'KR', 'KW', 'KG', 'LA', 'LV', 'LB', 'LS', 'LR', 'LY', 'LI', 'LT', 'LU', 'MO', 'MK', 'MG', 'MW', 'MY', 'MV', 'ML', 'MT', 'MH', 'MQ', 'MR', 'MU', 'YT', 'MX', 'FM', 'MD', 'MC', 'MN', 'ME', 'MS', 'MA', 'MZ', 'MM', 'NA', 

## Downloading data from Musicbrainz server (local instance)

Musicbrainz made it possible to download the 'snap' of their database and set up the local instance of the same server that is publicly available. At first it seemd as much better approach than using their public server (no limitations etc.) but it also turned out to be very slow.

In [4]:
#to run this it is possible to use local musicbrainz server or, after some adjustments, use public server
musicbrainzngs.set_hostname('localhost:5000')
musicbrainzngs.set_useragent(app="MusicTravel", version=0.1)
musicbrainzngs.set_rate_limit(False)


In [5]:
db_client = MongoClient('localhost', 27017)
db = db_client.musicdata
collection = db.get_collection("artists")

In [20]:
limit = 100 #max api limit
offset = 23780
next_country = False

keys =  ['type', 'id', 'name','gender', 'country', 'disambiguation', 'lifespan', 'tag-list', 'url-relation-list' ]
url_keys = [ 'type', 'target']

artist_count = 0
#countries = ['unknown']
countries = countries[107:]

for country in countries:
    
    next_country = False
    print(artist_count)
    while not next_country:
            try:
                result = musicbrainzngs.search_artists(country=country, limit=limit, offset=offset)
                res_num = len(result['artist-list'])
                offset+=res_num
            except:
                print("Unexpected error:", sys.exc_info()[0])
                break
                

            if res_num < 100:
                next_country = True

            for i in range(0,res_num):
                try:
                    id = result['artist-list'][i]['id']
                    artist = musicbrainzngs.get_artist_by_id(id, includes=['url-rels','tags'])


                    new_art = {k:v for k, v in artist['artist'].items() if k in keys}
                    new_art['_id'] = new_art.pop('id')
                    if 'url-relation-list' in new_art:
                        for j in range(0, len(new_art['url-relation-list'])):
                            new_art['url-relation-list'][j] = {k:v for k, v in new_art['url-relation-list'][j].items() if k in url_keys}


                        collection.insert_one(new_art)
                        artist_count+=1
                except errors.DuplicateKeyError as err:
                    #print("Exception: ", err)
                    continue
                except musicbrainzngs.MusicBrainzError as err:
                    print("Exception: ", err)
                    continue
                except:
                    print("Unexpected error:", sys.exc_info()[0])
                    continue
    offset = 0  

0
0
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
42
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  c

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
56855
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
59872
59898
Exception:  caused by: HTTP Error 404: Not Found
Ex

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128061
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128131
128131
128131
128131
128131
128131
Exception:  caused b

128555
128555
128555
128555
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128574
128574
128574
128574
128574
Exception:  caused by: HTTP Error 404: Not Found
128576
128576
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128580
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128580
128580
128580
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128610
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Except

128899
128899
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
128899
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
129060
129060
129060
129060
129060
129060
129060
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
129061
129061
129061
129061
129061
129061
129061
129061
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
129061
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Except

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
129093
129093
129093
129093
129093
129093
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
129282
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  c

Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused by: HTTP Error 404: Not Found
Exception:  caused b

# Find spotify IDs

In [6]:
#first find direct spotify link, then search for spotify IDs in wikidata
# for the rest use spotify api searching feature to find artists with same name
import helpers
import spotipy
import spotipy.oauth2 as oauth2
import spotipy.util as util

SPOTIPY_CLIENT_ID = 'd70e66cbc22e4e1287c50c97b54c2457'
SPOTIPY_CLIENT_SECRET = 'e20d58cae494486392723da75a3ef393'
SPOTIPY_REDIRECT_URI = 'http://localhost/?code=...'

credentials = oauth2.SpotifyClientCredentials(
client_id="d70e66cbc22e4e1287c50c97b54c2457",
client_secret="e20d58cae494486392723da75a3ef393")

token = credentials.get_access_token()
print(token)
spotify = spotipy.Spotify(auth=token)

BQDzHuABEJT3BFENm7a9iLGXzJz4EDfnYuleSJ6BYf-MOPshYrwN5dC514V6iNDFLDm3Wyep4cqVcMWxoOI


In [7]:
all_artists_number = collection.find().count()
print(all_artists_number)

#find all with url_relation_list with element 'streaming music' that usually contains spotify ID
artists_with_links = collection.find({'url-relation-list':{'$elemMatch': { 'type': 'streaming music' }}})
print("Streaming music links: "+str(artists_with_links.count()))

artists_with_wikidata = collection.find({'url-relation-list':{'$elemMatch': { 'type': 'wikidata' }}})
print("Wikidata links: "+str(artists_with_wikidata.count()))


spotify_ids_number = 0
spotify_ids = []
artists_with_links = collection.find({'url-relation-list':{'$elemMatch': { 'type': 'streaming music' }}})
for artist in artists_with_links:
    url_list = artist['url-relation-list']
    for item in url_list:
        if item['type'] == 'streaming music' and 'spotify' in item['target']:
            spotify_id = helpers.get_id(item['target'])
            spotify_id = spotify_id[:22] #spotify id length is 22, sometimes links has some other information so we need to cut it
            spotify_ids_number +=1
            spotify_ids.append(spotify_id)
print('Spotify links: ' + str(spotify_ids_number))

421433
Streaming music links: 12988
Wikidata links: 125749
Spotify links: 8788


In [7]:
import time

exactly_same_name = 0
different_name = 0
not_found = 0
all_artists = collection.find()
for artist in all_artists[:100]:
    artist_spotify = spotify.search(q =artist['name'], type='artist')
    #print(artist_spotify['artists'])
    if len(artist_spotify['artists']['items']) ==0:
        not_found +=1
        continue
    if artist['name'] == artist_spotify['artists']['items'][0]['name']:
        exactly_same_name += 1
    else:
        different_name +=1
        
print("same names: "+ str(exactly_same_name))
print("different names: " + str(different_name))
print("not found: "+ str(not_found))

same names: 53
different names: 9
not found: 38


In [20]:
for artist in artists_with_links:
    print(artist['url-relation-list'])

[{'type': 'official homepage', 'target': 'http://www.karimanayt.com/'}, {'type': 'streaming music', 'target': 'http://www.staimusic.com/en/bands/nayt_6170.html'}]
a
[{'type': 'allmusic', 'target': 'http://www.allmusic.com/artist/mn0000103057'}, {'type': 'last.fm', 'target': "http://www.last.fm/music/Kouideche+(El'eve+De+Saoud)"}, {'type': 'last.fm', 'target': 'http://www.last.fm/music/Kouideche'}, {'type': 'purchase for download', 'target': 'https://itunes.apple.com/tr/artist/id285863586'}, {'type': 'streaming music', 'target': 'https://open.spotify.com/artist/3rTwpDHftjGIiIcpMYhgtq'}, {'type': 'other databases', 'target': 'https://rateyourmusic.com/artist/kouideche'}, {'type': 'discogs', 'target': 'https://www.discogs.com/artist/1121069'}, {'type': 'youtube', 'target': 'https://www.youtube.com/channel/UCK4a-1Iwj9WZUtWNoBHlC2w'}, {'type': 'youtube', 'target': 'https://www.youtube.com/channel/UCahQ1XkbSCjW_GmbnMp9xLA'}]
a
[{'type': 'social network', 'target': 'http://instagram.com/zahoo

[{'type': 'lyrics', 'target': 'http://decoda.com/massacre-lyrics'}, {'type': 'other databases', 'target': 'http://rateyourmusic.com/artist/massacre_f2'}, {'type': 'allmusic', 'target': 'http://www.allmusic.com/artist/mn0000723396'}, {'type': 'official homepage', 'target': 'http://www.massacre.com.ar/'}, {'type': 'other databases', 'target': 'http://www.rock.com.ar/artistas/massacre'}, {'type': 'setlistfm', 'target': 'http://www.setlist.fm/setlists/massacre-33d4c475.html'}, {'type': 'myspace', 'target': 'https://myspace.com/massacrepalestina'}, {'type': 'streaming music', 'target': 'https://open.spotify.com/artist/0UAAJKwQZz8jVDoVtly8NA'}, {'type': 'social network', 'target': 'https://twitter.com/massacreoficial'}, {'type': 'discogs', 'target': 'https://www.discogs.com/artist/3058310'}, {'type': 'social network', 'target': 'https://www.facebook.com/MassacreOficial'}, {'type': 'wikidata', 'target': 'https://www.wikidata.org/wiki/Q6005496'}]
a
[{'type': 'other databases', 'target': 'http:

[{'type': 'official homepage', 'target': 'http://www.jessiuribe.com/'}, {'type': 'purchase for download', 'target': 'https://itunes.apple.com/co/artist/id998331667'}, {'type': 'streaming music', 'target': 'https://open.spotify.com/artist/3SN7I8KV2qBwTCZ4aNDcbS'}, {'type': 'social network', 'target': 'https://twitter.com/jessiuribemusic'}, {'type': 'streaming music', 'target': 'https://www.deezer.com/artist/7979540'}, {'type': 'discogs', 'target': 'https://www.discogs.com/artist/5398554'}, {'type': 'social network', 'target': 'https://www.facebook.com/jessiuribemusic'}]
a
[{'type': 'allmusic', 'target': 'http://www.allmusic.com/artist/mn0000883863'}, {'type': 'wikipedia', 'target': 'https://es.wikipedia.org/wiki/Noel_Petro'}, {'type': 'streaming music', 'target': 'https://open.spotify.com/artist/391Vew4S1ezIPjVZiygMDI'}, {'type': 'discogs', 'target': 'https://www.discogs.com/artist/1160017'}]
a
[{'type': 'purchase for download', 'target': 'https://itunes.apple.com/co/artist/id597689734'

a
[{'type': 'vgmdb', 'target': 'http://vgmdb.net/artist/22754'}, {'type': 'VIAF', 'target': 'http://viaf.org/viaf/29717780'}, {'type': 'allmusic', 'target': 'http://www.allmusic.com/artist/mn0001428305'}, {'type': 'BBC Music page', 'target': 'http://www.bbc.co.uk/music/artists/833c8d5e-9524-41a6-bf58-95bb1b1b29e5'}, {'type': 'IMDb', 'target': 'http://www.imdb.com/name/nm5412188/'}, {'type': 'last.fm', 'target': 'http://www.last.fm/music/Maurice+Durufl%C3%A9'}, {'type': 'image', 'target': 'https://commons.wikimedia.org/wiki/File:Durufle.jpg'}, {'type': 'wikipedia', 'target': 'https://en.wikipedia.org/wiki/Maurice_Durufl%C3%A9'}, {'type': 'purchase for download', 'target': 'https://itunes.apple.com/gb/artist/id579902'}, {'type': 'streaming music', 'target': 'https://open.spotify.com/artist/7Fph7U6qidZ2E97xKKsD4m'}, {'type': 'other databases', 'target': 'https://rateyourmusic.com/artist/maurice_durufle'}, {'type': 'discogs', 'target': 'https://www.discogs.com/artist/319439'}, {'type': 'wi

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Direct data downloading from musicbrainz database

As it was pointed out before - even local instance of server was very slow. Solution to this problem was to connect directly to downloaded Musicbrainz database instance. However, this solution was also not perfect - main problem was that it was not intended to use this way, so the documentation on database structure was very general and limited, while structure of the database very complicated. After some reverse-engineering, all needed tables and fields were found. Then data from different tables was merged and saved in MongoDB - using convention one entry for one artist with all information about this artist. (The exact form of gathered data with all fields will be explained here: TODO - add link to explanation)

In [40]:
import psycopg2

# Connect to an existing database
conn = psycopg2.connect(dbname="musicbrainz_db",user='musicbrainz', host="localhost",password="musicbrainz")
   

# Open a cursor to perform database operations
cur = conn.cursor()

#create dictionaries for type, area and gender
artist_types = {}
artist_gender = {}
iso_3661_1 = {}
iso_3661_2 = {}
areas = {}

cur.execute("SELECT id, name FROM artist_type")
types = cur.fetchall()
for t in types:
    artist_types[t[0]]= t[1]
    
cur.execute("SELECT id, name FROM gender")
genders = cur.fetchall()
for g in genders:
    artist_gender[g[0]]=g[1]
    
cur.execute("SELECT area,code from iso_3166_1")
is1 = cur.fetchall()
for area in is1:
    iso_3661_1[area[0]]=area[1]
    
cur.execute("SELECT area,code from iso_3166_2")
is1 = cur.fetchall()
for area in is1:
    iso_3661_2[area[0]]=area[1]

n =0
cur.execute("SELECT id,name,type from area")
for area in cur:
    if area[2] == 1:
        areas[area[0]] = {'name': area[1], 'iso_366_1': iso_3661_1.get(area[0], None)}
    elif area[2] == 2:
        areas[area[0]] = {'name': area[1], 'iso_366_2': iso_3661_2.get(area[0], None)}
    else:
        areas[area[0]] = {'name': area[1], 'area_type': area[2]}
                          
print(len(areas))

117173


In [8]:
collection = db.get_collection("artists_mb")

In [44]:
cur.execute("SELECT id,gid,name,begin_date_year,type,area,gender,begin_area FROM artist;")
for art in cur:
    artist = {
        '_id': art[1],
        'db_id': art[0],
        'name': art[2],
        'begin_date_year': art[3],
        'type': artist_types.get(art[4], 'unknown'),
        'area': areas.get(art[5], 'unknown'),
        'gender': artist_gender.get(art[6], 'unknown')
    }
    collection.insert_one(artist)
    

In [75]:
for artist in collection.find():
    urls = []
    ids = []
    cur.execute("SELECT entity1 from l_artist_url WHERE entity0={}".format(artist['db_id']))
    x = cur.fetchall()
    for i in range(0,len(x)):
        ids.append(x[i][0])
    if len(ids)==0:
        continue

    sql = "SELECT url from url WHERE id IN %s"
    cur.execute(sql, (tuple(ids),))
    for u in cur:
        urls.append(u[0])
    
    artist['urls'] = urls
    collection.update_one({'_id':artist['_id']}, {"$set": artist}, upsert=False)

In [73]:
conn.rollback()

In [9]:
spotify_ids = 0
wiki_and_spotify = 0
wikidata = 0

spotify_collection = db.get_collection('artist_spotify')

for artist in collection.find({'urls': {'$exists': True}}):
    for i in range(0, len(artist['urls'])):
        if 'spotify' in artist['urls'][i]:
            #add field with direct id from spotify and update entry in database
            artist['spotify_id'] = helpers.get_id(artist['urls'][i])[:22]
            #collection.update_one({'_id':artist['_id']}, {"$set": artist}, upsert=False)
            spotify_ids +=1
            
        if 'wikidata' in artist['urls'][i]:
            artist['wiki_id'] = helpers.get_id(artist['urls'][i])
            #collection.update_one({'_id':artist['_id']}, {"$set": artist}, upsert=False)
            wikidata += 1

print(spotify_ids)
print(wikidata)

KeyboardInterrupt: 

## Find spotify ids in wikidata

One of the information in Musicbrainz was 'Spotify ID', which enabled direct connecting entries from Musicbrainz and Spotify. Some of the artists had wikidata ID, so the idea was to search for Spotify IDs in Wikidata. 

In [11]:
from wikidata.client import Client
from IPython.display import clear_output

In [None]:
client = Client()

wikidata_collection = db.get_collection("artist_wikidata")
for ids, artist in enumerate(collection.find({'wiki_id': {'$exists': True}})):
    print(ids)

In [16]:
client = Client()

wikidata_collection = db.get_collection("artist_wikidata")

for ids, artist in enumerate(collection.find({'wiki_id': {'$exists': True}})[130500:]):
    clear_output(wait=True)
    print(ids)
    try:
        entity = client.get(artist['wiki_id'], load=True)
    except Exception as err:
        print(err)
        continue
    #print(entity.attributes)
    artist_wiki = {'_id':artist['wiki_id']}
    artist_dict = entity.attributes
    try:
        lookup_keys = artist_dict.get('claims').keys()
    except Exception as err:
        print(err)
        continue
        
    try:
        if 'P21' in lookup_keys:
            artist_wiki['gender_id'] = artist_dict.get('claims').get('P21')[0].get('mainsnak').get('datavalue').get('value').get('id')
        if 'P569' in lookup_keys:
            artist_wiki['date_of_birth'] = artist_dict.get('claims').get('P569')[0].get('mainsnak').get('datavalue').get('value').get('time')
            #artist_wiki['calendar_model_birth'] = entity.attributes.get('claims').get('P569')[0].get('mainsnak').get('datavalue').get('value').get('calendarmodel')   
        if 'P19' in lookup_keys:
            artist_wiki['place_of_birth_id'] = artist_dict.get('claims').get('P19')[0].get('mainsnak').get('datavalue').get('value').get('id')
        if 'P136' in lookup_keys:
            genres_ids = []
            for i in range(0, len(artist_dict.get('claims').get('P136'))):
                genres_ids.append(artist_dict.get('claims').get('P136')[i].get('mainsnak').get('datavalue').get('value').get('id'))
            artist_wiki['genres_ids'] = genres_ids
        if 'P434' in lookup_keys:
            artist_wiki['musicbrainz_id'] = artist_dict.get('claims').get('P434')[0].get('mainsnak').get('datavalue').get('value')
        if 'P1902' in lookup_keys:
            artist_wiki['spotify_id'] = artist_dict.get('claims').get('P1902')[0].get('mainsnak').get('datavalue').get('value')

        wikipedia_keys = artist_dict.get('sitelinks').keys()
        url_list=[]
        for k in wikipedia_keys:
            url_list.append(artist_dict['sitelinks'][k]['url'])
        artist_wiki['wikipedia_urls'] = url_list
            
    except Exception as err:
        print(err)
        wikidata_collection.update_one({'_id':artist_wiki['_id']}, {"$set": artist_wiki}, upsert=True)
        continue
        
    wikidata_collection.update_one({'_id':artist_wiki['_id']}, {"$set": artist_wiki}, upsert=True) 
    #print(ids)
    #print(artist_wiki)
    

13905


## Create collection of artists with Spotify ID

In order to make it easier to merge data from all sources, new collection for Spotify artists has already been created and it consisted of two main fields - IDs from Spotify and Musicbrainz. 
 

In [17]:
for artist in collection.find({'urls': {'$exists': True}}):
    for i in range(0, len(artist['urls'])):
        if 'spotify' in artist['urls'][i]:
            spotify_artist = {}
            spotify_artist['_id'] = helpers.get_id(artist['urls'][i])[:22]
            spotify_artist['mb_name'] = artist['name']
            #print(spotify_artist)
            spotify_collection.update_one({'_id':spotify_artist['_id']}, {"$set": spotify_artist}, upsert=True)

In [18]:
for artist in wikidata_collection.find({'spotify_id': {'$exists': True}}):
    spotify_artist = {'_id': artist['spotify_id']}
    spotify_collection.update_one({'_id':spotify_artist['_id']}, {"$set": spotify_artist}, upsert=True)

## Add geolocation data to artists

This part of code was used only to see how many entries with geolocalization was already available (withoud additional searching and downloading). The 'true' geolocation data was added here: TODO - add link

In [33]:
import pandas as pd

In [43]:
unique_artists = pd.read_csv('unique_artists.txt', sep=",", header=None)
unique_artists.columns = ["artist_id", "artist_mbid", "track", "name"]

artist_location = pd.read_csv('artist_location.txt', sep="\v", header=None)
artist_location.columns = ["artist_id", "latitude", "longtitude", "name","place"]


joined = pd.merge(unique_artists, artist_location, on='artist_id')
print(joined.shape)
print(unique_artists.shape)
print(artist_location.shape)

(13850, 8)
(44745, 4)
(13850, 5)
