# PROJECT - *My Way* of seeing music covers
#### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from io import StringIO
import sys
import requests
from bs4 import BeautifulSoup
import pickle
import os
import glob
from pathlib import Path

## Notebook plan
1. Data importation
2. Clique organisation (Multi-level indexing)
3. Addition of the language and the year of each track (SHS website web-scraping)
4. Addition of the tempo and song hotness of each track (Access to track files through the cluster)
5. Determine artist location for spatial analysis
6. Addition of the genre for each track (Use of LastFM dataset and external website for genre listing)

### 1.Data importation
Download available additional files containing metadata about our dataset from the cluster (dataset/million-songs_untar/)
- tracks_per_year.txt
- unique_tracks.txt
- unique_artists.txt
- artist_location.txt

Use the Second Hand Songs (SHS) dataset that was created through a collaboration between the Million Songs team and the Second Hand Songs website (https://secondhandsongs.com/). These data are splitted into two datasets to allowed machine learnings algorithms (a train and a test set).
- SHS_testset.txt
- SHS_trainset.txt

The use of external dataset (LastFM) for the genres and the use of the track files (.h5) available through the cluster are commented in part 4 and 5.

- All the additional files were downloaded from the cluster giving all the metadata of the Million Songs dataset. They will help to elaborate a plan and a script will then search more information about a specific track (h5 files in the cluster) maybe using cluster cpu. The path to access to a track in the cluster is for example million-songs/data/A/A/A (with the 3 letters at the end being the 3rd, 4th and 5th letter on the track id).
- The music covers will be detected using another dataset (SecondHandSongs), we have the choice to use the downloadable dataset containing 18,196 tracks (all with a connection to the MSD dataset), or to web-scrapp the SHS website (https://secondhandsongs.com/) where we have much more information (522 436 covers) but not necessarly connected to our MSD dataset. The SHS API is RESTful (return a JSON object) and we are limited to 100 requests per minute and 1000 requestion per hour but we can contact them to remove limitation.
- Some artist are geolocalised (30% of the MSD total artists) on the artist_location dataframe.

In [3]:
#Load Additional files
tracks_per_year=pd.read_csv('data/AdditionalFiles/tracks_per_year.txt',delimiter='<SEP>',engine='python',header=None,index_col=1,names=['year','trackID','artist','title'])
unique_tracks=pd.read_csv('data/AdditionalFiles/unique_tracks.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['trackID','songID','artist','title'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])

In [4]:
#Check if indexes is unique and print the number of elements for each dataframe
print('Dataframe (Unique index, Number of elements)')
print('tracks_per_year ',(tracks_per_year.index.is_unique,tracks_per_year.shape[0]))
print('unique_tracks ',(unique_tracks.index.is_unique,unique_tracks.shape[0]))
print('unique_artists ',(unique_artists.index.is_unique,unique_artists.shape[0]))
print('artist_location ',(artist_location.index.is_unique,artist_location.shape[0]))

Dataframe (Unique index, Number of elements)
tracks_per_year  (True, 515576)
unique_tracks  (True, 1000000)
unique_artists  (True, 44745)
artist_location  (True, 13850)


The covers dataset (SHS_testset.txt and SHS_trainset.txt) were organised in a very special way where group (named "cliques") list some tracks that are interrelated (music covers and original track). The function **read_shs_files** is used to import the files keeping the "clique" configuration.

In [5]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df

In [6]:
#Import the two SHS datasets and concatenate them
SHS_testset=read_shs_files('data/SHS_testset.txt')
SHS_trainset=read_shs_files('data/SHS_trainset.txt')
covers=pd.concat([SHS_testset,SHS_trainset])
covers.shsID=covers.shsID.str.strip('%')
covers.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf
0,"115402,74782, Putty (In Your Hands)",TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350
3,"24350, I.G.Y. (Album Version)",TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363
4,"79178, When The Catfish Is In Bloom",TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178


### 2. Clique organisation (Multi-level indexing)
As said before, the cover dataset is organised as cliques that group some specific tracks that are interrelated. 
In order to keep this structure, we decide to use a multilevel index with cliques (need to transform shsID category in int) and then use the ranking according the released date of the track (year attribute) for the second index (thus, 0 will be the original song).

In [6]:
#Convert shsID to clique id (first convert to category and get a code)
covers=covers.assign(clique_id=(covers.shsID.astype('category')).cat.codes)
#Remove the shsID column (useless since we have the clique_id now)
covers.drop('shsID',axis=1,inplace=True)

In order to use the informations contained in the metadata files first, we merged some necessary attributes (name of the artist, title of the track, released date) from the MSD dataframes named before.

In [7]:
#Merge with unique_artists dataframe to find the artist name for each track (no taking consideration of featuring since we take only the name of the artist assigned with the track)
covers=covers.merge(unique_artists[['name']],how='left',left_on='artistID',right_index=True)
#Merge with unique_tracks dataframe to find the track name
covers=covers.merge(unique_tracks[['title']],how='left',left_on='trackID',right_index=True)
#Merge with tracks_per_year dataframe to find the year of each track
covers=covers.merge(tracks_per_year[['year']],how='left',left_on='trackID',right_index=True)

In [8]:
covers.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf,name,title,year
0,"115402,74782, Putty (In Your Hands)",TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784,The Detroit Cobras,Putty (In Your Hands),1998.0
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584,Sylvie Vartan,Ne Le Déçois Pas,1962.0
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350,Donald Fagen,I.G.Y. (Album Version),1982.0
3,"24350, I.G.Y. (Album Version)",TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363,Take 6,Beautiful World (Album Version),
4,"79178, When The Catfish Is In Bloom",TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178,John Fahey,When The Catfish Is In Bloom,1968.0


Here is printed some useful informations about the cover dataset (the basis of our work) :

In [9]:
print('Number of tracks :', covers.shape[0])
print('Number of cliques :', max(covers.index)+1) #Number of cliques (+1 because id starts at 0)
print('Number of unique tracks :', len(covers.trackID.unique())) 
print('Number of unique artists :', len(covers.artistID.unique()))
print('Number of missing trackID :', len(covers[covers.trackID.isnull()]))
print('Number of missing artistID :', len(covers[covers.artistID.isnull()]))
print('Number of missing years :', len(covers[covers.year.isnull()]))

Number of tracks : 18196
Number of cliques : 12960
Number of unique tracks : 18196
Number of unique artists : 5578
Number of missing trackID : 0
Number of missing artistID : 0
Number of missing years : 4796


In [9]:
covers=covers.sort_values(['clique_id', 'year'], ascending=[True, True]).reset_index() #Reset index according clique_id and year
covers.drop('index',axis=1,inplace=True) #Drop the previous index

### 3. Addition of the language and the year of each track (SHS website web-scraping)
With the results printed above, we noticed that 4796 years were missing (26%) and since we thought to detect the original song by ranking the tracks year in each clique, we needed to find a way to get them.
Furthermore, year isn't necessarly sufficient informations to discriminate the tracks (cover appears sometimes in the same year than the original one), thus it will be better to have the released date for ALL the tracks if the information is available in the SHS website.

Indeed, the Second Hand Songs website allows API request on their database (limited to 1000 requests per hour).
Each track has a performance page where we can have access additional informations about the track as the language, the released date and the original song of the specific cover. In the SHS website, the performance id (that is used in the URL to access to the performance page) is available in our cover dataframe (shsPerf).
Nevertheless, we have negative values of shsPerf and it corresponds to missing values.

Thus, we have two ways to access extract the language/year/original song via web-scrapping :
- For valid SHS performance ID, access to the performance page (e.g. 'https://secondhandsongs.com/performance/1983') and web-scrapping of the Language and Released date informations using the perfInfo() function.
- For invalid SHS performance ID, API request to the search page (e.g. 'https://secondhandsongs.com/search/performance?title=blackbird&performer=beatles'), extract the perf ID with the find_PerfID() and then use the perfInfo() function.

In [10]:
print('Number of missing years with valid shsPerf (API request on the performance page) :',len(covers[(covers.year.isnull()) & (covers.shsPerf != -1)]))
print('Number of missing years with invalid shsPerf (API request on the search page to find shsPerf) :',len(covers[(covers.year.isnull())])-len(covers[(covers.year.isnull()) & (covers.shsPerf != -1)]))

Number of missing years with valid shsPerf (API request on the performance page) : 4128
Number of missing years with invalid shsPerf (API request on the search page to find shsPerf) : 668


In [11]:
pickle.dump(covers,open('covers.p','wb'))

In order to test these algorithms, we have worked for now with a part of the cover dataframe (part dataframe), containing only 962 tracks but being representative of the covers dataframe.

In [12]:
#Work with a subset a the dataframe to create the algorithms
part=covers[2055:3017]

#Merge with the unique_tracks dataframe to get the name of the artist for the track (take featuring as well), it will be useful for the find_shsPerf function 
part=part.merge(unique_tracks[['artist']],how='left',left_on='trackID',right_index=True)
part.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
2055,TRWJCBY128F4261C0F,AR11YQ81187FB3C654,100068,940,Dixie Chicks,Tonight The Heartache's On Me,1998.0,Dixie Chicks
2056,TRSSNRG12903CC8518,ARHAL3V1187B9AA462,100066,940,Joy Lynn White,Tonight The Heartache's On Me,,Joy Lynn White
2057,TRBWTHA128E0791A0B,ARLAUED1187B9ACEAF,30050,941,Eric Clapton,Willie And The Hand Jive,1974.0,Eric Clapton
2058,TRHFTUX128F93010D0,ARB5U6G1187B9A994C,10009,941,Johnny Otis Show,Willie And The Hand Jive,1984.0,The Johnny Otis Show
2059,TRGDJUM12903CC5CD9,ARRHVVL1187B991E41,-1,941,Johnny Otis,Willie And The Hand Jive,1991.0,Johnny Otis


In [13]:
print('Number of cliques in the subset :', len(part.clique_id.unique()))
print('Number of tracks in the subset :', part.shape[0])
print('Number of missing years in the subset :', len(part[part.year.isnull()]))
print('Number of invalid shsPerf in the subset :', len(part[part.shsPerf<0]))

Number of cliques in the subset : 298
Number of tracks in the subset : 962
Number of missing years in the subset : 269
Number of invalid shsPerf in the subset : 29


In [14]:
#API request to find the SHS perf for the unvalid ones (negative values)
def find_shsPerf(x):
    title=part.iloc[x]['title']
    artist=part.iloc[x]['artist']
    shsPerf=part.iloc[x]['shsPerf']
    
    if shsPerf<0:
        title=title.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        artist=artist.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        r=requests.get('https://secondhandsongs.com/search/performance?title='+title+'&op_title=contains&performer='+artist+'&op_performer=contains')
        soup = BeautifulSoup(r.text, 'html.parser')
        results=soup.find('tbody')

        if results is None :
            new_shsPerf=0
        else:
            new_shsPerf=int(results.find('a',attrs={'class':'link-performance'})['href'].split('/')[2])
    else :
        new_shsPerf=shsPerf
        
    return new_shsPerf

In [15]:
#Find the shsPerf for the tracks which doesn't have valid ones (substract 2055 to part dataframe index to start with index=0)
#part.shsPerf=part.index.map(lambda x: find_shsPerf(x-2055)) 

In [16]:
#pickle.dump(part,open('data/part_shsPerf.p','wb'))
part = pickle.load(open("data/part_shsPerf.p","rb"))

In [17]:
part[part.shsPerf==0]

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
2120,TRUSHGG128F92FB357,ARQCKT31187B98906B,0,956,Hurl,Understand,1999.0,Hurl
2475,TRXHGOY128F428F943,ARWIA8D1187B990A0C,0,1080,STRATOVARIUS,I surrender,2001.0,STRATOVARIUS
2493,TRITIQV12903CC9D01,ARSF0K11187B9AF319,0,1087,James Taylor,Don't Let Me Be Lonely Tonight,,John Sawyer
2513,TRVJRTA128F1466888,AR6001N1187B9A8632,0,1091,Tina Turner,Ball Of Confusion (That's What The World Is To...,,Tina Turner
2581,TRDQMYR128F92F9DF3,ARMKGXT11F4C8428AD,0,1115,Nat King Cole Trio,Route 66,,Nat King Cole Trio


In [18]:
#Complete if possible for the missing informations by searching manually on the website (and let the missing ones to 0)
part.loc[part.index==2493,'shsPerf']=10717
part.loc[part.index==2513,'shsPerf']=46614
part.loc[part.index==2581,'shsPerf']=10838

In [19]:
#API request to SHS website for the page of a specific performance (defined as shsPerf) to extract Language and Date
def perfInfo_SHS(shsPerf):
    if shsPerf==0:
        perfLanguage='Unavailable'
        perfDate='Unavailable'
        original_shsPerf='Unavailable'
        
    else :
        r = requests.get('https://secondhandsongs.com/performance/'+str(shsPerf))
        soup = BeautifulSoup(r.text, 'html.parser')
        perfMeta=soup.find('dl',attrs={'class':'dl-horizontal'})
        if perfMeta is None:
            perfLanguage='Missing'
            perfDate='Missing'
            original_shsPerf='Missing'
        else :
            perfLanguage=perfMeta.find('dd',attrs={'itemprop':'inLanguage'})
            if perfLanguage is None :
                perfLanguage='Missing'
            else :
                perfLanguage=perfLanguage.text

            perfDate=perfMeta.find('div',attrs={'class':'media-body'})
            if perfDate is None :
                perfDate='Missing'
            else :
                perfDate=perfDate.find('p').text.split('\n')[2].strip(' ')

            original_shsPerf=soup.find('section',attrs={'class':'work-originals'})
            if original_shsPerf is None :
                original_shsPerf='Missing'
            else :
                original_shsPerf=original_shsPerf.find('a',attrs={'class':'link-work'})['href'].split('/')[2]

    return perfLanguage,perfDate,original_shsPerf

In [20]:
#part['language'], \
#part['date'], \
#part['original_shsPerf']= zip(*part.shsPerf.map(perfInfo_SHS))

In [21]:
#pickle.dump(part,open('data/part_withLangYear.p','wb'))

In [22]:
part = pickle.load(open("data/part_withLangYear.p","rb"))

In [23]:
part

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist,language,date,original_shsPerf
2055,TRWJCBY128F4261C0F,AR11YQ81187FB3C654,100068,940,Dixie Chicks,Tonight The Heartache's On Me,1998.0,Dixie Chicks,English,"January 27, 1998",100066
2056,TRSSNRG12903CC8518,ARHAL3V1187B9AA462,100066,940,Joy Lynn White,Tonight The Heartache's On Me,,Joy Lynn White,English,September 1994,Missing
2057,TRBWTHA128E0791A0B,ARLAUED1187B9ACEAF,30050,941,Eric Clapton,Willie And The Hand Jive,1974.0,Eric Clapton,English,1974,10009
2058,TRHFTUX128F93010D0,ARB5U6G1187B9A994C,10009,941,Johnny Otis Show,Willie And The Hand Jive,1984.0,The Johnny Otis Show,English,April 1958,Missing
2059,TRGDJUM12903CC5CD9,ARRHVVL1187B991E41,10009,941,Johnny Otis,Willie And The Hand Jive,1991.0,Johnny Otis,English,April 1958,Missing
2060,TRMXFLN128F4270672,ARSRSPK1187B995ECD,183425,941,New Riders of The Purple Sage,Willie And The Hand Jive,2004.0,New Riders of The Purple Sage,English,1972,10009
2061,TROZPHF128F9326F37,AR7ICFK1187B9955FF,57959,941,Levon Helm,Willie And The Hand Jive,,Levon Helm,English,1982,10009
2062,TRIDQVH128F92D46F9,ARZNC3M1187FB392CC,83018,941,Cliff Richard & The Shadows,Willie And The Hand Jive (2008 Digital Remaster),,Cliff Richard & The Shadows,English,"March 18, 1960",10009
2063,TRYJQMN128F930DE44,AREEUH01187B9B7F71,49456,942,Loggins & Messina,Danny's Song,1972.0,Loggins & Messina,English,November 1971,10017
2064,TRCMRBA128F424B553,AR6XWV21187FB4ACC5,10020,942,Anne Murray,Danny's Song,1973.0,Anne Murray,English,1973,10017


Here is the information we obtained for now. We noticed several problems that needs to be fixed later :
- The track year (year) is sometimes different form the released date (date) we've extracted from the SHS website, we will prefer the data information found in the SHS website.
- Some languages and some dates are missing (we will consider droping these covers or restrict our analyse concerning these parameters to only a subset of cliques).
- Some original performance that were found in the SHS website don't appear in the clique so need to add these tracks to our dataframe (find them in the cluster or webscrap directly to the SHS website)
- We may still have a problem to find the ranking (just discriminate the original track and do not sort the music covers).

The work for this part will be to extend the analysis to the entire cover dataframe, resolve problems cited above and finish the multilevel indexing using ranking according the date in each clique. The API request is limited for the Second Hand Songs (SHS) website to 1000 requests per hour. Due to the large number of requests needed (668 to resolve the missing SHS problem and 18196 to find the Language/Year/Original Song), we'll maybe ask to the SHS team an exception to remove this limitation.

Then, the goal is to add to this dataframe the informations we've extracted through the cluster (tempo, song hotness), through the LastFM dataset (genre) and through the artist_location table (country of the artist) with the methods we describe in the next section. 

Thus, we'll have all the informations required to start the analysis of our music covers.

In [10]:
#Take the clique id we defined as id of the dataframe (not unique index for now)
#covers.set_index('clique_id',inplace=True)
#covers.sort_index(inplace=True)

#Compute the order of songs for each clique
#covers['rank']=covers.groupby('clique_id')['year'].rank(method='dense',ascending=True).astype('int')

### 4. Access to files (tempo / dancability)

We open the first file of the subset, to check what the HDF5 keys are and then we read each of them.

In [7]:
with pd.HDFStore("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5") as hdf:
    print(hdf.keys())

['/analysis/songs', '/metadata/songs', '/musicbrainz/songs']


In [40]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/analysis/songs")

Unnamed: 0,analysis_sample_rate,audio_md5,danceability,duration,end_of_fade_in,energy,idx_bars_confidence,idx_bars_start,idx_beats_confidence,idx_beats_start,...,key,key_confidence,loudness,mode,mode_confidence,start_of_fade_out,tempo,time_signature,time_signature_confidence,track_id
0,22050,a222795e07cd65b7a530f1346f520649,0.0,218.93179,0.247,0.0,0,0,0,0,...,1,0.736,-11.197,0,0.636,218.932,92.198,4,0.778,TRAAAAW128F429D538


In [21]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/metadata/songs")

Unnamed: 0,analyzer_version,artist_7digitalid,artist_familiarity,artist_hotttnesss,artist_id,artist_latitude,artist_location,artist_longitude,artist_mbid,artist_name,artist_playmeid,genre,idx_artist_terms,idx_similar_artists,release,release_7digitalid,song_hotttnesss,song_id,title,track_7digitalid
0,,165270,0.581794,0.401998,ARD7TVE1187B99BFB1,,California - LA,,e77e51a5-4761-45b3-9847-2051f811e366,Casual,4479,,0,0,Fear Itself,300848,0.60212,SOMZWCG12A8C13C480,I Didn't Mean To,3401791


In [22]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/musicbrainz/songs")

Unnamed: 0,idx_artist_mbtags,year
0,0,0


We only need to extract <tt>tempo</tt> and <tt>song_hotttnesss</tt>, here is an example of how to do that on the subset :

In [66]:
tempo = []
hotness = []

files = glob.glob("data/MillionSongSubset/data/A" + "/[A-Z]/[A-Z]/*")
for f in files:
    tempo.append(pd.read_hdf(f,"/analysis/songs")["tempo"][0])
    hotness.append(pd.read_hdf(f,"/metadata/songs")["song_hotttnesss"][0])

In [67]:
tempo = np.asarray(tempo)
hotness = np.asarray(hotness)
print(tempo)
print(hotness)

[ 124.059   80.084   54.874 ...,   96.064  149.012  162.653]
[ 0.54795294  0.47563847         nan ...,         nan         nan  0.270776  ]


In [71]:
print("Number of tracks =", len(files))
print("with missing hotness values =", np.sum(np.isnan(hotness)))

Number of files = 7620
with missing hotness values = 3268


We already got 3268 unknown hotness values and we only tested on a subset of 7620 song, so we can expect to have that information for only a little over half of our final dataset. Maybe we won't use it.

Once all the files are accessible on the cluster, we will have to go through our SHS dataset and get those attributes for each track_id.
We will do so in the following way : (the paths are just examples on the subset)

In [72]:
tempo = []
hotness = []

my_file = Path("/path/to/file")
for track in covers["trackID"]:
    folder1 = track[2]
    folder2 = track[3]
    folder3 = track[4]
    folder_path = "data/MillionSongSubset/data/" + folder1 + "/" + folder2 + "/" + folder3 + "/"
    track_path = folder_path + track + ".h5"
    if Path(track_path).exists(): #to delete later
        tempo.append(pd.read_hdf(track_path,"/analysis/songs")["tempo"][0])
        hotness.append(pd.read_hdf(track_path,"/metadata/songs")["song_hotttnesss"][0])

In [73]:
print(len(tempo))
print(np.sum(~np.isnan(hotness)))

204
128


Unsurprisingly, we only found 204 of those tracks in the subset and 128 of them have a hotness value.

### 5. Determine artist location for spatial analysis

In [3]:
#Load Additional files
#unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])
artist_location.head()

Unnamed: 0_level_0,lat,long,name,location
artistID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz
AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN"
ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England"
ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY"
ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga


We now load a subset of Second Hand Song

In [4]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df[['trackID', 'artistID', 'shsPerf']]

We retrieve the artists' names using the unique_artists.txt file and we assign a location for each track using the artist_location.txt.

In [5]:
def get_location(x) : 
    if x in artist_location.index:
        return artist_location.get_value(x, 'location')
    else : 
        return np.nan
    
data=read_shs_files('data/SHS_testset.txt')
data['artist'] = data['artistID'].map(lambda x : unique_artists.get_value(x, 'name'))
data['location'] = data['artistID'].map(lambda x : get_location(x))
data.head()

Unnamed: 0,trackID,artistID,shsPerf,artist,location
0,TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784,The Detroit Cobras,
1,TRNJXCO128F92E1930,ARQD13K1187B98E441,138584,Sylvie Vartan,"Iskretz, Bulgaria"
2,TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350,Donald Fagen,"Passaic, NJ"
3,TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363,Take 6,
4,TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178,John Fahey,


We now create the function finding the country for each location. In order to do that we wille use three different python packages : pycountry, us, and geopy, as geopy.geocoders does not support too much requests. 

- First, we will use the pycountry package to extract countries if location contains one. 


- If we didn't match any country in pycountry, we will use the us package to check if a us state is present in the location. From the data, we have observed that if the location refer to a us state, the location is either only defined by the state, or the state is the last element of the location.


- If the two precedent methods does not succeed, we will use the geopy.geocoders package, using Nominatim( ).


- We will manually define countries for some location as they are sometimes mispelled, troncated or refer to a website link.

In [6]:
geolocator = Nominatim()

def get_country(x):
    if x == np.nan:
        return x
    x = x.replace("-", ",")
    for c in pycountry.countries:
        if "England" in x or "UK" in x: 
            return "United Kingdom"
        elif c.name.lower() in x.lower():
            return c.name
    refactorlast = x.split(",")[-1].replace(" ", "")
    refactorfirst = x.split(",")[0]
    usstatelast = us.states.lookup(refactorlast)
    usstatefirst = us.states.lookup(refactorfirst)
    if usstatelast != None or usstatefirst != None:
        return "United State of America"
    elif x == "Swingtown":
        return "United State of America"
    elif x == "<a href=\"http://billyidol.net\" onmousedown='UntrustedLink.bootstrap($(this), \"fc44f8f60d13ab68c56b3c6709c6d670\", event)' target=\"_blank\" rel=\"nofollow\">http://billyidol.net</a>":
        return "United Kingdom"
    elif x == "Lennox Castle, Glasgow" or x == "Knowle West, Bristol, Avon, Engla"\
        or x == "Goldsmith's College, Lewisham, Lo" or x == "Julian Lennon&#039;s Official Facebook Music Page"\
        or x == "Sydney, Moscow, Pressburg" or x == "Penarth, Wales to Los Angeles" or x == "Leicester, Leicestershire, Englan":
        return "United Kingdom"
    elif x == "Vancouver, British Columbia, Cana":
        return "Canada"
    elif x == "Washington DC" or x == "Philladelphia" or "New Jersey" in x:
        return "United State of America"
    elif "Czechoslovakia" in x :
        return "Česko"
    elif x == "Jaded Heart Town":
        return "Germany"
    elif x == "RU" or x == "Russia":
        return "Russia"
    else :
        location = geolocator.geocode(x, timeout=None)
        return location.address.split(",")[-1]

In [7]:
#data['country'] = data['location'].map(lambda x : get_country(x))

The only problem with geopy is that it returns a country in its native language. To uniform our data, we create a function that translates manually the countries in English.

In [8]:
def rename(x):
    if "België - Belgique - Belgien" in x:
        return "Belgium"
    elif "Brasil" in x:
        return "Brazil"
    elif "United State" in x:
        return "United States of America"
    elif "Italia" in x:
        return "Italy"
    elif "Norge" in x:
        return "Norway"
    elif "España" in x:
        return "Spain"
    elif "Nederland" in x :
        return "Netherlands"
    elif "Suomi" in x :
        return "Finland"
    elif "Sverige" in x :
        return "Sweden"
    elif "UK" in x :
        return "United Kingdom"
    elif x[0] == " ":
        return x[1:]
    else : 
        return x

In [9]:
#data['country'] = data['country'].map(lambda x : rename(x))
#pickle.dump(data, open( "data.p", "wb" ) )
data_country = pickle.load(open("data/data_country.p", "rb"))

In [10]:
data_country.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf,artist,location,country
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584,Sylvie Vartan,"Iskretz, Bulgaria",Bulgaria
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350,Donald Fagen,"Passaic, NJ",United States of America
7,"11012, Sheer Heart Attack",TRABVTG128F934AB80,AR9BVRM1187FB51139,97131,Hallows Eve,Georgia,Georgia
8,"11012, Sheer Heart Attack",TRRZZZZ128F422F784,ARNFBNR1187B9A25C2,-1,Helloween,"Hamburg, Germany",Germany
11,"10974, Standing At The Crossroads",TRNTRUC128F4234EB5,AR49MOS1187B991A8B,139493,Smokey Wilson,"Glen Allen, MS",United States of America


### 6. Addition of the genre for each track (Use of LastFM dataset and external website for genre listing)

To find the genre of a song, we will use the LastFM dataset that contains a list a tags for each song.
Since the dataset is from the MillionSongDataset, we will not use all of the available tracks from LastFM but, but only the ones contained in the SecondHandSong dataset.

In [32]:
# Loading the files if they are in the SecondHandSong dataset and create the dataframe
covers_df = pickle.load(open("data/covers.p","rb"))
list_tracks = covers_df.trackID
test_path = "../../lastfm_test"
train_path = "../../lastfm_train"

genre_df = pd.DataFrame()
def create_dataFrame(genre_df):
    for track in list_tracks:
        folder1 = track[2]
        folder2 = track[3]
        folder3 = track[4]
        folder_path = "/" + folder1 + "/" + folder2 + "/" + folder3 + "/"
        track_path = folder_path + track + ".json"
        if glob.glob(train_path + track_path) != []:
                genre_df = genre_df.append(pd.DataFrame.from_dict(json.load(open(train_path + track_path)), orient="index").transpose())
        elif glob.glob(test_path + folder_path + track) != []:
                genre_df = genre_df.append(pd.DataFrame.from_dict(json.load(open(test_path + track_path)), orient="index").transpose())
    genre_df = genre_df.reset_index()
    return genre_df

#tracks_with_tags = create_dataFrame(genre_df)
tracks_with_tags = pickle.load(open("tracks_with_tags", "rb"))

We now list the unique tags in the resulting dataframe. Due to a time limit for the computation of the matching, we will first test on a subset.

In [19]:
tags = list()
for i in range (0,1000):
    tags = tags + tracks_with_tags.tags[i]
    
tags = np.unique(tags).tolist()

A lot of tags contains useless information, thus we first proceed to a pre-cleaning.

In [20]:
clean_tags = {}
def clean_tag(x):
    clean = x.replace("ooo", "")
    clean = clean.replace("-o", "")
    clean = clean.replace("o-", "")
    clean = clean.replace("- ", "")
    clean = clean.replace("-", "")
    clean_tags[x] = clean
for t in tags:
    clean_tag(t)

In order assign a genre to each song, we will use their different tags and try to match it with a list of genre obtained by webscrapping the http://www.musicgenreslist.com website. For more details on the webscrapping see the notebook Genre Webscrapping.ipynb.

In [26]:
map_genres = pickle.load(open("data/map_genres", "rb"))
all_genres = pickle.load(open("data/all_genres.p", "rb"))

We then use the Sequence Matcher package to match tags to the web-scrapped genres.

In [27]:
threshold = 0.80
def match_genres():
    i = 0
    genre_map = {}
    no_match = list()
    for ind in range(0,len(tags)):
        name1 = tags[ind]
        if i%1000 == 0:
            print(i)
        if clean_tags[name1] == "":
            genre_map[name1] = np.nan
        best_ratio = 0
        match = ""
        for name2 in map_genres.keys():
            if name2.lower() in name1.lower():
                for subgenre in map_genres[name2]:
                    ratio = SequenceMatcher(None,name1.lower(),name2.lower()).ratio()
                    if ratio > best_ratio:       # we find the maximum similarity
                        best_ratio = ratio
                        match = name2
                if (best_ratio > threshold):     # if it's superior to our threshold we add that couple to the mapping
                    genre_map[name1] = match
                else:
                    genre_map[name1] = name2
        if match == "":
            for subgenre in all_genres:
                ratio = SequenceMatcher(None,name1.lower(),name2.lower()).ratio()
                if ratio > best_ratio:       # we find the maximum similarity
                    best_ratio = ratio
                    match = name2
            if (best_ratio > threshold):     # if it's superior to our threshold we add that couple to the mapping
                genre_map[name1] = match
            else :
                genre_map[name1] = np.nan
        i = i+1
    return (genre_map, no_match)