# PROJECT - *My Way* of seeing music covers
#### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [161]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from io import StringIO
import sys

## Data organisation 
- All the additional files were downloaded from the cluster giving all the metadata of the Million Songs dataset. They will help to elaborate a plan and a script will then search more information about a specific track (h5 files in the cluster) maybe using cluster cpu. The path to access to a track in the cluster is for example million-songs/data/A/A/A (with the 3 letters at the end being the 3rd, 4th and 5th letter on the track id).
- The music covers will be detected using another dataset (SecondHandSongs), we have the choice to use the downloadable dataset containing 18,196 tracks (all with a connection to the MSD dataset), or to web-scrapp the SHS website (https://secondhandsongs.com/) where we have much more information (522 436 covers) but not necessarly connected to our MSD dataset. The SHS API is RESTful (return a JSON object) and we are limited to 100 requests per minute and 1000 requestion per hour but we can contact them to remove limitation.
- Some artist are geolocalised (30% of the MSD total artists) on the artist_location dataframe.

In [162]:
#Load Additional files
tracks_per_year=pd.read_csv('data/AdditionalFiles/tracks_per_year.txt',delimiter='<SEP>',engine='python',header=None,index_col=1,names=['year','trackID','artist','title'])
unique_tracks=pd.read_csv('data/AdditionalFiles/unique_tracks.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['trackID','songID','artist','title'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])

In [163]:
#Check if indexes is unique and print the number of elements for each dataframe
print('Dataframe (Unique index, Number of elements)')
print('tracks_per_year ',(tracks_per_year.index.is_unique,tracks_per_year.shape[0]))
print('unique_tracks ',(unique_tracks.index.is_unique,unique_tracks.shape[0]))
print('unique_artists ',(unique_artists.index.is_unique,unique_artists.shape[0]))
print('artist_location ',(artist_location.index.is_unique,artist_location.shape[0]))

Dataframe (Unique index, Number of elements)
tracks_per_year  (True, 515576)
unique_tracks  (True, 1000000)
unique_artists  (True, 44745)
artist_location  (True, 13850)


In [164]:
tracks_per_year.head()

Unnamed: 0_level_0,year,artist,title
trackID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TRSGHLU128F421DF83,1922,Alberta Hunter,Don't Pan Me
TRMYDFV128F42511FC,1922,Barrington Levy,Warm And Sunny Day
TRRAHXQ128F42511FF,1922,Barrington Levy,Looking My Love
TRFAFTK12903CC77B8,1922,Barrington Levy,Warm And Sunny Day
TRSTBUY128F4251203,1922,Barrington Levy,Mandela You're Free


In [165]:
unique_tracks.head()

Unnamed: 0_level_0,songID,artist,title
trackID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TRMMMYQ128F932D901,SOQMMHC12AB0180CB8,Faster Pussy cat,Silent Night
TRMMMKD128F425225D,SOVFVAK12A8C1350D9,Karkkiautomaatti,Tanssi vaan
TRMMMRX128F93187D9,SOGTUKN12AB017F4F1,Hudson Mohawke,No One Could Ever
TRMMMCH128F425532C,SOBNYVR12A8C13558C,Yerba Brava,Si Vos Querés
TRMMMWA128F426B589,SOHSBXH12A8C13B0DF,Der Mystic,Tangle Of Aspens


In [166]:
unique_tracks.artist.unique().shape

(72665,)

In [167]:
unique_artists.head()

Unnamed: 0_level_0,artistMID,randomTrack,name
artistID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AR002UA1187B9A637D,7752a11c-9d8b-4220-ac44-e4a04cc8471d,TRMUOZE12903CDF721,The Bristols
AR003FB1187B994355,1dbd2d7b-64c8-46aa-9f47-ff589096d672,TRWDPFR128F93594A6,The Feds
AR006821187FB5192B,94fc1228-7032-4fe6-a485-e122e5fbee65,TRMZLJF128F4269EAC,Stephen Varcoe/Choir of King's College_ Cambri...
AR009211187B989185,9dfe78a6-6d91-454e-9b95-9d7722cbc476,TRMGURO12903CAE2F0,Carroll Thompson
AR009SZ1187B9A73F4,8cd574c0-b9f7-4998-94f4-654dffaecdf2,TRGWWFP12903CE7E79,Gorodisch


In [168]:
unique_artists.head()

Unnamed: 0_level_0,artistMID,randomTrack,name
artistID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AR002UA1187B9A637D,7752a11c-9d8b-4220-ac44-e4a04cc8471d,TRMUOZE12903CDF721,The Bristols
AR003FB1187B994355,1dbd2d7b-64c8-46aa-9f47-ff589096d672,TRWDPFR128F93594A6,The Feds
AR006821187FB5192B,94fc1228-7032-4fe6-a485-e122e5fbee65,TRMZLJF128F4269EAC,Stephen Varcoe/Choir of King's College_ Cambri...
AR009211187B989185,9dfe78a6-6d91-454e-9b95-9d7722cbc476,TRMGURO12903CAE2F0,Carroll Thompson
AR009SZ1187B9A73F4,8cd574c0-b9f7-4998-94f4-654dffaecdf2,TRGWWFP12903CE7E79,Gorodisch


In [169]:
artist_location.head()

Unnamed: 0_level_0,lat,long,name,location
artistID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ARZGXZG1187B9B56B6,-16.96595,-61.14804,Endless Blue,Santa Cruz
AR8K6F31187B99C2BC,46.44231,-93.36586,Go Fish,"Twin Cities, MN"
ARHJJ771187FB5B581,51.59678,-0.33556,Screaming Lord Sutch,"Harrow, Middlesex, England"
ARJ8YLL1187FB3CA93,40.69626,-73.83301,Morton Gould,"Richmond Hill, NY"
ARYBAGV11ECC836DAC,43.58828,-79.64372,Crash Parallel,Mississauga


In [170]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df

In [331]:
#Import the two SHS datasets (SHS data splitted in a train and test set to use for ML if wanted)
SHS_testset=read_shs_files('data/SHS_testset.txt')
SHS_trainset=read_shs_files('data/SHS_trainset.txt')
covers=pd.concat([SHS_testset,SHS_trainset])
covers.shape

(18196, 4)

In [332]:
covers.shsID=covers.shsID.str.strip('%')

In [333]:
covers.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf
0,"115402,74782, Putty (In Your Hands)",TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350
3,"24350, I.G.Y. (Album Version)",TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363
4,"79178, When The Catfish Is In Bloom",TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178


In [334]:
#Convert shsID to clique id (first convert to category and get a code)
covers=covers.assign(clique_id=(covers.shsID.astype('category')).cat.codes)
#Remove the shsID and the shsPerf columns (useless)
covers.drop('shsID',axis=1,inplace=True)
covers.drop('shsPerf',axis=1,inplace=True)
#Merge with unique_artists dataframe to find the artist name for each track (no taking consideration of featuring since we take only the name of the artist assigned with the track)
covers=covers.merge(unique_artists[['name']],how='left',left_on='artistID',right_index=True)
#Take the clique id we defined as id of the dataframe (not unique index for now)
#covers.set_index('id',inplace=True)
#covers.sort_index(inplace=True)
#Merge with unique_tracks dataframe to find the track name
covers=covers.merge(unique_tracks[['title']],how='left',left_on='trackID',right_index=True)
#Merge with tracks_per_year dataframe to find the year of each track
covers=covers.merge(tracks_per_year[['year']],how='left',left_on='trackID',right_index=True)

In [335]:
covers.head()

Unnamed: 0,trackID,artistID,clique_id,name,title,year
0,TRJVDMI128F4281B99,AR46LG01187B98DB5D,1433,The Detroit Cobras,Putty (In Your Hands),1998.0
1,TRNJXCO128F92E1930,ARQD13K1187B98E441,1433,Sylvie Vartan,Ne Le Déçois Pas,1962.0
2,TRIBOIS128F9340B19,ARUVZYG1187B9B2809,2543,Donald Fagen,I.G.Y. (Album Version),1982.0
3,TRGXZDU128F9301E53,AR4LE591187FB3FCFB,2543,Take 6,Beautiful World (Album Version),
4,TRQSIOY128F92FACA7,ARU75JD1187FB38B79,5240,John Fahey,When The Catfish Is In Bloom,1968.0


In [336]:
print('Number of cliques :', max(covers.index)+1) #Number of cliques (+1 because id starts at 0)
print('Number of unique tracks :', len(covers.trackID.unique())) 
print('Number of unique artists :', len(covers.artistID.unique()))
print('Number of missing trackID :', len(covers[covers.trackID.isnull()]))
print('Number of missing artistID :', len(covers[covers.artistID.isnull()]))
print('Number of missing years :', len(covers[covers.year.isnull()]))

Number of cliques : 12960
Number of unique tracks : 18196
Number of unique artists : 5578
Number of missing trackID : 0
Number of missing artistID : 0
Number of missing years : 4796


In [337]:
#Remove for now the missing years in order to try an algo to detect the original song (and the covers order)
covers.dropna(axis=0,inplace=True)

In [338]:
covers=covers.sort_values(['clique_id', 'year'], ascending=[True, True]).reset_index()

In [339]:
covers.head()

Unnamed: 0,index,trackID,artistID,clique_id,name,title,year
0,11502,TRGDMZP128F42BC52B,ARB1DDF1187FB4FCFB,0,Louis Armstrong,Stardust,1988.0
1,11503,TRCATYW12903D038FE,ARGJEEO1271F573FD6,0,Artie Shaw and his orchestra,Stardust,1988.0
2,11501,TRVMZJZ128F4270CE4,ARY0HTV1187FB4A1B1,0,Hoagy Carmichael,Star Dust,1999.0
3,12947,TROJZTF128F428B546,ARJN76O1187FB43C99,1,Ana Belén,Yo Vengo A Ofrecer Mi Corazon,2001.0
4,5163,TRCKNGE128F92DA3F3,AR1CB5G1187B9AFB8E,2,Electric Light Orchestra,Mr. Blue Sky,1977.0


In [340]:
#Compute the order of songs for each clique
covers['rank']=covers.groupby('clique_id')['year'].rank(method='dense',ascending=True).astype('int')

In [341]:
covers.set_index('clique_id',inplace=True)
covers.drop('index',axis=1,inplace=True)

In [342]:
covers.head()

Unnamed: 0_level_0,trackID,artistID,name,title,year,rank
clique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,TRGDMZP128F42BC52B,ARB1DDF1187FB4FCFB,Louis Armstrong,Stardust,1988.0,1
0,TRCATYW12903D038FE,ARGJEEO1271F573FD6,Artie Shaw and his orchestra,Stardust,1988.0,1
0,TRVMZJZ128F4270CE4,ARY0HTV1187FB4A1B1,Hoagy Carmichael,Star Dust,1999.0,2
1,TROJZTF128F428B546,ARJN76O1187FB43C99,Ana Belén,Yo Vengo A Ofrecer Mi Corazon,2001.0,1
2,TRCKNGE128F92DA3F3,AR1CB5G1187B9AFB8E,Electric Light Orchestra,Mr. Blue Sky,1977.0,1


To handle :
- Find the missing years using API request to SHS website (if no informations about the year on the website, find another solution)
- Find a way to detect the original song if the first cover is the same year? (also the case where two covers are made during the same year and then have the same rank.. problem with multi-index after because not unique)
- API request to SHS website for the location and the language