# PROJECT - *My Way* of seeing music covers
#### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from io import StringIO
import sys
import requests
from bs4 import BeautifulSoup
import pickle

## Notebook plan
1. Data importation
2. Clique organisation (Multi-level indexing)
3. Addition of the language and the year of each track (SHS website web-scraping)
4. Addition of the tempo and song hotness of each track (Access to track files through the cluster)
5. Addition of the genre for each track (Use of LastFM dataset and external website for genre listing)
6. Determine artist location for spatial analysis

### Data importation
Download available additional files containing metadata about our dataset from the cluster (dataset/million-songs_untar/)
- tracks_per_year.txt
- unique_tracks.txt
- unique_artists.txt
- artist_location.txt

Use the Second Hand Songs (SHS) dataset that was created 


- All the additional files were downloaded from the cluster giving all the metadata of the Million Songs dataset. They will help to elaborate a plan and a script will then search more information about a specific track (h5 files in the cluster) maybe using cluster cpu. The path to access to a track in the cluster is for example million-songs/data/A/A/A (with the 3 letters at the end being the 3rd, 4th and 5th letter on the track id).
- The music covers will be detected using another dataset (SecondHandSongs), we have the choice to use the downloadable dataset containing 18,196 tracks (all with a connection to the MSD dataset), or to web-scrapp the SHS website (https://secondhandsongs.com/) where we have much more information (522 436 covers) but not necessarly connected to our MSD dataset. The SHS API is RESTful (return a JSON object) and we are limited to 100 requests per minute and 1000 requestion per hour but we can contact them to remove limitation.
- Some artist are geolocalised (30% of the MSD total artists) on the artist_location dataframe.

In [2]:
#Load Additional files
tracks_per_year=pd.read_csv('data/AdditionalFiles/tracks_per_year.txt',delimiter='<SEP>',engine='python',header=None,index_col=1,names=['year','trackID','artist','title'])
unique_tracks=pd.read_csv('data/AdditionalFiles/unique_tracks.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['trackID','songID','artist','title'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])

In [3]:
#Check if indexes is unique and print the number of elements for each dataframe
print('Dataframe (Unique index, Number of elements)')
print('tracks_per_year ',(tracks_per_year.index.is_unique,tracks_per_year.shape[0]))
print('unique_tracks ',(unique_tracks.index.is_unique,unique_tracks.shape[0]))
print('unique_artists ',(unique_artists.index.is_unique,unique_artists.shape[0]))
print('artist_location ',(artist_location.index.is_unique,artist_location.shape[0]))

Dataframe (Unique index, Number of elements)
tracks_per_year  (True, 515576)
unique_tracks  (True, 1000000)
unique_artists  (True, 44745)
artist_location  (True, 13850)


In [4]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df

In [5]:
#Import the two SHS datasets (SHS data splitted in a train and test set to use for ML if wanted)
SHS_testset=read_shs_files('data/SHS_testset.txt')
SHS_trainset=read_shs_files('data/SHS_trainset.txt')
covers=pd.concat([SHS_testset,SHS_trainset])
covers.shsID=covers.shsID.str.strip('%')
covers.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf
0,"115402,74782, Putty (In Your Hands)",TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350
3,"24350, I.G.Y. (Album Version)",TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363
4,"79178, When The Catfish Is In Bloom",TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178


In [6]:
#Convert shsID to clique id (first convert to category and get a code)
covers=covers.assign(clique_id=(covers.shsID.astype('category')).cat.codes)
#Remove the shsID and the shsPerf columns (useless)
covers.drop('shsID',axis=1,inplace=True)
#Merge with unique_artists dataframe to find the artist name for each track (no taking consideration of featuring since we take only the name of the artist assigned with the track)
covers=covers.merge(unique_artists[['name']],how='left',left_on='artistID',right_index=True)
#Merge with unique_tracks dataframe to find the track name
covers=covers.merge(unique_tracks[['title']],how='left',left_on='trackID',right_index=True)
#Merge with tracks_per_year dataframe to find the year of each track
covers=covers.merge(tracks_per_year[['year']],how='left',left_on='trackID',right_index=True)

In [7]:
covers.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year
0,TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784,1433,The Detroit Cobras,Putty (In Your Hands),1998.0
1,TRNJXCO128F92E1930,ARQD13K1187B98E441,138584,1433,Sylvie Vartan,Ne Le Déçois Pas,1962.0
2,TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350,2543,Donald Fagen,I.G.Y. (Album Version),1982.0
3,TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363,2543,Take 6,Beautiful World (Album Version),
4,TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178,5240,John Fahey,When The Catfish Is In Bloom,1968.0


In [8]:
print('Number of tracks :', covers.shape[0])
print('Number of cliques :', max(covers.index)+1) #Number of cliques (+1 because id starts at 0)
print('Number of unique tracks :', len(covers.trackID.unique())) 
print('Number of unique artists :', len(covers.artistID.unique()))
print('Number of missing trackID :', len(covers[covers.trackID.isnull()]))
print('Number of missing artistID :', len(covers[covers.artistID.isnull()]))
print('Number of missing years :', len(covers[covers.year.isnull()]))

Number of tracks : 18196
Number of cliques : 12960
Number of unique tracks : 18196
Number of unique artists : 5578
Number of missing trackID : 0
Number of missing artistID : 0
Number of missing years : 4796


In [9]:
covers=covers.sort_values(['clique_id', 'year'], ascending=[True, True]).reset_index() #Reset index according clique_id and year
covers.drop('index',axis=1,inplace=True) #Drop the previous index

In [10]:
print('Number of missing years with valid shsPerf (API request on the performance page) :',len(covers[(covers.year.isnull()) & (covers.shsPerf != -1)]))
print('Number of missing years with invalid shsPerf (API request on the search page to find shsPerf) :',len(covers[(covers.year.isnull())])-len(covers[(covers.year.isnull()) & (covers.shsPerf != -1)]))

Number of missing years with valid shsPerf (API request on the performance page) : 4128
Number of missing years with invalid shsPerf (API request on the search page to find shsPerf) : 668


We need to find the missing years in order to rank the cover songs for each clique and thus, find the original song and the following covers. Since year isn't necessarly sufficient informations to discriminate the songs (cover appears sometimes in the same year than the original one), it will be better to have the entire released date for ALL the tracks if the information is available in the SHS website.

Need the shsPerf to access to the song page in SHS website, where we can find informations about the language and the released date of the song. In the dataset, negative values of shsPerf are considered as missing values.

Two ways of doing it :
- For valid SHS performance ID, access to the performance page (e.g. 'https://secondhandsongs.com/performance/1983') and web-scrapping of the Language and Released date informations using the perfInfo() function.
- For invalid SHS performance ID, API request to the search page (e.g. 'https://secondhandsongs.com/search/performance?title=blackbird&performer=beatles'), extract the perf ID with the find_PerfID() and then use the perfInfo() function.

In [11]:
pickle.dump(covers,open('covers.p','wb'))

In [12]:
#Work with a subset a the dataframe to create the algorithms
part=covers[2055:3017]

#Merge with the unique_tracks dataframe to get the name of the artist for the track (take featuring as well), it will be useful for the find_shsPerf function 
part=part.merge(unique_tracks[['artist']],how='left',left_on='trackID',right_index=True)
part.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
2055,TRWJCBY128F4261C0F,AR11YQ81187FB3C654,100068,940,Dixie Chicks,Tonight The Heartache's On Me,1998.0,Dixie Chicks
2056,TRSSNRG12903CC8518,ARHAL3V1187B9AA462,100066,940,Joy Lynn White,Tonight The Heartache's On Me,,Joy Lynn White
2057,TRBWTHA128E0791A0B,ARLAUED1187B9ACEAF,30050,941,Eric Clapton,Willie And The Hand Jive,1974.0,Eric Clapton
2058,TRHFTUX128F93010D0,ARB5U6G1187B9A994C,10009,941,Johnny Otis Show,Willie And The Hand Jive,1984.0,The Johnny Otis Show
2059,TRGDJUM12903CC5CD9,ARRHVVL1187B991E41,-1,941,Johnny Otis,Willie And The Hand Jive,1991.0,Johnny Otis


In [13]:
print('Number of cliques in the subset :', len(part.clique_id.unique()))
print('Number of tracks in the subset :', part.shape[0])
print('Number of missing years in the subset :', len(part[part.year.isnull()]))
print('Number of invalid shsPerf in the subset :', len(part[part.shsPerf<0]))

Number of cliques in the subset : 298
Number of tracks in the subset : 962
Number of missing years in the subset : 269
Number of invalid shsPerf in the subset : 29


In [14]:
#API request to find the SHS perf for the unvalid ones (negative values)
def find_shsPerf(x):
    title=part.iloc[x]['title']
    artist=part.iloc[x]['artist']
    shsPerf=part.iloc[x]['shsPerf']
    
    if shsPerf<0:
        title=title.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        artist=artist.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        r=requests.get('https://secondhandsongs.com/search/performance?title='+title+'&op_title=contains&performer='+artist+'&op_performer=contains')
        soup = BeautifulSoup(r.text, 'html.parser')
        results=soup.find('tbody')

        if results is None :
            new_shsPerf=0
        else:
            new_shsPerf=int(results.find('a',attrs={'class':'link-performance'})['href'].split('/')[2])
    else :
        new_shsPerf=shsPerf
        
    return new_shsPerf

In [15]:
#Find the shsPerf for the tracks which doesn't have valid ones (substract 2055 to part dataframe index to start with index=0)
#part.shsPerf=part.index.map(lambda x: find_shsPerf(x-2055)) 

In [16]:
#pickle.dump(part,open('data/part_shsPerf.p','wb'))
part = pickle.load(open("data/part_shsPerf.p","rb"))

In [17]:
part[part.shsPerf==0]

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
2120,TRUSHGG128F92FB357,ARQCKT31187B98906B,0,956,Hurl,Understand,1999.0,Hurl
2475,TRXHGOY128F428F943,ARWIA8D1187B990A0C,0,1080,STRATOVARIUS,I surrender,2001.0,STRATOVARIUS
2493,TRITIQV12903CC9D01,ARSF0K11187B9AF319,0,1087,James Taylor,Don't Let Me Be Lonely Tonight,,John Sawyer
2513,TRVJRTA128F1466888,AR6001N1187B9A8632,0,1091,Tina Turner,Ball Of Confusion (That's What The World Is To...,,Tina Turner
2581,TRDQMYR128F92F9DF3,ARMKGXT11F4C8428AD,0,1115,Nat King Cole Trio,Route 66,,Nat King Cole Trio


In [18]:
#Complete if possible for the missing informations by searching manually on the website (and let the missing ones to 0)
part.loc[part.index==2493,'shsPerf']=10717
part.loc[part.index==2513,'shsPerf']=46614
part.loc[part.index==2581,'shsPerf']=10838

In [19]:
#API request to SHS website for the page of a specific performance (defined as shsPerf) to extract Language and Date
def perfInfo_SHS(shsPerf):
    if shsPerf==0:
        perfLanguage='Unavailable'
        perfDate='Unavailable'
        original_shsPerf='Unavailable'
        
    else :
        r = requests.get('https://secondhandsongs.com/performance/'+str(shsPerf))
        soup = BeautifulSoup(r.text, 'html.parser')
        perfMeta=soup.find('dl',attrs={'class':'dl-horizontal'})
        if perfMeta is None:
            perfLanguage='Missing'
            perfDate='Missing'
            original_shsPerf='Missing'
        else :
            perfLanguage=perfMeta.find('dd',attrs={'itemprop':'inLanguage'})
            if perfLanguage is None :
                perfLanguage='Missing'
            else :
                perfLanguage=perfLanguage.text

            perfDate=perfMeta.find('div',attrs={'class':'media-body'})
            if perfDate is None :
                perfDate='Missing'
            else :
                perfDate=perfDate.find('p').text.split('\n')[2].strip(' ')

            original_shsPerf=soup.find('section',attrs={'class':'work-originals'})
            if original_shsPerf is None :
                original_shsPerf='Missing'
            else :
                original_shsPerf=original_shsPerf.find('a',attrs={'class':'link-work'})['href'].split('/')[2]

    return perfLanguage,perfDate,original_shsPerf

In [20]:
#part['language'], \
#part['date'], \
#part['original_shsPerf']= zip(*part.shsPerf.map(perfInfo_SHS))

In [21]:
#pickle.dump(part,open('data/part_withLangYear.p','wb'))

In [22]:
part = pickle.load(open("data/part_withLangYear.p","rb"))

In [23]:
part

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist,language,date,original_shsPerf
2055,TRWJCBY128F4261C0F,AR11YQ81187FB3C654,100068,940,Dixie Chicks,Tonight The Heartache's On Me,1998.0,Dixie Chicks,English,"January 27, 1998",100066
2056,TRSSNRG12903CC8518,ARHAL3V1187B9AA462,100066,940,Joy Lynn White,Tonight The Heartache's On Me,,Joy Lynn White,English,September 1994,Missing
2057,TRBWTHA128E0791A0B,ARLAUED1187B9ACEAF,30050,941,Eric Clapton,Willie And The Hand Jive,1974.0,Eric Clapton,English,1974,10009
2058,TRHFTUX128F93010D0,ARB5U6G1187B9A994C,10009,941,Johnny Otis Show,Willie And The Hand Jive,1984.0,The Johnny Otis Show,English,April 1958,Missing
2059,TRGDJUM12903CC5CD9,ARRHVVL1187B991E41,10009,941,Johnny Otis,Willie And The Hand Jive,1991.0,Johnny Otis,English,April 1958,Missing
2060,TRMXFLN128F4270672,ARSRSPK1187B995ECD,183425,941,New Riders of The Purple Sage,Willie And The Hand Jive,2004.0,New Riders of The Purple Sage,English,1972,10009
2061,TROZPHF128F9326F37,AR7ICFK1187B9955FF,57959,941,Levon Helm,Willie And The Hand Jive,,Levon Helm,English,1982,10009
2062,TRIDQVH128F92D46F9,ARZNC3M1187FB392CC,83018,941,Cliff Richard & The Shadows,Willie And The Hand Jive (2008 Digital Remaster),,Cliff Richard & The Shadows,English,"March 18, 1960",10009
2063,TRYJQMN128F930DE44,AREEUH01187B9B7F71,49456,942,Loggins & Messina,Danny's Song,1972.0,Loggins & Messina,English,November 1971,10017
2064,TRCMRBA128F424B553,AR6XWV21187FB4ACC5,10020,942,Anne Murray,Danny's Song,1973.0,Anne Murray,English,1973,10017


The API request is limited for the Second Hand Songs (SHS) website to 1000 requests per hour. Due to the large number of requests needed (668 to resolve the missing SHS problem and 18196 to find the Language/Year/Original Song), we'll maybe ask to the SHS team an exception to remove this limitation.

In [24]:
#Take the clique id we defined as id of the dataframe (not unique index for now)
#covers.set_index('id',inplace=True)
#covers.sort_index(inplace=True)

#covers.set_index('clique_id',inplace=True)
#Compute the order of songs for each clique
#covers['rank']=covers.groupby('clique_id')['year'].rank(method='dense',ascending=True).astype('int')

### Access to files (tempo / dancability)

### Find artists location (spatial analysis)

### Find genre