# PROJECT - *My Way* of seeing music covers
#### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from io import StringIO
import sys
import requests
from bs4 import BeautifulSoup
import pickle
import os
import glob
from pathlib import Path

## Notebook plan
1. Data importation
2. Clique organisation (Multi-level indexing)
3. Addition of the language and the year of each track (SHS website web-scraping)
4. Addition of the tempo and song hotness of each track (Access to track files through the cluster)
5. Determine artist location for spatial analysis
6. Addition of the genre for each track (Use of LastFM dataset and external website for genre listing)
7. Distinction of the original song and the music covers inside each clique

### 1. Data importation
Download available additional files containing metadata about our dataset from the cluster (dataset/million-songs_untar/)
- tracks_per_year.txt
- unique_tracks.txt
- unique_artists.txt
- artist_location.txt

Use the Second Hand Songs (SHS) dataset that was created through a collaboration between the Million Songs team and the Second Hand Songs website (https://secondhandsongs.com/). These data are splitted into two datasets to allowed machine learnings algorithms (a train and a test set).
- SHS_testset.txt
- SHS_trainset.txt
Since we don't need this distinction for our data analysis, we merged these two datasets.

The use of external dataset (LastFM) for the genres and the use of the track files (.h5) available through the cluster are commented in part 4 and 5.

Some general informations about our data :
- All the additional files were downloaded from the cluster giving all the metadata of the Million Songs dataset. They will help to elaborate a plan and a script will then search more information about a specific track (h5 files in the cluster) maybe using cluster cpu. The path to access to a track in the cluster is for example million-songs/data/A/A/A (with the 3 letters at the end being the 3rd, 4th and 5th letter on the track id).
- The music covers will be detected using another dataset (SecondHandSongs), we have the choice to use the downloadable dataset containing 18,196 tracks (all with a connection to the MSD dataset), or to web-scrapp the SHS website (https://secondhandsongs.com/) where we have much more information (522 436 covers) but not necessarly connected to our MSD dataset. The SHS API is RESTful (return a JSON object) and will be used to provide additional or missing informations (localisation, language of the song, ...) in our dataset.
- Some artist are geolocalised (30% of the MSD total artists) on the artist_location dataframe.

In [2]:
#Load Additional files
tracks_per_year=pd.read_csv('data/AdditionalFiles/tracks_per_year.txt',delimiter='<SEP>',engine='python',header=None,index_col=1,names=['year','trackID','artist','title'])
unique_tracks=pd.read_csv('data/AdditionalFiles/unique_tracks.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['trackID','songID','artist','title'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])

In [3]:
#Check if indexes is unique and print the number of elements for each dataframe
print('Dataframe (Unique index, Number of elements)')
print('tracks_per_year ',(tracks_per_year.index.is_unique,tracks_per_year.shape[0]))
print('unique_tracks ',(unique_tracks.index.is_unique,unique_tracks.shape[0]))
print('unique_artists ',(unique_artists.index.is_unique,unique_artists.shape[0]))
print('artist_location ',(artist_location.index.is_unique,artist_location.shape[0]))

Dataframe (Unique index, Number of elements)
tracks_per_year  (True, 515576)
unique_tracks  (True, 1000000)
unique_artists  (True, 44745)
artist_location  (True, 13850)


The covers dataset (SHS_testset.txt and SHS_trainset.txt) were organised in a very special way where group (named "cliques") list some tracks that are interrelated (music covers and original track). The function **read_shs_files** is used to import the files keeping the "clique" configuration.

In [4]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df

In [5]:
#Import the two SHS datasets and concatenate them
SHS_testset=read_shs_files('data/SHS_testset.txt')
SHS_trainset=read_shs_files('data/SHS_trainset.txt')
covers=pd.concat([SHS_testset,SHS_trainset])
covers.shsID=covers.shsID.str.strip('%')
covers.head()

Unnamed: 0,shsID,trackID,artistID,shsPerf
0,"115402,74782, Putty (In Your Hands)",TRJVDMI128F4281B99,AR46LG01187B98DB5D,74784
1,"115402,74782, Putty (In Your Hands)",TRNJXCO128F92E1930,ARQD13K1187B98E441,138584
2,"24350, I.G.Y. (Album Version)",TRIBOIS128F9340B19,ARUVZYG1187B9B2809,24350
3,"24350, I.G.Y. (Album Version)",TRGXZDU128F9301E53,AR4LE591187FB3FCFB,24363
4,"79178, When The Catfish Is In Bloom",TRQSIOY128F92FACA7,ARU75JD1187FB38B79,79178


As said before, our dataset was created from a collaboration between the Million Songs Dataset (MSD) and the Second Hand Songs Dataset (SHS).  

The MSD consists of almost all the information available through the Echo Nest API for one million popular tracks and the *trackID* and *artistID* are directly based on the Echo Nest structure.   
More specifically, the *trackID* is the unique identifier of a track (connection with unique_tracks.txt) and the *artistID* is the unique identifier of an artist (connection with unique_artists.txt). The *trackID* is also the path used to navigate through the MSD directory and access to a specific song (through the cluster).  

The *shsPerf* corresponds to the SHS identifier of a song and this information is sometimes missing. Exploring the SHS website, we found that the page of a specific song can be accessed with different paths using the *shsPerf* information :
- https://secondhandsongs.com/performance/ [shsPerf]
- https://secondhandsongs.com/work/ [shsPerf]

We have deducted that only original music song has both a work page and a performance page, covers having only performance page. In some cases, the *shsPerf* for an original music song is not the same if we want to reach the performance page or the work page and in our dataset, we assumed that the *shsPerf* was only performance id.

The structure of the first column was impossible to decode and the informations contained as the name of the song can be found elsewhere (unique_tracks.txt). Thus, the column will only be used to define a clique_id for each track. 

In [6]:
#Convert shsID to clique id (first convert to category and get a code)
covers=covers.assign(clique_id=(covers.shsID.astype('category')).cat.codes)
#Remove the shsID column (useless since we have the clique_id now)
covers.drop('shsID',axis=1,inplace=True)

In order to use the informations contained in the metadata files first, we merged some necessary attributes (name of the artist, title of the track, released date) from the MSD dataframes named before.

In [7]:
#Merge with unique_artists dataframe to find the artist name for each track (no taking consideration of featuring since we take only the name of the artist assigned with the track)
covers=covers.merge(unique_artists[['name']],how='left',left_on='artistID',right_index=True)
#Merge with unique_tracks dataframe to find the track name
covers=covers.merge(unique_tracks[['title']],how='left',left_on='trackID',right_index=True)
#Merge with tracks_per_year dataframe to find the year of each track
covers=covers.merge(tracks_per_year[['year']],how='left',left_on='trackID',right_index=True)

In [11]:
covers=covers.sort_values(['clique_id', 'year'], ascending=[True, True]).reset_index() #Reset index according clique_id and year
covers.drop('index',axis=1,inplace=True) #Drop the previous index

In [12]:
covers.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year
0,TRGDMZP128F42BC52B,ARB1DDF1187FB4FCFB,-1,0,Louis Armstrong,Stardust,1988.0
1,TRCATYW12903D038FE,ARGJEEO1271F573FD6,-1,0,Artie Shaw and his orchestra,Stardust,1988.0
2,TRVMZJZ128F4270CE4,ARY0HTV1187FB4A1B1,-1,0,Hoagy Carmichael,Star Dust,1999.0
3,TRKOINL128F42926C3,ARQ5FSZ1187B98AD74,-1,0,Connee Boswell & Sy Oliver Orchestra,Star Dust,
4,TROJZTF128F428B546,ARJN76O1187FB43C99,-1,1,Ana Belén,Yo Vengo A Ofrecer Mi Corazon,2001.0


Here is printed some useful informations about the cover dataset (the basis of our work) :

In [14]:
print('Number of tracks :', covers.shape[0])
print('Number of cliques :', len(covers.clique_id.unique()))
print('Number of unique tracks :', len(covers.trackID.unique())) 
print('Number of unique artists :', len(covers.artistID.unique()))
print('Number of missing trackID :', len(covers[covers.trackID.isnull()]))
print('Number of missing artistID :', len(covers[covers.artistID.isnull()]))
print('Number of missing years :', len(covers[covers.year.isnull()]))
print('Number of invalid shsPerf :', len(covers[covers.shsPerf<0]))

Number of tracks : 18196
Number of cliques : 5854
Number of unique tracks : 18196
Number of unique artists : 5578
Number of missing trackID : 0
Number of missing artistID : 0
Number of missing years : 4796
Number of invalid shsPerf : 3075


An important step that we will faced in the data wrangling process will be to differentiate the original song and the music covers inside each clique.  
We thought using the released year that is provided in the track_per_year.txt file but with 4796 tracks (26%) with missing years, we need to find another way to get them. Furthermore, year isn't necessarly sufficient informations to discriminate the tracks (cover appears sometimes in the same year than the original one), thus it will be better to have the released date for ALL the tracks if the information is available.

### 2. Addition of the language and the year of each track (SHS website web-scraping)

Some useful informations can be found in a music cover page
In the Second Hand Song website (https://secondhandsongs.com/), each song page can be access by a specific id (shsPerf in our dataset) and some useful information about the song is provided as its language and its released date (not only the year). Furthermore, in case of music cover page, a link to the original song page is also present.

Thus, we decided to improve our dataset using web-scrapping of the SHS website.

As said before, the performance id (that is used in the URL to access to the song page) is available in our cover dataframe (shsPerf) but 3075 tracks (17%) have unvalid shsPerf (shsPerf=-1) and that make access to their page impossible.  
Thus, we have two ways to access extract the language/year/original song via web-scrapping :
- For valid SHS performance ID, access to the performance page (e.g. 'https://secondhandsongs.com/performance/1983') and web-scrapping of the Language and Released date informations using the perfInfo() function.
- For invalid SHS performance ID, API request to the search page (e.g. 'https://secondhandsongs.com/search/performance?title=blackbird&performer=beatles'), extract the perf ID with the find_PerfID() and then use the perfInfo() function.


In [23]:
#pickle.dump(covers,open('covers.p','wb'))
covers=pickle.load(open('covers.p','rb'))

In [24]:
covers.shape

(18196, 7)

The name of the artist is different if we use the information provided in the unique_artists.txt or in the unique_tracks.txt. Indeed, in order to assign a specific *artistID* to each track, the MSD team has splitted the featuring tracks keeping in an arbitrary way one of the artist.  
Both informations will be useful depending the precision we want to use in our algorithms, that's why two field corresponds to artist names (*name* and *artist*) in our cover dataframe.

In [25]:
#Merge with the unique_tracks dataframe to get the name of the artist for the track (take featuring as well)
covers=covers.merge(unique_tracks[['artist']],how='left',left_on='trackID',right_index=True)
covers.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
0,TRGDMZP128F42BC52B,ARB1DDF1187FB4FCFB,-1,0,Louis Armstrong,Stardust,1988.0,Louis Armstrong & His Orchestra
1,TRCATYW12903D038FE,ARGJEEO1271F573FD6,-1,0,Artie Shaw and his orchestra,Stardust,1988.0,Artie Shaw and his orchestra
2,TRVMZJZ128F4270CE4,ARY0HTV1187FB4A1B1,-1,0,Hoagy Carmichael,Star Dust,1999.0,Hoagy Carmichael
3,TRKOINL128F42926C3,ARQ5FSZ1187B98AD74,-1,0,Connee Boswell & Sy Oliver Orchestra,Star Dust,,Connee Boswell & Sy Oliver Orchestra
4,TROJZTF128F428B546,ARJN76O1187FB43C99,-1,1,Ana Belén,Yo Vengo A Ofrecer Mi Corazon,2001.0,Ana Belén


In [27]:
# API request to find the shsPerf for unvalid ones (shsPerf=-1)
# Corresponds to a detailed search in the SHS website with a contains condition on title and on artist

def find_shsPerf(x):
    
    # x corresponds to the index position of each track (in order to execute the function on the entire dataframe)
    title=covers.iloc[x]['title'] 
    artist=covers.iloc[x]['name']
    shsPerf=covers.iloc[x]['shsPerf']
    
    # In case of unvalid (missing) shsPerf
    if shsPerf<0: 
        title=title.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        artist=artist.replace('.', '').replace('_', '').replace('/', '').lower().replace(' ','+')
        r=requests.get('https://secondhandsongs.com/search/performance?title='+title+'&op_title=contains&performer='+artist+'&op_performer=contains')
        soup = BeautifulSoup(r.text, 'html.parser')
        results=soup.find('tbody')

        if results is None :
            # Assign default value of 0 if no results found in the detailed search request
            new_shsPerf=0
        else:
            # Take the first result of the detailed search (since it is sorted according relevance)
            new_shsPerf=int(results.find('a',attrs={'class':'link-performance'})['href'].split('/')[2])
    
    # Keep as it is if the shsPerf is valid
    else :
        new_shsPerf=shsPerf
        
        
    return new_shsPerf

In [28]:
#Find the shsPerf for the tracks which doesn't have valid ones
#covers_withSHS=covers_p.shsPerf.index.map(lambda x: find_shsPerf(x)) 

In [29]:
#pickle.dump(covers_withSHS,open('data/covers_withSHS_new.p','wb'))
covers['shsPerf']=pickle.load(open("data/covers_withSHS_new.p","rb"))

In [30]:
covers.head()

Unnamed: 0,trackID,artistID,shsPerf,clique_id,name,title,year,artist
0,TRGDMZP128F42BC52B,ARB1DDF1187FB4FCFB,0,0,Louis Armstrong,Stardust,1988.0,Louis Armstrong & His Orchestra
1,TRCATYW12903D038FE,ARGJEEO1271F573FD6,0,0,Artie Shaw and his orchestra,Stardust,1988.0,Artie Shaw and his orchestra
2,TRVMZJZ128F4270CE4,ARY0HTV1187FB4A1B1,412972,0,Hoagy Carmichael,Star Dust,1999.0,Hoagy Carmichael
3,TRKOINL128F42926C3,ARQ5FSZ1187B98AD74,0,0,Connee Boswell & Sy Oliver Orchestra,Star Dust,,Connee Boswell & Sy Oliver Orchestra
4,TROJZTF128F428B546,ARJN76O1187FB43C99,0,1,Ana Belén,Yo Vengo A Ofrecer Mi Corazon,2001.0,Ana Belén


In [32]:
# Number of missing shsPerf in our dataset
len(covers[covers.shsPerf==0])

1088

We still can't access the SHS pages for **1088** music covers (6%) and the missing *shsPerf* are assigned to a default value of 0. We will need to decide if remove them since it will be impossible to find the missing release date and/or the language of the track.
For now, we will compute the perfInfo_SHS for all the dataset.

In [78]:
# API request to extract the language, the released date and the shsPerf of the original song
# Computes for all the tracks in our dataset

def perfInfo_SHS(shsPerf):
        
    # If the shsPerf is missing in our dataset assign default values "Unavailable" for the three fields
    if shsPerf==0:
        perfLanguage='Unavailable'
        perfDate='Unavailable'
        original_shsPerf='Unavailable'
     
    # If we have shsPerf information in our dataset
    else :
        r = requests.get('https://secondhandsongs.com/performance/'+str(shsPerf)) # Access to the song page on SHS
        soup = BeautifulSoup(r.text, 'html.parser')
        perfMeta=soup.find('dl',attrs={'class':'dl-horizontal'})
        
        # If no metadata of the song is found, assign default values "Missing" to differentiate unreachable informations
        # and missing ones.
        if perfMeta is None:
            perfLanguage='Missing'
            perfDate='Missing'
            original_shsPerf='Missing'
        else :
            # Extract language
            perfLanguage=perfMeta.find('dd',attrs={'itemprop':'inLanguage'})
            if perfLanguage is None :
                perfLanguage='Missing'
            else :
                perfLanguage=perfLanguage.text

            # Extract released date    
            perfDate=perfMeta.find('div',attrs={'class':'media-body'})
            if perfDate is None :
                perfDate='Missing'
            else :
                perfDate=perfDate.find('p').text.split('\n')[2].strip(' ')

            #Find the original shsPerf (work or performance ID) 
            original_section=soup.find('section',attrs={'class':'work-originals'})
            versions_section=soup.find('section',attrs={'id':'entity-section'})
            
            if original_section is None :
                if versions_section is None :
                    original_shsPerf='Missing'
                else :
                    
                    original_shsPerf=versions_section.find('a')['href']
                    if original_shsPerf is None :
                        original_shsPerf='Missing'
                    else :
                        original_shsPerf=original_shsPerf.split('/')[2]

            else :
                original_shsPerf=original_section.find('div',attrs={'class':'media-body'})
                original_shsWork=original_section.find('a',attrs={'class':'link-work'})['href']
                
                if original_shsPerf is None :
                    original_shsPerf=original_shsWork.split('/')[2]
                else :
                    original_shsPerf=original_shsPerf.find('a')['href'].split('/')[2]

    return perfLanguage,perfDate,original_shsPerf

In [None]:
#covers['language'], \
#covers['date'], \
#covers['original_shsPerf']= zip(*covers.shsPerf.map(perfInfo_SHS))

In [None]:
covers.tail()

In [None]:
#pickle.dump(covers,open('data/covers_new.p','wb'))
covers=pickle.load(open("data/covers_new.p","rb"))

In [None]:
covers.shape

In [None]:
print('Number of unavailable language : ', len(covers[covers.language=='Unavailable']) )
print('Number of missing language : ', len(covers[covers.language=='Missing']) )
print('Total number of tracks with no language information : ', len((covers[covers.language=='Missing']) | (covers[covers.language=='Unavailable']))) 
print('Number of unavailable original shsPerf :' ,len(covers[covers.original_shsPerf=='Unavailable']))
print('Number of missing original shsPerf :' ,len(covers[covers.original_shsPerf=='Missing']))
print('Total number of tracks with no original shsPerf information : ', len((covers[covers.original_shsPerf=='Missing']) | (covers[covers.original_shsPerf=='Unavailable']))) 

Title impossible to found 

We noticed several problems with this intermediate result :

1/ The track year (year) is sometimes different form the released date (date) we've extracted from the SHS website, we will prefer the data information found in the SHS website.

In [None]:
covers['date'] = covers.apply(lambda row: row['year'] if ((row['date']=='Missing') | (row['date']=='Unavailable')) else row['date'],axis=1)

In [None]:
covers.drop('year',axis=1,inplace=True)

In [None]:
# Number of tracks where we don't have any information about the release year
len(covers[covers.date.isnull()])

In [None]:
covers.shape

In [None]:
# Are the web-scrapped original shsPerf are all in the dataset ?
covers[(covers.original_shsPerf != 'Missing') & (covers.original_shsPerf != 'Unavailable')].original_shsPerf.astype('int').isin(covers.shsPerf.unique()).value_counts()

In [None]:
covers.shsPerf.unique()

In [None]:
covers[(covers.original_shsPerf != 'Missing') & (covers.original_shsPerf != 'Unavailable')].original_shsPerf.head()

In [None]:
# Replace missing values by shsPerf and unavailable by 0
covers['original_shsPerf'] = np.where(((covers['original_shsPerf']=='Unavailable') | (covers['original_shsPerf']=='Missing')), 0, covers['original_shsPerf'])
covers['original_shsPerf']=covers['original_shsPerf'].astype(int)

In [None]:
covers.head()

In [None]:
# Compute the frequency for each original shsPerf and sort values according frequency
freq_original=covers.groupby(['clique_id','original_shsPerf'],as_index=False)['clique_id'].agg({'freq':'count'})

In [None]:
freq_original.sort_values(['clique_id', 'freq'], ascending=[True, False],inplace=True)
freq_original.set_index(['clique_id', 'freq'],drop=False,inplace=True)
freq_original.drop('clique_id',axis=1,inplace=True)

freq_original.head()

In [None]:
len(covers.shsPerf.unique()[1:])

In [None]:
freq_original.dtypes

In [None]:
freq_original_agg=freq_original[['freq']].groupby(by='clique_id').agg([{'nbrow':'count','nbfreq':'sum'}])

In [None]:
freq_original_agg.columns=freq_original_agg.columns.droplevel()
freq_original_agg.columns=freq_original_agg.columns.droplevel()
freq_original_agg['clique_id']=freq_original_agg.index

In [None]:
freq_original_agg.head()

In [None]:
#Keep only clique where there are 2 elements with two different original_shsPerf
freq_original_agg=freq_original_agg[(freq_original_agg.nbrow==2) & (freq_original_agg.nbfreq==2)]

In [None]:
freq_original_agg.head()

In [None]:
covers_clique=covers.copy()
covers_clique.set_index(['clique_id','trackID'],inplace=True)

In [None]:
covers_clique.head()

In [None]:
def attribute_original(clique_id) :

    first_elem=covers_clique.loc[clique_id].iloc[0]
    second_elem=covers_clique.loc[clique_id].iloc[1]
    
    if (first_elem.original_shsPerf==0) | (second_elem.original_shsPerf==0) :
        if (first_elem.original_shsPerf==0) & (second_elem.original_shsPerf!=second_elem.shsPerf):
            elemTrack=first_elem.name
            elemSHS=second_elem.original_shsPerf
            
        elif (second_elem.original_shsPerf==0) & (first_elem.original_shsPerf!=first_elem.shsPerf):
            elemTrack=second_elem.name
            elemSHS=first_elem.original_shsPerf
        else :
            elemTrack=np.nan
            elemSHS=np.nan
            
    else :
        elemTrack=np.nan
        elemSHS=np.nan
        
    return elemTrack, elemSHS

In [None]:
replace_shs=pd.DataFrame(columns=['elemTrack','elemSHS'])
replace_shs['elemTrack'], \
replace_shs['elemSHS'] = zip(*freq_original_agg.clique_id.map(attribute_original))

In [None]:
replace_shs.dropna(axis=0,inplace=True)
replace_shs.elemSHS=replace_shs.elemSHS.astype(int)
replace_shs.head()

In [None]:
covers=covers.merge(replace_shs,how='left',left_on='trackID',right_on='elemTrack')
covers['shsPerf'] = np.where(covers['elemSHS'].notnull(), covers['elemSHS'], covers['shsPerf'])

In [None]:
covers.drop(['elemTrack','elemSHS'],axis=1,inplace=True)
covers.shsPerf=covers.shsPerf.astype(int)

In [None]:
#Save pickle file
#pickle.dump(covers,open('data/covers_final.p','wb'))
covers=pickle.load(open("data/covers_final.p","rb"))

In [None]:
len(covers[covers.shsPerf==0])

In [None]:
def find_original_track(clique_id,max_freq_row):
    
    nb_elems=freq_original.loc[clique_id].original_shsPerf.count()
    unique_shsPerf=covers.shsPerf.unique()[1:];

    if nb_elems == 1 :

        first_elem=max_freq_row.original_shsPerf;
        if first_elem !=0:
            if (first_elem in(unique_shsPerf))==True :
                original_song=first_elem;
            else :
                original_song='Unknown';
        
        else :
            original_song='Unknown';
    
    if nb_elems == 2 :
        first_elem=max_freq_row.original_shsPerf;
        second_elem=freq_original.loc[clique_id].iloc[1].original_shsPerf;
        
        if first_elem != 0:
            if (first_elem in (unique_shsPerf))==True :
                original_song=first_elem;
            elif (second_elem in (unique_shsPerf))==True :
                original_song=second_elem;
            else : #Put the reference as the first
                original_song='Unknown';
        
        else :
            if (second_elem in (unique_shsPerf))==True :
                original_song=second_elem;
            else :
                original_song='Unknown';
    
    
    elif nb_elems>2 :
        first_elem=max_freq_row.original_shsPerf;
        
        second_elem=freq_original.loc[clique_id].iloc[1].original_shsPerf;
        third_elem=freq_original.loc[clique_id].iloc[2].original_shsPerf;
        
        if first_elem != 0:
            if (first_elem in (unique_shsPerf))==True :
                original_song=first_elem;
            elif (second_elem in (unique_shsPerf))==True :
                original_song=second_elem;
            elif (third_elem in (unique_shsPerf))==True :
                original_song=third_elem;
            else :
                original_song='Unknown';
        
        else :
            if (second_elem in (unique_shsPerf))==True :
                original_song=second_elem;
            elif (third_elem in (unique_shsPerf))==True :
                original_song=third_elem;
            else :
                original_song='Unknown';
                
    
    return clique_id, original_song

In [None]:
#Create empty DataFrame to contain the outputs of find_original_track()
original_song_df=pd.DataFrame(columns=['clique_id','original_id'])

In [None]:
original_song_df['clique_id'], \
original_song_df['original_id'] = zip(*pd.Series(freq_original.index.get_level_values('clique_id').unique()).map(lambda x : find_original_track(x,freq_original.loc[x].iloc[0])))

In [None]:
original_song_df.shape

In [None]:
#Number of clique where no original songs were found via web-scrapping
len(original_song_df[original_song_df.original_id=='Unknown'])

In [None]:
original_song_df[original_song_df.original_id=='Unknown'].head()

In [None]:
#Number of original_id duplicates between two cliques
len(original_song_df[(original_song_df.duplicated(subset='original_id',keep=False)) & (original_song_df.original_id!='Unknown')])

In [None]:
original_song_df[(original_song_df.duplicated(subset='original_id',keep=False)) & (original_song_df.original_id!='Unknown')].sort_values('original_id').head()

In [None]:
#Merge the original of each clique with cover dataframe
covers=covers.merge(original_song_df[['clique_id','original_id']],how='left',left_on='clique_id',right_on='clique_id')

In [None]:
dup_shsPerf=covers[covers.duplicated('shsPerf') & covers.shsPerf!=0].sort_values('shsPerf').shsPerf
all_duplicates=covers[covers.duplicated('shsPerf',keep=False) & covers.shsPerf!=0].sort_values('shsPerf')

In [None]:
def duplicate_to_keep(dup_shsPerf) :
    dup=covers[covers.shsPerf==dup_shsPerf]
    dup.replace(['Unknown','Unavailable'],0)
    count=(dup[['language','date','original_shsPerf','original_id']].iloc[:,1:] == 0).sum(axis=1).sort_values()
    trackID_keep=covers.iloc[count.index[0]].trackID
    
    return trackID_keep

In [None]:
to_keep=pd.DataFrame(data=dup_shsPerf.map(duplicate_to_keep))

In [None]:
to_remove=all_duplicates.merge(to_keep,how='left',left_on='trackID',right_on='shsPerf')
to_remove=to_remove[to_remove.shsPerf_y.isnull()]

In [None]:
to_remove.shape

In [None]:
covers=covers[~covers.trackID.isin(to_remove.trackID)]

In [None]:
covers.shape

In [None]:
pickle.dump(covers,open('data/covers_withoutduplicates.p','wb'))
#covers=pickle.load(open("data/covers_withoutduplicates.p","rb"))

In [None]:
def merge_cliques(original_id, covers, cliques):
    cliques_list = cliques[cliques['original_id'] == original_id].index
    c1 = cliques_list[0]
    for c in cliques_list[1:]:
        tracks = covers[covers['clique_id'] == c].index
        for t in tracks:
            covers.set_value(t, 'clique_id', c1)

def merge(covers):
    covers = covers
    cliques = covers.groupby('clique_id').max()
    cliques = cliques[cliques['original_id'] != "Unknown"]
    dup = (cliques[cliques['original_id'].duplicated()].sort_values("original_id")['original_id']).tolist()
    for d in dup:
        merge_cliques(d, covers, cliques)
        
    covers.sort_values("clique_id", inplace=True)
    covers.reset_index(inplace=True)
    covers.drop('index', axis=1, inplace=True)
    return covers

In [None]:
#merge(covers)

In [None]:
#pickle.dump(covers,open('data/covers_FINAL.p','wb'))
covers=pickle.load(open("data/covers_FINAL.p","rb"))

In [None]:
#Load the artist location
artist_location=pickle.load(open("data/artists_location.p","rb"))

In [None]:
artist_location.head()

In [None]:
covers=covers.merge(artist_location,how='left',left_on='name',right_on='name')

In [None]:
genres=pickle.load(open("data/genres_shs.p","rb"))

In [None]:
genres.head()

In [None]:
covers=covers.merge(genres[['genre']],how='left',left_on='trackID',right_index=True)

In [None]:
covers.head(15)

In [None]:
#pickle.dump(covers,open('data/covers_COMPLETE.p','wb'))
covers=pickle.load(open("data/covers_COMPLETE.p","rb"))

In [None]:
# The years recovered from the dataset are floats, so we need to strip their decimal part,
# and then we can take the last 4 characters on any date string to get the year.
covers["year"] = covers["date"].astype(str).replace("\.0","",regex=True).str[-4:]

In [None]:
covers

In [None]:
covers[covers.original_id=='Unknown']

In [None]:
# Find cliques where original_id='Unknown' and where there is at least one Nan year
nan_years=covers[covers.original_id=='Unknown'].groupby(['clique_id','year'],as_index=False)['clique_id'].agg({'freq':'count'})

In [None]:
nan_years[nan_years.year=='nan'].head()

In [None]:
#Number of cliques where nan present 
len(nan_years[nan_years.year=='nan'])

In [None]:
nan_years[nan_years.year=='nan'].clique_id.head()

In [None]:
covers=covers[~covers.clique_id.isin(nan_years[nan_years.year=='nan'].clique_id)]

In [None]:
covers[covers.year=='nan'].shape

In [None]:
original_song=pd.DataFrame(covers.original_id.unique())

In [None]:
original_song.shape

In [None]:
original_song['first_algo_result']='original'
original_song.columns=['original_id','first_algo_result']
original_song.set_index('original_id',inplace=True)

In [None]:
original_song.drop('Unknown',inplace=True)

In [None]:
original_song.shape

In [None]:
original_song.head()

In [None]:
covers.shape

In [None]:
covers=covers.merge(original_song,how='left',left_on='shsPerf',right_index=True)

In [None]:
covers.head()

In [None]:
year_nan=covers.groupby(['clique_id','year'],as_index=False)['clique_id'].agg({'freq':'count'})

In [None]:
year_nan[year_nan.year=='nan'].clique_id.head()

In [None]:
second_algo=covers[~covers.clique_id.isin(year_nan[year_nan.year=='nan'].clique_id)]

In [None]:
second_algo.head()

In [None]:
covers['second_algo_result']= np.where(((covers['rank']==1) & (covers.clique_id.isin(second_algo.clique_id))) , 'original', np.nan)

In [None]:
covers

In [None]:
pickle.dump(covers,open('data/covers_algo_results.p','wb'))
#covers=pickle.load(open("data/covers_COMPLETE.p","rb"))

In [None]:
second_algo_fail=covers.groupby(['clique_id','second_algo_result'],as_index=False)['clique_id'].agg({'freq':'count'})

In [None]:
second_algo_fail.head()

In [None]:
len(second_algo_fail[(second_algo_fail.second_algo_result=='original') & (second_algo_fail.freq>1)])

In [None]:
covers[covers.original_id=='Unknown']

In [None]:
set_original(2)

In [None]:
covers_clique=covers.copy()
covers_clique.set_index(['clique_id','trackID'], inplace=True)
covers_clique

In [None]:
covers_clique.loc[2].original_id[0]

In [None]:
covers[covers.shsPerf==covers_clique.loc[2].original_id[0]]

In [None]:
# In each clique, we rank the versions by year to find the original
covers["rank"] = covers.groupby("clique_id")["year"].rank(method="dense",ascending=True).astype(int)

In [None]:
covers

In [None]:
unknown_originals = covers[covers["rank"] == 1][covers[covers["rank"] == 1].duplicated(subset="clique_id",keep=False)]

In [None]:
merge_covers.head(10)

In [None]:
#Save cover dataFrame in pickle file
pickle.dump(original_song_df,open('data/original_song_df.p','wb'))
pickle.dump(covers,open('data/covers_merged_new.p','wb'))

In [2]:
covers=pickle.load(open("data/covers_merge_algo.p","rb"))

In [3]:
covers[covers.clique_id==1190].shape

(14, 18)

In [4]:
covers.drop(['original_shsPerf','original_id','first_algo_result','second_algo_result','artist'],axis=1,inplace=True)

In [5]:
covers.rename(columns = {'name':'artist','final_original':'status'}, inplace = True)

In [6]:
covers.set_index(['clique_id','trackID'],inplace=True)

In [7]:
covers

Unnamed: 0_level_0,Unnamed: 1_level_0,artistID,shsPerf,artist,title,language,date,country,genre,year,rank,status
clique_id,trackID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,TRCKNGE128F92DA3F3,AR1CB5G1187B9AFB8E,16660,Electric Light Orchestra,Mr. Blue Sky,English,1977,United Kingdom,Rock,1977,1,original
2,TRIOPLY128F423CFF3,ARKZJ301187FB521B2,551633,Lily Allen,Mr Blue Sky,English,2009,United Kingdom,Pop,2009,2,cover
3,TRWNDEU128F9329BF7,ARVZWQ31187B9B8946,354066,Liars,Mr Your On Fire Mr,English,"October 1, 2002",United States,Alternative,2002,1,original
3,TRYOPHS128F146DEFD,AR6NYHH1187B9BA128,354067,Yeah Yeah Yeahs,Mr. You're On Fire Mr.,English,"June 23, 2003",United States,Alternative,2003,2,cover
4,TRQNZCE128E078A9C0,ARWILYB1187FB37DFE,52011,Bananarama,More_ More_ More,English,"March 20, 1993",United Kingdom,Pop,1993,2,cover
4,TRMBSQR128F92DF66E,ARPI2DX1187FB4CED4,52010,Andrea True Connection,More More More,English,1976,United States,R&B/Soul,1976,1,original
6,TREVWUZ128F4263A9B,AR9UYPT1187B9AE833,45769,Hear'Say,Monday Monday,English,2001,United Kingdom,,2001,1,cover
6,TRGYREY128E0791913,ARQ294N1187FB53D2A,9133,The Mamas & The Papas,Monday_ Monday,Unavailable,,United States,,,2,original
7,TRBLBSR128F425EBFE,AR36FFP1187B9926D7,59663,Silicon Teens,Memphis Tennessee,English,September 1980,United Kingdom,Electronic,1980,5,cover
7,TRLRWJK128F427D602,AR6NBDC1187FB4D96D,575,Chuck Berry,Memphis_ Tennessee,English,May 1959,United States,Rock,1959,1,original


In [15]:
#pickle.dump(covers,open('covers_FINAL_NEW.p','wb'))
covers=pickle.load(open("covers_FINAL.p","rb"))

In [16]:
covers

Unnamed: 0_level_0,Unnamed: 1_level_0,artistID,shsPerf,artist,title,language,date,country,genre,year,rank,status
clique_id,trackID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,TRCKNGE128F92DA3F3,AR1CB5G1187B9AFB8E,16660,Electric Light Orchestra,Mr. Blue Sky,English,1977,United Kingdom,Rock,1977,1,original
2,TRIOPLY128F423CFF3,ARKZJ301187FB521B2,551633,Lily Allen,Mr Blue Sky,English,2009,United Kingdom,Pop,2009,2,cover
3,TRWNDEU128F9329BF7,ARVZWQ31187B9B8946,354066,Liars,Mr Your On Fire Mr,English,"October 1, 2002",United States,Alternative,2002,1,original
3,TRYOPHS128F146DEFD,AR6NYHH1187B9BA128,354067,Yeah Yeah Yeahs,Mr. You're On Fire Mr.,English,"June 23, 2003",United States,Alternative,2003,2,cover
4,TRQNZCE128E078A9C0,ARWILYB1187FB37DFE,52011,Bananarama,More_ More_ More,English,"March 20, 1993",United Kingdom,Pop,1993,2,cover
4,TRMBSQR128F92DF66E,ARPI2DX1187FB4CED4,52010,Andrea True Connection,More More More,English,1976,United States,R&B/Soul,1976,1,original
6,TREVWUZ128F4263A9B,AR9UYPT1187B9AE833,45769,Hear'Say,Monday Monday,English,2001,United Kingdom,,2001,1,cover
6,TRGYREY128E0791913,ARQ294N1187FB53D2A,9133,The Mamas & The Papas,Monday_ Monday,Unavailable,,United States,,,2,original
7,TRBLBSR128F425EBFE,AR36FFP1187B9926D7,59663,Silicon Teens,Memphis Tennessee,English,September 1980,United Kingdom,Electronic,1980,5,cover
7,TRLRWJK128F427D602,AR6NBDC1187FB4D96D,575,Chuck Berry,Memphis_ Tennessee,English,May 1959,United States,Rock,1959,1,original


In [17]:
covers.loc[(covers.year.str.contains("nan")) | (covers.year==""),"year"]=np.nan

In [18]:
covers.loc[(covers.date.str.contains("nan")) | (covers.date==""),"year"]=np.nan

In [19]:
covers

Unnamed: 0_level_0,Unnamed: 1_level_0,artistID,shsPerf,artist,title,language,date,country,genre,year,rank,status
clique_id,trackID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2,TRCKNGE128F92DA3F3,AR1CB5G1187B9AFB8E,16660,Electric Light Orchestra,Mr. Blue Sky,English,1977,United Kingdom,Rock,1977,1,original
2,TRIOPLY128F423CFF3,ARKZJ301187FB521B2,551633,Lily Allen,Mr Blue Sky,English,2009,United Kingdom,Pop,2009,2,cover
3,TRWNDEU128F9329BF7,ARVZWQ31187B9B8946,354066,Liars,Mr Your On Fire Mr,English,"October 1, 2002",United States,Alternative,2002,1,original
3,TRYOPHS128F146DEFD,AR6NYHH1187B9BA128,354067,Yeah Yeah Yeahs,Mr. You're On Fire Mr.,English,"June 23, 2003",United States,Alternative,2003,2,cover
4,TRQNZCE128E078A9C0,ARWILYB1187FB37DFE,52011,Bananarama,More_ More_ More,English,"March 20, 1993",United Kingdom,Pop,1993,2,cover
4,TRMBSQR128F92DF66E,ARPI2DX1187FB4CED4,52010,Andrea True Connection,More More More,English,1976,United States,R&B/Soul,1976,1,original
6,TREVWUZ128F4263A9B,AR9UYPT1187B9AE833,45769,Hear'Say,Monday Monday,English,2001,United Kingdom,,2001,1,cover
6,TRGYREY128E0791913,ARQ294N1187FB53D2A,9133,The Mamas & The Papas,Monday_ Monday,Unavailable,,United States,,,2,original
7,TRBLBSR128F425EBFE,AR36FFP1187B9926D7,59663,Silicon Teens,Memphis Tennessee,English,September 1980,United Kingdom,Electronic,1980,5,cover
7,TRLRWJK128F427D602,AR6NBDC1187FB4D96D,575,Chuck Berry,Memphis_ Tennessee,English,May 1959,United States,Rock,1959,1,original


In [21]:
pickle.dump(covers,open('covers_FINAL67.p','wb'))

In [None]:
original_song_df[(original_song_df.duplicated(subset='original_id')==True) & (original_song_df.original_id!='Unknown')].head()

In [None]:
covers[covers.clique_id==5158]

In [None]:
covers[covers.clique_id==1228]

In [None]:
covers[covers.clique_id==3584]

In [None]:
covers[covers.clique_id==1754]

In [None]:
covers[covers.original_shsPerf==623]

In [None]:
covers

In [None]:
original_song.shape

In [None]:
cover_new=covers.copy()

In [None]:
cover_new[]

In [None]:
covers['original_shsPerf'] = np.where(((covers['original_shsPerf']=='Unavailable') | (covers['original_shsPerf']=='Missing')), 0, covers['original_shsPerf'])
kidheart break

We have only 345 tracks with no informations about the year.

2/ Some original performance that were found in the SHS website don't appear in the clique so need to add these tracks to our dataframe.
- See for each clique if there is at least one original song defined
- See if the original song is unique (in each clique)
- Check if we have information about the orginal song in our dataframe or if we need to web-scrap from SHS website
- REMOVE DUPLICATES IN TRACKS (same shs perf)

In [None]:
merge_covers

In [None]:
covers

In [None]:
#At least one original song in each clique ?
#Replace missing values by Nan in the original_shsPerf column (to don't count them as unique values)
covers.original_shsPerf.replace(['Missing','Unavailable'],np.NaN,inplace=True)
count_unique=covers.groupby('clique_id').original_shsPerf.nunique(dropna=True)

In [None]:
#Is the original song unique in each clique ?
count_unique[count_unique>1].head()

In [None]:
count_unique[count_unique==0].index

In [None]:
count_tracks_clique=covers.groupby('clique_id').size()

In [None]:
count_tracks_clique[count_tracks_clique==1]

In [None]:
#All the cliques having more than 1 Original Performance
covers[covers.clique_id.isin(count_unique[count_unique>1].index)==True]

In [None]:
#All the cliques having only NaN values for Original Performance
covers[covers.clique_id.isin(count_unique[count_unique==0].index)==True]

In [None]:
#Take the clique id we defined as id of the dataframe (not unique index for now)
covers.set_index('clique_id',inplace=True)
covers.sort_index(inplace=True)

In [None]:
covers.head()

In [None]:
covers.loc[covers[(covers.year.isnull()) & ((covers.date!='Missing') & (covers.date!='Unavailable'))].loc,

In [None]:
covers[(covers.year=='Missing') | (covers.year=='Unavailable')].shape

In [None]:
#Number of cliques with no original song found
len(count_unique[count_unique==0])

In [None]:
#
shsPerfUnique=pd.DataFrame(covers.shsPerf.unique())
originalPerfUnique=pd.DataFrame(covers.originalPerfUnique.unique())
merge=originalPerfUnique.merge(shsPerfUnique,how='left')
merge

In [None]:
covers

Here is the information we obtained for now. We noticed several problems that needs to be fixed later :
- The track year (year) is sometimes different form the released date (date) we've extracted from the SHS website, we will prefer the data information found in the SHS website.
- Some languages and some dates are missing (we will consider droping these covers or restrict our analyse concerning these parameters to only a subset of cliques).
- Some original performance that were found in the SHS website don't appear in the clique so need to add these tracks to our dataframe (find them in the cluster or webscrap directly to the SHS website)
- We may still have a problem to find the ranking (just discriminate the original track and do not sort the music covers).

The work for this part will be to extend the analysis to the entire cover dataframe, resolve problems cited above and finish the multilevel indexing using ranking according the date in each clique. The API request is limited for the Second Hand Songs (SHS) website to 1000 requests per hour. Due to the large number of requests needed (668 to resolve the missing SHS problem and 18196 to find the Language/Year/Original Song), we'll maybe ask to the SHS team an exception to remove this limitation.

Then, the goal is to add to this dataframe the informations we've extracted through the cluster (tempo, song hotness), through the LastFM dataset (genre) and through the artist_location table (country of the artist) with the methods we describe in the next section. 

Thus, we'll have all the informations required to start the analysis of our music covers.

In [None]:


#Compute the order of songs for each clique
#covers['rank']=covers.groupby('clique_id')['year'].rank(method='dense',ascending=True).astype('int')

### 4. Access to files (tempo / dancability)

We open the first file of the subset, to check what the HDF5 keys are and then we read each of them.

In [None]:
with pd.HDFStore("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5") as hdf:
    print(hdf.keys())

In [None]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/analysis/songs")

In [None]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/metadata/songs")

In [None]:
pd.read_hdf("data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5","/musicbrainz/songs")

We only need to extract <tt>tempo</tt> and <tt>song_hotttnesss</tt>, here is an example of how to do that on the subset :

In [None]:
tempo = []
hotness = []

files = glob.glob("data/MillionSongSubset/data/A" + "/[A-Z]/[A-Z]/*")
for f in files:
    tempo.append(pd.read_hdf(f,"/analysis/songs")["tempo"][0])
    hotness.append(pd.read_hdf(f,"/metadata/songs")["song_hotttnesss"][0])

In [None]:
tempo = np.asarray(tempo)
hotness = np.asarray(hotness)
print(tempo)
print(hotness)

In [None]:
print("Number of tracks =", len(files))
print("with missing hotness values =", np.sum(np.isnan(hotness)))

We already got 3268 unknown hotness values and we only tested on a subset of 7620 song, so we can expect to have that information for only a little over half of our final dataset. Maybe we won't use it.

Once all the files are accessible on the cluster, we will have to go through our SHS dataset and get those attributes for each track_id.
We will do so in the following way : (the paths are just examples on the subset)

In [None]:
tempo = []
hotness = []

my_file = Path("/path/to/file")
for track in covers["trackID"]:
    folder1 = track[2]
    folder2 = track[3]
    folder3 = track[4]
    folder_path = "data/MillionSongSubset/data/" + folder1 + "/" + folder2 + "/" + folder3 + "/"
    track_path = folder_path + track + ".h5"
    if Path(track_path).exists(): #to delete later
        tempo.append(pd.read_hdf(track_path,"/analysis/songs")["tempo"][0])
        hotness.append(pd.read_hdf(track_path,"/metadata/songs")["song_hotttnesss"][0])

In [None]:
print(len(tempo))
print(np.sum(~np.isnan(hotness)))

Unsurprisingly, we only found 204 of those tracks in the subset and 128 of them have a hotness value.

### 5. Determine artist location for spatial analysis

In [None]:
#Load Additional files
#unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
unique_artists=pd.read_csv('data/AdditionalFiles/unique_artists.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','artistMID','randomTrack','name'])
artist_location=pd.read_csv('data/AdditionalFiles/artist_location.txt',delimiter='<SEP>',engine='python',header=None,index_col=0,names=['artistID','lat','long','name','location'])
artist_location.head()

We now load a subset of Second Hand Song

In [None]:
def read_shs_files(pathToFile):
    f = open(pathToFile)
    s = StringIO()
    cur_ID = None
    for ln in f:
        if not ln.strip():
                continue
        if ln.startswith('%'):
                cur_ID = ln.replace('\n','<SEP>',1)
                continue
        if cur_ID is None:
                print ('NO ID found')
                sys.exit(1)
        s.write(cur_ID + ln)
    s.seek(0)
    df = pd.read_csv(s,delimiter='<SEP>',engine='python',header=None,names=['shsID','trackID','artistID','shsPerf'])
    return df[['trackID', 'artistID', 'shsPerf']]

We retrieve the artists' names using the unique_artists.txt file and we assign a location for each track using the artist_location.txt.

In [None]:
def get_location(x) : 
    if x in artist_location.index:
        return artist_location.get_value(x, 'location')
    else : 
        return np.nan
    
data=read_shs_files('data/SHS_testset.txt')
data['artist'] = data['artistID'].map(lambda x : unique_artists.get_value(x, 'name'))
data['location'] = data['artistID'].map(lambda x : get_location(x))
data.head()

We now create the function finding the country for each location. In order to do that we wille use three different python packages : pycountry, us, and geopy, as geopy.geocoders does not support too much requests. 

- First, we will use the pycountry package to extract countries if location contains one. 


- If we didn't match any country in pycountry, we will use the us package to check if a us state is present in the location. From the data, we have observed that if the location refer to a us state, the location is either only defined by the state, or the state is the last element of the location.


- If the two precedent methods does not succeed, we will use the geopy.geocoders package, using Nominatim( ).


- We will manually define countries for some location as they are sometimes mispelled, troncated or refer to a website link.

In [None]:
geolocator = Nominatim()

def get_country(x):
    if x == np.nan:
        return x
    x = x.replace("-", ",")
    for c in pycountry.countries:
        if "England" in x or "UK" in x: 
            return "United Kingdom"
        elif c.name.lower() in x.lower():
            return c.name
    refactorlast = x.split(",")[-1].replace(" ", "")
    refactorfirst = x.split(",")[0]
    usstatelast = us.states.lookup(refactorlast)
    usstatefirst = us.states.lookup(refactorfirst)
    if usstatelast != None or usstatefirst != None:
        return "United State of America"
    elif x == "Swingtown":
        return "United State of America"
    elif x == "<a href=\"http://billyidol.net\" onmousedown='UntrustedLink.bootstrap($(this), \"fc44f8f60d13ab68c56b3c6709c6d670\", event)' target=\"_blank\" rel=\"nofollow\">http://billyidol.net</a>":
        return "United Kingdom"
    elif x == "Lennox Castle, Glasgow" or x == "Knowle West, Bristol, Avon, Engla"\
        or x == "Goldsmith's College, Lewisham, Lo" or x == "Julian Lennon&#039;s Official Facebook Music Page"\
        or x == "Sydney, Moscow, Pressburg" or x == "Penarth, Wales to Los Angeles" or x == "Leicester, Leicestershire, Englan":
        return "United Kingdom"
    elif x == "Vancouver, British Columbia, Cana":
        return "Canada"
    elif x == "Washington DC" or x == "Philladelphia" or "New Jersey" in x:
        return "United State of America"
    elif "Czechoslovakia" in x :
        return "Česko"
    elif x == "Jaded Heart Town":
        return "Germany"
    elif x == "RU" or x == "Russia":
        return "Russia"
    else :
        location = geolocator.geocode(x, timeout=None)
        return location.address.split(",")[-1]

In [None]:
#data['country'] = data['location'].map(lambda x : get_country(x))

The only problem with geopy is that it returns a country in its native language. To uniform our data, we create a function that translates manually the countries in English.

In [None]:
def rename(x):
    if "België - Belgique - Belgien" in x:
        return "Belgium"
    elif "Brasil" in x:
        return "Brazil"
    elif "United State" in x:
        return "United States of America"
    elif "Italia" in x:
        return "Italy"
    elif "Norge" in x:
        return "Norway"
    elif "España" in x:
        return "Spain"
    elif "Nederland" in x :
        return "Netherlands"
    elif "Suomi" in x :
        return "Finland"
    elif "Sverige" in x :
        return "Sweden"
    elif "UK" in x :
        return "United Kingdom"
    elif x[0] == " ":
        return x[1:]
    else : 
        return x

In [None]:
#data['country'] = data['country'].map(lambda x : rename(x))
#pickle.dump(data, open( "data.p", "wb" ) )
data_country = pickle.load(open("data/data_country.p", "rb"))

In [None]:
data_country.head()

### 6. Addition of the genre for each track (Use of LastFM dataset and external website for genre listing)

To find the genre of a song, we will use the LastFM dataset that contains a list a tags for each song.
Since the dataset is from the MillionSongDataset, we will not use all of the available tracks from LastFM but, but only the ones contained in the SecondHandSong dataset.

In [None]:
# Loading the files if they are in the SecondHandSong dataset and create the dataframe
covers_df = pickle.load(open("data/covers.p","rb"))
list_tracks = covers_df.trackID
test_path = "../../lastfm_test"
train_path = "../../lastfm_train"

genre_df = pd.DataFrame()
def create_dataFrame(genre_df):
    for track in list_tracks:
        folder1 = track[2]
        folder2 = track[3]
        folder3 = track[4]
        folder_path = "/" + folder1 + "/" + folder2 + "/" + folder3 + "/"
        track_path = folder_path + track + ".json"
        if glob.glob(train_path + track_path) != []:
                genre_df = genre_df.append(pd.DataFrame.from_dict(json.load(open(train_path + track_path)), orient="index").transpose())
        elif glob.glob(test_path + folder_path + track) != []:
                genre_df = genre_df.append(pd.DataFrame.from_dict(json.load(open(test_path + track_path)), orient="index").transpose())
    genre_df = genre_df.reset_index()
    return genre_df

#tracks_with_tags = create_dataFrame(genre_df)
tracks_with_tags = pickle.load(open("tracks_with_tags", "rb"))

We now list the unique tags in the resulting dataframe. Due to a time limit for the computation of the matching, we will first test on a subset.

In [None]:
tags = list()
for i in range (0,1000):
    tags = tags + tracks_with_tags.tags[i]
    
tags = np.unique(tags).tolist()

A lot of tags contains useless information, thus we first proceed to a pre-cleaning.

In [None]:
clean_tags = {}
def clean_tag(x):
    clean = x.replace("ooo", "")
    clean = clean.replace("-o", "")
    clean = clean.replace("o-", "")
    clean = clean.replace("- ", "")
    clean = clean.replace("-", "")
    clean_tags[x] = clean
for t in tags:
    clean_tag(t)

In order assign a genre to each song, we will use their different tags and try to match it with a list of genre obtained by webscrapping the http://www.musicgenreslist.com website. For more details on the webscrapping see the notebook Genre Webscrapping.ipynb.

In [None]:
map_genres = pickle.load(open("data/map_genres", "rb"))
all_genres = pickle.load(open("data/all_genres.p", "rb"))

We then use the Sequence Matcher package to match tags to the web-scrapped genres.

In [None]:
threshold = 0.80
def match_genres():
    i = 0
    genre_map = {}
    no_match = list()
    for ind in range(0,len(tags)):
        name1 = tags[ind]
        if i%1000 == 0:
            print(i)
        if clean_tags[name1] == "":
            genre_map[name1] = np.nan
        best_ratio = 0
        match = ""
        for name2 in map_genres.keys():
            if name2.lower() in name1.lower():
                for subgenre in map_genres[name2]:
                    ratio = SequenceMatcher(None,name1.lower(),name2.lower()).ratio()
                    if ratio > best_ratio:       # we find the maximum similarity
                        best_ratio = ratio
                        match = name2
                if (best_ratio > threshold):     # if it's superior to our threshold we add that couple to the mapping
                    genre_map[name1] = match
                else:
                    genre_map[name1] = name2
        if match == "":
            for subgenre in all_genres:
                ratio = SequenceMatcher(None,name1.lower(),name2.lower()).ratio()
                if ratio > best_ratio:       # we find the maximum similarity
                    best_ratio = ratio
                    match = name2
            if (best_ratio > threshold):     # if it's superior to our threshold we add that couple to the mapping
                genre_map[name1] = match
            else :
                genre_map[name1] = np.nan
        i = i+1
    return (genre_map, no_match)

### 4. Access to files (tempo / dancability)