# Recherche de morceaux similaires via les hashtags

Dans ce notebook, un nouveau dataframe est constitué à partir du `df_global`,<br> 
tous les hashtags d'un même morceau `track_id` sont regroupés dans une colonne, nettoyée (stopwords).<br>
Les entrées NaN de cette colonne `hashtag_cleaned` sont supprimées (dropna).<br>

Puis cette colonne est vectorisée : term frequency, inverse document frequency (tf-idf)<br> 

Ce qui permet d'effectuer des recherche de similarité par linear_kernel.

__Après quelques tests et analyse des résultats, la recherche de similarité est facilitée par une fonction en fin de notebook.__

Mais étant donné qu'on ne peut pas mesurer la performance, il est difficile de savoir si c'est beaucoup plus efficace qu'une recherche de chaine de caractère dans la colonne hashtag du dataframe (str.contains(...)).

De plus, nous ne connaissons pas les noms des morceaux, ce qui ne permet pas de juger de la pertinence des résultats (nous devons nous fier aux hashtags renseignés par les utilisateurs, ce qui ne garantit aucunement la cohérence du classement et donc du choix / des suggestions parmi ce classement).

In [1]:
import pandas as pd
import numpy as np


In [2]:
df_global = pd.read_csv("df_global.csv")
print(len(df_global['track_id'].unique()), "morceaux distincts")

161456 morceaux distincts


In [3]:
# avant merge des hashtags (pour rapport)
df_global[df_global['track_id']=='0000e47c1207e2c637a44753a713456f'][['track_id', 'hashtag']]


Unnamed: 0,track_id,hashtag
3461919,0000e47c1207e2c637a44753a713456f,rock
3461920,0000e47c1207e2c637a44753a713456f,pops
3461921,0000e47c1207e2c637a44753a713456f,70s
4768742,0000e47c1207e2c637a44753a713456f,music
4863387,0000e47c1207e2c637a44753a713456f,music


In [4]:
# creation d'un dataframe 1 morceau et tous les hashtags (suppression des doublons id - hashtag)
df_vector = df_global[['track_id','hashtag']].drop_duplicates().groupby(['track_id'], as_index = False).agg({'hashtag': ' '.join})

In [5]:
# pour rapport
df_vector[df_vector['track_id']=='0000e47c1207e2c637a44753a713456f']

Unnamed: 0,track_id,hashtag
1,0000e47c1207e2c637a44753a713456f,rock pops 70s music


In [6]:
df_vector

Unnamed: 0,track_id,hashtag
0,00003213fb3d4959f42e9157b0eda0a5,newchristianmusic
1,0000e47c1207e2c637a44753a713456f,rock pops 70s music
2,0001dc79946a42fbc837c044be0bdbbc,apt
3,000248c97c5991b9900360aca97d9879,musicislife disturbed tbfmonline vkscrobbler
4,00027df4d0bd64108624757fe4cbfe76,radio rock music hardrock
...,...,...
161451,fffd293d9450783348d5f9d169ed8abe,hopes
161452,fffd8d636ba5082a01d9f4127ac17b89,listenlive
161453,fffd997ff184ba9929358a77644fff6c,musicislife
161454,fffdf00857154771de0d8479d8341e1e,stonerrock doommetal


In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\A454131\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# ajouter stopwords
useless_words = ['listenlive','class95','kiss92','music','6music','234radio',
                 'jammin105','radio2','audio5cafe','radio','bbc6music','tweetlink',
                 'myplaylist','tbfmonline', 'vkscrobbler', 'wzbt', 'nowlistening',
                 'spotify', 'onlineradio', 'tunein'
                ]
stop.extend(useless_words)

In [9]:
# hashtag sans les stopwords
df_vector['hashtag_cleaned'] = df_vector['hashtag'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# remplacer mots vides (espaces) par NaN
df_vector['hashtag_cleaned'] = df_vector['hashtag_cleaned'].replace(r'^\s*$', np.NaN, regex=True)

# colonnes différentes
df_vector[df_vector['hashtag']!=df_vector['hashtag_cleaned']]


Unnamed: 0,track_id,hashtag,hashtag_cleaned
1,0000e47c1207e2c637a44753a713456f,rock pops 70s music,rock pops 70s
3,000248c97c5991b9900360aca97d9879,musicislife disturbed tbfmonline vkscrobbler,musicislife disturbed
4,00027df4d0bd64108624757fe4cbfe76,radio rock music hardrock,rock hardrock
5,000359abe73f500143335921d078a7b0,listenlive class95 kiss92 music 6music radiopo...,radiopowermix nowonair tsgonair planet90 niuro...
7,0003f7b02133154299b9c65215b3ac1e,wzbt,
...,...,...,...
161444,fffba65dae0f6f9f27d0f3f128017e6b,listenlive,
161445,fffc7d4fca481dccf7606c2066677a78,listenlive streetstyleradio 6music bbc6music,streetstyleradio
161449,fffd0d291f4addb6c4640a0f50d22f3f,hitmusic chicagomusic listenlive hitparty,hitmusic chicagomusic hitparty
161450,fffd0e6aa371d5d9f325092b4b176a5b,listenlive,


In [10]:
# pour rapport
df_vector[df_vector['track_id']=='0000e47c1207e2c637a44753a713456f']

Unnamed: 0,track_id,hashtag,hashtag_cleaned
1,0000e47c1207e2c637a44753a713456f,rock pops 70s music,rock pops 70s


In [11]:
df_vector.dropna(subset=['hashtag_cleaned'], inplace=True)
df_vector.drop(columns=['hashtag'], inplace=True)

df_vector.reset_index(drop=True, inplace=True)

In [12]:
df_vector

Unnamed: 0,track_id,hashtag_cleaned
0,00003213fb3d4959f42e9157b0eda0a5,newchristianmusic
1,0000e47c1207e2c637a44753a713456f,rock pops 70s
2,0001dc79946a42fbc837c044be0bdbbc,apt
3,000248c97c5991b9900360aca97d9879,musicislife disturbed
4,00027df4d0bd64108624757fe4cbfe76,rock hardrock
...,...,...
111196,fffd0d291f4addb6c4640a0f50d22f3f,hitmusic chicagomusic hitparty
111197,fffd293d9450783348d5f9d169ed8abe,hopes
111198,fffd997ff184ba9929358a77644fff6c,musicislife
111199,fffdf00857154771de0d8479d8341e1e,stonerrock doommetal


## Vectorisation

In [13]:
# matrice de similarité - term frequency, inverse document frequency (tf-idf)
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer().fit_transform(df_vector['hashtag_cleaned'])
tfidf.shape


(111201, 42281)

### tests et recherches préliminaires

In [14]:
# similarité entre la 2eme entrée tfidf[1] et le reste de la matrice vectorisée
from sklearn.metrics.pairwise import linear_kernel
hashtag_similarities = linear_kernel(tfidf[1:2], tfidf).flatten()
hashtag_similarities

array([0., 1., 0., ..., 0., 0., 0.])

In [15]:
# 5 résultats les plus proches, les plus similaires
related_docs_indices = hashtag_similarities.argsort()[:-5:-1]
related_docs_indices

array([    1, 83847, 30054, 44206], dtype=int64)

In [16]:
# vérification par les hashtags
df_vector.iloc[1]

track_id           0000e47c1207e2c637a44753a713456f
hashtag_cleaned                       rock pops 70s
Name: 1, dtype: object

In [17]:
# vérification par les hashtags (suite)
df_vector.iloc[83847]

track_id           c0a3db38bd9f8f8c44ed0c2d0e1b60af
hashtag_cleaned                       rock pops 70s
Name: 83847, dtype: object

In [18]:
# vérification par les hashtags (suite)
df_vector.iloc[30054]

track_id           45138db5d3e84a423b12650a0f65d4a6
hashtag_cleaned                       rock 70s pops
Name: 30054, dtype: object

In [19]:
# vérification par les hashtags (suite)
df_vector.iloc[44206]

track_id           65929a3541868bf88b85bdc16e86e1d0
hashtag_cleaned                       rock pops 60s
Name: 44206, dtype: object

## "Automatisation"

In [27]:
'''
    fonctions utiles
'''

# renvoie les hashtags de l'index 
# (si l'index existe dans le dataframe)
def get_hashtags_by_index(df_index):
    if df_index in df_vector.index:
        return df_vector.iloc[df_index]['hashtag_cleaned']
    
# renvoie les hashtags du track_id
# (si le track_id existe dans le dataframe)
def get_hashtags_by_track_id(track_id):
    if df_vector['track_id'].str.contains(track_id).any():
        return df_vector[df_vector['track_id']==track_id]['hashtag_cleaned'].values[0]

    
# renvoie le track_id de l'index
# (si l'index existe dans le dataframe)
def get_trackid_by_index(df_index):
    if df_index in df_vector.index:
        return df_vector.iloc[df_index]['track_id']


# renvoie l'index du track_id
# (si le track_id existe dans le dataframe)
def get_index_by_trackid(track_id):
    if df_vector['track_id'].str.contains(track_id).any():
        return df_vector.index[df_vector['track_id']==track_id].tolist()[0]

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


'''
    recherche nb morceaux similaires à track_id
    via la colonne qui regroupe les hashtags
'''
def similar_tune(track_id, nb):
    # recherche de l'index du track_id dans le dataframe
    df_index = get_index_by_trackid(track_id)
    
    # matrice de similarité
    hashtag_similarities = linear_kernel(tfidf[df_index:df_index+1], tfidf).flatten()
    
    # resultats similaires (index)
    similar_tunes_idx = hashtag_similarities.argsort()[:-nb-1:-1]
    
    # conversion au format dataframe (index, track_id, hashtag)
    idx_fnd = []
    trk_fnd = []
    hsh_fnd = []
    
    for res in similar_tunes_idx:
        tune_found = get_trackid_by_index(res)
        hashtags_found = get_hashtags_by_index(res)
        idx_fnd.append(res)
        trk_fnd.append(tune_found)
        hsh_fnd.append(get_hashtags_by_index(res))
    return pd.DataFrame({'idx' : idx_fnd, 'track_id' : trk_fnd,'hashtags' : hsh_fnd})


In [26]:
# morceau de référence
tune = 'c0a3db38bd9f8f8c44ed0c2d0e1b60af'

# nb de suggestions souhaitées
nb = 10

print('morceau : [',get_index_by_trackid(tune), "] : track_id =", tune)
print('hashtags:',get_hashtags_by_track_id(tune))
print('-----------------------------------------------------------------------')
print('résultats:')

similar_tune(tune,nb)


morceau : [ 83847 ] : track_id = c0a3db38bd9f8f8c44ed0c2d0e1b60af
hashtags: rock pops 70s
-----------------------------------------------------------------------
résultats:


Unnamed: 0,idx,track_id,hashtags
0,1,0000e47c1207e2c637a44753a713456f,rock pops 70s
1,83847,c0a3db38bd9f8f8c44ed0c2d0e1b60af,rock pops 70s
2,30054,45138db5d3e84a423b12650a0f65d4a6,rock 70s pops
3,44206,65929a3541868bf88b85bdc16e86e1d0,rock pops 60s
4,24288,37bc24b5c16cc1f91a1049d6ebed410b,pops 00s
5,71372,a4196978d57dd77ce85628bc86cb27d1,rock 70s
6,60588,8b4146285e3ca27693c3b1cf81851e1c,rock hardrock 70s
7,43678,646a1e0d1b8ceb54ce87f017d4357071,rock hardrock 70s
8,1786,03e508ca8989f648f305283100388b6c,rock hardrock 70s
9,102600,ec3ac9eb8e3df884700b76c70c252b5f,rock hardrock 70s


In [28]:
# autres tests

# morceau de référence
# tune = '00c838d53a639dcd4987fd0f92e5a674' # jazz
# tune = '00909a7d394b9aeb5c921c30d61f886f' # punk
# tune = '3e9e93051bbb2bc8979e111caa5c044c' # classical

# fail
tune ='d801089f2dea97bb0ea5b95c875465e0'

# nb de suggestions souhaitées
nb = 5

print('morceau : [',get_index_by_trackid(tune), "] : track_id =", tune)
print('hashtags:',get_hashtags_by_track_id(tune))
print('-----------------------------------------------------------------------')
print('résultats:')

similar_tune(tune,nb)


morceau : [ 93933 ] : track_id = d801089f2dea97bb0ea5b95c875465e0
hashtags: juevesmeloso
-----------------------------------------------------------------------
résultats:


Unnamed: 0,idx,track_id,hashtags
0,93933,d801089f2dea97bb0ea5b95c875465e0,juevesmeloso
1,37069,5545075b7f3a0bae8d14c9468b79abe4,streetstyleradio homegrownradio ryanlocker sav...
2,37058,553e1e757b2d176c973b353e075a3b1d,metalcore
3,37059,553e4cb281ad399347db86d2bde112db,bristol listeningnow 90smusic bbr bass
4,37060,553e4f0e9dde8a757e21ce87f7ae8902,love funklove 1rockhour nocturnalclassicswithl...


In [34]:
# autres tests

# fail
tune ='f16e9029afe5e148d2dffcf88a55603b'
# get_trackid_by_index(104841)

# nb de suggestions souhaitées
nb = 5

print('morceau : [',get_index_by_trackid(tune), "] : track_id =", tune)
print('hashtags:',get_hashtags_by_track_id(tune))
print('-----------------------------------------------------------------------')
print('résultats:')

similar_tune(tune,nb)


morceau : [ 104841 ] : track_id = f16e9029afe5e148d2dffcf88a55603b
hashtags: hiltonhead savannah progressiveradio toulouse sanjose phoenix word beatbc ontheair winnipeg belfast brunswick freshlysqueezed yourmusic aracityradio rock itunes metal bringinitback manchester rheims orleans kamloops pittsburgh anchorage takeovertext 2002年の音楽 radiocidadeoficial tocandonacidade fuerzamartes cancióndeviernes cancióndeldía musicislife nanaimo liverpool köln lille tacoma fragradio gaming foofighters turnitup turnup rockon tocando radio98rock brazil atacalaradio allmylife tsgonair rcbs
-----------------------------------------------------------------------
résultats:


Unnamed: 0,idx,track_id,hashtags
0,104841,f16e9029afe5e148d2dffcf88a55603b,hiltonhead savannah progressiveradio toulouse ...
1,35674,51f31c2eb4aa9fbfcb9f6ddc6f716416,savannah hiltonhead ontheair belfast brunswick...
2,65829,9774a311af1b0e64e092e50f4a35ec2a,worldclassmusic savannah hiltonhead auckland b...
3,108216,f92726d6d470a7f911257132863bdf5c,progressiveradio savannah hiltonhead orleans l...
4,83168,bf1468972cc9e5360fb681d9c6bceda5,progressiveradio savannah hiltonhead buffalo o...


In [32]:
# test random
tune = df_vector.iloc[np.random.randint(0,len(df_vector.index)+1)].track_id

nb = 5

print('morceau : [',get_index_by_trackid(tune), "] : track_id =", tune)
print('hashtags:',get_hashtags_by_track_id(tune))
print('-----------------------------------------------------------------------')
print('résultats:')

similar_tune(tune,nb)



morceau : [ 98176 ] : track_id = e1f7ca59eafa1f8c0844b50c2760526c
hashtags: listeningto mondaymosh live
-----------------------------------------------------------------------
résultats:


Unnamed: 0,idx,track_id,hashtags
0,3757,087f3c0fd9b349dce3175c82fe20ac3b,listeningto mondaymosh live
1,19323,2c5f5215db24b5fb08fc46424551c2c3,listeningto mondaymosh live
2,34592,4f8ccf0758a8d99cb059de7d9c7c7e81,mondaymosh live listeningto
3,79041,b581d408305017b123775b8ff19a79b1,listeningto mondaymosh live
4,98176,e1f7ca59eafa1f8c0844b50c2760526c,listeningto mondaymosh live
