# Mentoría 'de cómo clasificar en géneros a las canciones'
## Práctico III : Introducción al aprendizaje automático

**Antes de empezar:**
- [Instalar spaCy y el modelo de lenguaje con el que van a trabajar](https://spacy.io/models#quickstart)

**Consideraciones:**
- Se evalúa el estilo y prolijidad del código.
- Se permite hacer trabajo extra, siempre y cuando las actividades básicas estén resueltas.

**Recomendación:**
- Hay muchos ejemplos de código en internet, no se sientan obligados a implementar todo desde cero.

### Librerías

In [1]:
!pip3 install spotipy
!pip3 install pandas
!pip3 install spacy
!pip3 install pymusixmatch
!pip3 install nltk
!pip3 install sklearn
!pip3 install lyricsgenius

# Agregar las librerías extra que se utilicen en esta celda y la siguiente



In [2]:
#-------------Calculos Generales----------------------------
    
def loadTrackFeatures(features_track, tracks_id):
    
    features = sp.audio_features(tracks_id)
    for feature in features:
        for column_name in features_track.keys():           
            features_track.get(column_name).append(feature.get(column_name))
    return pd.DataFrame(features_track)
            

def loadTracksFromPlayListItems(tracks_dict, playlist_items):
    
    for item in playlist_items:
        track = item.get('track')
        for column_name in tracks_dict.keys():            
            if column_name == 'artists_name':
                artists_name, artists_id = getArtistsNameAndArtistIDFromArtistObjectList( track.get('album').get('artists') );
                tracks_dict['artists_name'].append(artists_name)
                tracks_dict['artists_id'].append(artists_id)
            elif column_name != 'artists_id': 
                tracks_dict.get(column_name).append(track.get(column_name))
    return pd.DataFrame(playlist_items)

def getArtistsNameAndArtistIDFromArtistObjectList( artistObjectList ):
    artist_name_list = []
    artist_id_list = []
    for artist in artistObjectList:
        artist_name_list.append(artist.get('name'))
        artist_id_list.append(artist.get('id'))
    return artist_name_list, artist_id_list

def getPlaylistDataFrame(playlist_id):
    offset = 0
    total_tracks = None
    artist_tracks_dict = {'artist_id':[],'artist_name': [], 'genre':[]}
    features_tracks_dict = {'id': [], 'key':[], 'mode':[], 'time_signature':[], 'tempo':[], 
                       'acousticness':[], 'danceability':[], 'energy':[], 'instrumentalness':[], 'valence':[] }
    tracks_dict = {'id': [],'name': [], 'artists_name':[] , 'artists_id':[]} 

    while True:        
        tracks = sp.playlist_tracks(playlist_id, offset=offset, additional_types=('track',))  
        playlist_items = tracks['items']
        limit = tracks.get('limit')

        loadTracksFromPlayListItems(tracks_dict, playlist_items)    
        loadTrackFeatures(features_tracks_dict, tracks_id=tracks_dict.get('id')[offset:offset + limit])

        total_tracks = tracks.get('total') if (total_tracks is None) else total_tracks
        offset = tracks.get('offset') + limit

        if offset > total_tracks:
            break
    df_tracks = pd.DataFrame(tracks_dict)
    df_features_tracks = pd.DataFrame(features_tracks_dict)
    data_set = pd.merge(df_tracks, df_features_tracks, how='inner', on='id')
    return data_set    

def display_markdown(*args, **kwargs):
    return display(Markdown(*args, **kwargs))




### Dependencias y acceso a APIs

In [3]:
import pandas as pd
import numpy as np
import spotipy
import spacy
from spotipy.oauth2 import SpotifyClientCredentials
from musixmatch import Musixmatch

client_id = '46b333d567314a89a6254b6c6b054be6'
client_secret = '9d922c3613e441518349dcf55f7d5853'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

#nlp = spacy.load("en_core_web_sm") # completar con el modelo que van a utilizar

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
musixmatch = Musixmatch('1aa5272f4402bf2f082ad2f3958c2c62') # se puede reemplazar por otra API si da mejores resultados

In [4]:
mentoria_playlist_id = '2IuD0qZb14cji5y52crdsO'
df_mentoria =  getPlaylistDataFrame(mentoria_playlist_id)
df_mentoria

Unnamed: 0,id,name,artists_name,artists_id,key,mode,time_signature,tempo,acousticness,danceability,energy,instrumentalness,valence
0,7j9DYPyCuvSAtPcevpAkzb,Desafío,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],1,0,5,161.092,0.486,0.161,0.482,0.409000,0.0926
1,1cwTMSQeMaA9fVKEF1iWeD,Anoche,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],10,0,5,80.793,0.570,0.230,0.434,0.000000,0.0834
2,0aL27vskbMpwsMGUkHm3Zf,Sin Rumbo,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],9,1,3,124.835,0.874,0.289,0.280,0.004430,0.0391
3,2kfSFdq2h0xLXq01em1zc7,La Gata Bajo la Lluvia,[Rocío Dúrcal],[2uyweLa0mvPZH6eRzDddeB],7,1,4,88.140,0.723,0.499,0.648,0.000000,0.4640
4,5ySxlyvySBhIEvoO2xx7uT,Querida,[Juan Gabriel],[2MRBDr0crHWE5JwPceFncq],2,1,4,89.089,0.376,0.528,0.383,0.000000,0.4600
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,1o7e4Tadn4o6nPvycwIP0C,En Libertad,[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],5,0,4,150.167,0.236,0.684,0.802,0.000012,0.8390
1226,5AZwGCyE9tIzFYXpqyrKHm,"Amante Tu, Amante El",[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],4,0,4,160.612,0.541,0.541,0.734,0.000000,0.7040
1227,6Lkh9lukUEPBTZ1gewNT0m,Todo Me Lleva a Ti,[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],7,1,4,96.312,0.157,0.600,0.897,0.000610,0.7560
1228,4sUEtsaCvQVIkzMksupERW,La Marcha De La Bronca,[Pedro Y Pablo],[5YDpwWFLxk3wmHBKqAcfiI],9,0,4,119.200,0.696,0.423,0.593,0.000000,0.5210


In [5]:
df_mentoria

Unnamed: 0,id,name,artists_name,artists_id,key,mode,time_signature,tempo,acousticness,danceability,energy,instrumentalness,valence
0,7j9DYPyCuvSAtPcevpAkzb,Desafío,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],1,0,5,161.092,0.486,0.161,0.482,0.409000,0.0926
1,1cwTMSQeMaA9fVKEF1iWeD,Anoche,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],10,0,5,80.793,0.570,0.230,0.434,0.000000,0.0834
2,0aL27vskbMpwsMGUkHm3Zf,Sin Rumbo,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],9,1,3,124.835,0.874,0.289,0.280,0.004430,0.0391
3,2kfSFdq2h0xLXq01em1zc7,La Gata Bajo la Lluvia,[Rocío Dúrcal],[2uyweLa0mvPZH6eRzDddeB],7,1,4,88.140,0.723,0.499,0.648,0.000000,0.4640
4,5ySxlyvySBhIEvoO2xx7uT,Querida,[Juan Gabriel],[2MRBDr0crHWE5JwPceFncq],2,1,4,89.089,0.376,0.528,0.383,0.000000,0.4600
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,1o7e4Tadn4o6nPvycwIP0C,En Libertad,[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],5,0,4,150.167,0.236,0.684,0.802,0.000012,0.8390
1226,5AZwGCyE9tIzFYXpqyrKHm,"Amante Tu, Amante El",[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],4,0,4,160.612,0.541,0.541,0.734,0.000000,0.7040
1227,6Lkh9lukUEPBTZ1gewNT0m,Todo Me Lleva a Ti,[Rodrigo],[235Vf4hkmwvxjVEMuCbRxm],7,1,4,96.312,0.157,0.600,0.897,0.000610,0.7560
1228,4sUEtsaCvQVIkzMksupERW,La Marcha De La Bronca,[Pedro Y Pablo],[5YDpwWFLxk3wmHBKqAcfiI],9,0,4,119.200,0.696,0.423,0.593,0.000000,0.5210


In [6]:
import lyricsgenius
import json

genius_token = "upzErEoYU9BAcEOF0WUIQ-Rw_fxFYrU6QeqVhlHXUEc-7VI1jxZvQNiiTGDigvpk"
csv_folder_path = './TP_3_CSV_Files/{}'
#genius = lyricsgenius.Genius(genius_token)
genius = lyricsgenius.Genius("upzErEoYU9BAcEOF0WUIQ-Rw_fxFYrU6QeqVhlHXUEc-7VI1jxZvQNiiTGDigvpk")

#song = genius.search_song("Desafío", "Arca")
#print(song)
#jsonStr = json.dumps(song.__dict__)
#test_dict = song.__dict__
#print(song.get("_body").get("primary_artist").get("name"))
#print(song.artist)
#print(song.title)

file_name = 'Test_track' + '_2.csv'
test_path = csv_folder_path.format(file_name)
df_test = pd.read_csv(test_path, sep='#')
df_test


Unnamed: 0,id,name,lyric
0,7j9DYPyCuvSAtPcevpAkzb,Desafío,"[Letra de ""Desafío""]\n\n[Verso 1]\nTócame de p..."
1,1cwTMSQeMaA9fVKEF1iWeD,Anoche,
2,0aL27vskbMpwsMGUkHm3Zf,Sin Rumbo,"[Letra de ""Sin Rumbo""]\n\n[Intro]\nGirando en ..."
3,2kfSFdq2h0xLXq01em1zc7,La Gata Bajo la Lluvia,
4,5ySxlyvySBhIEvoO2xx7uT,Querida,"[Letra de ""Querida""]\n\n[Verso 1]\nQuerida, ca..."
...,...,...,...
1225,1o7e4Tadn4o6nPvycwIP0C,En Libertad,
1226,5AZwGCyE9tIzFYXpqyrKHm,"Amante Tu, Amante El",
1227,6Lkh9lukUEPBTZ1gewNT0m,Todo Me Lleva a Ti,
1228,4sUEtsaCvQVIkzMksupERW,La Marcha De La Bronca,


In [74]:
#df_test = pd.DataFrame({'id': [],'name': [], 'lyric':[]})
file_name = 'Test_track' + '_2.csv'
test_path = csv_folder_path.format(file_name)
df_test = pd.read_csv(test_path, sep='#')
tracks_lyric_dict = {'id': [],'name': [], 'lyric':[] } 
try:
    for index, row in df_mentoria.iterrows():
        track_name = row.get("name")
        artists = row.get("artists_name")  

        test =  df_test[df_test.id == row.get('id')]
        if test.empty:
            song = genius.search_song(track_name, artists[0])
            lyric = None
            if song and song.artist ==  artists[0] and song.title == track_name:
                lyric = song.lyrics
            tracks_lyric_dict.get("lyric").append(lyric)
            tracks_lyric_dict.get("id").append(row.get("id"))
            tracks_lyric_dict.get("name").append(track_name)
            print(index)
           # if index == 2 :
           #     break
except:
    print("Something else went wrong")
    
df_temporal = pd.DataFrame(tracks_lyric_dict) 
df_append = df_test.append(df_temporal)
file_name = 'Test_track' + '_2.csv'
df_append.to_csv(csv_folder_path.format(file_name),  encoding='utf-8', index=False, sep= '#')
df_append



Unnamed: 0,id,name,lyric
0,7j9DYPyCuvSAtPcevpAkzb,Desafío,"[Letra de ""Desafío""]\n\n[Verso 1]\nTócame de p..."
1,1cwTMSQeMaA9fVKEF1iWeD,Anoche,
2,0aL27vskbMpwsMGUkHm3Zf,Sin Rumbo,"[Letra de ""Sin Rumbo""]\n\n[Intro]\nGirando en ..."
3,2kfSFdq2h0xLXq01em1zc7,La Gata Bajo la Lluvia,
4,5ySxlyvySBhIEvoO2xx7uT,Querida,"[Letra de ""Querida""]\n\n[Verso 1]\nQuerida, ca..."
...,...,...,...
1225,1o7e4Tadn4o6nPvycwIP0C,En Libertad,
1226,5AZwGCyE9tIzFYXpqyrKHm,"Amante Tu, Amante El",
1227,6Lkh9lukUEPBTZ1gewNT0m,Todo Me Lleva a Ti,
1228,4sUEtsaCvQVIkzMksupERW,La Marcha De La Bronca,


In [78]:
df_append['lyric'].isnull().sum()

845

In [8]:
import os
from os import path


artist_id_list = df_mentoria['artists_id'].to_list()
columns = ['artist_id', 'artist_name','artist_genre']
artist_genre_df= pd.DataFrame(columns=columns)
new_row = { 'artist_id':23 , 'artist_name':3323, 'artist_genre': []} 

artists_genre_file_path = csv_folder_path.format('artists_genres.csv')
if not path.exists(artists_genre_file_path):
    for id_list in artist_id_list:
        for artist_id in id_list:
            if artist_genre_df[artist_genre_df.artist_id == artist_id].empty :
                artist = sp.artist(artist_id)
                new_row = { 'artist_id':artist_id , 'artist_name':artist.get('name'), 'artist_genre': artist.get('genres')} 
                artist_genre_df = artist_genre_df.append(new_row, ignore_index=True)
    artist_genre_df.to_csv(artists_genre_file_path, encoding='utf-8', index=False, sep='#')

  


In [18]:
playlist_file_path = csv_folder_path.format('spotify_playlist_tracks.csv')
track_lyrics_file_path = csv_folder_path.format('Track_lyrics.csv')
artists_genre_file_path = csv_folder_path.format('artists_genres.csv')
playlist_df = pd.read_csv(playlist_file_path, sep='#')
track_lyrics_df = pd.read_csv(track_lyrics_file_path, sep='#')
artists_genre_df = pd.read_csv(artists_genre_file_path, sep='#')
playlist_df

Unnamed: 0,id,name,artists_name,artists_id,key,mode,time_signature,tempo,acousticness,danceability,energy,instrumentalness,valence
0,7j9DYPyCuvSAtPcevpAkzb,Desafío,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],1,0,5,161.092,0.486,0.161,0.482,0.409000,0.0926
1,1cwTMSQeMaA9fVKEF1iWeD,Anoche,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],10,0,5,80.793,0.570,0.230,0.434,0.000000,0.0834
2,0aL27vskbMpwsMGUkHm3Zf,Sin Rumbo,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],9,1,3,124.835,0.874,0.289,0.280,0.004430,0.0391
3,2kfSFdq2h0xLXq01em1zc7,La Gata Bajo la Lluvia,['Rocío Dúrcal'],['2uyweLa0mvPZH6eRzDddeB'],7,1,4,88.140,0.723,0.499,0.648,0.000000,0.4640
4,5ySxlyvySBhIEvoO2xx7uT,Querida,['Juan Gabriel'],['2MRBDr0crHWE5JwPceFncq'],2,1,4,89.089,0.376,0.528,0.383,0.000000,0.4600
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,1o7e4Tadn4o6nPvycwIP0C,En Libertad,['Rodrigo'],['235Vf4hkmwvxjVEMuCbRxm'],5,0,4,150.167,0.236,0.684,0.802,0.000012,0.8390
1226,5AZwGCyE9tIzFYXpqyrKHm,"Amante Tu, Amante El",['Rodrigo'],['235Vf4hkmwvxjVEMuCbRxm'],4,0,4,160.612,0.541,0.541,0.734,0.000000,0.7040
1227,6Lkh9lukUEPBTZ1gewNT0m,Todo Me Lleva a Ti,['Rodrigo'],['235Vf4hkmwvxjVEMuCbRxm'],7,1,4,96.312,0.157,0.600,0.897,0.000610,0.7560
1228,4sUEtsaCvQVIkzMksupERW,La Marcha De La Bronca,['Pedro Y Pablo'],['5YDpwWFLxk3wmHBKqAcfiI'],9,0,4,119.200,0.696,0.423,0.593,0.000000,0.5210


In [32]:
data_frame = playlist_df
df_duplicated = data_frame[data_frame.duplicated()]
len_no_duplicated = len(data_frame) - len(df_duplicated)
print(len(data_frame))
print(len_no_duplicated)
#data_frame[data_frame.id == '1bQR9zy4vw1RuTOgOBitDB']

1230
1170


In [49]:
df_no_duplicated = data_frame.drop_duplicates(keep='first',inplace=False, ignore_index=True)
df_lyric_no_duplicated = track_lyrics_df.drop_duplicates(subset='id', keep='first', inplace=False, ignore_index=True)

len(track_lyrics_df) - len(track_lyrics_df[track_lyrics_df.duplicated(subset='id')])
track_lyrics_df[track_lyrics_df.duplicated(subset='id')]
df_lyric_no_duplicated

df = pd.merge(df_no_duplicated, df_lyric_no_duplicated, how='inner', on='id')


In [51]:
file_path = csv_folder_path.format('spotify_playlist_lyrics.csv')
df.to_csv(file_path, encoding='utf-8', index=False, sep='#')

### 1) Recopilar los datos obtenidos en los prácticos anteriores

Para esta parte consideraremos [la playlist colaborativa de la mentoría](https://open.spotify.com/playlist/2IuD0qZb14cji5y52crdsO?si=nfHRPDquQRyotEcXc4tG7Q), de esta obtendremos:
- Las features del audio de las canciones
- Las features textuales de sus letras

Además es necesario aplicar el mismo preprocesamiento que aplicamos en los prácticos anteriores para ambos tipos de features (el preprocesamiento del p1 a las features de audio y el de p2 al de features textuales) y obtener el género de cada canción, que en caso de ser más de uno para una canción el equipo deberá discutir una estrategia para estos casos y comentarla en el informe.

Luego, se separará al dataset resultante en **X** e **y**, donde:
- X es el conjunto de features
- y es la etiqueta, en este caso el género de la canción, que deberá ser codificado en valores del tipo **int**

Por último, se dividirá a estos dos conjuntos en los splits **train** y **test**

**Recomendaciones:**
- Obtener las features por separado y hacer un join de los datasets.
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)
- Si usan features categóricas, ENCODEARLAS!

### 2) Elegir tres modelos de clasificadores multiclase

Aquí escogeremos tres modelos diferentes y luego compararemos su rendimiento para esta tarea. El procedimiento será el siguiente:
- Inicializar los modelos
- Entrenarlos usando el split **train** de los datos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 3) Informe: Comparar el rendimiento de los modelos

Una vez entrenados los tres modelos, compararemos su rendimiento:
- Correr los modelos usando el split **test**
- Obtener el reporte de clasificación y la matriz de confusión para cada modelo
- Graficar llevando a 2 dimensiones nuestro split **test** pintando con colores diferentes según la etiqueta correspondiente.
- Graficar de manera similar los resultados obtenidos con cada clasificador y sobre esto la función de clasificación obtenida.
- Guardar los modelos usando **pickle**
- Discutir los resultados obtenidos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 4) Tareas adicionales:

Estas tareas servirán para extrapolar un poco el trabajo básico, y también sumarán puntos extra. Deben elegir una o más de las siguientes:
-  Análisis sobre el balance de clases del dataset, balanceo usando **subsampling** u **oversampling** y comparación de resultados vs el modelo básico
- Optimización de hiperparámetros y comparación de resultados vs el modelo básico
- Graficar importancia de features
- Graficar correlación de features

**Recomendación:**
- Hacer varias ahora puede ahorrarles tiempo en el futuro