# Mentoría 'de cómo clasificar en géneros a las canciones'
## Práctico III : Introducción al aprendizaje automático

**Antes de empezar:**
- [Instalar spaCy y el modelo de lenguaje con el que van a trabajar](https://spacy.io/models#quickstart)

**Consideraciones:**
- Se evalúa el estilo y prolijidad del código.
- Se permite hacer trabajo extra, siempre y cuando las actividades básicas estén resueltas.

**Recomendación:**
- Hay muchos ejemplos de código en internet, no se sientan obligados a implementar todo desde cero.

### Librerías

In [1]:
!pip3 install spotipy
!pip3 install pandas
!pip3 install spacy
!pip3 install pymusixmatch
!pip3 install nltk
!pip3 install sklearn
!pip3 install stanza

# Agregar las librerías extra que se utilicen en esta celda y la siguiente

Collecting spotipy
  Downloading spotipy-2.15.0-py3-none-any.whl (24 kB)
Installing collected packages: spotipy
Successfully installed spotipy-2.15.0
Processing /home/kunan/.cache/pip/wheels/cd/22/66/fcfe16c783269151e68dfa0a25411b21a2d5d2106cda7dac1e/pymusixmatch-0.3-py3-none-any.whl
Installing collected packages: pymusixmatch
Successfully installed pymusixmatch-0.3
Collecting stanza
  Using cached stanza-1.1.1-py3-none-any.whl (227 kB)
Installing collected packages: stanza
Successfully installed stanza-1.1.1


### Dependencias y acceso a APIs

In [7]:
import pandas as pd
import numpy as np
import spotipy
import spacy
from spotipy.oauth2 import SpotifyClientCredentials
from musixmatch import Musixmatch
import seaborn as sns
import tqdm
import plotly.express as px
import plotly.graph_objects as go
import requests
from collections import Counter
from nltk import ngrams, bigrams
import nltk
import itertools
import matplotlib.pyplot as plt
import json
import spacy
from sentiment_analysis_spanish import sentiment_analysis

client_id = '46b333d567314a89a6254b6c6b054be6'
client_secret = '9d922c3613e441518349dcf55f7d5853'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

# es = es_core_web_sm
#python -m spacy link es_core_news_sm es
nlp = spacy.load("es")

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
musixmatch = Musixmatch('1aa5272f4402bf2f082ad2f3958c2c62') # se puede reemplazar por otra API si da mejores resultados

In [9]:
sns.set_context(context='paper')

### 1) Recopilar los datos obtenidos en los prácticos anteriores

Para esta parte consideraremos [la playlist colaborativa de la mentoría](https://open.spotify.com/playlist/2IuD0qZb14cji5y52crdsO?si=nfHRPDquQRyotEcXc4tG7Q), de esta obtendremos:
- Las features del audio de las canciones
- Las features textuales de sus letras

Además es necesario aplicar el mismo preprocesamiento que aplicamos en los prácticos anteriores para ambos tipos de features (el preprocesamiento del p1 a las features de audio y el de p2 al de features textuales) y obtener el género de cada canción, que en caso de ser más de uno para una canción el equipo deberá discutir una estrategia para estos casos y comentarla en el informe.

Luego, se separará al dataset resultante en **X** e **y**, donde:
- X es el conjunto de features
- y es la etiqueta, en este caso el género de la canción, que deberá ser codificado en valores del tipo **int**

Por último, se dividirá a estos dos conjuntos en los splits **train** y **test**

**Recomendaciones:**
- Obtener las features por separado y hacer un join de los datasets.
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)
- Si usan features categóricas, ENCODEARLAS!

# Dataset

## Music Features

In [10]:
#Aux funcs
def genres_by_artist_id(id): #id: str
    artist = sp.artist(id)
    genres = artist['genres']
    return genres    #genres: List[str]

def songs_from_album_id(album_id):
    songs = []
    album = sp.album(album_id)
    artist = album['artists'][0]['name']
    for item in album['tracks']['items']:
        track = {}
        track["song_name"] = item['name']
        track["song_id"] = item['id']
        track["album_name"] = album['name']
        track["album_id"] = album["id"]
        audio_features = sp.audio_features(track["song_id"])
        track["audio_features"] = audio_features[0]
        track["artist"] = artist
        songs.append(track)
    return songs    #songs:List[dict]

def get_genres(artists_id):
    res = [genres_by_artist_id(x) for x in artists_id if genres_by_artist_id(x)!= []]
    if res != []:
        return res[0]
    return res

def add_track(track_id, songs): #track_id:str, songs:List[dict]
    track = sp.track(track_id)
    audio_features = sp.audio_features(track_id)
    row = {}
    row["song_name"] = track['name']
    row["song_id"] = track['id']
    row["artists"] = [x["name"] for x in track["artists"]]
    row["artists_id"] = [x["id"] for x in track["artists"]]
    row["album_name"] = track['album']['name']
    row["album_id"] = track['album']['id']
    row["audio_features"] = audio_features[0]
    row["genres"] = get_genres(row["artists_id"])
    songs.append(row)
    return songs


In [11]:
PLAYLIST_ID = "2IuD0qZb14cji5y52crdsO"
TEST_PLAYLIST_ID = "3gLmPh92AyeYDKYLaNC8uv"
songs = []

def add_songs_of_playlist(playlist_id,songs_array):

    print("This may take a while...")
    offset = 0
    playlist = sp.playlist_tracks(playlist_id,offset=offset,limit=100)
    batches = playlist["total"] // 100 + 1
    print("...downloading "+ str(playlist["total"]) + " songs")
    print("in "+str(batches)+ " batches")
    for j in tqdm.tqdm(range(batches)):
        for i in range(len(playlist["items"])):
            add_track(playlist["items"][i]["track"]["id"],songs_array)
        offset += len(playlist["items"])
        playlist = sp.playlist_tracks(playlist_id, offset=offset,limit=100)
    return

add_songs_of_playlist(PLAYLIST_ID,songs)

This may take a while...


  0%|          | 0/12 [00:00<?, ?it/s]

...downloading 1190 songs
in 12 batches


100%|██████████| 12/12 [22:48<00:00, 114.08s/it]


In [12]:
songs_original_df = pd.DataFrame(songs)
songs_original_df.count()

song_name         1190
song_id           1190
artists           1190
artists_id        1190
album_name        1190
album_id          1190
audio_features    1190
genres            1190
dtype: int64

Checking dataframe's consistency.

In [13]:
songs_original_df[[x == [] for x in songs_original_df["genres"]]]

Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres
48,Cambalache,3PI0FE7JUmEmEyN5YgKPZA,[Enrique Santos Discépolo],[0aPYs7yoiP2NtS5xNZXKjg],"Enrique Santos Discepolo ""El poeta del tango"" ...",59tn7tvd1M5XNWwV3TaVWC,"{'danceability': 0.492, 'energy': 0.541, 'key'...",[]
478,Sigue Feliz,1s0ndZpf2KeKEA08CsIFia,[Alonso y Bernardo],[5sskVxLnToHrnwTAICyVF5],"Narcos, Vol. 2 (More Music from the Netflix Or...",0EJRlYjvVcym9K4wrww9vB,"{'danceability': 0.672, 'energy': 0.625, 'key'...",[]
526,Celia,1rzFbkSvxQv6r3PSGjn7Ub,[Incas de Oro],[58wFXtpJxfvtigDaRWTNcj],Los Mejores Tinkus,4xDelBtEq3aJCU8hU6gFLB,"{'danceability': 0.675, 'energy': 0.477, 'key'...",[]
600,Arteria Ulnar,7kGsNBECFyCQ0fBJn2KB6o,[Té de Brujas],[39BvzssARgDTZ1Kf0uqNfj],Arteria Ulnar,5g5rgxGPlCnRTrYyf173fp,"{'danceability': 0.497, 'energy': 0.651, 'key'...",[]
604,No Le Ganamos a Nadie,0F0I189uNvQBdgy1SFNOec,[Literal],[0Ec1MqHP5MENR7rK3DtO3G],No Le Ganamos a Nadie,43moEeCjsTjk6N25XRin0S,"{'danceability': 0.568, 'energy': 0.938, 'key'...",[]
606,Contratiempos,50GbEo3clyzJRzuAjIFWdz,[Parientes],[76lUSSvc6Z83CLrIVB7YrE],Contratiempos,3F6da9yP7HMGwl88egAqZ5,"{'danceability': 0.549, 'energy': 0.778, 'key'...",[]
613,Si Tú No Estas (Nashville),3TCpMjVi4DVzbc5dXLpEeX,[Stokoff],[03wfTeoZex93T5TPxWo3B9],Si Tú No Estas (Nashville),2fNd57gzWCMwsNVG0K5YQy,"{'danceability': 0.545, 'energy': 0.652, 'key'...",[]
617,Una Nueva Realidad,3CQinOLvOg1vMvP9a060xV,[Scones de la Chola],[1n0013t3w2RbIqSYarnPGS],Una Nueva Realidad,6NCW2haZteRywEWZSzc7in,"{'danceability': 0.536, 'energy': 0.571, 'key'...",[]
628,Dale!,48EI8HkseqMBYQw8yl4WL3,[La Extrema Vanguardia],[3p1OOKD3Rs8JsT9I76mACt],Epe,4Bgue5pbIMGEZ61SoULBMr,"{'danceability': 0.464, 'energy': 0.864, 'key'...",[]
637,Si Me Dijeras,0NhFqADNG4OABBBEtxW0WM,[Vozenoff],[0hASTHk8Lmdj2zAHvkfsfW],Si Me Dijeras,6gyIUgOHK85AQswoDcLDDw,"{'danceability': 0.534, 'energy': 0.747, 'key'...",[]


In [14]:
songs_original_df[[x == [] for x in songs_original_df["genres"]]].count()

song_name         13
song_id           13
artists           13
artists_id        13
album_name        13
album_id          13
audio_features    13
genres            13
dtype: int64

These cells don't have a genre since Spotify hasn't assigned a genre to their corresponding artists. 
These edge cases will be corrected by hand trying to find the closest match by listening to each song and finding a similiar artist that has an assigned genre in Spotify, if that is possible.

In [15]:
sp.artist("4Apvih9OZt9ghebGFIVcXI")
get_genres(["1AvkrI2S7knrbaZxydvc9B"])

['cumbia villera']

In [16]:
sanitized_df = songs_original_df.copy()
sanitized_df.iloc[48]["genres"] = get_genres(["7Cg2eqV6oHNE0P54WfajIX"]) #like julio sosa
sanitized_df.iloc[478]["genres"] = 'folklore'
#sanitized_df.iloc[498]["genres"] = get_genres(["4Apvih9OZt9ghebGFIVcXI"]) #like los del suquia
sanitized_df.iloc[526]["genres"] = get_genres(["0iutktJLkNNtErs8c3EoF6"]) #like los tekis
sanitized_df.iloc[600]["genres"] = get_genres(["7okwEbXzyT2VffBmyQBWLz"]) #like maná
sanitized_df.iloc[604]["genres"] = get_genres(["3HrbmsYpKjWH1lzhad7alj"]) #like los autenticos decadentes
sanitized_df.iloc[606]["genres"] = get_genres(["1vumST4jmwQla7lkbLKDUw"]) #like foxley
sanitized_df.iloc[613]["genres"] = get_genres(["5f5Wlazt9jmI75fK5nPpd6"]) #like 8 segundos
sanitized_df.iloc[617]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[628]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[637]["genres"] = get_genres(["2wjmZuSHtRx96Qxb8HiP2o"]) #like los guasones
sanitized_df.iloc[647]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[652]["genres"] = get_genres(["54YdJC33Ztc1CNIuodmyUb"]) #like leo garcia
sanitized_df.iloc[663]["genres"] = get_genres(["1AvkrI2S7knrbaZxydvc9B"]) #like mala fama
sanitized_df[[x == [] for x in sanitized_df["genres"]]]

Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres


In [17]:
sanitized_df.count()

song_name         1190
song_id           1190
artists           1190
artists_id        1190
album_name        1190
album_id          1190
audio_features    1190
genres            1190
dtype: int64

No nulls in any other column

In [18]:
columns = sanitized_df.columns

for column in columns:
    if sanitized_df[[x == [] for x in sanitized_df[column]]].empty:
        print("No nulls in column ", column)
    else:
        print(">>>>>>> Found nulls in column", column)

No nulls in column  song_name
No nulls in column  song_id
No nulls in column  artists
No nulls in column  artists_id
No nulls in column  album_name
No nulls in column  album_id
No nulls in column  audio_features
No nulls in column  genres


Let's remove possible duplicated songs

In [19]:
#duplicated_songs = songs_original_df[songs_original_df["song_id"].duplicated(keep='last')]
#duplicated_songs
sanitized_df[sanitized_df.duplicated(subset=["song_id"])==True].count()[0]
#songs_original_df.duplicated(subset=["song_id"],keep='first').any()
#songs_original_df[songs_original_df["song_id"]=="2TNV1bPTWhKTRTVAghIszh"]

20

We find two duplicated tracks of the same song_id in the dataset.
Let's remove them

In [20]:
songs_df = sanitized_df[sanitized_df["song_id"].duplicated(keep='last') != True]
if not songs_df.duplicated(subset=["song_id"]).any():
    print(f"no duplicated records found in {len(songs_df)} records")

no duplicated records found in 1170 records


Let's make a backup file to not have to run all that code again

In [22]:
songs_df.to_csv("/home/kunan/Documentos/famaf/diplodatos/mentoria-canciones-DiploDatos/songs_df.csv")

In [23]:
audio_features_base = pd.DataFrame(list(songs_original_df["audio_features"].values))
audio_features_description = audio_features_base.describe()
audio_features_description

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0
mean,0.584804,0.718866,5.342017,-6.442758,0.608403,0.069282,0.221394,0.019798,0.245614,0.603148,124.777358,236615.7,3.942857
std,0.152737,0.182243,3.51589,2.779049,0.488312,0.066547,0.233478,0.104262,0.235393,0.228425,30.077972,70930.69,0.301544
min,0.148,0.108,0.0,-19.575,0.0,0.0226,2e-06,0.0,0.0277,0.0377,62.85,38933.0,1.0
25%,0.488,0.604,2.0,-7.734,0.0,0.0325,0.0342,0.0,0.0967,0.42125,97.1005,198660.5,4.0
50%,0.602,0.7505,6.0,-6.0355,1.0,0.0436,0.137,2e-06,0.142,0.6205,125.0025,228546.5,4.0
75%,0.696,0.86575,9.0,-4.5415,1.0,0.074875,0.341,0.000224,0.30775,0.7935,147.32225,263706.8,4.0
max,0.945,0.995,11.0,-0.767,1.0,0.514,0.982,0.944,0.991,0.976,204.498,1129160.0,5.0


Preeliminary analysis of the dataset

In [24]:
#more aux functions
def track_by_feature(feature, value):
    track_id = audio_features_base[audio_features_base[feature]==value]['id']
    track_id = track_id.values.item(0)
    return songs_original_df[songs_original_df['song_id']== track_id]
#example use: 
#print(track_by_feature("valence",0.039100))
#track_by_feature("speechiness",0.492000)

def songs_of_description(statistic):
    row = audio_features_description.loc[statistic]
    keys = row.keys()
    tracks_of_row = []
    for key in keys:
        track = track_by_feature(key,row[key]).to_dict()['song_name']
        track = list(track.values())[0]
        tracks_of_row.append({key: track})
    return tracks_of_row
#example use
#songs_of_description("min")

songs_of_description("max")

[{'danceability': 'Salud y Vida'},
 {'energy': 'Boom Boom'},
 {'key': 'Robo un Auto'},
 {'loudness': 'Quién Se Tomó Todo el Vino - En Vivo'},
 {'mode': 'Sin Rumbo'},
 {'speechiness': "Pa'l Norte (feat. Orishas)"},
 {'acousticness': 'Que Cruz La Que Lleva El Viento'},
 {'instrumentalness': 'Caballo negro'},
 {'liveness': 'Amor de mañana'},
 {'valence': 'Rock del Gato'},
 {'tempo': 'Como Vas a Hacer'},
 {'duration_ms': 'Popurrí - En Vivo'},
 {'time_signature': 'Desafío'}]

In [25]:
songs_of_description("min")

[{'danceability': 'Piel'},
 {'energy': 'Te Quiero, Te Espero'},
 {'key': 'El Pibe Tigre'},
 {'loudness': 'Outropía'},
 {'mode': 'Desafío'},
 {'speechiness': 'No Tengo'},
 {'acousticness': 'Porque Hoy Nací'},
 {'instrumentalness': 'Anoche'},
 {'liveness': 'Armas Gemelas'},
 {'valence': 'Piel'},
 {'tempo': 'Moraleja'},
 {'duration_ms': 'Luli'},
 {'time_signature': 'El Enrosque'}]

## Lyrics

In [26]:
songs_df.count()

song_name         1170
song_id           1170
artists           1170
artists_id        1170
album_name        1170
album_id          1170
audio_features    1170
genres            1170
dtype: int64

In [68]:
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 7.45MB/s]                    
2020-09-08 22:43:54 INFO: Downloading default packages for language: es (Spanish)...
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/es/default.zip: 100%|██████████| 583M/583M [09:43<00:00, 999kB/s]    
2020-09-08 22:53:51 INFO: Finished downloading models and saved to /home/kunan/stanza_resources.


In [None]:
nlp = stanza.Pipeline(lang='es', processors='tokenize,sentiment')
doc = nlp('I hate that they banned Mox Opal')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

analisis previo:
distribución de los géneros en el dataset.

In [71]:
# aux methods

def song_url_for_request(artist, song_title):
    return "https://api.lyrics.ovh/v1/" + artist + '/' + song_title #str
    # example use:
    # requests.get(song_url_for_request("Death Grips", "Hacker"))
    

def lemmafy(doc):
    lemmas = []
    for token in doc:
        if not token.is_stop and token.is_alpha:
             lemmas.append(token.lemma_)
    return lemmas

def add_lyrics_to_data(all_songs):
    without_lyrics = 0
    for song in tqdm.tqdm(all_songs):
        try:
            response = requests.get(song_url_for_request(song["artists"][0],song["song_name"])).json()
            json_data = response #json.loads(response.content)
            lyrics_raw = json_data["lyrics"]
            doc = nlp(lyrics_raw)
            #lemmed = lemmafy(doc)
            song["lyrics"]=[word.lower() for word in lyrics_raw] #may fail
        except: #found a song without lyrics!
            without_lyrics +=1
            #print(song["song_name"]+ ">>>No lyrics")
            song["lyrics"]=None
    print(f"There are {without_lyrics} songs without lyrics")
    return all_songs

In [72]:
all_songs = songs_df.copy()
all_songs = all_songs.T.to_dict().values()
all_songs = add_lyrics_to_data(all_songs)

100%|██████████| 1170/1170 [27:03<00:00,  1.39s/it]

There are 377 songs without lyrics





In [73]:
songs_base = pd.DataFrame(all_songs)
print(f"Before sanitization {len(all_songs)}")


Before sanitization 1170


In [74]:
# sanitize
bool_series = pd.notnull(songs_base["lyrics"])
songs_base = songs_base[bool_series]
print(f"After sanitization {len(songs_base)}")


After sanitization 793


In [75]:
songs_base.sample(3)

Unnamed: 0,album_id,album_name,artists,artists_id,audio_features,genres,lyrics,song_id,song_name
591,3Pr88tREtqKF9srgl21jGT,Presión,[Callejeros],[2osoVujXgV0PA8lhqDKYFw],"{'danceability': 0.411, 'energy': 0.974, 'key'...","[argentine rock, cumbia pop, cumbia villera, l...","[jugar, jugar, imaginar, gente, mierda, morir,...",5Vrw8Br1PSD8OcjblU5BEV,Imposible
733,6hmHpM6GIfKiaDWxzyh1J8,Volver A Nacer,[Chayanne],[1JbemQ1fPt2YmSLjAFhPBv],"{'danceability': 0.726, 'energy': 0.992, 'key'...","[latin, latin pop, puerto rican pop, tropical]","[baila, bailar, moreno, mover, cintura, baila,...",5IgpCARxKnZlE9AE7bFwAT,Baila Baila
461,5aiz5TGgMyaJDnjbM8zpqR,Intérpretes,[Hermetica],[6j6Ld5h0aFgH0VQWQNazS7],"{'danceability': 0.245, 'energy': 0.845, 'key'...","[argentine heavy metal, argentine metal, argen...","[muerte, esquivando, patrullar, noche, enfermo...",3ACU4Frw3Ditx6gPKRfiMw,Cráneo Candente


#### Notas del profe

Primero, la letra en sí no se usa, igual que el nombre de la canción. Los n gramas tampoco.

Lo que sí se puede usar son datos obtenidos en base al texto de las canciones, el ejemplo por excelencia es hacer sentiment analysis

Ver
https://elitedatascience.com/python-nlp-libraries


### 2) Elegir tres modelos de clasificadores multiclase

Aquí escogeremos tres modelos diferentes y luego compararemos su rendimiento para esta tarea. El procedimiento será el siguiente:
- Inicializar los modelos
- Entrenarlos usando el split **train** de los datos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 3) Informe: Comparar el rendimiento de los modelos

Una vez entrenados los tres modelos, compararemos su rendimiento:
- Correr los modelos usando el split **test**
- Obtener el reporte de clasificación y la matriz de confusión para cada modelo
- Graficar llevando a 2 dimensiones nuestro split **test** pintando con colores diferentes según la etiqueta correspondiente.
- Graficar de manera similar los resultados obtenidos con cada clasificador y sobre esto la función de clasificación obtenida.
- Guardar los modelos usando **pickle**
- Discutir los resultados obtenidos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 4) Tareas adicionales:

Estas tareas servirán para extrapolar un poco el trabajo básico, y también sumarán puntos extra. Deben elegir una o más de las siguientes:
-  Análisis sobre el balance de clases del dataset, balanceo usando **subsampling** u **oversampling** y comparación de resultados vs el modelo básico
- Optimización de hiperparámetros y comparación de resultados vs el modelo básico
- Graficar importancia de features
- Graficar correlación de features

**Recomendación:**
- Hacer varias ahora puede ahorrarles tiempo en el futuro

In [37]:
#problemas de spotify's genres
get_genres(["0iutktJLkNNtErs8c3EoF6"])

['cumbia andina mexicana', 'folclore jujeno', 'folklore argentino']