# Mentoría 'de cómo clasificar en géneros a las canciones'
## Práctico III : Introducción al aprendizaje automático

**Antes de empezar:**
- [Instalar spaCy y el modelo de lenguaje con el que van a trabajar](https://spacy.io/models#quickstart)

**Consideraciones:**
- Se evalúa el estilo y prolijidad del código.
- Se permite hacer trabajo extra, siempre y cuando las actividades básicas estén resueltas.

**Recomendación:**
- Hay muchos ejemplos de código en internet, no se sientan obligados a implementar todo desde cero.

### Librerías

In [1]:
%%capture
!pip3 install spotipy
!pip3 install pandas
!pip3 install spacy
!pip3 install pymusixmatch
!pip3 install nltk
!pip3 install seaborn
!pip3 install requests
!pip3 install nltk
!pip3 install tqdm
!pip3 install plotly
!pip3 install sklearn
!pip3 install sentiment_analysis_spanish

# Agregar las librerías extra que se utilicen en esta celda y la siguiente

Collecting spotipy
  Downloading spotipy-2.15.0-py3-none-any.whl (24 kB)
Installing collected packages: spotipy
Successfully installed spotipy-2.15.0
Processing /home/kunan/.cache/pip/wheels/cd/22/66/fcfe16c783269151e68dfa0a25411b21a2d5d2106cda7dac1e/pymusixmatch-0.3-py3-none-any.whl
Installing collected packages: pymusixmatch
Successfully installed pymusixmatch-0.3
Collecting stanza
  Using cached stanza-1.1.1-py3-none-any.whl (227 kB)
Installing collected packages: stanza
Successfully installed stanza-1.1.1


### Dependencias y acceso a APIs

In [1]:
import pandas as pd
import numpy as np
import spotipy
import spacy
from spotipy.oauth2 import SpotifyClientCredentials
from musixmatch import Musixmatch
import seaborn as sns
import tqdm
import plotly.express as px
import plotly.graph_objects as go
import requests
from collections import Counter
from nltk import ngrams, bigrams
import nltk
import itertools
import matplotlib.pyplot as plt
import json
import spacy
from sentiment_analysis_spanish import sentiment_analysis
from  statistics import median,mean


client_id = '46b333d567314a89a6254b6c6b054be6'
client_secret = '9d922c3613e441518349dcf55f7d5853'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

# es = es_core_web_sm
#python -m spacy link es_core_news_sm es
nlp = spacy.load("es")

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
musixmatch = Musixmatch('1aa5272f4402bf2f082ad2f3958c2c62') # se puede reemplazar por otra API si da mejores resultados

sentiment = sentiment_analysis.SentimentAnalysisSpanish()


In [2]:
sns.set_context(context='paper')

### 1) Recopilar los datos obtenidos en los prácticos anteriores

Para esta parte consideraremos [la playlist colaborativa de la mentoría](https://open.spotify.com/playlist/2IuD0qZb14cji5y52crdsO?si=nfHRPDquQRyotEcXc4tG7Q), de esta obtendremos:
- Las features del audio de las canciones
- Las features textuales de sus letras

Además es necesario aplicar el mismo preprocesamiento que aplicamos en los prácticos anteriores para ambos tipos de features (el preprocesamiento del p1 a las features de audio y el de p2 al de features textuales) y obtener el género de cada canción, que en caso de ser más de uno para una canción el equipo deberá discutir una estrategia para estos casos y comentarla en el informe.

Luego, se separará al dataset resultante en **X** e **y**, donde:
- X es el conjunto de features
- y es la etiqueta, en este caso el género de la canción, que deberá ser codificado en valores del tipo **int**

Por último, se dividirá a estos dos conjuntos en los splits **train** y **test**

**Recomendaciones:**
- Obtener las features por separado y hacer un join de los datasets.
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)
- Si usan features categóricas, ENCODEARLAS!

# Dataset

## Music Features

In [3]:
#Aux funcs
def genres_by_artist_id(id): #id: str
    artist = sp.artist(id)
    genres = artist['genres']
    return genres    #genres: List[str]

def songs_from_album_id(album_id):
    songs = []
    album = sp.album(album_id)
    artist = album['artists'][0]['name']
    for item in album['tracks']['items']:
        track = {}
        track["song_name"] = item['name']
        track["song_id"] = item['id']
        track["album_name"] = album['name']
        track["album_id"] = album["id"]
        audio_features = sp.audio_features(track["song_id"])
        track["audio_features"] = audio_features[0]
        track["artist"] = artist
        songs.append(track)
    return songs    #songs:List[dict]

def get_genres(artists_id):
    res = [genres_by_artist_id(x) for x in artists_id if genres_by_artist_id(x)!= []]
    if res != []:
        return res[0]
    return res

def add_track(track_id, songs): #track_id:str, songs:List[dict]
    track = sp.track(track_id)
    audio_features = sp.audio_features(track_id)
    row = {}
    row["song_name"] = track['name']
    row["song_id"] = track['id']
    row["artists"] = [x["name"] for x in track["artists"]]
    row["artists_id"] = [x["id"] for x in track["artists"]]
    row["album_name"] = track['album']['name']
    row["album_id"] = track['album']['id']
    row["audio_features"] = audio_features[0]
    row["genres"] = get_genres(row["artists_id"])
    songs.append(row)
    return songs


In [11]:
PLAYLIST_ID = "2IuD0qZb14cji5y52crdsO"
TEST_PLAYLIST_ID = "3gLmPh92AyeYDKYLaNC8uv"
songs = []

def add_songs_of_playlist(playlist_id,songs_array):

    print("This may take a while...")
    offset = 0
    playlist = sp.playlist_tracks(playlist_id,offset=offset,limit=100)
    batches = playlist["total"] // 100 + 1
    print("...downloading "+ str(playlist["total"]) + " songs")
    print("in "+str(batches)+ " batches")
    for j in tqdm.tqdm(range(batches)):
        for i in range(len(playlist["items"])):
            add_track(playlist["items"][i]["track"]["id"],songs_array)
        offset += len(playlist["items"])
        playlist = sp.playlist_tracks(playlist_id, offset=offset,limit=100)
    return

add_songs_of_playlist(PLAYLIST_ID,songs)

This may take a while...


  0%|          | 0/12 [00:00<?, ?it/s]

...downloading 1190 songs
in 12 batches


100%|██████████| 12/12 [22:48<00:00, 114.08s/it]


In [12]:
songs_original_df = pd.DataFrame(songs)
songs_original_df.count()

song_name         1190
song_id           1190
artists           1190
artists_id        1190
album_name        1190
album_id          1190
audio_features    1190
genres            1190
dtype: int64

Checking dataframe's consistency.

In [13]:
songs_original_df[[x == [] for x in songs_original_df["genres"]]]

Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres
48,Cambalache,3PI0FE7JUmEmEyN5YgKPZA,[Enrique Santos Discépolo],[0aPYs7yoiP2NtS5xNZXKjg],"Enrique Santos Discepolo ""El poeta del tango"" ...",59tn7tvd1M5XNWwV3TaVWC,"{'danceability': 0.492, 'energy': 0.541, 'key'...",[]
478,Sigue Feliz,1s0ndZpf2KeKEA08CsIFia,[Alonso y Bernardo],[5sskVxLnToHrnwTAICyVF5],"Narcos, Vol. 2 (More Music from the Netflix Or...",0EJRlYjvVcym9K4wrww9vB,"{'danceability': 0.672, 'energy': 0.625, 'key'...",[]
526,Celia,1rzFbkSvxQv6r3PSGjn7Ub,[Incas de Oro],[58wFXtpJxfvtigDaRWTNcj],Los Mejores Tinkus,4xDelBtEq3aJCU8hU6gFLB,"{'danceability': 0.675, 'energy': 0.477, 'key'...",[]
600,Arteria Ulnar,7kGsNBECFyCQ0fBJn2KB6o,[Té de Brujas],[39BvzssARgDTZ1Kf0uqNfj],Arteria Ulnar,5g5rgxGPlCnRTrYyf173fp,"{'danceability': 0.497, 'energy': 0.651, 'key'...",[]
604,No Le Ganamos a Nadie,0F0I189uNvQBdgy1SFNOec,[Literal],[0Ec1MqHP5MENR7rK3DtO3G],No Le Ganamos a Nadie,43moEeCjsTjk6N25XRin0S,"{'danceability': 0.568, 'energy': 0.938, 'key'...",[]
606,Contratiempos,50GbEo3clyzJRzuAjIFWdz,[Parientes],[76lUSSvc6Z83CLrIVB7YrE],Contratiempos,3F6da9yP7HMGwl88egAqZ5,"{'danceability': 0.549, 'energy': 0.778, 'key'...",[]
613,Si Tú No Estas (Nashville),3TCpMjVi4DVzbc5dXLpEeX,[Stokoff],[03wfTeoZex93T5TPxWo3B9],Si Tú No Estas (Nashville),2fNd57gzWCMwsNVG0K5YQy,"{'danceability': 0.545, 'energy': 0.652, 'key'...",[]
617,Una Nueva Realidad,3CQinOLvOg1vMvP9a060xV,[Scones de la Chola],[1n0013t3w2RbIqSYarnPGS],Una Nueva Realidad,6NCW2haZteRywEWZSzc7in,"{'danceability': 0.536, 'energy': 0.571, 'key'...",[]
628,Dale!,48EI8HkseqMBYQw8yl4WL3,[La Extrema Vanguardia],[3p1OOKD3Rs8JsT9I76mACt],Epe,4Bgue5pbIMGEZ61SoULBMr,"{'danceability': 0.464, 'energy': 0.864, 'key'...",[]
637,Si Me Dijeras,0NhFqADNG4OABBBEtxW0WM,[Vozenoff],[0hASTHk8Lmdj2zAHvkfsfW],Si Me Dijeras,6gyIUgOHK85AQswoDcLDDw,"{'danceability': 0.534, 'energy': 0.747, 'key'...",[]


In [14]:
songs_original_df[[x == [] for x in songs_original_df["genres"]]].count()

song_name         13
song_id           13
artists           13
artists_id        13
album_name        13
album_id          13
audio_features    13
genres            13
dtype: int64

These cells don't have a genre since Spotify hasn't assigned a genre to their corresponding artists. 
These edge cases will be corrected by hand trying to find the closest match by listening to each song and finding a similiar artist that has an assigned genre in Spotify, if that is possible.

In [15]:
sp.artist("4Apvih9OZt9ghebGFIVcXI")
get_genres(["1AvkrI2S7knrbaZxydvc9B"])

['cumbia villera']

In [16]:
sanitized_df = songs_original_df.copy()
sanitized_df.iloc[48]["genres"] = get_genres(["7Cg2eqV6oHNE0P54WfajIX"]) #like julio sosa
sanitized_df.iloc[478]["genres"] = 'folklore'
#sanitized_df.iloc[498]["genres"] = get_genres(["4Apvih9OZt9ghebGFIVcXI"]) #like los del suquia
sanitized_df.iloc[526]["genres"] = get_genres(["0iutktJLkNNtErs8c3EoF6"]) #like los tekis
sanitized_df.iloc[600]["genres"] = get_genres(["7okwEbXzyT2VffBmyQBWLz"]) #like maná
sanitized_df.iloc[604]["genres"] = get_genres(["3HrbmsYpKjWH1lzhad7alj"]) #like los autenticos decadentes
sanitized_df.iloc[606]["genres"] = get_genres(["1vumST4jmwQla7lkbLKDUw"]) #like foxley
sanitized_df.iloc[613]["genres"] = get_genres(["5f5Wlazt9jmI75fK5nPpd6"]) #like 8 segundos
sanitized_df.iloc[617]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[628]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[637]["genres"] = get_genres(["2wjmZuSHtRx96Qxb8HiP2o"]) #like los guasones
sanitized_df.iloc[647]["genres"] = get_genres(["0SnyKkoyBaB2fG8IJH4xmU"]) #like los piojos
sanitized_df.iloc[652]["genres"] = get_genres(["54YdJC33Ztc1CNIuodmyUb"]) #like leo garcia
sanitized_df.iloc[663]["genres"] = get_genres(["1AvkrI2S7knrbaZxydvc9B"]) #like mala fama
sanitized_df[[x == [] for x in sanitized_df["genres"]]]

Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres


In [17]:
sanitized_df.count()

song_name         1190
song_id           1190
artists           1190
artists_id        1190
album_name        1190
album_id          1190
audio_features    1190
genres            1190
dtype: int64

No nulls in any other column

In [18]:
columns = sanitized_df.columns

for column in columns:
    if sanitized_df[[x == [] for x in sanitized_df[column]]].empty:
        print("No nulls in column ", column)
    else:
        print(">>>>>>> Found nulls in column", column)

No nulls in column  song_name
No nulls in column  song_id
No nulls in column  artists
No nulls in column  artists_id
No nulls in column  album_name
No nulls in column  album_id
No nulls in column  audio_features
No nulls in column  genres


Let's remove possible duplicated songs

In [19]:
#duplicated_songs = songs_original_df[songs_original_df["song_id"].duplicated(keep='last')]
#duplicated_songs
sanitized_df[sanitized_df.duplicated(subset=["song_id"])==True].count()[0]
#songs_original_df.duplicated(subset=["song_id"],keep='first').any()
#songs_original_df[songs_original_df["song_id"]=="2TNV1bPTWhKTRTVAghIszh"]

20

We find two duplicated tracks of the same song_id in the dataset.
Let's remove them

In [20]:
songs_df = sanitized_df[sanitized_df["song_id"].duplicated(keep='last') != True]
if not songs_df.duplicated(subset=["song_id"]).any():
    print(f"no duplicated records found in {len(songs_df)} records")

no duplicated records found in 1170 records


Let's make a backup file to not have to run all that code again

In [22]:
songs_df.to_csv("./songs_df.csv")

In [23]:
audio_features_base = pd.DataFrame(list(songs_original_df["audio_features"].values))
audio_features_description = audio_features_base.describe()
audio_features_description

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0
mean,0.584804,0.718866,5.342017,-6.442758,0.608403,0.069282,0.221394,0.019798,0.245614,0.603148,124.777358,236615.7,3.942857
std,0.152737,0.182243,3.51589,2.779049,0.488312,0.066547,0.233478,0.104262,0.235393,0.228425,30.077972,70930.69,0.301544
min,0.148,0.108,0.0,-19.575,0.0,0.0226,2e-06,0.0,0.0277,0.0377,62.85,38933.0,1.0
25%,0.488,0.604,2.0,-7.734,0.0,0.0325,0.0342,0.0,0.0967,0.42125,97.1005,198660.5,4.0
50%,0.602,0.7505,6.0,-6.0355,1.0,0.0436,0.137,2e-06,0.142,0.6205,125.0025,228546.5,4.0
75%,0.696,0.86575,9.0,-4.5415,1.0,0.074875,0.341,0.000224,0.30775,0.7935,147.32225,263706.8,4.0
max,0.945,0.995,11.0,-0.767,1.0,0.514,0.982,0.944,0.991,0.976,204.498,1129160.0,5.0


Preeliminary analysis of the dataset

In [4]:
#more aux functions
def track_by_feature(feature, value):
    track_id = audio_features_base[audio_features_base[feature]==value]['id']
    track_id = track_id.values.item(0)
    return songs_original_df[songs_original_df['song_id']== track_id]
#example use: 
#print(track_by_feature("valence",0.039100))
#track_by_feature("speechiness",0.492000)

def songs_of_description(statistic):
    row = audio_features_description.loc[statistic]
    keys = row.keys()
    tracks_of_row = []
    for key in keys:
        track = track_by_feature(key,row[key]).to_dict()['song_name']
        track = list(track.values())[0]
        tracks_of_row.append({key: track})
    return tracks_of_row
#example use
#songs_of_description("min")

#songs_of_description("max")

In [25]:
songs_of_description("min")

[{'danceability': 'Piel'},
 {'energy': 'Te Quiero, Te Espero'},
 {'key': 'El Pibe Tigre'},
 {'loudness': 'Outropía'},
 {'mode': 'Desafío'},
 {'speechiness': 'No Tengo'},
 {'acousticness': 'Porque Hoy Nací'},
 {'instrumentalness': 'Anoche'},
 {'liveness': 'Armas Gemelas'},
 {'valence': 'Piel'},
 {'tempo': 'Moraleja'},
 {'duration_ms': 'Luli'},
 {'time_signature': 'El Enrosque'}]

## Lyrics

In [26]:
songs_df.count()

song_name         1170
song_id           1170
artists           1170
artists_id        1170
album_name        1170
album_id          1170
audio_features    1170
genres            1170
dtype: int64

analisis previo:
distribución de los géneros en el dataset.

Considerations: the performance of using the mean and median of the sentiment of the sentences that comprise each song's lyrics was compared to the performance of using the whole song as a long sentence.

The following function can be used to make this comparision:

In [5]:

def compare_performance_mean_vs_median_vs_whole_song(songs_df):
    all_songs = songs_df.copy()
    all_songs = all_songs.head(5).T.to_dict().values()
    songs_sentiments = []
    for song in all_songs:
        print(song["song_name"])
        response = requests.get(song_url_for_request(song["artists"][0],song["song_name"])).json()
        json_data = response 
        lyrics_raw = json_data["lyrics"]
        whole_lyrics_way = lyrics_raw.replace("\n", " ") #for whole sentence
        lyrics_raw = lyrics_raw.split("\n") #for each sentence
        # lo siguiente es por si se necesita analizar palabra por palabra
        #lyrics_raw = [sentence.split(" ") for sentence in lyrics_raw]
        sentiments_tmp = []
        for sentence in lyrics_raw:
            thing = sentiment.sentiment(sentence.lower())
            sentiments_tmp.append(thing)
        whole_song_sent = sentiment.sentiment(whole_lyrics_way)
        sent_mean = mean(sentiments_tmp)
        sent_median = median(sentiments_tmp)
        print("mean ",sent_mean, " median ",sent_median)
        print("the other way: ", whole_song_sent)
        print("\n\n")
        song_sent_statistics = {"song":song["song_name"],"mean":sent_mean,"median":sent_median,"whole_song":whole_song_sent}
        songs_sentiments.append(song_sent_statistics)
    return songs_sentiments


In [6]:
# aux methods
def song_url_for_request(artist, song_title):
    return "https://api.lyrics.ovh/v1/" + artist + '/' + song_title #str
    # example use:
    # requests.get(song_url_for_request("Death Grips", "Hacker"))
    

def add_lyrics_to_data(all_songs):
    songs_without_lyrics = []
    without_lyrics = 0
    for song in tqdm.tqdm(all_songs):
        try:
            response = requests.get(song_url_for_request(song["artists"][0],song["song_name"])).json()
            json_data = response #json.loads(response.content)
            lyrics_raw = json_data["lyrics"]
            lyrics_raw = lyrics_raw.replace("\n", " ") #for whole sentence
            song_sentiment = sentiment.sentiment(lyrics_raw)
            song["lyrics_sentiment"]= round(song_sentiment,4)
        except: #found a song without lyrics!
            without_lyrics +=1
            song["lyrics_sentiment"]=None
            songs_without_lyrics.append(song)
    print(f"There are {without_lyrics} songs without lyrics")
    return all_songs

In [123]:
all_songs = songs_df.copy()
all_songs = all_songs.T.to_dict().values()
all_songs = add_lyrics_to_data(all_songs)











  0%|          | 0/1170 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[A








  0%|          | 1/1170 [00:01<26:19,  1.35s/it][A[A[A[A[A[A[A[A[A








  0%|          | 2/1170 [00:02<25:16,  1.30s/it][A[A[A[A[A[A[A[A[A








  0%|          | 3/1170 [00:03<25:17,  1.30s/it][A[A[A[A[A[A[A[A[A








  0%|          | 4/1170 [00:05<24:39,  1.27s/it][A[A[A[A[A[A[A[A[A








  0%|          | 5/1170 [00:06<24:25,  1.26s/it][A[A[A[A[A[A[A[A[A








  1%|          | 6/1170 [00:07<24:23,  1.26s/it][A[A[A[A[A[A[A[A[A








  1%|          | 7/1170 [00:08<24:10,  1.25s/it][A[A[A[A[A[A[A[A[A








  1%|          | 8/1170 [00:10<24:27,  1.26s/it][A[A[A[A[A[A[A[A[A








  1%|          | 9/1170 [00:11<25:56,  1.34s/it][A[A[A[A[A[A[A[A[A








  1%|          | 10/1170 [00:12<25:50,  1.34s/it][A[A[A[A[A[A[A[A[A








  1%|          | 11/1170 [00:14<28:12,  1.46s/it][A[A[A[A[A[A[A[A

 16%|█▋        | 191/1170 [04:39<23:54,  1.47s/it][A[A[A[A[A[A[A[A[A








 16%|█▋        | 192/1170 [04:40<23:40,  1.45s/it][A[A[A[A[A[A[A[A[A








 16%|█▋        | 193/1170 [04:42<23:42,  1.46s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 194/1170 [04:43<23:54,  1.47s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 195/1170 [04:45<23:13,  1.43s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 196/1170 [04:46<23:14,  1.43s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 197/1170 [04:47<23:12,  1.43s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 198/1170 [04:49<23:42,  1.46s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 199/1170 [04:50<23:46,  1.47s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 200/1170 [04:52<23:20,  1.44s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 201/1170 [04:53<23:32,  1.46s/it][A[A[A[A[A[A[A[A[A








 17%|█▋        | 202/1170 [04:55<22:34,  1.40s/it][A

 33%|███▎      | 381/1170 [09:09<18:46,  1.43s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 382/1170 [09:10<18:46,  1.43s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 383/1170 [09:12<19:16,  1.47s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 384/1170 [09:13<19:24,  1.48s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 385/1170 [09:15<19:35,  1.50s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 386/1170 [09:16<19:59,  1.53s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 387/1170 [09:18<19:29,  1.49s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 388/1170 [09:19<18:43,  1.44s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 389/1170 [09:20<18:16,  1.40s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 390/1170 [09:22<19:19,  1.49s/it][A[A[A[A[A[A[A[A[A








 33%|███▎      | 391/1170 [09:23<18:53,  1.45s/it][A[A[A[A[A[A[A[A[A








 34%|███▎      | 392/1170 [09:25<18:19,  1.41s/it][A

 49%|████▉     | 571/1170 [13:50<13:16,  1.33s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 572/1170 [13:52<13:53,  1.39s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 573/1170 [13:53<13:32,  1.36s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 574/1170 [13:54<13:47,  1.39s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 575/1170 [13:56<13:28,  1.36s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 576/1170 [13:57<13:06,  1.32s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 577/1170 [13:58<12:54,  1.31s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 578/1170 [13:59<12:26,  1.26s/it][A[A[A[A[A[A[A[A[A








 49%|████▉     | 579/1170 [14:01<12:17,  1.25s/it][A[A[A[A[A[A[A[A[A








 50%|████▉     | 580/1170 [14:02<12:51,  1.31s/it][A[A[A[A[A[A[A[A[A








 50%|████▉     | 581/1170 [14:03<12:45,  1.30s/it][A[A[A[A[A[A[A[A[A








 50%|████▉     | 582/1170 [14:05<12:49,  1.31s/it][A

 65%|██████▌   | 761/1170 [18:09<09:09,  1.34s/it][A[A[A[A[A[A[A[A[A








 65%|██████▌   | 762/1170 [18:11<09:05,  1.34s/it][A[A[A[A[A[A[A[A[A








 65%|██████▌   | 763/1170 [18:12<09:04,  1.34s/it][A[A[A[A[A[A[A[A[A








 65%|██████▌   | 764/1170 [18:13<08:46,  1.30s/it][A[A[A[A[A[A[A[A[A








 65%|██████▌   | 765/1170 [18:15<08:39,  1.28s/it][A[A[A[A[A[A[A[A[A








 65%|██████▌   | 766/1170 [18:16<08:42,  1.29s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 767/1170 [18:17<08:32,  1.27s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 768/1170 [18:19<08:52,  1.32s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 769/1170 [18:20<08:38,  1.29s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 770/1170 [18:21<08:54,  1.34s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 771/1170 [18:22<08:39,  1.30s/it][A[A[A[A[A[A[A[A[A








 66%|██████▌   | 772/1170 [18:24<08:23,  1.27s/it][A

 81%|████████▏ | 951/1170 [22:26<04:48,  1.32s/it][A[A[A[A[A[A[A[A[A








 81%|████████▏ | 952/1170 [22:27<04:45,  1.31s/it][A[A[A[A[A[A[A[A[A








 81%|████████▏ | 953/1170 [22:28<05:01,  1.39s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 954/1170 [22:30<04:56,  1.37s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 955/1170 [22:31<04:52,  1.36s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 956/1170 [22:32<04:49,  1.35s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 957/1170 [22:34<04:46,  1.34s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 958/1170 [22:35<04:45,  1.35s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 959/1170 [22:36<04:35,  1.31s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 960/1170 [22:38<04:36,  1.32s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 961/1170 [22:39<04:35,  1.32s/it][A[A[A[A[A[A[A[A[A








 82%|████████▏ | 962/1170 [22:40<04:39,  1.34s/it][A

 97%|█████████▋| 1139/1170 [26:46<00:45,  1.48s/it][A[A[A[A[A[A[A[A[A








 97%|█████████▋| 1140/1170 [26:47<00:44,  1.49s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1141/1170 [26:49<00:42,  1.46s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1142/1170 [26:50<00:41,  1.50s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1143/1170 [26:52<00:40,  1.50s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1144/1170 [26:53<00:38,  1.50s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1145/1170 [26:55<00:36,  1.47s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1146/1170 [26:56<00:36,  1.50s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1147/1170 [26:58<00:32,  1.43s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1148/1170 [26:59<00:31,  1.44s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1149/1170 [27:00<00:29,  1.40s/it][A[A[A[A[A[A[A[A[A








 98%|█████████▊| 1150/1170 [27:02<00:28,  1

There are 377 songs without lyrics





In [133]:
songs_base = pd.DataFrame(all_songs)
#del songs_base["lyrics"]
songs_base.to_csv("./songs_base_with_lyrics.csv")

In [134]:
print(f"Before sanitization {len(all_songs)}")
songs_base.sample(5)

Before sanitization 1170


Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres,lyrics_sentiment
1117,Me Extrañarás,1e4PRJWfSorO7RLgqDRvYR,"[Rodrigo, Ulises Bueno]","[235Vf4hkmwvxjVEMuCbRxm, 2UqRkW2wfEkZmyvKyTTv2W]",Me Extrañarás,366XQK2TfrjoWv17w2B9eD,"{'danceability': 0.53, 'energy': 0.918, 'key':...","[argentine rock, cuarteto, cumbia pop, cumbia ...",
773,Deseos de Cosas Imposibles (with Abel Pintos) ...,42rtLutZjrQGmzxZhN5YFA,"[La Oreja de Van Gogh, Abel Pintos]","[4U7lXyKdSf1JbM1aXvsodC, 6HTUcOExehqydqa7C3usAa]",Primera Fila,7pC1BMjl8x5Yr60xX2tyZh,"{'danceability': 0.608, 'energy': 0.509, 'key'...","[latin, latin arena pop, latin pop, mexican po...",0.0001
436,Para Que Sigamos siendo,1eqa7UxRqR59lR5SQ79lZg,[Eruca Sativa],[2RPNbhguRnI9uqahGYcUc6],La Carne,4irdwHtE3LGlYkPzxRQGln,"{'danceability': 0.593, 'energy': 0.715, 'key'...","[argentine indie, argentine metal, argentine r...",0.0
1075,Soy de la Esquina,6q8EF1C7ZzNOpgNye7ycvh,[Hermetica],[6j6Ld5h0aFgH0VQWQNazS7],Victimas del Vaciamiento,70Bq990gBdpLDzx7M8r28i,"{'danceability': 0.372, 'energy': 0.83, 'key':...","[argentine heavy metal, argentine metal, argen...",0.0
1108,Dramas Gratis,31kX39lcYsiGEaVPRs5Pfb,"[Damas Gratis, Andrés Calamaro]","[3YeBTR1Q1rUxKguz4jP6UV, 3tAICgiSR5PfYY4B8qsoAU]",Esquivando el Exito,0OkF4frqfEXWFZDwgSWfIN,"{'danceability': 0.602, 'energy': 0.914, 'key'...",[cumbia villera],0.0


In [141]:
# sanitize
bool_series = pd.notnull(songs_base["lyrics_sentiment"])
songs_base = songs_base[bool_series]
print(f"After sanitization {len(songs_base)}")
songs_base.head(3)

After sanitization 793


Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres,lyrics_sentiment
0,Desafío,7j9DYPyCuvSAtPcevpAkzb,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.161, 'energy': 0.482, 'key'...","[art pop, dance pop, deconstructed club, elect...",0.0055
1,Anoche,1cwTMSQeMaA9fVKEF1iWeD,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.23, 'energy': 0.434, 'key':...","[art pop, dance pop, deconstructed club, elect...",0.3901
2,Sin Rumbo,0aL27vskbMpwsMGUkHm3Zf,[Arca],[4SQdUpG4f7UbkJG3cJ2Iyj],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.289, 'energy': 0.28, 'key':...","[art pop, dance pop, deconstructed club, elect...",0.0


In [57]:
#songs_base.to_csv("./songs_base_with_lyrics_sanitized_final.csv",sep="#",index=False)
songs_base = pd.read_csv("./songs_base_with_lyrics_sanitized_final.csv",sep="#")
#let's get back our genres
songs_base["genres"]=songs_base["genres"].apply(lambda x: x.replace("[","").replace("]","").replace("'","").split(","))


In [10]:
songs_base.head(5)

Unnamed: 0,song_name,song_id,artists,artists_id,album_name,album_id,audio_features,genres,lyrics_sentiment
0,Desafío,7j9DYPyCuvSAtPcevpAkzb,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.161, 'energy': 0.482, 'key'...","['art pop', 'dance pop', 'deconstructed club',...",0.0055
1,Anoche,1cwTMSQeMaA9fVKEF1iWeD,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.23, 'energy': 0.434, 'key':...","['art pop', 'dance pop', 'deconstructed club',...",0.3901
2,Sin Rumbo,0aL27vskbMpwsMGUkHm3Zf,['Arca'],['4SQdUpG4f7UbkJG3cJ2Iyj'],Arca,1MQO4j8QExVgmnplbIodEU,"{'danceability': 0.289, 'energy': 0.28, 'key':...","['art pop', 'dance pop', 'deconstructed club',...",0.0
3,La Gata Bajo la Lluvia,2kfSFdq2h0xLXq01em1zc7,['Rocío Dúrcal'],['2uyweLa0mvPZH6eRzDddeB'],Sus 16 Grandes Exitos,1QXxmsxolhkqiFtI1mpX4i,"{'danceability': 0.499, 'energy': 0.648, 'key'...","['bolero', 'cancion melodica', 'grupera', 'lat...",0.0057
4,Querida,5ySxlyvySBhIEvoO2xx7uT,['Juan Gabriel'],['2MRBDr0crHWE5JwPceFncq'],Recuerdos II,1xrQ48Vvnvm3SmAbnIukGt,"{'danceability': 0.528, 'energy': 0.383, 'key'...","['cancion melodica', 'latin', 'latin pop']",0.0


Let's try to simplify the process by using a unique genre of our dataset for each song

In [13]:
from sklearn.neighbors import KNeighborsClassifier

In [65]:
simple_df = songs_base.copy()

# using a random genre
import random
random_simple_df = simple_df.copy()
random_simple_df["genres"] = simple_df["genres"].apply(lambda x: random.choice(x))
random_simple_df["genres"].value_counts()

cuarteto            76
 rock en espanol    62
 latin pop          46
 latin rock         46
 rock nacional      43
                    ..
reggaeton            1
latin pop            1
mexican indie        1
 cantautor           1
 mexican rock        1
Name: genres, Length: 87, dtype: int64

In [66]:
simple_df["genres"] = simple_df["genres"].apply(lambda x: x[0])
simple_df["genres"].value_counts()

latin                         213
argentine rock                208
argentine heavy metal          79
cuarteto                       76
bolero                         58
argentine indie                38
argentine alternative rock     25
cumbia villera                 14
grupera                        10
argentine punk                  8
latin alternative               7
argentine metal                 6
argentine reggae                6
cantautor                       4
cumbia chilena                  4
art pop                         4
cumbia paraguaya                3
cancion melodica                3
flamenco                        2
dance pop                       2
colombian pop                   2
celtic metal                    2
folclore salteno                2
reggaeton                       1
latintronica                    1
rumba                           1
electro latino                  1
tango                           1
panamanian pop                  1
aggrotech     

In [63]:
simple_df["genres"].value_counts()

cuarteto              76
 rock en espanol      55
latin                 48
argentine rock        44
 latin alternative    43
                      ..
mexican indie          1
cumbia paraguaya       1
 pop venezolano        1
 escape room           1
downtempo              1
Name: genres, Length: 82, dtype: int64

#### Notas del profe

Primero, la letra en sí no se usa, igual que el nombre de la canción. Los n gramas tampoco.

Lo que sí se puede usar son datos obtenidos en base al texto de las canciones, el ejemplo por excelencia es hacer sentiment analysis

Ver
https://elitedatascience.com/python-nlp-libraries


### 2) Elegir tres modelos de clasificadores multiclase

Aquí escogeremos tres modelos diferentes y luego compararemos su rendimiento para esta tarea. El procedimiento será el siguiente:
- Inicializar los modelos
- Entrenarlos usando el split **train** de los datos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 3) Informe: Comparar el rendimiento de los modelos

Una vez entrenados los tres modelos, compararemos su rendimiento:
- Correr los modelos usando el split **test**
- Obtener el reporte de clasificación y la matriz de confusión para cada modelo
- Graficar llevando a 2 dimensiones nuestro split **test** pintando con colores diferentes según la etiqueta correspondiente.
- Graficar de manera similar los resultados obtenidos con cada clasificador y sobre esto la función de clasificación obtenida.
- Guardar los modelos usando **pickle**
- Discutir los resultados obtenidos

**Recomendación:**
- Prestar atención a la [documentación de sklearn](https://scikit-learn.org/stable/)

### 4) Tareas adicionales:

Estas tareas servirán para extrapolar un poco el trabajo básico, y también sumarán puntos extra. Deben elegir una o más de las siguientes:
-  Análisis sobre el balance de clases del dataset, balanceo usando **subsampling** u **oversampling** y comparación de resultados vs el modelo básico
- Optimización de hiperparámetros y comparación de resultados vs el modelo básico
- Graficar importancia de features
- Graficar correlación de features

**Recomendación:**
- Hacer varias ahora puede ahorrarles tiempo en el futuro

In [37]:
#problemas de spotify's genres
get_genres(["0iutktJLkNNtErs8c3EoF6"])

['cumbia andina mexicana', 'folclore jujeno', 'folklore argentino']