<a href="https://colab.research.google.com/github/pelardillo/aid/blob/main/AID2_Lyrics_Analysis_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducción
La música (del griego mousikē o “arte de las musas”) es una manifestación artística y cultural, con múltiples finalidades en la sociedad así como en el desarrollo cognitivo de los seres humanos, aportando al desarrollo psicomotriz, el aprendizaje de lenguas así como potenciar la inteligencia emocional. En su definición más básica se la describe como una composición ordenada de sonidos y silencios, que conservan ritmo y una melodía. Como tal, la música es un componente fundamental en la construcción de la cultura e identidad de la humanidad. 

Las canciones a su vez contienen versos y coros (letra), que complementan la experiencia sensorial, y enriquecen el mensaje que se transmite al oyente entre los tonos musicales, las letras y cómo son interpretadas por quienes las cantan. 

En este marco se busca realizar un estudio analitico en las letras para modelar y aprender cómo los diferentes artistas de cada géneros componen canciones.

## Objetivos
El estudio busca modelar y clasificar letras de canciones por artista y género, y poder identificar letras aún no aprendidas según del contenido de las letras, y de esa forma inferir su artista y género.

Para se propone aplicar un modelo de deep learning, aplicando NLP (Natural Language Processing) que permite analizar el contenido de las letras y aprender a reconocer los patrones de cada artista y género.

El objetivo del modelo es poder clasificar una letra de una canción no antes vista (que no pertenezca al dataset de entrenamiento) mediante el análisis del contenido. La clasificación consiste en determinar el género de la canción.
Una vez entrenado, el modelo deberá poder inferir el género en función de la letra, o fragmento de la misma.

### Hiperparámetros

In [None]:
# largo minimo en palabras de una cancion valida para el modelo.
HP_MIN_SONG_LEN = 250

# canciones por genero a cargar
HP_SONGS_SAMPLES = 15000

HP_MAX_VOCAB_SIZE = 25000

# tamaño del batch
HP_BATCH_SIZE = 128

# tamaño del embedding size
HP_EMBEDDING_DIM = 100

# filtros a utilizar por la CNN para procesar el texto
HP_FILTERS = [2, 3, 4]

# cantidad neuronas para cada filtro
HP_N_FILTERS = 200

# drop ratio
HP_DROPOUT = 0.5

# numero de epochs de training
HP_N_EPOCHS = 10

### Imports

In [None]:
user = !whoami
if user[0] == "root":
  # enables googlw table plugin
  %load_ext google.colab.data_table

  # install API plugin
  %pip install lyricsgenius
  %pip install music_story

  # !pip install torchtext --upgrade

Collecting lyricsgenius
[?25l  Downloading https://files.pythonhosted.org/packages/41/c1/b7d56971a43e430214727daf774623d8edd0c13fe7bac1f484d0934af29b/lyricsgenius-2.0.2-py3-none-any.whl (46kB)
[K     |███████▏                        | 10kB 31.0MB/s eta 0:00:01[K     |██████████████▎                 | 20kB 35.0MB/s eta 0:00:01[K     |█████████████████████▍          | 30kB 20.3MB/s eta 0:00:01[K     |████████████████████████████▌   | 40kB 17.5MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 6.8MB/s 
Installing collected packages: lyricsgenius
Successfully installed lyricsgenius-2.0.2
Collecting music_story
  Downloading https://files.pythonhosted.org/packages/5f/5e/158a0dca477d6d6843ab198ebe03851a8eae811db3bc50aee0c28bd09d2c/music_story-0.1.tar.gz
Building wheels for collected packages: music-story
  Building wheel for music-story (setup.py) ... [?25l[?25hdone
  Created wheel for music-story: filename=music_story-0.1-cp36-none-any.whl size=8980 sha256=45dd19b92

In [None]:
# misc
import time
import requests
import random
import re
import math
import numpy as np
from pathlib import Path
import pandas as pd

# music APIs
import lyricsgenius
import music_story

# torch
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

# torchtext
from torchtext import data
from torchtext import datasets
from torchtext.data import Field, Dataset, Example

# prettytable
from prettytable import PrettyTable

# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# plots
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,8)

# spacy/nlp
import spacy 
import nltk
from nltk.corpus import stopwords
nltk.download("popular")
stops = stopwords.words("english")
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

### Funciones Auxiliares

In [None]:
def elapsed_time(start_time):
    elapsed_time = time.time() - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    elapsed_ms = elapsed_time - (elapsed_mins * 60) - elapsed_secs
    print(f'\nelapsed time: {elapsed_mins}m {elapsed_secs}s {elapsed_ms}ms')

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
class DataFrameDataset(Dataset):
  """Class for using pandas DataFrames as a datasource"""
  def __init__(self, examples, fields, filter_pred=None):
      """
      Create a dataset from a pandas dataframe of examples and Fields
      Arguments:
          examples pd.DataFrame: DataFrame of examples
          fields {str: Field}: The Fields to use in this tuple. The
              string is a field name, and the Field is the associated field.
          filter_pred (callable or None): use only exanples for which
              filter_pred(example) is true, or use all examples if None.
              Default is None
      """
      self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist()
      if filter_pred is not None:
          self.examples = filter(filter_pred, self.examples)
      self.fields = dict(fields)
      # Unpack field tuples
      for n, f in list(self.fields.items()):
          if isinstance(n, tuple):
              self.fields.update(zip(n, f))
              del self.fields[n]

class SeriesExample(Example):
  """Class to convert a pandas Series to an Example"""

  @classmethod
  def fromSeries(cls, data, fields):
      return cls.fromdict(data.to_dict(), fields)

  @classmethod
  def fromdict(cls, data, fields):
      ex = cls()
      
      for key, field in fields.items():
          
          if key not in data:
              raise ValueError("Specified key {} was not found in "
              "the input data".format(key))
          if field is not None:
              setattr(ex, key, field.preprocess(data[key]))
          else:
              setattr(ex, key, data[key])
      return ex

In [None]:
def print_batch(batch):

  dat_dtype = { 
    'names'   : ('Lyric',   'Lyrics', 'Gender'), 
    'formats' : ('|S1000',  '|S1000', 'i')
  }

  dat = np.zeros(len(batch), dat_dtype)

  x = PrettyTable(dat.dtype.names)
  x.align['Lyric'] = 'r'
  x.align['Gender'] = 'r'

  lyric_ts = batch.Lyric.to(torch.device('cpu'))
  gender_ts = batch.Genre.to(torch.device('cpu'))

  lyrics_dat = []
  lyrics = []

  for l in lyric_ts.t():
    lyrics_dat.append(', '.join([str(x) for x in l.numpy()]))
    lyrics.append(' '.join([TEXT.vocab.itos[x] for x in l.numpy()]).encode('utf-8'))

  dat['Lyric'] = lyrics_dat
  dat['Lyrics'] = lyrics
  dat['Artist'] = list(map(lambda x: x.encode('utf-8'), batch.Artist))
  dat['SName'] = list(map(lambda x: x.encode('utf-8'), batch.SName))
  dat['Gender'] = gender_ts 

  for row in dat:
      x.add_row(row)

  print(x)

In [None]:
def calculate_tf_idf(data):

  tfidf_vectorizer = TfidfVectorizer(use_idf=True)
  tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(songs['Lyric'])

  tfidf = tfidf_vectorizer_vectors.todense()
  
  # TFIDF of words not in the doc will be 0, so replace them with nan
  tfidf[tfidf == 0] = np.nan
  
  # Use nanmean of numpy which will ignore nan while calculating the mean
  means = np.nanmean(tfidf, axis=0)
  
  # convert it into a dictionary for later lookup
  means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))

  tfidf = tfidf_vectorizer_vectors.todense()
  
  # Argsort the full TFIDF dense vector
  ordered = np.argsort(tfidf*-1)
  words = tfidf_vectorizer.get_feature_names()

  dff = pd.DataFrame({ 'word': words, 'tf-idf': means.values() })
  dff = dff.sort_values('tf-idf', ascending=False)
  
  return means

# Data
Para construir el dataset se utilizó el siguiente juego de [datos](https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres), el cual contiene un listado csv de canciones con título, letra, artista e idioma, y otro la informacion de dichos artistas.

El autor del dataset original utilizo R para obtener la informacion del sitio https://www.vagalume.com.br

### Download Data
Ambos archivos CSV están disponibles en el siguiente link público en Goole Drive: https://drive.google.com/file/d/1d9s2_Y3d502iFvnR2-n_lP0GS7EbNplN/view?usp=sharing

In [None]:
def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

In [None]:
start_time = time.time()

my_file = Path("artists-data.csv")
if not my_file.is_file():
    print("downloading data...")

    # descarga el archivo zip desde google drive
    download_file_from_google_drive('1d9s2_Y3d502iFvnR2-n_lP0GS7EbNplN', 'archive.zip')

    # extraer y borrar el archivo zip
    !unzip -o archive.zip
    !rm archive.zip
else:
  print("data available")

# cargamos los archivos csv para ser procesados
artists = pd.read_csv('artists-data.csv')
lyrics = pd.read_csv('lyrics-data.csv')

elapsed_time(start_time)

downloading data...
Archive:  archive.zip
  inflating: artists-data.csv        
  inflating: lyrics-data.csv         

elapsed time: 0m 6s 0.4975142478942871ms


### Normalización de los Datos
Es deseable procesar los datos para eliminar 
Los datos contenidos en el dataset contienen elementos que no contribuyen al significado y deben ser eliminados para mejorar la performance del modelo y reducir los tiempos de computo. Para ello utilizamos SpaCy

In [None]:
# join de artistas y letras
def get_songs(left, right, lng, gnrs):
  s = pd.merge(left=left, right=right, how='inner', left_on='ALink', right_on='Link')
  return s[(s.Idiom.isin(lng)) & (s.Genre.isin(gnrs))].drop_duplicates(subset=['SName'], keep='first').copy()

# normaliza las letras utilizando SpaCy
def normalize(comment, lowercase, remove_stopwords):
  comment = re.sub(r"[^a-zA-Z0-9]+", ' ', comment)
  if lowercase:
      comment = comment.lower()
  comment = nlp(comment)
  lemmatized = list()
  for word in comment:
      lemma = word.lemma_.strip()
      if lemma:
          if not remove_stopwords or (remove_stopwords and lemma not in stops):
              lemmatized.append(lemma)
  return " ".join(lemmatized)

In [None]:
start_time = time.time()

# filtrar canciones en ingles
songs = get_songs(lyrics, artists, ['ENGLISH'], ['Rock', 'Pop', 'Hip Hop'])
print(f'all songs: {len(songs)}')

# pre-procesar las canciones
songs['Lyric'] = songs['Lyric'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
songs['Lyric'] = songs['Lyric'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
songs['Lyric'] = songs['Lyric'].apply(normalize, lowercase=True, remove_stopwords=False)

songs['Len'] = songs['Lyric'].apply(lambda x: len(x.split(' ')))
songs = songs[(songs['Len'] > HP_MIN_SONG_LEN) & (~songs['SName'].str.contains('tablatura'))];
songs = songs.groupby('Genre').head(HP_SONGS_SAMPLES)

print(f'filtered songs: {len(songs)}')

# remove extra columns
songs = songs[['Artist', 'SName', 'Lyric', 'Genre', 'Len']].copy()

elapsed_time(start_time)

songs.head(10)

all songs: 67226
filtered songs: 31333

elapsed time: 1m 12s 0.03745675086975098ms


Unnamed: 0,Artist,SName,Lyric,Genre,Len
6,10000 Maniacs,A Campfire Song,a lie to say o my mountain have coal vein and ...,Rock,273
10,10000 Maniacs,Don't Talk,don t talk i will listen don t talk you keep y...,Rock,284
22,10000 Maniacs,Back O' The Moon,jenny jenny you don t know the night i hide be...,Rock,366
26,10000 Maniacs,Like The Weather,the color of the sky a far a i can see be coal...,Rock,273
28,10000 Maniacs,Eat For Two,oh baby blanket and baby shoe baby slipper bab...,Rock,253
30,10000 Maniacs,Maddox Table,the leg of maddox kitchen table my whole life ...,Rock,263
38,10000 Maniacs,Poison In The Well,tell me what s go wrong i tilt my head there u...,Rock,268
40,10000 Maniacs,Gold Rush Brides,while the young folk be have their good time s...,Rock,266
98,10000 Maniacs,Dreadlock Holiday,i be walkin down the street concentratin on tr...,Rock,346
100,10000 Maniacs,Dust Bowl,i should know to leave them home they follow m...,Rock,291


### Analisis de los Datos

In [None]:
# print(songs.groupby('Genre').describe())
# songs.groupby('Genre').describe().plot.bar()

In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer

# def draw_plot(genre, genre_songs):
#   tfidf_vectorizer = TfidfVectorizer(use_idf=True)
#   tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(genre_songs['Lyric'])

#   tfidf = tfidf_vectorizer_vectors.todense()

#   # TFIDF of words not in the doc will be 0, so replace them with nan
#   tfidf[tfidf == 0] = np.nan

#   # Use nanmean of numpy which will ignore nan while calculating the mean
#   means = np.nanmean(tfidf, axis=0)

#   # convert it into a dictionary for later lookup
#   means = dict(zip(tfidf_vectorizer.get_feature_names(), means.tolist()[0]))

#   tfidf = tfidf_vectorizer_vectors.todense()

#   # Argsort the full TFIDF dense vector
#   ordered = np.argsort(tfidf*-1)
#   words = tfidf_vectorizer.get_feature_names()

#   dff = pd.DataFrame({ 'word': words, 'tf-idf': means.values() })
#   dff = dff.sort_values('tf-idf', ascending=True)
#   dff[dff['tf-idf'] >= 0.6].head(10).plot('word', 'tf-idf', 'bar', title=genre)

#   return dff

# # plotear words count global vs por genero
# start_time = time.time()
# draw_plot('@all', songs)
# for genre, genre_songs in songs.groupby('Genre'):
#   draw_plot(genre, genre_songs)
# elapsed_time(start_time)

In [None]:
# from collections import Counter

# def plot(genre, genre_songs, color, all=None):
#   count = Counter()
#   for song in genre_songs['Lyric']:
#     count += Counter(song.split())

#   if all != None:
#     for w, c in all.items():
#       if count[w] < c - count[w]:
#         count[w] = 0

#   plt.bar(*zip(*count.most_common(10)), width=.5, color=color)
#   plt.title(genre)
#   plt.xlabel('word')
#   plt.ylabel('freq')
#   plt.show()

#   return count

# a = plot('@all', songs, 'b')
# for g, s in songs.groupby('Genre'):
#   plot(g, s, 'g', all=a)

# Build Datasets
Construimos los datasets de train, test y validation con las canciones procesesadas.

In [None]:
SEED = random.random()

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

start_time = time.time()

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField()
# SNAME = data.RawField(is_target=False)

dataset = DataFrameDataset(songs, { 
                              'Lyric': TEXT
                            , 'Genre': LABEL
                            #, 'SName': SNAME 
})

train_data, valid_data = dataset.split(split_ratio=0.7, random_state = random.seed(SEED))
train_data, test_data = train_data.split(random_state = random.seed(SEED))

elapsed_time(start_time)

print(f'Number of all examples: {len(dataset)}')
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(test_data)}')
print(f'Number of testing examples: {len(valid_data)}')

assert len(train_data) + len(valid_data) + len(test_data) == len(dataset)


elapsed time: 0m 25s 0.6894738674163818ms
Number of all examples: 31333
Number of training examples: 15353
Number of validation examples: 6580
Number of testing examples: 9400


### Build Vocab
El objetivo de este paso es constuir el vocabulario del training_dataset y poder calcular los word embedding correspondientes.  

Para entrenar el modelo se utilizaran pesos pre-entrenados ([GloVe](https://nlp.stanford.edu/projects/glove/)) para mejorar los tiempos, ya que se cuenta con una representación inicial de pesos para palabras y términos comunes ya entrenados.

In [None]:
MAX_VOCAB_SIZE = HP_MAX_VOCAB_SIZE

start_time = time.time()

# descarga pesos pre entrenados de GloVe u otros.
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE, vectors = "glove.6B.100d", unk_init = torch.Tensor.normal_)

# crear vocab del label (genero)
LABEL.build_vocab(train_data)

elapsed_time(start_time)

print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)} {TEXT.vocab.freqs.most_common(10)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)} {LABEL.vocab.itos}")

.vector_cache/glove.6B.zip: 862MB [06:28, 2.22MB/s]                           
100%|█████████▉| 398962/400000 [00:15<00:00, 25554.78it/s]


elapsed time: 7m 17s 0.14974021911621094ms
Unique tokens in TEXT vocabulary: 25002 [('i', 305360), ('you', 247094), ('the', 223369), ('be', 206749), ('to', 145967), ('a', 138827), ('and', 124245), ('it', 123205), ('me', 100122), ('t', 94206)]
Unique tokens in LABEL vocabulary: 3 ['Rock', 'Pop', 'Hip Hop']


Se construyen los batches en la gpu de estar disponible.
Como las canciones no tienen un largo definido, pueden exitisr canciones con menos palabras que el largo del embedding (`HP_EMBEDDING_DIM`) para eso buscamos que las palabras queden agrupadas.

In [None]:
BATCH_SIZE = HP_BATCH_SIZE

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

start_time = time.time()

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    #sort = True, # ordena los batches para procesar canciones de similar largo juntas
    # sort_within_batch  = True,  # ordena los elementos dentro del batch
    sort_key = lambda x: len(x.Lyric), # ordena los elementos del batch segun largo
    device = device)

# for batch in train_iterator:
#     print_batch(batch)

# for batch in valid_iterator:
#     print_batch(batch)

# for batch in test_iterator:
#     print_batch(batch)    

print(f'train batches: {len(train_iterator)}')
    
elapsed_time(start_time)

train batches: 120

elapsed time: 0m 0s 0.00034809112548828125ms


# Build Model
El modelo utilizará diferentes filtros (2x2, 3x3, 4x4) para computar diferentes n-gramas. Luego se concatenan y se pasan a una capa densa. La arquitectura está basada en la publicación [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) por Yoon Kim. Como variante y desafío, en este caso se busca tener una clasificación múltiple de artista y género.

Para esto la red se compone de las siguientes layers:
1.   **Embeddings**
2.   **Convolutional**
3.   **Max Pool**
4.   **Dropout**
5.   **Dense**
6.   **Softmax**

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx):

        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # se crean las N conv layers por cada tipo de filtro: 2, y 4 x N, siendo N el largo del embedding
        self.convs = nn.ModuleList([
          nn.Conv2d(in_channels = 1, out_channels = n_filters, kernel_size = (fs, embedding_dim)) for fs in filter_sizes
        ])
        
        # capa densa/FC
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        
        # regularization por dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [sent len, batch size]
        text = text.permute(1, 0)
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)
        #embedded = [batch size, 1, sent len, emb dim]
        
        # aplicamos las N convoluciones en paralelo, 1 por cada filtro definido
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        
        # maxpool de cada convolucion y se guarda en un array
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        #pooled_n = [batch size, n_filters]
        
        # aplicamos dropout a la concatenacion de todos los pools y generamos un vector de 1 x N x flter size
        cat = self.dropout(torch.cat(pooled, dim = 1))
        #cat = [batch size, n_filters * len(filter_sizes)]

        # retornamos el reultado de la capa densa/FC    
        return self.fc(cat)

### Crear Modelo

Creamos la instancia del modelo con los parametros e hiperparametros corresponidnetes. 
`INPUT_DIM` es la dimension del vocab. 
`EMBEDDING_DIM` es el largo del word embedding, el largo del vector que contiene los tokens (tensores) de cada palabra generado a partir del input (full one-hot vector). 
`OUTPUT_DIM` corresponde a la dimension de la salida, la cual es igual al numero de clases $C$ que queremos aprender del modelo (`LABEL.vocab`). En este caso la cantidad de generos musicales.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = HP_EMBEDDING_DIM
N_FILTERS = HP_N_FILTERS
FILTER_SIZES = HP_FILTERS
OUTPUT_DIM = len(LABEL.vocab)
DROPOUT = HP_DROPOUT
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # <pad>

print('CNN(', INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX, ')\n')
model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

for layer in model.children():
  num_params = sum(p.numel() for p in layer.parameters())
  print(f"Layer: {layer}, Parameters: {num_params}") 

print(f'\nThe model has {count_parameters(model):,} trainable parameters')

CNN( 25002 100 200 [2, 3, 4] 3 0.5 1 )

Layer: Embedding(25002, 100), Parameters: 2500200
Layer: ModuleList(
  (0): Conv2d(1, 200, kernel_size=(2, 100), stride=(1, 1))
  (1): Conv2d(1, 200, kernel_size=(3, 100), stride=(1, 1))
  (2): Conv2d(1, 200, kernel_size=(4, 100), stride=(1, 1))
), Parameters: 180600
Layer: Linear(in_features=600, out_features=3, bias=True), Parameters: 1803
Layer: Dropout(p=0.5, inplace=False), Parameters: 0

The model has 2,682,603 trainable parameters


### Adjust Embeddings
Ajustamos los pesos del modelo de embeding con los pre-trainned descargados

In [None]:
pretrained_embeddings = TEXT.vocab.vectors
if pretrained_embeddings != None:
  model.embedding.weight.data.copy_(pretrained_embeddings)

Dejamos en 0 los pesos de `UNK` (unknown word) y `PAD` (representa los espacios cuando las palabras no tiene el largo del embedding `HP_EMBEDDING_DIM`) tokens.

In [None]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

# Train Model

## Fitting Methods

El modelo produce como resultado un vector de dimension $C$ (y_hat), siendo c el numero de clases (generos) y los valores en cada posicion la probablidad que tiene el input de pertencer a esas clases.

Por ejempl, en nuestro caso: 'rock' = 0, 'hip hop' = 1 y 'pop' = 2. La salida del modelo seria algo del estilo: [4.2, 1.1, 0.5]

Calculamos la accuracy del modelo onteniendo `argmax` de ese vector, lo que nos devuelve el indice del elemento del array con mayor valor (en este caso 0), y comparamos eso con la label / Y para determinar si coinciden. Calculamos el accuracy de todo el batch. Obtener 8 correctas de 10 en un mismo batch implica un retorno de 0.8.

In [None]:
def categorical_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    correct = max_preds.squeeze(1).eq(y)
    return correct.sum() / torch.FloatTensor([y.shape[0]]).to(device)

In [None]:
def train(model, iterator, optimizer, criterion):
    
  epoch_loss = 0
  epoch_acc = 0
  
  model.train()
  
  for batch in iterator:
      
    optimizer.zero_grad()
    
    predictions = model(batch.Lyric)
    
    loss = criterion(predictions, batch.Genre)
    
    acc = categorical_accuracy(predictions, batch.Genre)
    
    loss.backward()
    
    optimizer.step()
    
    epoch_loss += loss.item()
    epoch_acc += acc.item()
      
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
  epoch_loss = 0
  epoch_acc = 0
    
  model.eval()
  
  with torch.no_grad():
  
    for batch in iterator:

      predictions = model(batch.Lyric)
            
      loss = criterion(predictions, batch.Genre)
        
      acc = categorical_accuracy(predictions, batch.Genre)

      epoch_loss += loss.item()
      epoch_acc += acc.item()
    
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

## Run Train (WIP)

Se utiliza [Adam](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam) como algoritmo de optimizacion y [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) como loss function para el calculo de multi-class classification.

In [None]:
with torch.no_grad():
    torch.cuda.empty_cache()

In [None]:
# creamos el optimizer, en este caso Adam
optimizer = optim.Adam(model.parameters())

# creamos la función de loss, en este cao BCE para multiple class
criterion = nn.CrossEntropyLoss()

# enviamos a la gpu el modelo
model = model.to(device)
criterion = criterion.to(device)

# numero de epochs
N_EPOCHS = HP_N_EPOCHS

# el "mejor" loss
best_valid_loss = float('inf')

training_start_time = time.time()

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lyrics-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

training_epoch_mins, training_epoch_secs = epoch_time(training_start_time, time.time())
print(f'Training Duration: {training_epoch_mins}m {training_epoch_secs}s')

100%|█████████▉| 398962/400000 [00:29<00:00, 25554.78it/s]

Epoch: 01 | Epoch Time: 0m 19s
	Train Loss: 0.840 | Train Acc: 59.47%
	 Val. Loss: 0.705 |  Val. Acc: 68.73%
Epoch: 02 | Epoch Time: 0m 18s
	Train Loss: 0.695 | Train Acc: 68.29%
	 Val. Loss: 0.661 |  Val. Acc: 70.77%
Epoch: 03 | Epoch Time: 0m 19s
	Train Loss: 0.617 | Train Acc: 73.08%
	 Val. Loss: 0.654 |  Val. Acc: 71.19%
Epoch: 04 | Epoch Time: 0m 19s
	Train Loss: 0.552 | Train Acc: 76.41%
	 Val. Loss: 0.634 |  Val. Acc: 72.27%
Epoch: 05 | Epoch Time: 0m 20s
	Train Loss: 0.483 | Train Acc: 79.99%
	 Val. Loss: 0.640 |  Val. Acc: 72.31%
Epoch: 06 | Epoch Time: 0m 20s
	Train Loss: 0.416 | Train Acc: 83.81%
	 Val. Loss: 0.656 |  Val. Acc: 72.25%
Epoch: 07 | Epoch Time: 0m 20s
	Train Loss: 0.338 | Train Acc: 87.56%
	 Val. Loss: 0.695 |  Val. Acc: 72.84%
Epoch: 08 | Epoch Time: 0m 19s
	Train Loss: 0.281 | Train Acc: 89.95%
	 Val. Loss: 0.703 |  Val. Acc: 72.26%
Epoch: 09 | Epoch Time: 0m 20s
	Train Loss: 0.228 | Train Acc: 92.19%
	 Val. Loss: 0.751 |  Val. Acc: 72.68%
Epoch: 10 | Epoch T

Finally, let's run our model on the test set!

# Test Model (WIP)

In [None]:
# cargamos el modelo para test
model.load_state_dict(torch.load('lyrics-model.pt'))

# evaluamos el dataset de test
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.631 | Test Acc: 71.57%


Similar to how we made a function to predict sentiment for any given sentences, we can now make a function that will predict the class of question given.

The only difference here is that instead of using a sigmoid function to squash the input between 0 and 1, we use the `argmax` to get the highest predicted class index. We then use this index with the label vocab to get the human readable label.

In [None]:
# import spacy
# nlp = spacy.load('en')
def predict_class(model, sentence, min_len = 4):
  model.eval()
  tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
  if len(tokenized) < min_len:
    tokenized += ['<pad>'] * (min_len - len(tokenized))
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  preds = model(tensor)
  max_preds = preds.argmax(dim = 1)
  return max_preds.item()

Ejemplo aleatorio del test_set. Estos datos no fueron vistos por el modelo.

In [None]:
corrects = 0

for i in range(10):
  random_song = test_iterator.dataset.examples[random.randint(0, len(test_iterator.dataset.examples) - 1)]

  # input_title = random_song.SName
  input_text = ' '.join(random_song.Lyric).strip()
  input_label = random_song.Genre

  print(f'\ninput: ({input_label}) => {input_text[:50]}...')

  pred_class = predict_class(model, input_text)
  
  correct = (LABEL.vocab.itos[pred_class] == input_label)
  result = '✓' if correct else '✗'

  print(f'Predicted class is: {pred_class} = {LABEL.vocab.itos[pred_class]} AND song is: {input_label} {result}')

  if correct:
    corrects += 1

print(f'{corrects}/10')


input: (Hip Hop) => also feature on sway and king tech s this or that ...
Predicted class is: 2 = Hip Hop AND song is: Hip Hop ✓

input: (Rock) => we be force to live in silence eat dust and breath...
Predicted class is: 0 = Rock AND song is: Rock ✓

input: (Pop) => be you ready it s time for me to take it i be the ...
Predicted class is: 1 = Pop AND song is: Pop ✓

input: (Rock) => read some kerouac and it put me on the track to bu...
Predicted class is: 0 = Rock AND song is: Rock ✓

input: (Pop) => i come into this world without a single idea that ...
Predicted class is: 2 = Hip Hop AND song is: Pop ✗

input: (Rock) => pop drone if i ever get back home again if i ever ...
Predicted class is: 1 = Pop AND song is: Rock ✗

input: (Pop) => x4 if i can escape i would but ﻿1 of all let me sa...
Predicted class is: 1 = Pop AND song is: Pop ✓

input: (Hip Hop) => when you say it be over you shoot right through my...
Predicted class is: 2 = Hip Hop AND song is: Hip Hop ✓

input: (Pop) => ima

In [None]:
genius = lyricsgenius.Genius("0jwZEZBbtda-jlaUnaKAR2jDleXsw5SKB6GhuTWneS1u-efwOpUYBsaTDOW185He")
genius.verbose = True

artist = genius.search_artist('Eminem', max_songs=1, sort="popularity")
custom_title = artist.songs[0].title
custom_text = artist.songs[0].lyrics
custom_label = "unk"

song_found = False
# for td in train_data:
#   if td.SName == custom_title.lower().strip():
#     song_found = True
#     break

if song_found:
  print(f'\ncancion entrenada')
else:
  print(f'\ncancion nunca antes vista')

genius_songs = pd.DataFrame({ 'Lyric': [custom_text], 'Genre': [custom_label], 'Artist': [artist.name], 'SName': [custom_title] })

# mismo procesamiento que el dataset utilizado
genius_songs['Lyric'] = genius_songs['Lyric'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
genius_songs['Lyric'] = genius_songs['Lyric'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
genius_songs['Lyric'] = genius_songs['Lyric'].apply(normalize, lowercase=True, remove_stopwords=False)
genius_songs['Len'] = genius_songs['Lyric'].apply(lambda x: len(x.split(' ')))

genius_songs.head(1)

Searching for songs by Eminem...

Song 1: "Rap God"

Reached user-specified song limit (1).
Done. Found 1 songs.

cancion nunca antes vista


Unnamed: 0,Lyric,Genre,Artist,SName,Len
0,look i be go to go easy on you not to hurt you...,unk,Eminem,Rap God,1633


In [None]:
pred_class = predict_class(model, genius_songs.iloc[0].Lyric)
print(f'Predicted class is: {pred_class} => {LABEL.vocab.itos[pred_class]}')

Predicted class is: 2 => Hip Hop
