The aim of this notebook is to make use of the word2vec model to find similar songs

### Se quiere crear el corpus a partir del libro Don Quijote de La Mancha

In [3]:
import pandas as pd
import numpy as np
import gensim.models.word2vec as w2v
import multiprocessing
import os
import re
import pprint
import sklearn.manifold
import matplotlib.pyplot as plt

Though non english artists were removed, the dataset contained Hindi lyrics of Lata Mangeshkar written in English. Therefore, I decided to remove all songs sung by her.

In [4]:
songs = pd.read_csv("C:/Users/Marino/Documents/UNIVERSIDAD/MasterBD/AnalisisDatosNoEstructurados/Practica/4_TEXT/data/hhgroups_merge_28_05.csv", header=0)
songs.head()

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas
0,0,Denom,Machete (con Jarfaiter y Gente jodida),Medicina,"Para su nuevo disco ""Medicina"", Denom ha vuelt...",2019,126
1,1,Denom,Vacío (con Ivo Incuerdo),Medicina,"[Denom]\nYo que quería, yo que pedía vida,\nSe...",2019,361
2,2,Denom,El orgullo es fiel (con Juancho Marqués y Elio...,Medicina,"""El orgullo es fiel"" es uno de los cortes incl...",2019,262
3,3,Denom,Mueve mueve (con Fernandocosta),Medicina,"[Estribillo: Denom] (x2)\nMueve, mueve, mueve,...",2019,578
4,4,Jaro Desperdizio,Insomnia,"Sin álbum, es un vídeo suelto","[Estribillo]\nY en esta noche, ¿Quién me arrop...",2019,219


To train the word2vec model, we first need to build its vocabulary. To do that, I iterated over each song and added it to an array that can later be fed to the model.

In [5]:
songs["letra"] = songs["letra"].replace("[^\w+]", " ", regex = True) 
songs.head()

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas
0,0,Denom,Machete (con Jarfaiter y Gente jodida),Medicina,Para su nuevo disco Medicina Denom ha vuelt...,2019,126
1,1,Denom,Vacío (con Ivo Incuerdo),Medicina,Denom Yo que quería yo que pedía vida Se p...,2019,361
2,2,Denom,El orgullo es fiel (con Juancho Marqués y Elio...,Medicina,El orgullo es fiel es uno de los cortes incl...,2019,262
3,3,Denom,Mueve mueve (con Fernandocosta),Medicina,Estribillo Denom x2 Mueve mueve mueve ...,2019,578
4,4,Jaro Desperdizio,Insomnia,"Sin álbum, es un vídeo suelto",Estribillo Y en esta noche Quién me arropa...,2019,219


In [8]:
quijote = open("C:/Users/Marino/Documents/UNIVERSIDAD/MasterBD/AnalisisDatosNoEstructurados/Practica/4_TEXT/data/quijote.txt", "r", encoding='utf-8')
text_corpus = []
for w in quijote:
    words = re.sub("[^\w+]", ' ', str(w))
    words = w.lower().split()
    text_corpus.append(words)


# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 50
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

songs2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

songs2vec.build_vocab(text_corpus)
print (len(text_corpus))

37861


In [9]:
text_corpus[0]

['\ufeffthe',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'don',
 'quijote,',
 'by',
 'miguel',
 'de',
 'cervantes',
 'saavedra']

In [10]:
import time
start_time = time.time()



songs2vec.train(text_corpus, total_examples=songs2vec.corpus_count, epochs=2)

if not os.path.exists("trained"):
    os.makedirs("trained")

songs2vec.save(os.path.join("trained", "songs2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 3.1560516357421875 seconds ---


In [11]:
songs2vec = w2v.Word2Vec.load(os.path.join("trained", "songs2vectors.w2v"))

#### Let's explore our model

Find similar words

In [12]:
songs2vec.wv.most_similar("amor")

[('libro', 0.9935914278030396),
 ('viene', 0.9932799935340881),
 ('alma', 0.9928247928619385),
 ('loco', 0.9928109049797058),
 ('trabajo', 0.9927223324775696),
 ('cuento', 0.9924731254577637),
 ('pobre', 0.9916315078735352),
 ('rey', 0.9915354251861572),
 ('temor', 0.9912034273147583),
 ('muerto', 0.9910915493965149)]

In [13]:
songs2vec.wv.most_similar("persona")

[('voluntad', 0.9935448169708252),
 ('lengua', 0.9926726818084717),
 ('fama', 0.9924412369728088),
 ('vida,', 0.9921514391899109),
 ('historia,', 0.9913180470466614),
 ('solo', 0.9910567998886108),
 ('ocasión', 0.9901459813117981),
 ('sola', 0.9893699884414673),
 ('gusto,', 0.9891254901885986),
 ('fortuna', 0.9885876178741455)]

Words out of context

In [14]:
songs2vec.wv.doesnt_match("feliz amor mesa odio".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'feliz'

Curiosamente, ahora detecta peor la palabra fuera de contexto. Esto se pude deber a que las canciones hablan normalmente de emociones y la novela habla menos de estos temas.

In [15]:
songs2vec.most_similar(positive=['mujer', 'rey'], negative=['hombre'])
#reina

  """Entry point for launching an IPython kernel.


[('dueña', 0.9866008758544922),
 ('presencia', 0.9818426966667175),
 ('duquesa', 0.9804731607437134),
 ('carta', 0.9803775548934937),
 ('plática', 0.9792066812515259),
 ('peregrina', 0.9787508845329285),
 ('teresa', 0.9786129593849182),
 ('campaña', 0.9784932136535645),
 ('sobrina', 0.9783490896224976),
 ('hija,', 0.9781205654144287)]

Sin embargo, las palabras que aparecen sugeridas son más apropiadas que con el corpus de las canciones.

Semantic distance between words

In [16]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = songs2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2))

In [42]:
nearest_similarity_cosmul("caballero", "escudero", "sancho")

caballero es a escudero, lo que quijote es a sancho
