The aim of this notebook is to make use of the word2vec model to find similar songs

In [1]:
import pandas as pd
import numpy as np
import gensim.models.word2vec as w2v
import multiprocessing
import os
import re
import pprint
import sklearn.manifold
import matplotlib.pyplot as plt

# Limpieza y creación de los corpus

Though non english artists were removed, the dataset contained Hindi lyrics of Lata Mangeshkar written in English. Therefore, I decided to remove all songs sung by her.

In [2]:
import re

def normalize(s):
    replacements = (
        ("á", "a"),
        ("é", "e"),
        ("í", "i"),
        ("ó", "o"),
        ("ú", "u"),
        ("ñ", "n"),
    )
    for a, b in replacements:
        s = s.lower().replace(a, b)
    return s

def clean(text, remove_digits=True):
    if remove_digits:
        pattern = r'[^a-zA-z\s]'
    else:
        pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text).replace('\n', ' ')
    return text

## Quijote

In [3]:
quijote = ['']
previous_line = ''
with open("data/quijote.txt", encoding='UTF-8') as f:
    for line in f:
        if line == '\n' and previous_line!='\n':
            quijote.append('')
        else:
            quijote[-1] = quijote[-1] + clean(normalize(line))
        previous_line = line

quijote_corpus = [[word for word in line.split()] for line in quijote if list != []]

print(len(quijote_corpus))

5176


## Rap
Son canciones de rap en español. No es el mejor corpusm porque hay algunas canciones vacías, otras en inglés y catalán. Algunos artistas mezclan español con inglés, y suele haber faltas de ortografía.

In [4]:
songs = pd.read_csv("data/hhgroups_merge_28_05.csv", header=0)
rap_corpus = [[word for word in clean(normalize(line)).split()] for line in songs['letra']]
print(len(rap_corpus))

9325


## Noticias
Corpus compuesto por más de 500 noticias extraidas del diario El Mundo utilizando el fichero get_news.py. Estas noticias corresponden a 8 secciones diferentes.
Este corpus es demasiado pequeño para que salgan cosas razonables en la sección de análisis de palabras similares, pero será útil para realizar clustering sobre él.

In [5]:
news = pd.read_csv("data/noticias.csv", sep='\|', encoding='UTF-8').groupby('NEW_TITLE', as_index=False).max()
news_corpus = [[word for word in clean(normalize(str(line))).split()] for line in news['NEW']]
print(len(news_corpus))

542


## Los tres juntos 

In [6]:
join_corpus = quijote_corpus + rap_corpus + news_corpus
print(len(join_corpus))

15043


# Entrenamiento de los modelos

### Parámetros utilizados en el entrenamiento

In [15]:

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 100
# Minimum word count threshold.
min_word_count = 10

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

## Rap

In [108]:
songs2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

songs2vec.build_vocab(rap_corpus)
print (len(rap_corpus))

9325


In [109]:
import time
start_time = time.time()

songs2vec.train(rap_corpus, total_examples=songs2vec.corpus_count, epochs=20)

if not os.path.exists("trained"):
    os.makedirs("trained")

songs2vec.save(os.path.join("trained", "songs2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 308.18955993652344 seconds ---


## Quijote

In [110]:
quijote2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

quijote2vec.build_vocab(quijote_corpus)
print (len(quijote_corpus))

5176


In [111]:
import time
start_time = time.time()

quijote2vec.train(quijote_corpus, total_examples=quijote2vec.corpus_count, epochs=20)

if not os.path.exists("trained"):
    os.makedirs("trained")

quijote2vec.save(os.path.join("trained", "quijote2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 28.54300093650818 seconds ---


## Noticias
En este caso utilizamos un min_count de 2, porque el corpus es demasiado pequeño, y si se descartan más palabras no se podrá comparar con los otros corpus.m

In [112]:
news2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=2,
    window=context_size,
    sample=downsampling
)

news2vec.build_vocab(news_corpus)
print (len(news_corpus))

542


In [113]:
import time
start_time = time.time()

news2vec.train(news_corpus, total_examples=news2vec.corpus_count, epochs=20)

if not os.path.exists("trained"):
    os.makedirs("trained")

news2vec.save(os.path.join("trained", "news2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 19.259528875350952 seconds ---


## Los tres juntos

In [114]:
join2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

join2vec.build_vocab(join_corpus)
print (len(join_corpus))

15043


In [115]:
import time
start_time = time.time()

join2vec.train(join_corpus, total_examples=join2vec.corpus_count, epochs=20)

if not os.path.exists("trained"):
    os.makedirs("trained")

join2vec.save(os.path.join("trained", "join2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))

--- 355.69666147232056 seconds ---


## Se cargan los modelos

In [7]:
songs2vec = w2v.Word2Vec.load(os.path.join("trained", "songs2vectors.w2v"))

In [8]:
quijote2vec = w2v.Word2Vec.load(os.path.join("trained", "quijote2vectors.w2v"))

In [9]:
join2vec = w2v.Word2Vec.load(os.path.join("trained", "join2vectors.w2v"))

In [10]:
news2vec = w2v.Word2Vec.load(os.path.join("trained", "news2vectors.w2v"))

# Comparación de los resultados

## Palabras similares

### Amor

In [120]:
rap = [word[0] for word in songs2vec.wv.most_similar("amor")]
quijote = [word[0] for word in quijote2vec.wv.most_similar("amor")]
join = [word[0] for word in join2vec.wv.most_similar("amor")]
news = [word[0] for word in news2vec.wv.most_similar("amor")]
pd.DataFrame(list(zip(rap,quijote,join,news)), columns =['Rap', 'Quijote', 'Juntos', 'Noticias'])

Unnamed: 0,Rap,Quijote,Juntos,Noticias
0,dolor,premio,dolor,intentaba
1,odio,vergenza,odio,vivio
2,rencor,misericordia,rencor,espejo
3,carino,recato,desamor,tactica
4,verdadero,solicitud,platonico,homosexual
5,platonico,trato,carino,huido
6,desamor,alabanza,sexo,marcel
7,honor,nace,olvido,llaga
8,placer,quitarme,calor,amante
9,amistad,muestre,verdadero,novio


### Iglesia

In [121]:
rap = [word[0] for word in songs2vec.wv.most_similar("iglesia")]
quijote = [word[0] for word in quijote2vec.wv.most_similar("iglesia")]
join = [word[0] for word in join2vec.wv.most_similar("iglesia")]
news = [word[0] for word in news2vec.wv.most_similar("iglesia")]
pd.DataFrame(list(zip(rap,quijote,join,news)), columns =['Rap', 'Quijote', 'Juntos', 'Noticias'])

Unnamed: 0,Rap,Quijote,Juntos,Noticias
0,biblia,santa,catolica,unanime
1,politica,hermandad,biblia,alimentacion
2,madre,riqueza,romana,ajo
3,hambruna,misericordia,justicia,contemporaneo
4,injusticia,honra,zarzuela,finca
5,romana,corte,politica,aplauso
6,impuesta,madre,inquisicion,conejo
7,economia,miseria,hacienda,cuadro
8,pobreza,dada,obediencia,juvenil
9,poblacion,hacienda,monarquia,luces


### Bella

In [122]:
rap = [word[0] for word in songs2vec.wv.most_similar("bella")]
quijote = [word[0] for word in quijote2vec.wv.most_similar("bella")]
join = [word[0] for word in join2vec.wv.most_similar("bella")]
news = [word[0] for word in news2vec.wv.most_similar("bella")]
pd.DataFrame(list(zip(rap,quijote,join,news)), columns =['Rap', 'Quijote', 'Juntos', 'Noticias'])

Unnamed: 0,Rap,Quijote,Juntos,Noticias
0,hermosa,criatura,hermosa,clasica
1,bonita,enemiga,bonita,claudia
2,preciosa,hermosa,linda,misteriosa
3,sencilla,amada,preciosa,turco
4,princesa,dichoso,dama,reflectante
5,discreta,belerma,pequena,esencialmente
6,linda,ingrata,fea,ternura
7,oscura,labradora,princesa,romantica
8,pequena,querida,mustia,traida
9,ella,cautiva,diosa,fisura


### Gobierno

In [123]:
rap = [word[0] for word in songs2vec.wv.most_similar("gobierno")]
quijote = [word[0] for word in quijote2vec.wv.most_similar("gobierno")]
join = [word[0] for word in join2vec.wv.most_similar("gobierno")]
news = [word[0] for word in news2vec.wv.most_similar("gobierno")]
pd.DataFrame(list(zip(rap,quijote,join,news)), columns =['Rap', 'Quijote', 'Juntos', 'Noticias'])

Unnamed: 0,Rap,Quijote,Juntos,Noticias
0,pueblo,insula,presidente,ejecutivo
1,sistema,gobernador,ejecutivo,sanchez
2,dictador,barataria,ministro,empeno
3,policial,prometida,cgpj,anticipado
4,ebola,condado,ministerio,racionamiento
5,politico,dia,sanidad,gobernador
6,capitalista,aviso,erc,socialcomunista
7,organizado,prometido,comunicado,pp
8,congreso,traves,pnv,hansjoachim
9,presidente,posesion,portavoz,criticado


Los resultados de este análisis son mejorables, pero se observa que en todas las palabras se obtienen palabras relacionadas con la palabra objetivo, menos con el Corpus de noticias, que es demasiado pequeño. Se observa que que el corpus de noticias sí que funciona correctamente para la palabra gobierno, debido a que es una palabra que se repite mucho en las noticias de actualidad.

## Palabras fuera de contexto

### perro gato pajaro armario

In [124]:
palabras = "perro gato pajaro armario".split()

rap = songs2vec.wv.doesnt_match(palabras)
quijote = quijote2vec.wv.doesnt_match(palabras)
join = join2vec.wv.doesnt_match(palabras)
news = news2vec.wv.doesnt_match(palabras)

print("Con el corpus de rap sobra: {}".format(rap))
print("Con el corpus de quijote sobra: {}".format(quijote))
print("Con el corpus de join sobra: {}".format(join))
print("Con el corpus de noticias sobra: {}".format(news))

Con el corpus de rap sobra: armario
Con el corpus de quijote sobra: gato
Con el corpus de join sobra: armario
Con el corpus de noticias sobra: pajaro


### cojin sofa mesa arbol

In [125]:
palabras = "cojin sofa mesa arbol".split()

rap = songs2vec.wv.doesnt_match(palabras)
quijote = quijote2vec.wv.doesnt_match(palabras)
join = join2vec.wv.doesnt_match(palabras)
news = news2vec.wv.doesnt_match(palabras)

print("Con el corpus de rap sobra: {}".format(rap))
print("Con el corpus de quijote sobra: {}".format(quijote))
print("Con el corpus de join sobra: {}".format(join))
print("Con el corpus de noticias sobra: {}".format(news))

Con el corpus de rap sobra: arbol
Con el corpus de quijote sobra: mesa
Con el corpus de join sobra: cojin
Con el corpus de noticias sobra: sofa


## Regla de 3

### Hombre, Reina | Mujer --> ?

In [126]:
positive = ['hombre', 'reina']
negative = ['mujer']

rap = [word[0] for word in songs2vec.most_similar(positive=positive, negative=negative)]
quijote = [word[0] for word in quijote2vec.most_similar(positive=positive, negative=negative)]
join = [word[0] for word in join2vec.most_similar(positive=positive, negative=negative)]
news = [word[0] for word in news2vec.most_similar(positive=positive, negative=negative)]
pd.DataFrame(list(zip(rap,quijote,join,news)), columns =['Rap', 'Quijote', 'Juntos', 'Noticias'])
#queen


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).


Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).



Unnamed: 0,Rap,Quijote,Juntos,Noticias
0,rey,principe,rey,tuit
1,reino,roldan,imperio,colaborador
2,reyes,caballero,faraon,estomago
3,lobo,encantado,apodado,diputada
4,creador,feo,artifice,sello
5,tsunami,pintor,zar,difundir
6,estandarte,maestro,gladiador,acabar
7,relata,pierres,pacifico,anunciaban
8,yeaah,viene,relata,implacable
9,gladiador,soldado,poseidon,ayudo


## Distancia entre palabras

In [127]:
def nearest_similarity_cosmul(vec, start1, end1, end2, prefix=''):
    similarities = vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{4}{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2, prefix))

### "noche", "dia", "arriba"

In [128]:
start1 = "noche"
start2 = "dia"
end1 = "arriba"

nearest_similarity_cosmul(songs2vec, start1, start2, end1, 'Rap: ')
nearest_similarity_cosmul(quijote2vec, start1, start2, end1, 'Quijote: ')
nearest_similarity_cosmul(join2vec, start1, start2, end1, 'Juntos: ')
nearest_similarity_cosmul(news2vec, start1, start2, end1, 'Juntos: ')

Rap: noche es a dia, lo que cima es a arriba
Quijote: noche es a dia, lo que abajo es a arriba
Juntos: noche es a dia, lo que abajo es a arriba
Juntos: noche es a dia, lo que sara es a arriba


# Clustering

In [17]:
def create_vector(row):
    vector_sum = 0
    words = row.lower().split()
    for word in words:
        try:
            vector_sum = vector_sum + songs2vec[word]
        except:
            pass
    if isinstance(vector_sum, int):
        return "No words in vocab"
    else:
        vector_sum = vector_sum.reshape(1,-1)
    normalised_vector_sum = sklearn.preprocessing.normalize(vector_sum)

    return normalised_vector_sum

## Clustering de canciones 2D

In [18]:
import time
start_time = time.time()

songs['song_vector'] = [clean(normalize(letra)) for letra in songs['letra']]
songs['song_vector'] = songs['song_vector'].apply(create_vector)
songs = songs[songs['song_vector'] != "No words in vocab"]

print("--- %s seconds ---" % (time.time() - start_time))

 self.wv.__getitem__() instead).
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
--- 35.560264348983765 seconds ---
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


### t-sne and random song selection

In [247]:
song_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(songs, test_size = 0.2)


for song_vector in train['song_vector']:
    song_vectors.append(song_vector)

train.head(10)

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas,song_vector
7731,1,Ceerre y Pablo Gareta,Concepto,Grey theory,[Ceerre]\nCon esta presión dentro desde pequeñ...,2015,1229,"[[-0.034195747, 0.09055399, 0.1346869, -0.1285..."
6922,17,Hard GZ,Monster,Nictofilia,"Cada día me acuerdo de cuando tenía 6 años,\nf...",2015,6032,"[[-0.033740018, 0.10971237, 0.14973374, -0.174..."
5278,23,Dante,Calma,"Sin álbum, es un vídeo suelto","Ahora voy a hablar, pa' dejar las cosas claras...",2017,8173,"[[-0.05179523, 0.103807546, 0.16787839, -0.209..."
6959,54,SFDK,Lo intento,Tesoros y caras B,"Tú di, tú di, jefe de la M, ?ok?,\nespecial de...",2010,1037,"[[-0.025074981, 0.09681175, 0.1696142, -0.1441..."
413,28,SFDK,Seguimos fuertes (con Jefe de la M),Siempre fuertes 2,[Estribillo]\nSuenan Jefe y SFDK\nsi no siente...,2009,3408,"[[-0.04357646, 0.09102699, 0.1567576, -0.14719..."
6333,33,Omar El Hachemi,Breaking news,Apolo 8,"El banco es el trono de ese borracho,\nLa barr...",2016,313,"[[-0.04542916, 0.1002035, 0.15871415, -0.16521..."
5579,49,Gordo Master,El paraiso nos rodea (con Raule),Freshkush,"[Gordo Master]\nCon mi música me siento libre,...",2017,531,"[[-0.020846864, 0.08475931, 0.106602, -0.13565..."
8286,6,Momo,Viento,Viento,[Estribillo] (x4)\nV-I-E-N-T-O.\nViento.\n\nEm...,2013,947,"[[-0.070750564, 0.0592482, 0.14344029, -0.2086..."
8326,46,Iván Nieto,Hambre,Mirlo blanco,"Cambio de chaqueta porque ahora sea moda,\nsi ...",2014,1885,"[[-0.042250436, 0.08705383, 0.14074238, -0.196..."
5981,11,Riot propaganda,Cambiarlo todo,Agenda oculta,"[Nega]\nA lomos de un Airbus, La Mazorca, Habe...",2017,4105,"[[-0.054960012, 0.10092814, 0.20124944, -0.160..."


In [283]:
X = np.array(song_vectors).reshape((int(len(song_vectors)*num_features/num_features), num_features))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=10000, random_state=0, verbose=10)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 7459 samples in 0.286s...
[t-SNE] Computed neighbors for 7459 samples in 20.559s...
[t-SNE] Computed conditional probabilities for sample 1000 / 7459
[t-SNE] Computed conditional probabilities for sample 2000 / 7459
[t-SNE] Computed conditional probabilities for sample 3000 / 7459
[t-SNE] Computed conditional probabilities for sample 4000 / 7459
[t-SNE] Computed conditional probabilities for sample 5000 / 7459
[t-SNE] Computed conditional probabilities for sample 6000 / 7459
[t-SNE] Computed conditional probabilities for sample 7000 / 7459
[t-SNE] Computed conditional probabilities for sample 7459 / 7459
[t-SNE] Mean sigma: 0.053211
[t-SNE] Computed conditional probabilities in 0.490s
[t-SNE] Iteration 50: error = 89.6782990, gradient norm = 0.0762581 (50 iterations in 6.507s)
[t-SNE] Iteration 100: error = 89.0150452, gradient norm = 0.0359127 (50 iterations in 4.103s)
[t-SNE] Iteration 150: error = 88.9644775, gradient norm = 

In [284]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

Joining two dataframes to obtain each song's corresponding X,Y co-ordinate.

In [285]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas,song_vector,X,Y
0,1,Ceerre y Pablo Gareta,Concepto,Grey theory,[Ceerre]\nCon esta presión dentro desde pequeñ...,2015,1229,"[[-0.034195747, 0.09055399, 0.1346869, -0.1285...",9.781878,-61.608582
1,17,Hard GZ,Monster,Nictofilia,"Cada día me acuerdo de cuando tenía 6 años,\nf...",2015,6032,"[[-0.033740018, 0.10971237, 0.14973374, -0.174...",40.164669,-14.228997
2,23,Dante,Calma,"Sin álbum, es un vídeo suelto","Ahora voy a hablar, pa' dejar las cosas claras...",2017,8173,"[[-0.05179523, 0.103807546, 0.16787839, -0.209...",-25.296196,65.582489
3,54,SFDK,Lo intento,Tesoros y caras B,"Tú di, tú di, jefe de la M, ?ok?,\nespecial de...",2010,1037,"[[-0.025074981, 0.09681175, 0.1696142, -0.1441...",-15.371625,-3.844587
4,28,SFDK,Seguimos fuertes (con Jefe de la M),Siempre fuertes 2,[Estribillo]\nSuenan Jefe y SFDK\nsi no siente...,2009,3408,"[[-0.04357646, 0.09102699, 0.1567576, -0.14719...",1.411984,-57.916382


**Plotting the results**

Using plotly, I plotted the results so that it becomes easier to explore similar songs based on their colors and clusters.

In [286]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['artista'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = two_dimensional_songs['artista'].astype('category').cat.codes, #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)

### Como hay demasiados artistas, plotly reutiliza clores y es imposible saber si se está clusterizando bien, por lo que reduciremos el dataset a 5 cantantes

In [287]:
artists = ["SFDK", "Rayden", "Dendro", "Nach", "Dante"]
parcial_artists = two_dimensional_songs[two_dimensional_songs['artista'].isin(artists)]

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = parcial_artists['Y'],
    x = parcial_artists['X'],
    text = parcial_artists['artista'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = parcial_artists['artista'].astype('category').cat.codes, #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)


#### Cómo se muestra en la imagen, no somos capaces de separar claramente a los 5 artistas, pero si que se ve claramente que cada uno tiene un estilo completamente distinto y se observa que las canciones de Rayden y Nach son más variopintas que las de Dante o las de SFDK.

## Canciones 3D

### En el caso del plot en tres dimensionse se ha representado directamente a los 5 artistas anteriores,

In [19]:
song_vectors = []
from sklearn.model_selection import train_test_split

artists = ["SFDK", "Rayden", "Dendro", "Nach", "Dante"]
parcial_artists = songs[songs['artista'].isin(artists)]

train, test = train_test_split(parcial_artists, test_size = 0.2)


for song_vector in train['song_vector']:
    song_vectors.append(song_vector)

train.head(10)

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas,song_vector
139,29,Nach,Cuando ya no esté (con Klau),Almanauta,[Nach]\n¿Cuánto tiempo me queda? Los años vuel...,2018,3823,"[[-0.045673825, 0.11685847, 0.18566772, -0.196..."
2667,27,SFDK,El perro anda suelto,Desde los chiqueros,"Hoy ya somos la pareja con compatibilidad, y v...",2000,3841,"[[-0.039533764, 0.07019628, 0.14477897, -0.155..."
7241,6,Rayden,El violinista del titanic,En alma y hueso,Tras el duro invier(_) siempre sale el sol de ...,2014,2447,"[[-0.033285853, 0.13485606, 0.1683339, -0.1798..."
6017,47,Rayden,Imperdible (con Sidecars),Antónimo,"Podré perder el norte,\nPodré perderme en dond...",2017,21931,"[[-0.030984491, 0.14549348, 0.19831865, -0.185..."
2613,28,Nach,Manifiesto,Un día en suburbia,"Mi padre es el sol, mi madre la luna\nmi herma...",2008,21397,"[[-0.053237442, 0.097117715, 0.16198772, -0.17..."
7244,9,Rayden,Controversia,En alma y hueso,"Embajadores del rap, puristas anclados atrás,\...",2014,4153,"[[-0.045818407, 0.099605605, 0.17925186, -0.18..."
7017,2,SFDK,"Digan lo que quieran (con Little Pepe, Jefe de...",Lista de invitados,[Estribillo: Little Pepe] (x2)\nDeja que hable...,2011,1251,"[[-0.039689604, 0.060751934, 0.18773203, -0.19..."
947,12,Nach,Los años luz (con Diana Feria),Un día en suburbia,"[Nach]\nYeah, un día en suburbia, esto esta de...",2008,5147,"[[-0.035146303, 0.10290157, 0.1688699, -0.2011..."
3641,11,Nach,Repaso mis pasos,Ars Magna / Miradas,"Todo empezó de manera sencilla,\nquería tener ...",2005,4724,"[[-0.010917659, 0.13445596, 0.15909608, -0.144..."
7797,12,Nach,El Hip Hop que sé,A través de mí,"El Hip Hop que sé no calla, habla fuerte y alt...",2015,6738,"[[-0.041176964, 0.13488114, 0.14056477, -0.167..."


In [20]:
X = np.array(song_vectors).reshape((int(len(song_vectors)*num_features/num_features), num_features))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=3, n_iter=10000, random_state=0, verbose=10)

all_word_vectors_matrix_3d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 292 samples in 0.006s...
[t-SNE] Computed neighbors for 292 samples in 0.028s...
[t-SNE] Computed conditional probabilities for sample 292 / 292
[t-SNE] Mean sigma: 0.081829
[t-SNE] Computed conditional probabilities in 0.017s
[t-SNE] Iteration 50: error = 108.0790558, gradient norm = 0.1826579 (50 iterations in 0.265s)
[t-SNE] Iteration 100: error = 129.8995514, gradient norm = 0.1182012 (50 iterations in 0.489s)
[t-SNE] Iteration 150: error = 139.3829651, gradient norm = 0.1434451 (50 iterations in 0.221s)
[t-SNE] Iteration 200: error = 148.7807617, gradient norm = 0.0894067 (50 iterations in 0.163s)
[t-SNE] Iteration 250: error = 150.9445496, gradient norm = 0.0879795 (50 iterations in 0.216s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 150.944550
[t-SNE] Iteration 300: error = 4.3100562, gradient norm = 0.0004980 (50 iterations in 0.167s)
[t-SNE] Iteration 350: error = 3.6914742, gradient norm = 0.000

In [21]:
df=pd.DataFrame(all_word_vectors_matrix_3d,columns=['X','Y', 'Z'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

In [22]:
three_dimensional_songs = pd.concat([train, df], axis=1)

three_dimensional_songs.head()

Unnamed: 0,id,artista,cancion,album,letra,anyo,visitas,song_vector,X,Y,Z
0,29,Nach,Cuando ya no esté (con Klau),Almanauta,[Nach]\n¿Cuánto tiempo me queda? Los años vuel...,2018,3823,"[[-0.045673825, 0.11685847, 0.18566772, -0.196...",-242.748611,-149.885117,-417.295929
1,27,SFDK,El perro anda suelto,Desde los chiqueros,"Hoy ya somos la pareja con compatibilidad, y v...",2000,3841,"[[-0.039533764, 0.07019628, 0.14477897, -0.155...",290.997314,-92.063995,320.812408
2,6,Rayden,El violinista del titanic,En alma y hueso,Tras el duro invier(_) siempre sale el sol de ...,2014,2447,"[[-0.033285853, 0.13485606, 0.1683339, -0.1798...",-199.720444,-297.669434,-229.7285
3,47,Rayden,Imperdible (con Sidecars),Antónimo,"Podré perder el norte,\nPodré perderme en dond...",2017,21931,"[[-0.030984491, 0.14549348, 0.19831865, -0.185...",-314.887756,-401.349518,-55.033806
4,28,Nach,Manifiesto,Un día en suburbia,"Mi padre es el sol, mi madre la luna\nmi herma...",2008,21397,"[[-0.053237442, 0.097117715, 0.16198772, -0.17...",-213.315598,207.021698,71.046303


In [23]:
import plotly.express as px

fig = px.scatter_3d(three_dimensional_songs, x='X', y='Y', z='Z',
              color='artista')
fig.show()

### Se observa como, en este caso, el algoritmo es capaz de separar mejor a los distintos raperos

## Noticias 2D
En este caso, lo que se busca con el clustering es comprobar si somos capaces de diferenciar la sección a la que pertenece la noticia

In [24]:
import time
news = pd.read_csv("songlyrics/noticias.csv", sep='\|', encoding='UTF-8').groupby('NEW_TITLE', as_index=False).max()

start_time = time.time()

news['news_vector'] = [clean(normalize(str(new))) for new in news['NEW']]
news['news_vector'] = news['news_vector'].apply(create_vector)
news = news[news['news_vector'] != "No words in vocab"]

print("--- %s seconds ---" % (time.time() - start_time))


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).


Call to deprecated `__getitem__` (Method will be remov

In [26]:
news_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(news, test_size = 0.2)


for news_vector in train['news_vector']:
    news_vectors.append(news_vector)

train.head(5)

Unnamed: 0,NEW_TITLE,SECTION,SECTION_URL,NEW_URL,NEW,news_vector
526,"¡Pedro, Pedro, encadénanos!",opinion,https://www.elmundo.es/opinion.html?intcmp=MEN...,https://www.elmundo.es/opinion/2020/04/15/5e97...,Al final el pobre Sánchez no va a tener más re...,"[[-0.018448656, 0.101849996, 0.18082964, -0.16..."
375,MasterChef 8 y la rebelión que acabó en una cr...,television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/momentvs/202...,"""Cuenta la leyenda que allá por el 2020, en pl...","[[-0.037350487, 0.107858576, 0.19142526, -0.17..."
184,El Gobierno gana poder en el deporte y lleva a...,deportes,https://www.elmundo.es/deportes.html?intcmp=ME...,https://www.elmundo.es/deportes/futbol/primera...,"""Los pactos alcanzados en la reunión entre Ire...","[[-0.040540792, 0.10750827, 0.17678349, -0.155..."
391,"Muere José María Castillo, productor ejecutivo...",television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/2020/04/12/5...,"""Este domingo ha fallecido José María Castillo...","[[0.0018487575, 0.11295477, 0.14102325, -0.147..."
188,El Gobierno quita el límite de suspensos que f...,espana,https://www.elmundo.es/espana.html?intcmp=MENU...,https://www.elmundo.es/espana/2020/04/24/5ea2b...,"""El Gobierno ha quitado """"de forma excepcional...","[[-0.062308613, 0.13607141, 0.18420309, -0.187..."


In [33]:
X = np.array(news_vectors).reshape((int(len(news_vectors)*num_features/num_features), num_features))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=30000, random_state=0, verbose=10)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 423 samples in 0.006s...
[t-SNE] Computed neighbors for 423 samples in 0.051s...
[t-SNE] Computed conditional probabilities for sample 423 / 423
[t-SNE] Mean sigma: 0.059898
[t-SNE] Computed conditional probabilities in 0.024s
[t-SNE] Iteration 50: error = 82.0115204, gradient norm = 0.4252600 (50 iterations in 0.139s)
[t-SNE] Iteration 100: error = 86.1990051, gradient norm = 0.4118958 (50 iterations in 0.150s)
[t-SNE] Iteration 150: error = 87.2386246, gradient norm = 0.3672943 (50 iterations in 0.146s)
[t-SNE] Iteration 200: error = 85.1166763, gradient norm = 0.4023973 (50 iterations in 0.139s)
[t-SNE] Iteration 250: error = 84.7281036, gradient norm = 0.3776489 (50 iterations in 0.150s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 84.728104
[t-SNE] Iteration 300: error = 1.6459476, gradient norm = 0.0054623 (50 iterations in 0.118s)
[t-SNE] Iteration 350: error = 1.4961488, gradient norm = 0.0020255 (

In [34]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

In [35]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()

Unnamed: 0,NEW_TITLE,SECTION,SECTION_URL,NEW_URL,NEW,news_vector,X,Y
0,"¡Pedro, Pedro, encadénanos!",opinion,https://www.elmundo.es/opinion.html?intcmp=MEN...,https://www.elmundo.es/opinion/2020/04/15/5e97...,Al final el pobre Sánchez no va a tener más re...,"[[-0.018448656, 0.101849996, 0.18082964, -0.16...",-15.398737,-1.115017
1,MasterChef 8 y la rebelión que acabó en una cr...,television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/momentvs/202...,"""Cuenta la leyenda que allá por el 2020, en pl...","[[-0.037350487, 0.107858576, 0.19142526, -0.17...",10.991641,-5.949742
2,El Gobierno gana poder en el deporte y lleva a...,deportes,https://www.elmundo.es/deportes.html?intcmp=ME...,https://www.elmundo.es/deportes/futbol/primera...,"""Los pactos alcanzados en la reunión entre Ire...","[[-0.040540792, 0.10750827, 0.17678349, -0.155...",-6.96582,-3.349588
3,"Muere José María Castillo, productor ejecutivo...",television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/2020/04/12/5...,"""Este domingo ha fallecido José María Castillo...","[[0.0018487575, 0.11295477, 0.14102325, -0.147...",9.949397,6.432118
4,El Gobierno quita el límite de suspensos que f...,espana,https://www.elmundo.es/espana.html?intcmp=MENU...,https://www.elmundo.es/espana/2020/04/24/5ea2b...,"""El Gobierno ha quitado """"de forma excepcional...","[[-0.062308613, 0.13607141, 0.18420309, -0.187...",-2.611839,-2.950437


In [42]:
import plotly.express as px
df = px.data.tips()
fig = px.scatter(two_dimensional_songs, x="X", y="Y", color="SECTION",
                 title="Sacciones de prensa")

fig.show()

#### En este caso, el algoritmo no es capaz de separar claramente las 8 secciones de prensa, pero si que se ve claramente que muchas de ellas tienen un lenguaje muy parecido, como la sección de tecnología, que todos sus artículos se ubicas en el cuadrante inferior, el de cultura, que acapara prácticamente todo el cuadrante superior derecho. Otros, como ciencia y salud está disperso entre todos los cuadrantes, y, en el caso de televisión, se separa en dos grupos.

In [43]:
start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=3, n_iter=30000, random_state=0, verbose=10)

all_word_vectors_matrix_3d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))

 = 0.0000002 (50 iterations in 0.296s)
[t-SNE] Iteration 19700: error = 1.6363192, gradient norm = 0.0000002 (50 iterations in 0.304s)
[t-SNE] Iteration 19750: error = 1.6359529, gradient norm = 0.0000002 (50 iterations in 0.275s)
[t-SNE] Iteration 19800: error = 1.6355671, gradient norm = 0.0000002 (50 iterations in 0.320s)
[t-SNE] Iteration 19850: error = 1.6352118, gradient norm = 0.0000002 (50 iterations in 0.382s)
[t-SNE] Iteration 19900: error = 1.6348723, gradient norm = 0.0000002 (50 iterations in 0.309s)
[t-SNE] Iteration 19950: error = 1.6345243, gradient norm = 0.0000002 (50 iterations in 0.292s)
[t-SNE] Iteration 20000: error = 1.6342288, gradient norm = 0.0000002 (50 iterations in 0.279s)
[t-SNE] Iteration 20050: error = 1.6338726, gradient norm = 0.0000002 (50 iterations in 0.362s)
[t-SNE] Iteration 20100: error = 1.6334019, gradient norm = 0.0000002 (50 iterations in 0.350s)
[t-SNE] Iteration 20150: error = 1.6330438, gradient norm = 0.0000002 (50 iterations in 0.327s)
[

In [47]:
df=pd.DataFrame(all_word_vectors_matrix_3d,columns=['X','Y', 'Z'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

In [48]:
three_dimensional_songs = pd.concat([train, df], axis=1)

three_dimensional_songs.head()

Unnamed: 0,NEW_TITLE,SECTION,SECTION_URL,NEW_URL,NEW,news_vector,X,Y,Z
0,"¡Pedro, Pedro, encadénanos!",opinion,https://www.elmundo.es/opinion.html?intcmp=MEN...,https://www.elmundo.es/opinion/2020/04/15/5e97...,Al final el pobre Sánchez no va a tener más re...,"[[-0.018448656, 0.101849996, 0.18082964, -0.16...",-83.147209,1204.362061,-931.738831
1,MasterChef 8 y la rebelión que acabó en una cr...,television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/momentvs/202...,"""Cuenta la leyenda que allá por el 2020, en pl...","[[-0.037350487, 0.107858576, 0.19142526, -0.17...",1095.219971,-173.9431,440.944824
2,El Gobierno gana poder en el deporte y lleva a...,deportes,https://www.elmundo.es/deportes.html?intcmp=ME...,https://www.elmundo.es/deportes/futbol/primera...,"""Los pactos alcanzados en la reunión entre Ire...","[[-0.040540792, 0.10750827, 0.17678349, -0.155...",118.84864,613.045654,-519.76825
3,"Muere José María Castillo, productor ejecutivo...",television,https://www.elmundo.es/television.html?intcmp=...,https://www.elmundo.es/television/2020/04/12/5...,"""Este domingo ha fallecido José María Castillo...","[[0.0018487575, 0.11295477, 0.14102325, -0.147...",1025.276245,-439.361176,-928.734009
4,El Gobierno quita el límite de suspensos que f...,espana,https://www.elmundo.es/espana.html?intcmp=MENU...,https://www.elmundo.es/espana/2020/04/24/5ea2b...,"""El Gobierno ha quitado """"de forma excepcional...","[[-0.062308613, 0.13607141, 0.18420309, -0.187...",-136.912872,350.793976,99.948891


In [49]:
import plotly.express as px

fig = px.scatter_3d(three_dimensional_songs, x='X', y='Y', z='Z',
              color='SECTION')
fig.show()

#### El análisis 3D ha dado sus frutos y se observa como en este caso se agrupan más secciones, como la sección internacional, la de cultura, la de España, la de deportes o la de tecnología. Como en los casos anteriores, no están claramente separadas, pero las noticias pertenecientes a una misma sección están cerca unas de las otras.