<img src="images/header.png" alt="Logo UCLM-ESII" align="right">

<br><br><br><br>
<h2><font color="#92002A" size=4>Trabajo Fin de Grado</font></h2>

<h1><font color="#6B001F" size=5>Generación automática de playlist de canciones <br> mediante técnicas de minería de datos</font></h1>
<h2><font color="#92002A" size=3>Parte 8 - Preprocesamiento 1</font></h2>

<br>
<div style="text-align: right">
    <font color="#B20033" size=3><strong>Autor</strong>: <em>Miguel Ángel Cantero Víllora</em></font><br>
    <br>
    <font color="#B20033" size=3><strong>Directores</strong>: <em>José Antonio Gámez Martín</em></font><br>
    <font color="#B20033" size=3><em>Juan Ángel Aledo Sánchez</em></font><br>
    <br>
<font color="#B20033" size=3>Grado en Ingeniería Informática</font><br>
<font color="#B20033" size=2>Escuela Superior de Ingeniería Informática | Universidad de Castilla-La Mancha</font>

</div>

---

<br>


<a id="indice"></a>
<h2><font color="#92002A" size=5>Índice</font></h2>

<br>

* [1. Introducción](#section1)
* [2. Cambio de formato del DataSet](#section2)
* [3. Modificaciones adicionales en DataSet](#section3)
* [4. Creación de características de playlist](#section4)

<br>

---

In [1]:
# Permite establecer la anchura de la celda
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [2]:
#!pip install nltk

In [3]:
import csv
import emoji
import json
import os
import pandas as pd
import re
import string

from collections import defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm as tqdm_nb
from zipfile import ZipFile

In [4]:
# Ejecutar si es la primera vez que se emplea nltk

#import nltk
#nltk.download('punkt') 
#nltk.download('stopwords')

In [5]:
MPD_PATH = 'MPD'
MPD_TEST_PATH = 'MPD_TEST'
MPD_CSV_PATH = 'MPD_CSV'
MPD_SLICE_PREFIX = 'mpd.slice.'

ALBUMS_FILE = os.path.join(MPD_CSV_PATH,'mpd.albums.csv')
ARTISTS_FILE = os.path.join(MPD_CSV_PATH,'mpd.artists.csv')
TRACKS_FILE = os.path.join(MPD_CSV_PATH,'mpd.tracks.csv')
PLSTRS_FILE = os.path.join(MPD_CSV_PATH,'mpd.pls-tracks.csv')
PLSTARTS_FILE = os.path.join(MPD_CSV_PATH,'mpd.pls-artists.csv')
PLSINFO_FILE = os.path.join(MPD_CSV_PATH,'mpd.playlists-info.csv')
PLSTESTINFO_FILE = os.path.join(MPD_CSV_PATH,'mpd.playlists-info-test.csv')
NAMES_FILE = os.path.join(MPD_CSV_PATH, "mpd.names.csv")
PLSTRS_PREFIX = 'mpd.playlists-tracks.'
EMOJI_DICT_FILE = os.path.join(MPD_CSV_PATH, "emojis_translation_dict.json")

---

<br>


<a id="section1"></a>
## <font color="#92002A">1 - Introducción</font>
<br>

En esta libreta vamos a convertir el *dataset* que hemos obtenido previamente, en formato *JSON*, a formato *CSV*. También vamos a generar las características de las playlists para usar en el modelo de *LightFM* que crearemos más adelante.

<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#92002A"></i></font></a>
</div>

---

<a id="section2"></a>
## <font color="#92002A">2 - Cambio de formato del DataSet</font>
<br>

Cuando creamos los ficheros que contienen el dataset con el conjunto de playlists, empleamos el formato JSON y lo almacenamos en varios archivos comprimidos debido a su elevado tamaño: 7,5 GB (43,3 GB sin comprimir). También nos encontramos con el caso de que tenemos la información repetida de las pistas, puesto que cada playlist contiene toda la información completa de éstas.


Para solucionar estos problemas y poder trabajar mejor con los datos, vamos a convertirlos al formato CSV, con lo que tendremos la información repartida en varias tablas y evitaremos tener información repetida de forma innecesaria. Se ha creado un fichero/tabla para cada tipo de item del conjunto:

*	Álbumes.
*	Artistas.
*	Pistas.
*	Información sobre las playlists.
*	Información sobre las playlists del conjunto de prueba.
*	Lista de pistas de las playlists (para ambos conjuntos).

Adicionalmente, hemos creado una tabla que contiene la lista de artistas de las playlists, junto al número de veces que aparece cada artista en ellas.


<br>

---

<a id="section21"></a>
### <font color="#B20033">2.1 - Definición de funciones adicionales</font>
<br>
<br>

In [6]:
# Función que nos permite leer un archivo .json comprimido o sin comprimir
# y devuelve un diccionario con su contenido
def json_file_reader(file_path):
    """
    :param file_path: Ruta del fichero a leer.
    :return results: Diccionario con los datos leidos del fichero JSON.
    """
    _ , file_extension = os.path.splitext(file_path)

    # Fichero comprimido
    if file_extension == '.zip':
        with ZipFile(file_path,'r') as zip_file:
            with zip_file.open(zip_file.namelist()[0]) as json_file:
                json_data = json.load(json_file)
    # Fichero sin comprimir
    elif file_extension == '.json':
        with open(file_path, "r") as json_file:
            json_data = json.load(json_file)            
    # En caso de que sea otra extensión, devolvemos un diccionario vacío
    else:
        json_data = {}            
    
    return json_data

In [7]:
# Función que se encarga de convertir el dataset que contiene 
# 1 millón de playlists a formato CSV
def jsonds_to_csvds(json_ds_path,csv_ds_path):
    """
    :param json_ds_path: Ruta donde se encuentra el conjunto de datos en formato JSON.
    :param csv_ds_path: Ruta donde almacenaremos el nuevo conjunto de datos en formato CSV.
    """
    if not os.path.isdir(csv_ds_path):
        os.mkdir(csv_ds_path)    
    
    files = []
    tracks_dict = defaultdict(dict)

    for file in os.listdir(json_ds_path):
        if file.startswith(MPD_SLICE_PREFIX):
            files.append(os.path.join(json_ds_path,file))

    plstrs_fieldnames = ['pid','pos','track_uri']
    tracks_fieldnames = ['track_name', 'track_uri', 'duration_ms', 'artist_name', 
                         'artist_uri', 'album_name', 'album_uri']
    plsinfo_fieldnames = ['pid','name','collaborative','modified_at',
                          'num_albums','num_tracks', 'num_followers',
                          'num_edits','duration_ms','num_artists']

    with open(PLSINFO_FILE,'w',newline='') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=plsinfo_fieldnames,delimiter=',')
        writer.writeheader()

    for file in tqdm_nb(files):
        file_name , _ = os.path.splitext(file)
        portion = file_name.split('.')[-1]
        csv_pltrs_file = os.path.join(csv_ds_path, f"{PLSTRS_PREFIX}{portion}.csv")
        row_list = []

        with open(PLSINFO_FILE,'a',encoding='utf8',newline='') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=plsinfo_fieldnames,
                                    delimiter=',',quoting=csv.QUOTE_MINIMAL)
            for pl in json_file_reader(file)['playlists']:
                tracks_list = pl.pop('tracks')
                writer.writerow(pl)
                for track in tracks_list:
                    pos = track.pop('pos')
                    row = {'pid': pl['pid'], 'pos': pos,
                           'track_uri' : track['track_uri']}
                    row_list.append(row)
                    tracks_dict[track['track_uri']] = track

        with open(csv_pltrs_file,'w',newline='') as csv_tracks_file:
            writer_tracks = csv.DictWriter(csv_tracks_file, fieldnames=plstrs_fieldnames)
            writer_tracks.writeheader()
            for row in row_list:
                writer_tracks.writerow(row)
    
    print("Volcado de información sobre pistas:")
    with open(TRACKS_FILE,'w',newline='', encoding='utf8') as csv_tracks_file:
        writer_tracks = csv.DictWriter(csv_tracks_file, fieldnames=tracks_fieldnames)
        writer_tracks.writeheader()
        pbar = tqdm_nb(total=len(tracks_dict))
        for track in tracks_dict.values():
            writer_tracks.writerow(track)
            pbar.update(1)

In [8]:
# Función que se encarga de convertir el conjunto de datos (test)
# a formato CSV
def jsonds_to_csvds_test(json_ds_path,csv_ds_path):
    """
    :param json_ds_path: Ruta donde se encuentra el conjunto de datos en formato JSON.
    :param csv_ds_path: Ruta donde almacenaremos el nuevo conjunto de datos en formato CSV.
    """
    if not os.path.isdir(csv_ds_path):
        os.mkdir(csv_ds_path)

    file_name = "mpd.test.zip"
    
    if file_name in os.listdir(json_ds_path):
        plstrs_fieldnames = ['pid','pos','track_uri']
        plsinfo_fieldnames = ['pid','name','num_holdouts','num_samples',
                              'num_tracks']
        row_list = []
        
        with open(PLSTESTINFO_FILE, 'w',newline='',encoding='utf8') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=plsinfo_fieldnames,
                                    delimiter=',',quoting=csv.QUOTE_MINIMAL)
            writer.writeheader()
            test_playlists = json_file_reader(os.path.join(json_ds_path,file_name))['playlists']
            for pl in test_playlists:
                tracks_list = pl.pop('tracks')
                writer.writerow(pl)
                
                for track in tracks_list:
                    pos = track.pop('pos')
                    row = {'pid': pl['pid'], 'pos': pos,
                           'track_uri' : track['track_uri']}
                    row_list.append(row)
        
        plstrs_test_path = os.path.join(MPD_CSV_PATH,"{}test.csv".format(PLSTRS_PREFIX))
        with open(plstrs_test_path, 'w', newline='', encoding='utf8') as csv_tracks_file:
            writer_tracks = csv.DictWriter(csv_tracks_file, fieldnames=plstrs_fieldnames)
            writer_tracks.writeheader()
            for row in row_list:
                writer_tracks.writerow(row)    
    else:
        print("ERROR: Fichero del conjunto de prueba no encontrado")

In [9]:
jsonds_to_csvds(MPD_PATH,MPD_CSV_PATH)

In [10]:
jsonds_to_csvds_test(MPD_TEST_PATH,MPD_CSV_PATH)

Tras este proceso el conjunto queda convertido a formato *CSV*.

<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#92002A"></i></font></a>
</div>

---

<a id="section3"></a>
## <font color="#92002A">3 - Modificaciones adicionales en DataSet</font>
<br>

Como los identificadores Spotify contienen unos “prefijos” que resultan únicos en el caso de los álbumes, artistas y pistas, los eliminamos (ya que en caso de ser necesario podemos volver a añadirlos). Ejemplo:

_spotify:album:**0MlTOiC5ZYKFGeZ8h3D4rd**_ => *0MlTOiC5ZYKFGeZ8h3D4rd*


Para que las tablas sean más fáciles de entender y ocupen menos tamaño, hemos establecido nuestro propio identificador, al que llamaremos PID, para álbumes, pistas y artistas. De esta manera las relaciones entre los elementos de las tablas, que hemos almacenado en ficheros CSV, se hacen empleando estos identificadores, y no por el código alfanumérico de Spotify


<br>

In [11]:
df_tracks = pd.read_csv(TRACKS_FILE)
df_tracks.head(3)

Unnamed: 0,track_name,track_uri,duration_ms,artist_name,artist_uri,album_name,album_uri
0,No Control,spotify:track:3G0MlTNw2AX3Xdbhrj33OS,237306,Holland,spotify:artist:21KpdYIquYJUEiEcPO2cnI,No Control,spotify:album:5F9BLbIkEvApFC8flNVKYR
1,City Lights,spotify:track:3R9H16eUSv5vJ9DEgMG2Lu,369693,Holland,spotify:artist:21KpdYIquYJUEiEcPO2cnI,No Control,spotify:album:5F9BLbIkEvApFC8flNVKYR
2,Sunshine,spotify:track:1YgaIKDgm7IQpq9MVAmExQ,252066,Keane,spotify:artist:53A0W3U0s8diEn9RhXQhVz,Hopes And Fears,spotify:album:0MlTOiC5ZYKFGeZ8h3D4rd


In [12]:
df_artists = df_tracks[['artist_name', 'artist_uri']].copy()
df_artists.drop_duplicates(subset=['artist_uri'],inplace=True)
df_artists.reset_index(drop=True, inplace=True)
df_artists.index.name = 'artist_pid'
df_artists['artist_id'] = df_artists['artist_uri'].apply(lambda x: x.split(':')[-1])

In [13]:
df_albums = df_tracks[['album_name', 'album_uri', 'artist_uri']].copy()
df_albums.drop_duplicates(subset=['album_uri'],inplace=True)
df_albums.reset_index(drop=True, inplace=True)
df_albums.index.name = 'album_pid'

In [14]:
df_tracks.drop(columns=['artist_name','album_name'],inplace=True)
df_tracks.drop_duplicates(subset='track_uri',inplace=True)
df_tracks.reset_index(drop=True, inplace=True)

In [15]:
df_tracks = pd.merge(df_tracks, df_artists.reset_index()[['artist_pid','artist_uri']], on=['artist_uri']) \
                .drop(columns=['artist_uri'])
df_tracks = pd.merge(df_tracks, df_albums.reset_index()[['album_pid','album_uri']], on=['album_uri']) \
                .drop(columns=['album_uri'])

df_tracks.index.name = 'track_pid'

df_tracks['track_id'] = df_tracks['track_uri'].apply(lambda x: x.split(':')[-1])

In [16]:
df_albums = pd.merge(df_albums, df_artists.reset_index()[['artist_pid','artist_uri']], on=['artist_uri']) \
                .drop(columns=['artist_uri'])
df_albums.index.name = 'album_pid'

df_albums['album_id'] = df_albums['album_uri'].apply(lambda x: x.split(':')[-1])
df_albums.drop(columns=['album_uri'], inplace = True)
df_artists.drop(columns=['artist_uri'], inplace = True)

In [17]:
# Función empleada para leer un dataframe que ha sido almacenado
# en varios ficheros
def read_dataset_multifile(ds_prefix, folder=os.curdir):
    """
    :param ds_prefix: Prefijo de los ficheros a leer.
    :param folder: Directorio donde se encuentran los ficheros.
    :return: Dataframe resultante de leer los ficheros.
    """
    list_df = []
    
    for file_name in os.listdir(folder):
        if file_name.startswith(ds_prefix):
            file_path = os.path.join(folder, file_name)
            df_temp = pd.read_csv(file_path)
            list_df.append(df_temp)
            
    return pd.concat(list_df, axis=0, ignore_index=True)

In [18]:
df_plstrs = read_dataset_multifile(PLSTRS_PREFIX,MPD_CSV_PATH)

trackid_map_dict = df_tracks[['track_uri']].to_dict()['track_uri']
trackid_map_dict = {v: k for k, v in trackid_map_dict.items()}

df_plstrs["track_pid"] = df_plstrs["track_uri"].map(trackid_map_dict)
df_plstrs.drop(columns=['track_uri'], inplace=True)
df_tracks.drop(columns=['track_uri'], inplace=True)
df_plstrs.sort_values(['pid', 'pos'],inplace=True)
df_plstrs.set_index('pid',inplace=True)
df_plstrs.index.name = 'pl_pid'

In [19]:
for file_name in os.listdir(MPD_CSV_PATH):
    if file_name.startswith(PLSTRS_PREFIX):
        os.remove(os.path.join(MPD_CSV_PATH,file_name))

In [20]:
df_plsarts = pd.merge(df_plstrs.reset_index(), 
                      df_tracks.reset_index()[['track_pid', 'artist_pid']], on=['track_pid'])
df_plsarts.drop(columns=['track_pid','pos'], inplace=True)
df_plsarts.sort_values(by=['pl_pid','artist_pid'], inplace=True)
df_plsarts = df_plsarts.groupby(['pl_pid','artist_pid']).size().to_frame(name = 'artist_count').reset_index()
df_plsarts.set_index('pl_pid',inplace=True)

In [21]:
plinfo_dtypes = {'modified_at' : int, 'num_albums': int, 'num_tracks': int, 'num_followers' : int,
                 'num_edits': int, 'duration_ms' : int, 'num_artists': int}

df_plsinfo = pd.read_csv(PLSINFO_FILE, dtype=plinfo_dtypes,index_col=0)
df_plsinfo.index.name = 'pl_pid'

In [22]:
plinfo_test_dtypes = {'num_holdouts' : int, 'num_samples': int, 'num_tracks': int}

df_plsinfo_test = pd.read_csv(PLSTESTINFO_FILE, dtype=plinfo_test_dtypes, index_col=0)
df_plsinfo_test.index.name = 'pl_pid'

In [23]:
df_artists.to_csv(ARTISTS_FILE)
df_albums.to_csv(ALBUMS_FILE)
df_tracks.to_csv(TRACKS_FILE)
df_plsinfo.to_csv(PLSINFO_FILE)
df_plsinfo_test.to_csv(PLSTESTINFO_FILE)
df_plstrs.to_csv(PLSTRS_FILE)
df_plsarts.to_csv(PLSTARTS_FILE)

<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#92002A"></i></font></a>
</div>

---

<a id="section4"></a>
## <font color="#92002A">4 - Creación de características de playlist</font>
<br>

A continuación, vamos a convertir los títulos en etiquetas, en las que cada una corresponderá a una característica.

Como podemos recordar, algunas playlists contienen emoticonos. Para este caso creamos un diccionario con el cual cambiar dichos elementos por palabras.

<br>

---

<a id="section21"></a>
### <font color="#B20033">Emoticonos a texto</font>

<br>

In [24]:
# Comprueba si un carácter es un emoticono
def is_emoji(char):
    return char in emoji.UNICODE_EMOJI['en']

# Obtiene la lista de emoticonos que aparecen en un texto dado
def get_emojis_list(text):
    text_chars = list(text)
    emojis_set = set()    
    for char in text_chars:
        if is_emoji(char):
            emojis_set.add(char)            
    return list(emojis_set)

# Devuelve un string con los emoticonos que aparecen en un texto dado
def get_emojis_string(text):
    emojis_list = get_emojis_list(text)
    return "".join(emojis_list)

# Comprueba si un texto tiene emoticonos
def has_emojis(text):
    return len(get_emojis_list(text)) > 0

# Elimina los emoticonos de un texto
def remove_emojis(text):
    return emoji.get_emoji_regexp().sub(u'', text)

# De una lista de cadenas, devuelve aquellas que contienen
# el emoticono indicado
def get_names_with_emoji(emj, names):
    names_list = []
    
    if is_emoji(emj):
        for name in names:
            if emj in name:
                names_list.append(name)
    else:
        raise Exception("'{}' no es un emoji.".format(emj))
        
    return names_list

In [25]:
def most_frequent_words(names):
    words_count_dict = defaultdict(int)
    
    for name in names:
        words = name.split(' ')
        for word in words:
            # Comprobamos si la palabra no es vacia y no contiene un dígito
            if word != '' and not any(str.isdigit(c) for c in word):
                words_count_dict[word.lower()] += 1
            
    # Ordenamos las tuplas de palabras por su número de apariciones
    words_count_dict = sorted(words_count_dict.items(), 
                              key=lambda k_v: k_v[1], 
                              reverse=True)
    
    # Extraemos en una lista la palabra de cada tupla
    words_list = [word for (word,count) in words_count_dict]
            
    return words_list

In [26]:
def emoji_most_frequent_words(emoj, names_list, num_words=-1):
    lemmatizer = WordNetLemmatizer()
    names = get_names_with_emoji(emoj, names_list)
    for i, name in enumerate(names):
        clean_name = re.sub(r'[^\w\s]','',remove_emojis(name).lower()).strip()
        clean_name_tokenized = list()
        for token in word_tokenize(clean_name):
            if lemmatizer.lemmatize(token) not in set(stopwords.words("english")):
                clean_name_tokenized.append(lemmatizer.lemmatize(token))
        names[i] = ' '.join(clean_name_tokenized)
    word_list = most_frequent_words(names)[0:num_words]
    
    return word_list

In [27]:
def create_emoji_dict(names_list, removable_words=[], num_tags=5):
    names_with_emoji_list = [name for name in names_list if has_emojis(name)]
    
    emojis_set = set()    
    for name in names_with_emoji_list:
        emojis_set.update(get_emojis_list(name))    
    
    emoji_dict = dict()
    for emoj in tqdm_nb(emojis_set):
        emoji_dict[emoj] = emoji_most_frequent_words(emoj, names_with_emoji_list)
        
    words_count_dict = defaultdict(int)
    
    for _,value in emoji_dict.items():
        for element in value:
            words_count_dict[element] +=1
    
    words_count_dict = sorted(words_count_dict.items(), 
                              key=lambda k_v: k_v[1], 
                              reverse=True)
            
    for k,v in emoji_dict.items():
        emoji_dict[k] = [word for word in v if word not in removable_words][0:num_tags]
        
    emoji_dict = {k: v for k, v in emoji_dict.items() if len(v) > 0}
    
    return emoji_dict

---

<br>

<a id="section31"></a>
### <font color="#B20033">Limpieza de texto</font>


<br>

In [28]:
def remove_punctuation_chars(text):
    unwanted_chars = string.punctuation.replace("'", "")
    if any(x in text for x in string.punctuation):
        for c in unwanted_chars :
            text = text.replace(c, ' ')
            
    return text

In [29]:
def remove_multiple_whitespaces(text):
    return re.sub(' +', ' ', text).strip()

In [30]:
def clean_text(text):
    text = remove_emojis(text.lower())
    text = remove_punctuation_chars(text)
    text = remove_multiple_whitespaces(text)
    
    return text

---

<br>

<a id="section31"></a>
### <font color="#B20033">Creación de etiquetas</font>

<br>


In [31]:
def get_tags(cleaned_text, emojis_list=[], emojis_translation_dict=dict(), max_emoji_translations=3):
    stop_words = set(stopwords.words("english"))
    tokenized_text = set(word_tokenize(cleaned_text))
    
    stemmer = PorterStemmer()
    text_tags_set = set(stemmer.stem(word) for word in tokenized_text 
                        if word not in stop_words)

    if len(emojis_list) > 0 and len(emojis_translation_dict) > 0:
        emojis_tags_set = set()
        for emoj in emojis_list:
            if emoj in emoji_dict:
                emojis_tags_set.update(emojis_translation_dict[emoj][0:max_emoji_translations])
        
        text_tags_set = text_tags_set.union(emojis_tags_set)
    
    return "|".join(list(text_tags_set))

In [32]:
def get_unusual_tags(name_tags_list,min_appearances=2):
    tags_count = defaultdict(int)
    for name_tags in name_tags_list:
        for tag in name_tags.split('|'):
            tags_count[tag] += 1

    removable_set = set()
    for tag,count in tags_count.items():
        if count < min_appearances:
            removable_set.add(tag)
            
    return removable_set

In [33]:
def remove_unusual_tags(tags, unusual_tags_set):
    tags_list = tags.split('|')
    new_tags_list = []
    for tag in tags_list:
        if tag not in unusual_tags_set:
            new_tags_list.append(tag)
            
    return ('|').join(new_tags_list)

---

<br>

<a id="section31"></a>
### <font color="#B20033">Proceso final</font>

<br>


In [34]:
def create_names_df(df_names, emojis_translation_dict=dict(), remove_unusualtags=False):
    total_steps = 3
    if remove_unusualtags:
        total_steps += 1
    
    tqdm_nb.pandas()
    print("(1/{}) Cleaning playlists names".format(total_steps))
    df_names['clean_name'] = df_names.name.progress_apply(clean_text)
    print("(2/{}) Emojis translation".format(total_steps))
    df_names['emojis'] = df_names.name.progress_apply(get_emojis_string)
    print("(3/{}) Generating tags".format(total_steps))
    df_names['tags'] = df_names.progress_apply(lambda df: get_tags(df['clean_name'], df['emojis'],
                                                                   emojis_translation_dict,
                                                                   max_emoji_translations=3),axis=1)
    
    df_names.drop(columns=['emojis'], inplace=True)
    df_names.drop(columns=['clean_name'], inplace=True)
    
    if remove_unusualtags:
        print("(4/{}) Removing unusual tags".format(total_steps))
        removable_tags_set = get_unusual_tags(df_names.tags.to_list()) 
        df_names['tags'] = df_names.progress_apply(lambda df: remove_unusual_tags(df['tags'], 
                                                                                  removable_tags_set),
                                                   axis=1)    
    
    return df_names

In [35]:
# Diccionario de emoticonos
if os.path.isfile(EMOJI_DICT_FILE):
    with open(EMOJI_DICT_FILE) as json_file:
        emoji_dict = json.load(json_file)
else:
    names = pd.read_csv(PLSINFO_FILE, index_col=0)['name'].astype(str).to_list()
    names = names + pd.read_csv(PLSTESTINFO_FILE, index_col=0)['name'].astype(str).to_list()
    emoji_dict = create_emoji_dict(names, removable_words= ["music", "playlist", "song"])
    with open(EMOJI_DICT_FILE, 'w') as json_file:
        json.dump(emoji_dict, json_file, indent=4)

  0%|          | 0/1081 [00:00<?, ?it/s]

In [36]:
list(emoji_dict.items())[:8]

[('🎐', ['bit', 'everything', 'swavé', 'mi', 'watercolour']),
 ('🙃', ['sad', 'mood', 'love', 'feel', 'good']),
 ('♑', ['capricorn', 'gamzee', 'makara', 'ive', 'hellbent']),
 ('🌤', ['morning', 'cloud', 'day', 'summer', 'spring']),
 ('♣', ['attack', 'team', 'nigga', 'explosion', 'later']),
 ('🤶', ['christmas', 'justin', 'bieber', 'drummer']),
 ('🚂', ['train', 'soul', 'express', 'toot', 'gain']),
 ('🤰', ['pregnant', 'word'])]

In [37]:
# DataFrame nombres
if os.path.isfile(NAMES_FILE):
    df_names = pd.read_csv(NAMES_FILE, index_col=0)
else:
    df_names = pd.read_csv(PLSINFO_FILE, index_col=0)[['name']].astype(str)
    df_names_test = pd.read_csv(PLSTESTINFO_FILE, index_col=0)[['name']].astype(str)
    df_names = create_names_df(pd.concat([df_names,df_names_test]), emoji_dict, remove_unusualtags=False)
    df_names.to_csv(NAMES_FILE)

(1/3) Cleaning playlists names


  0%|          | 0/1010000 [00:00<?, ?it/s]

(2/3) Emojis translation


  0%|          | 0/1010000 [00:00<?, ?it/s]

(3/3) Generating tags


  0%|          | 0/1010000 [00:00<?, ?it/s]

In [38]:
df_names.head()

Unnamed: 0_level_0,name,tags
pl_pid,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Low viscosity vibes,viscos|low|vibe
1,dalanda 🐉,dalanda|imagine|game|dragon
2,freeze pops,pop|freez
3,Golden Oldies,oldi|golden
4,I d◻n't g◻v◻ a d◻mn,dmn|n't|dnt|gv


<div style="text-align: right">
<a href="#indice"><font size=5><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#92002A"></i></font></a>
</div>

---

<div style="text-align: right"> <font size=6><i class="fa fa-graduation-cap" aria-hidden="true" style="color:#92002A"></i> </font></div>