# **Exploración de Datos:** Audio features and lyrics of Spotify songs

## **1. Librerias**

In [22]:
import pandas as pd

---
## **2. Análisis de Dataset**
### 2.1 Carga de datos

In [33]:
dataPath = "./spotify_songs.csv"
dataset = pd.read_csv(dataPath)

dataset.head(1)

Unnamed: 0,track_id,track_name,track_artist,lyrics,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,language
0,0017A6SJgTbfQVU2EtsPNo,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,41,1srJQ0njEQgd8w4XSqI4JQ,Trip,2001-01-01,Pinoy Classic Rock,37i9dQZF1DWYDQ8wBxd7xt,...,-10.068,1,0.0236,0.279,0.0117,0.0887,0.566,97.091,235440,tl


### 2.2 Features disponibles

In [24]:
print(f"Filas:{dataset.shape[0]}\nColumnas:{dataset.shape[1]}\n")
print(dataset.columns.values)

Filas:18454
Columnas:25

['track_id' 'track_name' 'track_artist' 'lyrics' 'track_popularity'
 'track_album_id' 'track_album_name' 'track_album_release_date'
 'playlist_name' 'playlist_id' 'playlist_genre' 'playlist_subgenre'
 'danceability' 'energy' 'key' 'loudness' 'mode' 'speechiness'
 'acousticness' 'instrumentalness' 'liveness' 'valence' 'tempo'
 'duration_ms' 'language']


### 2.3 Features categóricas

In [25]:
print(dataset.select_dtypes(include = ["object"]).columns.values)

['track_id' 'track_name' 'track_artist' 'lyrics' 'track_album_id'
 'track_album_name' 'track_album_release_date' 'playlist_name'
 'playlist_id' 'playlist_genre' 'playlist_subgenre' 'language']


### 2.4 Features numéricas

In [26]:
print(dataset.select_dtypes(exclude = ["object"]).columns.values)

['track_popularity' 'danceability' 'energy' 'key' 'loudness' 'mode'
 'speechiness' 'acousticness' 'instrumentalness' 'liveness' 'valence'
 'tempo' 'duration_ms']


### 2.5 Valores nulos

In [29]:
print(f"Valores nulos de dataset con tamaño {dataset.shape}")
prevDatasetRows = dataset.shape[0]

dataset = dataset.dropna()

print(f"Valores nulos de dataset limpio con tamaño {dataset.shape}")
print(f"Se eliminaron {prevDatasetRows - dataset.shape[0]} filas")

Valores nulos de dataset con tamaño (18454, 25)
Valores nulos de dataset limpio con tamaño (18194, 25)
Se eliminaron 260 filas


## **3. Análisis de columnas fundamentales**

Dado que dentro de las features categóricas, existe una columna lyrics que da información acerca de la letra de canción, es un correcto candidato para poner a prueba nuestro índice invertido que aplica similitud del coseno por medio de una búsqueda en lenguaje natural

### 3.1 Nuevo dataset

In [50]:
newDataset = dataset[[
    "track_id",
    "track_name",
    "track_artist",
    "lyrics",
    "track_album_name",
    "playlist_name",
    "playlist_genre",
    "playlist_subgenre",
    "language"
]]

newDataset.head(3)

Unnamed: 0,track_id,track_name,track_artist,lyrics,track_album_name,playlist_name,playlist_genre,playlist_subgenre,language
0,0017A6SJgTbfQVU2EtsPNo,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,Trip,Pinoy Classic Rock,rock,classic rock,tl
1,004s3t0ONYlzxII9PLgU6z,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",Love & Loss,Hard Rock Workout,rock,hard rock,en
2,00chLpzhgVjxs1zKC9UScL,Poison,Bell Biv DeVoe,"NA Yeah, Spyderman and Freeze in full effect U...",Gold,"Back in the day - R&B, New Jack Swing, Swingbe...",r&b,new jack swing,en


### 3.2 Información de la feature

Se agrega una fila con todos los textos concatenados, como petición por parte de los requerimientos del proyecto

In [51]:
def concatenateTextFields(row):
    return ' '.join([
        str(row['track_name']),
        str(row['track_artist']),
        str(row['lyrics']),
        str(row['track_album_name']),
        str(row['playlist_name']),
        str(row['playlist_genre']),
        str(row['playlist_subgenre'])
    ])
    
newDataset["texto_concatenado"] = dataset.apply(concatenateTextFields, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newDataset["texto_concatenado"] = dataset.apply(concatenateTextFields, axis=1)


In [52]:
newDataset.head(3)

Unnamed: 0,track_id,track_name,track_artist,lyrics,track_album_name,playlist_name,playlist_genre,playlist_subgenre,language,texto_concatenado
0,0017A6SJgTbfQVU2EtsPNo,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,Trip,Pinoy Classic Rock,rock,classic rock,tl,Pangarap Barbie's Cradle Minsan pa Nang ako'y ...
1,004s3t0ONYlzxII9PLgU6z,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",Love & Loss,Hard Rock Workout,rock,hard rock,en,"I Feel Alive Steady Rollin The trees, are sing..."
2,00chLpzhgVjxs1zKC9UScL,Poison,Bell Biv DeVoe,"NA Yeah, Spyderman and Freeze in full effect U...",Gold,"Back in the day - R&B, New Jack Swing, Swingbe...",r&b,new jack swing,en,"Poison Bell Biv DeVoe NA Yeah, Spyderman and F..."


In [55]:
newDataset.to_csv("spotifySongsTextConcatenated.csv", index=False)
print("Dataset con texto concatenado guardado en formato .csv correctamente")

Dataset con texto concatenado guardado en formato .csv correctamente
