# Analisis de exitos nuevos y exitos historicos
### Canciones numero uno de Billboard comparados con éxitos actuales

En toda la historia del "top 100" de Billboard solo el 3% de las canciones han estado mas de 10 semanas en el #1
Vamos a comparar las características de las 50 canciones que estuvieron mas tiempo en el #1 de Billboard con las características del top 50 actual de 2020 de Spotify

## Importamos librerías

In [1]:
import requests
import base64
import re
from urllib.parse import urlencode
import pandas as pd
from bs4 import BeautifulSoup
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np

## Scrappeo del sitio web de Billboard
#### Comenzamos scrappeando este artículo web donde podemos ver las canciones que han permanecido mas semanas en el numero uno del "Hot 100" chart de Billboard, traemos las cincuenta canciones que han permanecido mas tiempo

In [2]:
#Defino lista vacia para poner mis elementos
songs_list = []
#Defino esta variable para ir enumerando el ranking
rank=51
#Hago un loop en las pags 6:11 de mi url para obtener las primeras 50 canciones
for i in range(6,11):
    #Definimos el url con format al numero de pag
    url_billb = f'https://stacker.com/stories/3384/songs-dominated-billboard-charts-longest?page={i}'
    #Hacemos el request de la información
    billboard = requests.get(url_billb).content
    #Creamos un objeto de BeautifulSoup
    music_soup = BeautifulSoup(billboard,'html5')
    #Despues de analizar la "sopa" traemos los elementos que nos interesan
    songs=music_soup.select('div[class="slideshow-slide__title"]>h2')
    #Traemos solo los elementos de texto
    songs_text = [song.text for song in songs]
    #iteramos en todas las str traidas para separar por ranking, titulo y artista
    for  song in songs_text:
        #Definimos el rango
        rank=rank-1
        #Limpiamos cada string
        #Quitamos el ranking de la pagina(no nos sirve pq hay "empates" y se repiten numeros)
        song=re.sub(r'\#\d*. ' ,'',song)
        #Quitamos comillas simples pues no son utiles
        song=re.sub(r'\'','',song)
        #Creamos una lista temporal con ranking, titulo y artista para cada string traido
        templist=[rank, song.split(' by ')[0], song.split('by ')[1]]
        #A una lista maestra agregamos la lista creada de cada string
        songs_list.append(templist)
        #La ordenamos con el nuevo ranking
        songs_list.sort()
#Creamos un DF con nuestra lista de listas, asignamos nombres a las columnas   
billboard_scrapped=pd.DataFrame(list(songs_list),columns=['rank','title','artist'])
#Vemos el DF (lo comento pa' q no se haga largo)
billboard_scrapped.head(3)

Unnamed: 0,rank,title,artist
0,1,Old Town Road,Lil Nas X featuring Billy Ray Cyrus
1,2,Despacito,Luis Fonsi & Daddy Yankee featuring Justin Bieber
2,3,One Sweet Day,Mariah Carey & Boyz II Men


# Usando la API de spotify
### Comenzamos creando un token y trayendo la info de al lista de top 50 global

In [3]:
#Esto se obtiene del dashboard d la API
client_ID = open('../../spoti_client.txt','r').readline()
client_secret = open('../../spoti_secret.txt','r').readline()

#creamos una str como pide la API para enviar el ID y el secret
client_creds = f"{client_ID}:{client_secret}"
    
#La encriptamos en base64 como lo pide la API
client_creds_base64 = base64.b64encode(client_creds.encode())

In [4]:
#Definimos url que hara el request
token_url = 'https://accounts.spotify.com/api/token?grant_type=client_credentials'
#Definimos el tipo de acceso que usaremos
token_data = {'grant_type':'client_credentials'}
#Le pasamos nuestro usuario y secret codificado
token_headers = {'Authorization':f'Basic {client_creds_base64.decode()}'}
#Hacemos el request con los parametros que definimos arriba 
r = requests.post(token_url, data=token_data, headers=token_headers)
#Vemos el token
data=r.json()
#Lo comento para que no se haga largo
#data

In [5]:
#Creamos el headers con los parametros que nos pide
auth = {'authorization':f'Bearer {data["access_token"]}'}

## Spotipy

#### Spotipy es un wrapper de la API de Spotify que hace la conexión y busquedas mas sencillas, 
#### Esto nos funcionara cuando no tenemos el ID de la canción, como es el caso de el chart de Billboard

In [6]:
#Autorizacion
#Leemos las credenciales d un txt fuera del repo
client_ID = open('../../spoti_client.txt','r').readline()
client_secret = open('../../spoti_secret.txt','r').readline()
#Creamos un objeto client credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_ID, client_secret=client_secret)
#Autorizamos nuestras consultas
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Buscamos la lista de top 50 y extraemos las canciones y sus características

In [7]:
# Creamos un query como lo pide la app 
"""q= lo que queremos buscar
   type=album, artist, playlist, track, show, episode(puedes poner varios separados por comas)
   limit= de 1 a 50 default=20
   offset=indice del primer elemento a buscar, default=0
   """
#creamos el query, el url encode lo convierte en el formatoo que lo pide la pag
#Buscamos la playlist de top 50 global para compararlos con éxitos históricos
query = urlencode({'q':'top 50 global','type':'playlist','limit':5})
query


#Agregamos el url "base" y creamos el url completo
#En el url base se cambia search dependiendo de lo que busquemos:
# playlists,shows,albums,artists son opciones search es una busqueda gral.
endpoint = "https://api.spotify.com/v1/"
lookup = f'{endpoint}search?{query}'

In [8]:
#Hacemos el request.GET con el url (lookup) que definimos arriba y el headers q tiene nuestro token
busca = requests.get(lookup, headers=auth)

#vemos el json de lo obtenido
#Lo comento para que no se haga largo
#busca.json()

In [9]:
#Creamos un DF con las listas obtenidas
prueba_listas = pd.json_normalize(busca.json()['playlists']['items'])
prueba_listas2 = prueba_listas.loc[prueba_listas['owner.id']=='spotifycharts']
#Extraemos solo la primera
top50spotify = prueba_listas2.iloc[0]
#Vemos la playlist
#Lo comento para que no se haga largo
#top50spotify

In [10]:
#Buscamos la lista por su id para traer sus canciones
busca_playlist = requests.get(f'https://api.spotify.com/v1/playlists/{top50spotify["id"]}/tracks', headers=auth)
#Creo un DF de las canciones de la playlist y la ordeno segun su popularidad
playlist_tracks = pd.json_normalize(busca_playlist.json()['items']).sort_values(['track.popularity'],ascending=False)
#Creo otro DF solo con las columnas que me son utiles y pongo el id de cancion como index, ordeno por popularidad
spotify_songs = playlist_tracks[['track.id', 'track.name', 'track.album.artists','track.popularity']].set_index('track.id')
#Reviso contenido
spotify_songs.head(3)

Unnamed: 0_level_0,track.name,track.album.artists,track.popularity
track.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),[{'external_urls': {'spotify': 'https://open.s...,100
47EiUVwUp4C9fGccaPuUCS,Dakiti,[{'external_urls': {'spotify': 'https://open.s...,99
0t1kP63rueHleOhQkYSXFY,Dynamite,[{'external_urls': {'spotify': 'https://open.s...,97


In [11]:
#Creo una funcion para traer las caracteristicas de las canciones (Usare este proceso dos. veces)
def get_audio_features (songs,headers):
    #Creo un diccionario. vacio para llenarlo con las características de cada canción
    songs_features={}
    #Itero sobre los id's de las canciones para traer las  características de cada una y ponerlas en mi diccionario
    for track in songs:
        busca_audio_analysis = requests.get(f'https://api.spotify.com/v1/audio-features/{track}', headers=headers)
        songs_features[track] = busca_audio_analysis.json()
    #Convierto el diccionario en dataframe y pongo el id como index
    audio_features = pd.DataFrame(songs_features).T.set_index('id')
    #Regreso un df con las caracteristicas de mis canciones
    return audio_features

In [12]:
#traigo las caracteristicas del top 50 de spotify 
spotify_features = get_audio_features(spotify_songs.index,auth)

In [13]:
#Reviso info
spotify_features.head(3)

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3tjFYV6RSFtuktYl3ZtYcq,0.7,0.722,7,-3.558,0,0.0369,0.221,0.0,0.272,0.756,90.989,audio_features,spotify:track:3tjFYV6RSFtuktYl3ZtYcq,https://api.spotify.com/v1/tracks/3tjFYV6RSFtu...,https://api.spotify.com/v1/audio-analysis/3tjF...,140526,4
47EiUVwUp4C9fGccaPuUCS,0.731,0.573,4,-10.059,0,0.0544,0.401,5.22e-05,0.113,0.145,109.928,audio_features,spotify:track:47EiUVwUp4C9fGccaPuUCS,https://api.spotify.com/v1/tracks/47EiUVwUp4C9...,https://api.spotify.com/v1/audio-analysis/47Ei...,205090,4
0t1kP63rueHleOhQkYSXFY,0.746,0.765,6,-4.41,0,0.0993,0.0112,0.0,0.0936,0.737,114.044,audio_features,spotify:track:0t1kP63rueHleOhQkYSXFY,https://api.spotify.com/v1/tracks/0t1kP63rueHl...,https://api.spotify.com/v1/audio-analysis/0t1k...,199054,4


In [14]:
#Hago join de los DF de info de cancion y caracteristicas de cancion
spotify_songs_feat = spotify_songs.join(spotify_features)
#Agrego columna para identificar que es del top 5 d Spotify
spotify_songs_feat['Chart_Type']='Spotify top'
#Reviso
spotify_songs_feat.head(3)
#Listo para limpieza

Unnamed: 0_level_0,track.name,track.album.artists,track.popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature,Chart_Type
track.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),[{'external_urls': {'spotify': 'https://open.s...,100,0.7,0.722,7,-3.558,0,0.0369,0.221,...,0.272,0.756,90.989,audio_features,spotify:track:3tjFYV6RSFtuktYl3ZtYcq,https://api.spotify.com/v1/tracks/3tjFYV6RSFtu...,https://api.spotify.com/v1/audio-analysis/3tjF...,140526,4,Spotify top
47EiUVwUp4C9fGccaPuUCS,Dakiti,[{'external_urls': {'spotify': 'https://open.s...,99,0.731,0.573,4,-10.059,0,0.0544,0.401,...,0.113,0.145,109.928,audio_features,spotify:track:47EiUVwUp4C9fGccaPuUCS,https://api.spotify.com/v1/tracks/47EiUVwUp4C9...,https://api.spotify.com/v1/audio-analysis/47Ei...,205090,4,Spotify top
0t1kP63rueHleOhQkYSXFY,Dynamite,[{'external_urls': {'spotify': 'https://open.s...,97,0.746,0.765,6,-4.41,0,0.0993,0.0112,...,0.0936,0.737,114.044,audio_features,spotify:track:0t1kP63rueHleOhQkYSXFY,https://api.spotify.com/v1/tracks/0t1kP63rueHl...,https://api.spotify.com/v1/audio-analysis/0t1k...,199054,4,Spotify top


## Buscamos las canciones de Billboard en Spotify 

In [15]:
#Definimos lista vacia donde tendremos nuestros DF
df_list = []
#Iteramos en los titulos de canciones en el df de Billboard 
for i in billboard_scrapped['title'].index:
    #Para cada elemento del chart de Billboard, buscamos con el wrapper spotipy que permite buscar por artista y cancion
    #Usamos su metodo .search() para buscar por el titulo y la primera palabra del artista
    #Esto nos asegura que no traemos una cancion del titulo correcto pero de otro artista
    busca=sp.search(billboard_scrapped['title'][i]+' '+(billboard_scrapped['artist'][i].split(' ',)[0]))
    #Hacemos una lista de DF 
    df_list.append(pd.json_normalize(busca['tracks']['items'][0]))
#juntamos nuestro DF, seleccionamos solo columnas importantes, ponemos el ID como index    
billboard_songs = pd.concat(df_list).set_index('id')[['name', 'artists', 'popularity']]

In [16]:
#Buscamos las caracteristicas de las canciones de billbord con nuestra fucion definida arriba
billboard_features = get_audio_features(billboard_songs.index,auth)

In [17]:
billboard_features.head(3)

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2YpeDb67231RjR0MgVLzsG,0.878,0.619,6,-5.56,1,0.102,0.0533,0,0.113,0.639,136.041,audio_features,spotify:track:2YpeDb67231RjR0MgVLzsG,https://api.spotify.com/v1/tracks/2YpeDb67231R...,https://api.spotify.com/v1/audio-analysis/2Ype...,157067,4
6habFhsOp2NvshLv26DqMb,0.655,0.797,2,-4.787,1,0.153,0.198,0,0.067,0.839,177.928,audio_features,spotify:track:6habFhsOp2NvshLv26DqMb,https://api.spotify.com/v1/tracks/6habFhsOp2Nv...,https://api.spotify.com/v1/audio-analysis/6hab...,229360,4
7ySbfLwdCwl1EM0zNCJZ38,0.568,0.495,1,-8.964,1,0.0299,0.353,0,0.0839,0.303,128.234,audio_features,spotify:track:7ySbfLwdCwl1EM0zNCJZ38,https://api.spotify.com/v1/tracks/7ySbfLwdCwl1...,https://api.spotify.com/v1/audio-analysis/7ySb...,281067,4


### Ahora tenemos dos dataframes de billboard, uno con info de las canciones y otro con caracteristicas, es hora de unirlos

In [18]:
#Hago join de los DF de info de cancion y caracteristicas de cancion
billboard_songs_feat = billboard_songs.join(billboard_features)
#Agrego columna para identificar que es del top 50 d Billboard
billboard_songs_feat['Chart_Type']='Billboard top'
#Reviso
billboard_songs_feat.head(3)
#Listo para limpieza

Unnamed: 0_level_0,name,artists,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature,Chart_Type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2YpeDb67231RjR0MgVLzsG,Old Town Road - Remix,[{'external_urls': {'spotify': 'https://open.s...,83,0.878,0.619,6,-5.56,1,0.102,0.0533,...,0.113,0.639,136.041,audio_features,spotify:track:2YpeDb67231RjR0MgVLzsG,https://api.spotify.com/v1/tracks/2YpeDb67231R...,https://api.spotify.com/v1/audio-analysis/2Ype...,157067,4,Billboard top
6habFhsOp2NvshLv26DqMb,Despacito,[{'external_urls': {'spotify': 'https://open.s...,79,0.655,0.797,2,-4.787,1,0.153,0.198,...,0.067,0.839,177.928,audio_features,spotify:track:6habFhsOp2NvshLv26DqMb,https://api.spotify.com/v1/tracks/6habFhsOp2Nv...,https://api.spotify.com/v1/audio-analysis/6hab...,229360,4,Billboard top
7ySbfLwdCwl1EM0zNCJZ38,One Sweet Day,[{'external_urls': {'spotify': 'https://open.s...,69,0.568,0.495,1,-8.964,1,0.0299,0.353,...,0.0839,0.303,128.234,audio_features,spotify:track:7ySbfLwdCwl1EM0zNCJZ38,https://api.spotify.com/v1/tracks/7ySbfLwdCwl1...,https://api.spotify.com/v1/audio-analysis/7ySb...,281067,4,Billboard top


## Finalmente tenemos dos dataframes uno de top50 de spotify en 2020 y uno de top50 de billboard desde el inicio del top100 chart en 1958

In [19]:
print('Spotify df preview')
print(spotify_songs_feat.shape)
spotify_songs_feat.to_csv('Spotify_output')
spotify_songs_feat.head(3)

Spotify df preview
(50, 21)


Unnamed: 0_level_0,track.name,track.album.artists,track.popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature,Chart_Type
track.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),[{'external_urls': {'spotify': 'https://open.s...,100,0.7,0.722,7,-3.558,0,0.0369,0.221,...,0.272,0.756,90.989,audio_features,spotify:track:3tjFYV6RSFtuktYl3ZtYcq,https://api.spotify.com/v1/tracks/3tjFYV6RSFtu...,https://api.spotify.com/v1/audio-analysis/3tjF...,140526,4,Spotify top
47EiUVwUp4C9fGccaPuUCS,Dakiti,[{'external_urls': {'spotify': 'https://open.s...,99,0.731,0.573,4,-10.059,0,0.0544,0.401,...,0.113,0.145,109.928,audio_features,spotify:track:47EiUVwUp4C9fGccaPuUCS,https://api.spotify.com/v1/tracks/47EiUVwUp4C9...,https://api.spotify.com/v1/audio-analysis/47Ei...,205090,4,Spotify top
0t1kP63rueHleOhQkYSXFY,Dynamite,[{'external_urls': {'spotify': 'https://open.s...,97,0.746,0.765,6,-4.41,0,0.0993,0.0112,...,0.0936,0.737,114.044,audio_features,spotify:track:0t1kP63rueHleOhQkYSXFY,https://api.spotify.com/v1/tracks/0t1kP63rueHl...,https://api.spotify.com/v1/audio-analysis/0t1k...,199054,4,Spotify top


In [20]:
print('Billboard df preview')
print(billboard_songs_feat.shape)
billboard_songs_feat.to_csv('Billboard_output')
billboard_songs_feat.head(3)

Billboard df preview
(50, 21)


Unnamed: 0_level_0,name,artists,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature,Chart_Type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2YpeDb67231RjR0MgVLzsG,Old Town Road - Remix,[{'external_urls': {'spotify': 'https://open.s...,83,0.878,0.619,6,-5.56,1,0.102,0.0533,...,0.113,0.639,136.041,audio_features,spotify:track:2YpeDb67231RjR0MgVLzsG,https://api.spotify.com/v1/tracks/2YpeDb67231R...,https://api.spotify.com/v1/audio-analysis/2Ype...,157067,4,Billboard top
6habFhsOp2NvshLv26DqMb,Despacito,[{'external_urls': {'spotify': 'https://open.s...,79,0.655,0.797,2,-4.787,1,0.153,0.198,...,0.067,0.839,177.928,audio_features,spotify:track:6habFhsOp2NvshLv26DqMb,https://api.spotify.com/v1/tracks/6habFhsOp2Nv...,https://api.spotify.com/v1/audio-analysis/6hab...,229360,4,Billboard top
7ySbfLwdCwl1EM0zNCJZ38,One Sweet Day,[{'external_urls': {'spotify': 'https://open.s...,69,0.568,0.495,1,-8.964,1,0.0299,0.353,...,0.0839,0.303,128.234,audio_features,spotify:track:7ySbfLwdCwl1EM0zNCJZ38,https://api.spotify.com/v1/tracks/7ySbfLwdCwl1...,https://api.spotify.com/v1/audio-analysis/7ySb...,281067,4,Billboard top


# Hora de limpiar nuestros data frames

In [21]:
# Comenzamos revisando los nombres de las columnas
print(spotify_songs_feat.columns)
print(billboard_songs_feat.columns)

Index(['track.name', 'track.album.artists', 'track.popularity', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'uri',
       'track_href', 'analysis_url', 'duration_ms', 'time_signature',
       'Chart_Type'],
      dtype='object')
Index(['name', 'artists', 'popularity', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'type', 'uri', 'track_href',
       'analysis_url', 'duration_ms', 'time_signature', 'Chart_Type'],
      dtype='object')


In [22]:
#Creamos lista d columnas a quitar pq sus datos no son relevantes
drop_cols = ['type', 'uri','track_href', 'analysis_url', 'time_signature']

In [23]:
#Las quitamos d ambos DF
spotify_songs_feat.drop(drop_cols,axis=1, inplace=True)
billboard_songs_feat.drop(drop_cols,axis=1, inplace=True)

In [24]:
# Seguimos poniendo el mismo nombre de columnas para ambos
columns = ['Track_Name', 'Artists', 'Popularity', 'Danceability',
       'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 'Acousticness',
       'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration', 'Chart_Type']
spotify_songs_feat.columns=columns
billboard_songs_feat.columns=columns
#Reviso que esten iguales
spotify_songs_feat.columns==billboard_songs_feat.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [25]:
def artist_extract(list_dict):
    return(' - '.join([artist['name'] for artist in list_dict]))

In [26]:
billboard_songs_feat['Artists']=billboard_songs_feat['Artists'].apply(artist_extract)
spotify_songs_feat['Artists']=spotify_songs_feat['Artists'].apply(artist_extract)

In [27]:
#Revisamos si hay columnas con null
null_cols=spotify_songs_feat.isnull().sum()
null_cols[null_cols>0]

Series([], dtype: int64)

In [28]:
#vemos los tipos de datos
print(spotify_songs_feat.dtypes)
print(billboard_songs_feat.dtypes)

Track_Name          object
Artists             object
Popularity           int64
Danceability        object
Energy              object
Key                 object
Loudness            object
Mode                object
Speechiness         object
Acousticness        object
Instrumentalness    object
Liveness            object
Valence             object
Tempo               object
Duration            object
Chart_Type          object
dtype: object
Track_Name          object
Artists             object
Popularity           int64
Danceability        object
Energy              object
Key                 object
Loudness            object
Mode                object
Speechiness         object
Acousticness        object
Instrumentalness    object
Liveness            object
Valence             object
Tempo               object
Duration            object
Chart_Type          object
dtype: object


In [29]:
#Cambiamos los campos q necesitamos a numerico para hacer analisis 
numeric_cols = ['Danceability',
       'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 'Acousticness',
       'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration']
for col in numeric_cols:
    spotify_songs_feat[col]=spotify_songs_feat[col].astype('float64')
for col in numeric_cols:
    billboard_songs_feat[col]=billboard_songs_feat[col].astype('float64')

In [30]:
#Cambiamos el tiempo a segundos
spotify_songs_feat['Duration']=spotify_songs_feat['Duration']/1000
billboard_songs_feat['Duration']=billboard_songs_feat['Duration']/1000

In [31]:
#Buscamos las columnas que tienen poca varianza pues no nos dan mucho valor
low_variance_spotify=[]
for col in spotify_songs_feat._get_numeric_data():#Get numeric solo trae las q tienen nums
    minimun = min(spotify_songs_feat[col]) #Seleccionas el valor mas chico d la columna 
    ninety_pc = np.percentile(spotify_songs_feat[col],90) #Calculamos el percentil 90
    if ninety_pc == minimun: #Si el minimo es igual al 90% d las observaciones, no es útil
        low_variance.append(col) #Creamos una lista con los nombres d col q cumplen esta condicion
low_variance_spotify

[]

In [32]:
#Buscamos las columnas que tienen poca varianza pues no nos dan mucho valor
low_variance_billboard=[]
for col in billboard_songs_feat._get_numeric_data():#Get numeric solo trae las q tienen nums
    minimun = min(billboard_songs_feat[col]) #Seleccionas el valor mas chico d la columna 
    ninety_pc = np.percentile(billboard_songs_feat[col],90) #Calculamos el percentil 90
    if ninety_pc == minimun: #Si el minimo es igual al 90% d las observaciones, no es útil
        low_variance.append(col) #Creamos una lista con los nombres d col q cumplen esta condicion
low_variance_billboard

[]

### No hay columnas con poca varianza que eliminar, nuestros data frames estan listos

In [33]:
print('Spotify df preview')
print(spotify_songs_feat.shape)
spotify_songs_feat.to_csv('Spotify_output')
spotify_songs_feat.head(3)

Spotify df preview
(50, 16)


Unnamed: 0_level_0,Track_Name,Artists,Popularity,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration,Chart_Type
track.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3tjFYV6RSFtuktYl3ZtYcq,Mood (feat. iann dior),24kGoldn - iann dior,100,0.7,0.722,7.0,-3.558,0.0,0.0369,0.221,0.0,0.272,0.756,90.989,140.526,Spotify top
47EiUVwUp4C9fGccaPuUCS,Dakiti,Bad Bunny - Jhay Cortez,99,0.731,0.573,4.0,-10.059,0.0,0.0544,0.401,5.2e-05,0.113,0.145,109.928,205.09,Spotify top
0t1kP63rueHleOhQkYSXFY,Dynamite,BTS,97,0.746,0.765,6.0,-4.41,0.0,0.0993,0.0112,0.0,0.0936,0.737,114.044,199.054,Spotify top


In [34]:
print('Billboard df preview')
print(billboard_songs_feat.shape)
billboard_songs_feat.to_csv('Billboard_output')
billboard_songs_feat.head(3)

Billboard df preview
(50, 16)


Unnamed: 0_level_0,Track_Name,Artists,Popularity,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration,Chart_Type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2YpeDb67231RjR0MgVLzsG,Old Town Road - Remix,Lil Nas X - Billy Ray Cyrus,83,0.878,0.619,6.0,-5.56,1.0,0.102,0.0533,0.0,0.113,0.639,136.041,157.067,Billboard top
6habFhsOp2NvshLv26DqMb,Despacito,Luis Fonsi - Daddy Yankee,79,0.655,0.797,2.0,-4.787,1.0,0.153,0.198,0.0,0.067,0.839,177.928,229.36,Billboard top
7ySbfLwdCwl1EM0zNCJZ38,One Sweet Day,Mariah Carey - Boyz II Men,69,0.568,0.495,1.0,-8.964,1.0,0.0299,0.353,0.0,0.0839,0.303,128.234,281.067,Billboard top
