# Criando um dataset com as características musicais das 200 músicas mais tocadas no Spotify

O objetivo desse notebook foi coletar as características musicais das 200 músicas mais tocadas no Spotify na primeira semana de novembro, para o Brasil, de acordo com a lista gerada pelo [Spotify Charts](https://spotifycharts.com/regional/br/weekly/2019-11-01--2019-11-08)

# Coletando dados do Spotify API usando a biblioteca Spotipy

## Sobre a biblioteca Spotipy:

De acordo com o [documento oficial Spotipy](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


## Sobre o uso do Spotify API:

Spotify oferece vários [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) para acessar os dados do Spotify. Nesse notebook, eu usei o [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) para pegar as características musicais das 200 músicas mais tocadas no Spotify.


In [1]:
# abrindo e visualizando as músicas que serão coletadas

top200 = pd.read_csv('top200-nov.csv')
top200.head()

Unnamed: 0,Position,Track,Artist,Streams,ID
0,1,Surtada,Dadá Boladão,4767587,5F8ffc8KWKNawllr5WsW0r
1,2,Gaiola É o Troco,MC Du Black,4066848,3Uq45ipGutypFPmETfaoaH
2,3,Some que ele vem atrás,Anitta,4053084,2qD7VoDYcrAPY0cVEfpsR5
3,4,Supera,Marília Mendonça,4047123,3GmJxfnUDrIs1iCfKUELFz
4,5,Hit Contagiante,Felipe Original,3600154,5dKesZwp6deuhEeW8F1UEi


In [2]:
# realizando a conexão com a biblioteca

# pip install spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid = '45fe37a5c1574527a4ee39347e679af8' 
secret = '4fc29716882f4044a6b1ce5ae073a134'

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Uma limitação do [endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) é que ele coleta no máximo 100 músicas por consulta. Dessa forma usei um for loop, onde o loop externo pegava os ID's das músicas em lotes de 100 e o loop interno fazia a consulta e anexava os resultados à linha de 'linhas'.

Além disso, implementei uma verificação para quando o ID da faixa não retornava nenhuma característica da música para não causar problemas.

In [3]:
linhas = []
batchsize = 100
indisp_count = 0

for i in range(0,len(top200['ID']),batchsize):
    batch = top200['ID'][i:i+batchsize]
    feature_res = sp.audio_features(batch)
    for i, t in enumerate(feature_res):
        if t == None:
            indisp_count += + 1
        else:
            linhas.append(t)
            
print('Número de músicas onde as características não estavam disponíveis:', indisp_count)

Número de músicas onde as características não estavam disponíveis: 0


In [4]:
# juntando as características da música ao dataset com nome da música e artista

caract = pd.DataFrame.from_dict(linhas, orient='columns')
df_caract = top200.join(caract)
df_caract.head()

Unnamed: 0,Position,Track,Artist,Streams,ID,acousticness,analysis_url,danceability,duration_ms,energy,...,liveness,loudness,mode,speechiness,tempo,time_signature,track_href,type,uri,valence
0,1,Surtada,Dadá Boladão,4767587,5F8ffc8KWKNawllr5WsW0r,0.249,https://api.spotify.com/v1/audio-analysis/5F8f...,0.832,152784,0.55,...,0.182,-7.026,0,0.0587,154.064,4,https://api.spotify.com/v1/tracks/5F8ffc8KWKNa...,audio_features,spotify:track:5F8ffc8KWKNawllr5WsW0r,0.881
1,2,Gaiola É o Troco,MC Du Black,4066848,3Uq45ipGutypFPmETfaoaH,0.42,https://api.spotify.com/v1/audio-analysis/3Uq4...,0.722,187246,0.84,...,0.112,-3.24,0,0.0785,150.108,4,https://api.spotify.com/v1/tracks/3Uq45ipGutyp...,audio_features,spotify:track:3Uq45ipGutypFPmETfaoaH,0.851
2,3,Some que ele vem atrás,Anitta,4053084,2qD7VoDYcrAPY0cVEfpsR5,0.0748,https://api.spotify.com/v1/audio-analysis/2qD7...,0.648,194771,0.795,...,0.38,-5.536,0,0.17,180.043,4,https://api.spotify.com/v1/tracks/2qD7VoDYcrAP...,audio_features,spotify:track:2qD7VoDYcrAPY0cVEfpsR5,0.598
3,4,Supera,Marília Mendonça,4047123,3GmJxfnUDrIs1iCfKUELFz,0.0604,https://api.spotify.com/v1/audio-analysis/3GmJ...,0.665,147748,0.743,...,0.959,-4.434,1,0.0567,131.573,4,https://api.spotify.com/v1/tracks/3GmJxfnUDrIs...,audio_features,spotify:track:3GmJxfnUDrIs1iCfKUELFz,0.658
4,5,Hit Contagiante,Felipe Original,3600154,5dKesZwp6deuhEeW8F1UEi,0.132,https://api.spotify.com/v1/audio-analysis/5dKe...,0.819,137125,0.684,...,0.0942,-7.169,0,0.119,170.187,4,https://api.spotify.com/v1/tracks/5dKesZwp6deu...,audio_features,spotify:track:5dKesZwp6deuhEeW8F1UEi,0.959


In [5]:
# excluindo algumas colunas que não serão necessárias

df_caract.drop(['ID', 'id', 'analysis_url','track_href','type','uri'], axis=1,inplace=True)

In [6]:
# conferindo o tipo das linhas e se não tem valores nulos

df_caract.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 17 columns):
Position            200 non-null int64
Track               200 non-null object
Artist              200 non-null object
Streams             200 non-null int64
acousticness        200 non-null float64
danceability        200 non-null float64
duration_ms         200 non-null int64
energy              200 non-null float64
instrumentalness    200 non-null float64
key                 200 non-null int64
liveness            200 non-null float64
loudness            200 non-null float64
mode                200 non-null int64
speechiness         200 non-null float64
tempo               200 non-null float64
time_signature      200 non-null int64
valence             200 non-null float64
dtypes: float64(9), int64(6), object(2)
memory usage: 25.0+ KB


In [7]:
# salvando em um arquivo csv

df_caract.to_csv('caract_top200.csv', sep = '|', encoding='utf-8', index_label = False)