## EXPLORANDO OS MEUS DADOS DO SPOTIFY

Exercício a partir dos meus dados de 2023 do Spotify e com base na metodologia sugerida no artigo "Creating a Bar Chart Race in Python using Spotify Streaming History"

Link: https://medium.com/@karolinastawicka91/creating-a-bar-chart-race-in-python-using-spotify-streaming-history-249d3eb269ab

## Instalando e importando bibliotecas em Python

In [3]:
#Instalando bibliotecas faltantes
!pip install bar_chart_race
!pip install matplotlib
!pip install bar_chart_race

Collecting bar_chart_race
  Downloading bar_chart_race-0.1.0-py3-none-any.whl (156 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/156.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m153.6/156.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.8/156.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bar_chart_race
Successfully installed bar_chart_race-0.1.0


In [4]:
#Importando bibliotecas Python para o projeto
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

import bar_chart_race as bcr

## Coletando e preparando os dados originais
* Solicitar o download dos meus dados no Spotify
* Salvar dados (enviados por e-mail) no Google Drive: "/content/drive/MyDrive/Colab Repository/Spotify_Data/StreamingHistory0.json"
* Ler arquivos disponíveis em .json
* Criar um único arquivo a partir dos 4 arquivos "Streaming History"
* Remover valores duplicados, se houverem
* Salvar versão original unificada como .csv

In [4]:
# Ler arquivos em .json salvos no Google Drive
sh0 = pd.read_json('/content/drive/MyDrive/Colab Repository/Spotify_Data/StreamingHistory0.json')
sh1 = pd.read_json('/content/drive/MyDrive/Colab Repository/Spotify_Data/StreamingHistory1.json')
sh2 = pd.read_json('/content/drive/MyDrive/Colab Repository/Spotify_Data/StreamingHistory2.json')
sh3 = pd.read_json('/content/drive/MyDrive/Colab Repository/Spotify_Data/StreamingHistory3.json')

In [5]:
# Conferir os dados de cada arquivo
sh0.head()
#sh1.head()
#sh2.head()
#sh3.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-01-04 16:45,Mustache e os Apaches,Harry Nilsson Blues,890
1,2023-01-04 16:49,Durand Jones & The Indications,Witchoo,222901
2,2023-01-04 16:53,Durand Jones & The Indications,Is It Any Wonder?,276600
3,2023-01-04 16:57,Durand Jones & The Indications,Sea Gets Hotter,193378
4,2023-01-04 18:54,Nomade Orquestra,Plena Magia,3433


In [6]:
# Criar/concatenar um arquivo único
sh_original = pd.concat([sh0, sh1, sh2, sh3], axis=0, ignore_index=True)
sh_original #35756 rows

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-01-04 16:45,Mustache e os Apaches,Harry Nilsson Blues,890
1,2023-01-04 16:49,Durand Jones & The Indications,Witchoo,222901
2,2023-01-04 16:53,Durand Jones & The Indications,Is It Any Wonder?,276600
3,2023-01-04 16:57,Durand Jones & The Indications,Sea Gets Hotter,193378
4,2023-01-04 18:54,Nomade Orquestra,Plena Magia,3433
...,...,...,...,...
35751,2024-01-04 15:50,Uyara Torrente,A Temperança,264362
35752,2024-01-04 15:54,Jorge Alabe,Ijexa for Oxum,246600
35753,2024-01-04 15:56,Duina Del Mar,Gotica,104855
35754,2024-01-04 15:59,Jovem Dionisio,Amigos Até Certa Instância,191654


In [None]:
# Verificar se há valores duplicados
sh_original.duplicated().sum()

# Remover duplicados
sh_original.drop_duplicates(inplace=True)

Entenda os dados:
* endTIme - Data e hora, no formato UTC (Tempo Universal Coordenado), do fim do último streaming.
* artistName - Nome do “criador” de cada streaming (por exemplo, o nome do artista no caso de uma música).
* trackName - Nomes dos itens reproduzidos (por exemplo, nome da música ou do vídeo).
* msPlayer - Mostra por quantos milésimos de segundos uma faixa foi reproduzida.

In [None]:
# Salvar o arquivo único como "sh_original" em .csv no mesmo repositório do Google Drive

#sh_original.to_csv('/content/drive/MyDrive/Colab Repository/Spotify_Data/my_spotify_data.csv', index=False)

## Limpando os dados originais para análise
* Ler arquivo em .csv e converter em dataframe
* Remover valores indesejáveis (podcasts e músicas de ambiente, por exemplo)

In [5]:
# Ler versão unificada
sh = pd.read_csv('/content/drive/MyDrive/Colab Repository/Spotify_Data/my_spotify_data.csv')
sh

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-01-04 16:45,Mustache e os Apaches,Harry Nilsson Blues,890
1,2023-01-04 16:49,Durand Jones & The Indications,Witchoo,222901
2,2023-01-04 16:53,Durand Jones & The Indications,Is It Any Wonder?,276600
3,2023-01-04 16:57,Durand Jones & The Indications,Sea Gets Hotter,193378
4,2023-01-04 18:54,Nomade Orquestra,Plena Magia,3433
...,...,...,...,...
35751,2024-01-04 15:50,Uyara Torrente,A Temperança,264362
35752,2024-01-04 15:54,Jorge Alabe,Ijexa for Oxum,246600
35753,2024-01-04 15:56,Duina Del Mar,Gotica,104855
35754,2024-01-04 15:59,Jovem Dionisio,Amigos Até Certa Instância,191654


In [6]:
# Remover valores de artistName que devem ser removidos
valores1 = ['The Remarkable Leadership Podcast',
            'Café com Bioeconomia',
            'Petit Journal',
            'Filosofia Vermelha',
            'The MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography',
            'Tape by Rob',
            'Pico dos Marins: O Caso do Escoteiro Marco Aurélio',
            '451 MHz',
            'Canal Quântico',
            'Mano a Mano',
            'MEDITATION SELF',
            'AYAOn Podcast',
            'Mamilos',
            'The Mindset Meditation Podcast',
            'Despertar Zen',
            'Bom dia, Obvious',
            'Juliana Goes Podcast',
            'Corvo Seco',
            'Alexandre',
            'Café da Manhã',
            'Juliana Goes Podcast',
            'ESG de A a Z',
            'c.e.m - Rádio Corpo de Corpos',
            'Thais Galassi ',
            'InnerFrench',
            'Pausa de Arte ',
            'CBN Professional',
            'La Historia de España',
            'Foro de Teresina',
            'Meditação Guiada',
            'Essência Consciente',
            'Roda Viva ',
            'Priorize Você',
            'Malte Marten Konstantin Rössler',
            'Malte Marten',
            'Respondendo em Voz Alta',
            'Dj bm mix',
            'Durma com essa',
            'Productive Jazz'
            ]
#limpar sh_original e salvar como sh
sh = sh.query(f'artistName not in {valores1}')
sh

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-01-04 16:45,Mustache e os Apaches,Harry Nilsson Blues,890
1,2023-01-04 16:49,Durand Jones & The Indications,Witchoo,222901
2,2023-01-04 16:53,Durand Jones & The Indications,Is It Any Wonder?,276600
3,2023-01-04 16:57,Durand Jones & The Indications,Sea Gets Hotter,193378
4,2023-01-04 18:54,Nomade Orquestra,Plena Magia,3433
...,...,...,...,...
35751,2024-01-04 15:50,Uyara Torrente,A Temperança,264362
35752,2024-01-04 15:54,Jorge Alabe,Ijexa for Oxum,246600
35753,2024-01-04 15:56,Duina Del Mar,Gotica,104855
35754,2024-01-04 15:59,Jovem Dionisio,Amigos Até Certa Instância,191654


In [7]:
# Verificar o data type dos valores
print(sh.dtypes)

endTime       object
artistName    object
trackName     object
msPlayed       int64
dtype: object


In [8]:
# Converter 'endTime' para o formato 'datetime'
sh['endTime'] = pd.to_datetime(sh['endTime'], format='%Y-%m-%d %H:%M')

print(sh.dtypes)

endTime       datetime64[ns]
artistName            object
trackName             object
msPlayed               int64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['endTime'] = pd.to_datetime(sh['endTime'], format='%Y-%m-%d %H:%M')


In [9]:
# Transformar endTime para formatos mais legíveis
sh['hour'] = sh['endTime'].dt.hour
sh['date'] = sh['endTime'].dt.to_period('D').apply(lambda r: r.start_time)
sh['week'] = sh['endTime'].dt.to_period('W').apply(lambda r: r.start_time)
sh['month'] = sh['endTime'].dt.to_period('M').apply(lambda r: r.start_time)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['hour'] = sh['endTime'].dt.hour
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['date'] = sh['endTime'].dt.to_period('D').apply(lambda r: r.start_time)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['week'] = sh['endTime'].dt.to_period('W').apply(lambda r: r.start_time)
A value is trying to

In [10]:
# Arredondar msPlayed para formatos mais legíveis
sh['sPlayed'] = sh['msPlayed']/(1000)
sh['mPlayed'] = sh['sPlayed']/(60)
sh['hPlayed'] = sh['sPlayed']/(60*60)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['sPlayed'] = sh['msPlayed']/(1000)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['mPlayed'] = sh['sPlayed']/(60)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sh['hPlayed'] = sh['sPlayed']/(60*60)


In [11]:
# Remover músicas puladas (menos de 10s) para melhores resultados
sh_no_skips = sh.loc[sh['sPlayed']>10]

# Calcular o tempo de música tocado
weekly_artist = sh_no_skips.groupby([pd.Grouper(key='endTime', freq='W'),'artistName'])['trackName'].size().reset_index()
weekly_artist['no_csum'] = weekly_artist.groupby(['artistName'])['trackName'].cumsum()

# Selecionar apenas os 10 artistas mais tocados na semana
weekly_artist_top_10 = weekly_artist.set_index(['endTime', 'artistName']).groupby(level=0, group_keys=False)['no_csum'].nlargest(10)



## Visualizando as músicas mais tocadas por semana
* Formatar dados
* Criar visualização de barras

In [None]:
# Formatar os dados para a biblioteca bar_chart_race
weekly_artist_top_10 = weekly_artist_top_10.unstack()
weekly_artist_top_10.fillna(method='ffill', inplace=True)
weekly_artist_top_10.fillna(0, inplace=True)

In [12]:
# Criar visualização animada em barra
bcr.bar_chart_race(df=weekly_artist_top_10,
                   n_bars=10)

  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
  fig.canvas.print_figure(io.BytesIO())
