# Most Streamed Spotify Songs 2023 - Músicas mais tocadas Spotify 2023


## Importando as bibliotecas

In [57]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [48]:
sns.set_style("darkgrid")

## Carregando o dataset .csv

In [49]:
df = pd.read_csv(r"C:\Users\Jao\Documents\datasets\spotify_data.csv", encoding='latin-1')
df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


## Limpando os dados - data cleaning


In [50]:
df.info() #conhecendo um pouco melhor o dataset. Nome das colunas, Número de registros por colunas e o tipo de cada.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [51]:
df.isna().sum() # calculando o número de registros nulos por coluna

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                     95
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

### Vimos aqui que a coluna 'in_shazam_charts' possui 50 registros faltantes, como nosso objetivo é analisar dados apenas quanto spotify e não outras plataformas, iremos excluir essa coluna.

In [53]:
df = df.drop('in_shazam_charts', axis=1)


In [55]:
print(df.columns)


Index(['track_name', 'artist(s)_name', 'artist_count', 'released_year',
       'released_month', 'released_day', 'in_spotify_playlists',
       'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts',
       'in_deezer_playlists', 'in_deezer_charts', 'bpm', 'key', 'mode',
       'danceability_%', 'valence_%', 'energy_%', 'acousticness_%',
       'instrumentalness_%', 'liveness_%', 'speechiness_%'],
      dtype='object')


### Houveram também 95 valores nulos no campo "Key", o tom/clave da música. Tal dado é importante e não deve ser ignorado.

### Utilizaremos um modelo de aprendizado de máquina para preencher esses valores nulos. Levaremos as colunas -> 'danceability', 'valence', 'energy', 'acousticness','instrumentalness', 'liveness', 'speechiness', como características para preencher os valores faltantes.

### O algoritmo utilizado será o K-Nearest Neighbors (kNN) classifica um novo dado com base na maioria das categorias de seus 'K' vizinhos mais próximos. Ele opera sob o princípio de que itens semelhantes estão próximos uns dos outros em um espaço de características.



In [59]:
# Separando os dados em dois conjuntos: um onde 'Key' não é nulo e outro onde é
df_train = df[df['key'].notna()]
df_predict = df[df['key'].isna()]

# Definindo as características-x e o alvo-y
X_train = df_train[['danceability_%', 'valence_%', 'energy_%', 'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']]
y_train = df_train['key']
X_predict = df_predict[['danceability_%', 'valence_%', 'energy_%', 'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']]

# Padronizando os dados
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_predict_scaled = scaler.transform(X_predict)

# Criando e treinando o modelo KNN -> utilizando 10 "vizinhos"
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_scaled, y_train)

# Fazendo previsões para os dados com 'Key' faltante
predicted_keys = knn.predict(X_predict_scaled)

# Preenchendo os valores faltantes com as previsões
df_predict['key'] = predicted_keys

# Juntando os dados de volta
df_filled = pd.concat([df_train, df_predict])

# Agora df_filled é o seu DataFrame com os valores de 'Key' preenchidos

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_predict['key'] = predicted_keys
