# Spotify Top 100 Songs

### Overview:
In this study, we will use data sources containing top song information [*Insert year range once confirmed*] according to **Spotify's Top 100 Songs** list along with other data sources containing awards and artist information to construct a machine learning model. The goal of the project is to form a machine learning model to make predictions on how new songs will perform on Spotify Top Song charts.
</br></br>

---
</br>

*Maybe explain here the model, data sources used, details/overview, etc.. once established*

</br>

---

### Group Members
* Jaquan Jones
* Marie Karibyan
* Hagop (Christian) Arabian
* Nicol Barrios
* Ani Movsesian
</br></br>



* Link to Spotify Top Song Data set

https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year?select=top10s.csv

In [59]:
import os
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from numpy import random
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


pd.set_option('display.max_rows', 500)
df = pd.read_csv("./datasets/tracks.csv")

In [32]:
print(df)

                            id                                 name  \
0       35iwgR4jXetI318WEWsa1Q                                Carve   
1       021ht4sdgPcrDgSk7JTbKY  Capítulo 2.16 - Banquero Anarquista   
2       07A5yehtSnoedViJAZkNnc   Vivo para Quererte - Remasterizado   
3       08FmqUhxtyLTn6pAh6bk45        El Prisionero - Remasterizado   
4       08y9GfoqCWfOGsKdwojr5e                  Lady of the Evening   
...                        ...                                  ...   
586667  5rgu12WBIHQtvej2MdHSH0                                  云与海   
586668  0NuWgxEp51CutD2pJoF4OM                                blind   
586669  27Y1N4Q4U3EfDU5Ubw8ws2            What They'll Say About Us   
586670  45XJsGpFTyzbzeWK8VzR8S                      A Day At A Time   
586671  5Ocn6dZ3BJFPWh4ylwFXtn                     Mar de Emociones   

        popularity  duration_ms  explicit                          artists  \
0                6       126903         0                          ['

In [62]:
print(f'df.columns => {list(df)}\n')
print(f'df.shape => {df.shape}\n')

df['release_date'] = pd.to_datetime(df['release_date'])
# filter by release_date
# Note: release_date is considered a qualitative feature e.g. 2020-09-26
filtered_df = df.loc[(df['release_date'] >= '2000-01-01')]
print(f'filtered_df["release_date"] => {filtered_df["release_date"]}\n')
# print(f'filtered_df => {filtered_df["release_date"].head(500)}')
print(f'filtered_df.shape => {filtered_df.shape}\n')


df.columns => ['id', 'name', 'popularity', 'duration_ms', 'explicit', 'artists', 'id_artists', 'release_date', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']

df.shape => (586672, 20)

filtered_df["release_date"] => 39501    2008-02-11
39511    2020-03-13
39517    2008-02-11
39521    2008-02-11
39529    2018-05-04
            ...    
586667   2020-09-26
586668   2020-10-21
586669   2020-09-02
586670   2021-03-05
586671   2015-07-01
Name: release_date, Length: 212304, dtype: datetime64[ns]

filtered_df.shape => (212304, 20)



In [68]:
# Remove qualitative features AND release_date 
# or else error: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>
# Note: select_dtypes will choose only categorical types
numerical_features = filtered_df.columns.drop(['release_date'] + list(filtered_df.select_dtypes(include=[object])))
print(f'numerical_features => {list(numerical_features)}\n')

numerical_features => ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']



In [69]:
X = filtered_df[list(numerical_features)]  
print(f'X.columns => {X.columns}')
print(f'X.shape => {X.shape}')
y = filtered_df['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=4)


X.columns => Index(['popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature'],
      dtype='object')
X.shape => (212304, 15)


In [75]:
k = 50
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_predict_knn = knn.predict(X_test)
print(f'knn accuracy => {accuracy_score(y_test, y_predict_knn)}')
print(f'knn accurately predicted records => {accuracy_score(y_test, y_predict_knn, normalize=False)}')



knn accuracy => 0.04435149596804582
knn accurately predicted records => 2354


In [76]:
dt = DecisionTreeClassifier(random_state=5)
dt.fit(X_train, y_train)
y_predict_dt = dt.predict(X_test)
print(f'dt accuracy => {accuracy_score(y_test, y_predict_dt)}')
print(f'dt accurately predicted records => {accuracy_score(y_test, y_predict_dt, normalize=False)}')

dt accuracy => 0.9999623181852438
dt accurately predicted records => 53074


In [78]:
lr = LogisticRegression(max_iter=100) # default 1000, failed to converge
lr.fit(X_train, y_train)
y_predict_lr = lr.predict(X_test)
print(f'lr accuracy => {accuracy_score(y_test, y_predict_lr)}')

lr accuracy => 0.042316677971211095


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
