# Spotify Classification Problem
### Question: How is a song’s genre categorized?
Finding what predictors matter the most when classifying what Genre a song belongs to. 

### Importing Data

In [42]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots 
import plotly.express as px
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
summarize)
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import (LinearDiscriminantAnalysis as LDA, 
                                           QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression

In [43]:
spotify = pd.read_csv("spotify_songs.csv") 
spotify.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


### Cleaning Data
Get rid of all null values and duplicated values

In [44]:
spotify.columns

Index(['track_id', 'track_name', 'track_artist', 'track_popularity',
       'track_album_id', 'track_album_name', 'track_album_release_date',
       'playlist_name', 'playlist_id', 'playlist_genre', 'playlist_subgenre',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms'],
      dtype='object')

In [45]:
spotify.isnull().sum(), spotify.shape

(track_id                    0
 track_name                  5
 track_artist                5
 track_popularity            0
 track_album_id              0
 track_album_name            5
 track_album_release_date    0
 playlist_name               0
 playlist_id                 0
 playlist_genre              0
 playlist_subgenre           0
 danceability                0
 energy                      0
 key                         0
 loudness                    0
 mode                        0
 speechiness                 0
 acousticness                0
 instrumentalness            0
 liveness                    0
 valence                     0
 tempo                       0
 duration_ms                 0
 dtype: int64,
 (32833, 23))

In [46]:
# Drop na values in the spotify data
spotify = spotify.dropna()
spotify.isnull().sum(), spotify.shape

(track_id                    0
 track_name                  0
 track_artist                0
 track_popularity            0
 track_album_id              0
 track_album_name            0
 track_album_release_date    0
 playlist_name               0
 playlist_id                 0
 playlist_genre              0
 playlist_subgenre           0
 danceability                0
 energy                      0
 key                         0
 loudness                    0
 mode                        0
 speechiness                 0
 acousticness                0
 instrumentalness            0
 liveness                    0
 valence                     0
 tempo                       0
 duration_ms                 0
 dtype: int64,
 (32828, 23))

In [47]:
# Look for any duplicate rows
spotify.duplicated().sum()

0

In [48]:
# Get rid of duplicated values anyways, just in case
spotify = spotify.drop_duplicates()
spotify.duplicated().sum(), spotify.shape

(0, (32828, 23))

### Data Check:
Make sure that all types are standard dtypes

In [49]:
spotify.dtypes

track_id                     object
track_name                   object
track_artist                 object
track_popularity              int64
track_album_id               object
track_album_name             object
track_album_release_date     object
playlist_name                object
playlist_id                  object
playlist_genre               object
playlist_subgenre            object
danceability                float64
energy                      float64
key                           int64
loudness                    float64
mode                          int64
speechiness                 float64
acousticness                float64
instrumentalness            float64
liveness                    float64
valence                     float64
tempo                       float64
duration_ms                   int64
dtype: object

In [50]:
spotify.describe()

Unnamed: 0,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
count,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0,32828.0
mean,42.483551,0.65485,0.698603,5.373949,-6.719529,0.565737,0.107053,0.175352,0.08476,0.190175,0.510556,120.883642,225796.829779
std,24.980476,0.145092,0.180916,3.611572,2.988641,0.495667,0.101307,0.219644,0.224245,0.154313,0.233152,26.903632,59836.492346
min,0.0,0.0,0.000175,0.0,-46.448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4000.0
25%,24.0,0.563,0.581,2.0,-8.17125,0.0,0.041,0.0151,0.0,0.0927,0.331,99.961,187804.5
50%,45.0,0.672,0.721,6.0,-6.166,1.0,0.0625,0.0804,1.6e-05,0.127,0.512,121.984,216000.0
75%,62.0,0.761,0.84,9.0,-4.645,1.0,0.132,0.255,0.00483,0.248,0.693,133.91825,253581.25
max,100.0,0.983,1.0,11.0,1.275,1.0,0.918,0.994,0.994,0.996,0.991,239.44,517810.0


In [51]:
spotify.playlist_genre.unique()

array(['pop', 'rap', 'rock', 'latin', 'r&b', 'edm'], dtype=object)

In [52]:
fig=px.bar(spotify.playlist_genre.value_counts(),color=['red','blue','green','orange','teal','black'], text_auto=True)
fig.update_layout(width = 800)
fig.show()

In [53]:
# Check to see if all songs have a genre
spotify.playlist_genre.isnull().sum() 

0

In [57]:
spotify.playlist_genre.value_counts().sum() == spotify.shape[0]

True

We don't have any null values in the genre column, so no need to remove anything. When we sum the value counts of playlist_genres we see that it matches the number of rows in the dataset as well, meaning that each song has a genre.

In [62]:
spotify.max()

track_id                                               7zzZmpw8L66ZPjH1M6qmOs
track_name                                                    하드캐리 Hard Carry
track_artist                                                             香取慎吾
track_popularity                                                          100
track_album_id                                         7zygyMUltFYOvHoT3NOTsj
track_album_name            화양연화 The Most Beautiful Moment In Life: Young ...
track_album_release_date                                           2020-01-29
playlist_name                                               🤩🤪Post Teen Pop🤪🤩
playlist_id                                            7xWuNevFBmwnFEg6wzdCc7
playlist_genre                                                           rock
playlist_subgenre                                          urban contemporary
danceability                                                            0.983
energy                                                          

In [63]:
spotify.min()

track_id                                    0017A6SJgTbfQVU2EtsPNo
track_name                  "I TRIED FOR YEARS... NOBODY LISTENED"
track_artist                                                   !!!
track_popularity                                                 0
track_album_id                              000YOrgQoB5IiiH95Yb8vY
track_album_name                                                 !
track_album_release_date                                1957-01-01
playlist_name                                     "Permanent Wave"
playlist_id                                 0275i1VNfBnsNbPl0QIBpG
playlist_genre                                                 edm
playlist_subgenre                                       album rock
danceability                                                   0.0
energy                                                    0.000175
key                                                              0
loudness                                                   -46

<strong>Data Check Conclusion:</strong> All of the values are within an acceptable range as well. After cleaning the data to ensure we can work with all observations we can begin exploring the data a bit more, and split the data into test and training data.
Loudness is in the negatives for the minimum value, which makes sense as loudness is measured in decibals.

## Training/Test Split
We will use the train_test_split package to split our data set into a training set, and a test set to see how well our model can predict genre based on a set of predictors.