<a href="https://colab.research.google.com/github/he16946/Data_Labs/blob/master/Spotify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Data

In [0]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Downloading data

In [2]:
os.environ['KAGGLE_USERNAME'] = "ce889group3" # username from the json file

os.environ['KAGGLE_KEY'] = "f0ec34f29bf37cceb765b25f5a1bce2c" # key from the json file

!kaggle datasets download -d geomack/spotifyclassification

Downloading spotifyclassification.zip to /content
  0% 0.00/98.4k [00:00<?, ?B/s]
100% 98.4k/98.4k [00:00<00:00, 28.6MB/s]


In [0]:
! unzip -q "spotifyclassification.zip"

In [0]:
df = pd.read_csv("data.csv")

In [5]:
df.columns

Index(['Unnamed: 0', 'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'target',
       'song_title', 'artist'],
      dtype='object')

In [6]:
print("Total count: ", df['target'].count())

Total count:  2017


In [7]:
print("Liked: ", df['target'].value_counts()[1])
print("Disliked: ", df['target'].value_counts()[0])

Liked:  1020
Disliked:  997


In [8]:
print('Like', round(df['target'].value_counts()[1]/len(df) * 100,2), '% of the dataset')
print('Dislike', round(df['target'].value_counts()[0]/len(df) * 100,2), '% of the dataset')

Like 50.57 % of the dataset
Dislike 49.43 % of the dataset


## Creating the imbalance

To create the ratio 60 : 40 we reduce the number of disliked songs to 612. 

First the two categories are split:

In [0]:
like = df.loc[df['target'] == 1]
dislike = df.loc[df['target'] == 0]

612 random rows are obtained from the Dislikes

In [0]:
newdis = dislike.sample(n = 612, random_state = 1)

In [0]:
imb = [like, newdis]
imb = pd.concat(imb)

In [12]:
print('Like', round(imb['target'].value_counts()[1]/len(imb) * 100,2), '% of the dataset')
print('Dislike', round(imb['target'].value_counts()[0]/len(imb) * 100,2), '% of the dataset')

Like 62.5 % of the dataset
Dislike 37.5 % of the dataset


## Preprocessing

In [13]:
imb.head()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future
3,3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4.0,0.23,1,Master Of None,Beach House
4,4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4.0,0.904,1,Parallel Lines,Junior Boys


Song title and Artist name columns are removed

In [0]:
imb = imb.drop(['song_title', 'artist'], axis = 1)

In [15]:
norm = ((imb - imb.min())/(imb.max()-imb.min()))
norm.columns

Index(['Unnamed: 0', 'acousticness', 'danceability', 'duration_ms', 'energy',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'valence', 'target'],
      dtype='object')

## Cross-validation

In [16]:
features = list(norm.columns[1:14])
print(features)

['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence']


In [0]:
x = norm[features]
y = norm['target']

In [18]:
dt = DecisionTreeClassifier(min_samples_split = 30, max_depth = 4, random_state = 32)
dt_result = cross_validate(dt, x, y)
print("Result: ", dt_result['test_score'].mean())

Result:  0.7377553892047054


In [19]:
rf = RandomForestClassifier(n_estimators = 100, max_depth = 4, random_state = 32)
rf_result = cross_validate(rf, x, y)
print("Result: ", rf_result['test_score'].mean())

Result:  0.7445001031875574
