[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lyeskhalil/mlbootcamp2022/blob/main/Spotify.ipynb)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
from pathlib import Path
from os import system
if not Path('AgeDataset-V1.csv.zip').exists():
    system('wget --no-check-certificate --content-disposition https://github.com/lyeskhalil/mlbootcamp2022/raw/main/songs_normalize.csv')

We are going to be using data from Spotify to predict what genres a song is described as. Load the data below and take a look:

In [4]:
df = pd.read_csv('songs_normalize.csv')
df.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


This is some of the metadata that Spotify uses to make music recommendations. Most of these features are calculated from the music itself using deep learning. At the moment, the genre column is a list of genres, so let's convert this into a table where the genres are separate columns:

In [6]:
y = []
for k,v in df.iterrows():
    y_temp = {}
    y_temp['idx'] = k
    genres = v.genre.split(',')
    for genre in genres:
        for g in genre.split('/'):
            y_temp[g.strip().lower()] = 1
    y.append(y_temp)
y = pd.DataFrame(y).set_index('idx').fillna(0).drop('set()',axis=1)
y.head()

Unnamed: 0_level_0,pop,rock,country,metal,hip hop,r&b,dance,electronic,folk,acoustic,easy listening,latin,blues,world,traditional,jazz,classical
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


That's better! Now we will split into training and testing sets (don't change this!) and run a very basic decision tree model:

In [7]:
X = df[[col for col in df.columns if col not in ['artist','song','genre']]]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.9,random_state=0)

In [9]:
tree = DecisionTreeClassifier().fit(X_train,y_train)
tree.score(X_test,y_test)

0.265

This performance is... not great! But it could be worse - random performance would be around 0.05. What is the best performance you can achieve?

Note that you can also take advantage of the artist and song name columns:

In [21]:
df[['artist','song']]

Unnamed: 0,artist,song
0,Britney Spears,Oops!...I Did It Again
1,blink-182,All The Small Things
2,Faith Hill,Breathe
3,Bon Jovi,It's My Life
4,*NSYNC,Bye Bye Bye
...,...,...
1995,Jonas Brothers,Sucker
1996,Taylor Swift,Cruel Summer
1997,Blanco Brown,The Git Up
1998,Sam Smith,Dancing With A Stranger (with Normani)
