# Data Classification: Crafting BeatWave's Genre Tapestry

## Business Understanding

Classifying songs into different genres can aid in organizing and categorizing a large music library. This can be useful for music streaming platforms that need to categorize and tag their music content effectively. 

We are working for a fictional startup called 'BeatWave', an up-and-coming music streaming platform specializing in electronic music. Through the iterative process of data classification, we can begin to sift through popular electronic songs and give BeatWave a better understanding of which sonic metrics predict which genres. This will allow for a streamlined approach to classifying titles into their appropriate categories.

## Data Understanding

Data source: https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify

We are using an open source dataset that displays sample data from popular streaming platform Spotify. Using this data will help us gauge our place in competition with the world's leading streaming service.

Let's begin by importing the necessary libraries and superficially inspecting the dataset.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('Data/genres.csv')

df.head()

  df = pd.read_csv('Data/genres.csv')


Unnamed: 0.1,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,id,uri,track_href,analysis_url,duration_ms,time_signature,genre,song_name,Unnamed: 0,title
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,...,2Vc6NJ9PW9gD9q343XFRKx,spotify:track:2Vc6NJ9PW9gD9q343XFRKx,https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD...,https://api.spotify.com/v1/audio-analysis/2Vc6...,124539,4,Dark Trap,Mercury: Retrograde,,
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,...,7pgJBLVz5VmnL7uGHmRj6p,spotify:track:7pgJBLVz5VmnL7uGHmRj6p,https://api.spotify.com/v1/tracks/7pgJBLVz5Vmn...,https://api.spotify.com/v1/audio-analysis/7pgJ...,224427,4,Dark Trap,Pathology,,
2,0.85,0.893,5,-4.783,1,0.0623,0.0138,4e-06,0.372,0.0391,...,0vSWgAlfpye0WCGeNmuNhy,spotify:track:0vSWgAlfpye0WCGeNmuNhy,https://api.spotify.com/v1/tracks/0vSWgAlfpye0...,https://api.spotify.com/v1/audio-analysis/0vSW...,98821,4,Dark Trap,Symbiote,,
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,...,0VSXnJqQkwuH2ei1nOQ1nu,spotify:track:0VSXnJqQkwuH2ei1nOQ1nu,https://api.spotify.com/v1/tracks/0VSXnJqQkwuH...,https://api.spotify.com/v1/audio-analysis/0VSX...,123661,3,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote),,
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,...,4jCeguq9rMTlbMmPHuO7S3,spotify:track:4jCeguq9rMTlbMmPHuO7S3,https://api.spotify.com/v1/tracks/4jCeguq9rMTl...,https://api.spotify.com/v1/audio-analysis/4jCe...,123298,4,Dark Trap,Venom,,


In [3]:
df.shape

(42305, 22)

In [4]:
df.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature', 'genre', 'song_name', 'Unnamed: 0', 'title'],
      dtype='object')

In [5]:
# Assessing the number of genres in our target variable
len(df['genre'].value_counts())

15

In [6]:
df['genre'].value_counts()

Underground Rap    5875
Dark Trap          4578
Hiphop             3028
trance             2999
trap               2987
techhouse          2975
dnb                2966
psytrance          2961
techno             2956
hardstyle          2936
RnB                2099
Trap Metal         1956
Rap                1848
Emo                1680
Pop                 461
Name: genre, dtype: int64

In [7]:
df.describe

<bound method NDFrame.describe of        danceability  energy  key  loudness  mode  speechiness  acousticness  \
0             0.831   0.814    2    -7.364     1       0.4200      0.059800   
1             0.719   0.493    8    -7.230     1       0.0794      0.401000   
2             0.850   0.893    5    -4.783     1       0.0623      0.013800   
3             0.476   0.781    0    -4.710     1       0.1030      0.023700   
4             0.798   0.624    2    -7.668     1       0.2930      0.217000   
...             ...     ...  ...       ...   ...          ...           ...   
42300         0.528   0.693    4    -5.148     1       0.0304      0.031500   
42301         0.517   0.768    0    -7.922     0       0.0479      0.022500   
42302         0.361   0.821    8    -3.102     1       0.0505      0.026000   
42303         0.477   0.921    6    -4.777     0       0.0392      0.000551   
42304         0.529   0.945    9    -5.862     1       0.0615      0.001890   

       instrument

## Data Preparation

In [8]:
df.isnull().sum()

danceability            0
energy                  0
key                     0
loudness                0
mode                    0
speechiness             0
acousticness            0
instrumentalness        0
liveness                0
valence                 0
tempo                   0
type                    0
id                      0
uri                     0
track_href              0
analysis_url            0
duration_ms             0
time_signature          0
genre                   0
song_name           20786
Unnamed: 0          21525
title               21525
dtype: int64

In [9]:
df = df.dropna(axis=1)

In [10]:
df.isnull().sum()

danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
type                0
id                  0
uri                 0
track_href          0
analysis_url        0
duration_ms         0
time_signature      0
genre               0
dtype: int64

In [11]:
# Getting the column names containing strings
string_columns = df.select_dtypes(include=['object']).columns

# Printing the column names containing strings
print("Columns containing strings:")
print(string_columns)

Columns containing strings:
Index(['type', 'id', 'uri', 'track_href', 'analysis_url', 'genre'], dtype='object')


## Modeling

### First Model: Decision Tree

In [14]:
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Split the data into features (X) and target variable (y)
X = df.drop(['genre', 'id', 'type', 'uri', 'track_href', 'analysis_url'], axis=1)  
y = df['genre']

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [15]:
# Step 3: Create a decision tree model and fit it to the training data
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 4: Evaluate the model's performance on the testing data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Initial Model Accuracy:", accuracy)

Initial Model Accuracy: 0.5765275972107315


Not bad! We scored over 50% in initial model accuracy. However, it's possible that accuracy alone may not provide a comprehensive evaluation of the model's performance, especially if the dataset is imbalanced or if different classes have varying importance.

We might want to consider other evaluation metrics like precision, recall, F1-score, or area under the ROC curve (AUC-ROC) to gain a more complete understanding of the model's performance. First, let's check the class distributions to see whether we want to use macro or micro averaging in our performance metrics.

In [22]:
import pandas as pd

# Creating a DataFrame to analyze the class distribution
class_distribution_df = pd.DataFrame({'genre': y_train})
class_counts = class_distribution_df['genre'].value_counts()

# Calculating class proportions
class_proportions = class_counts / len(y_train)

# Calculating the imbalance ratio
imbalance_ratio = class_counts.max() / class_counts.min()

# Printing the class counts, proportions, and imbalance ratio
print("Class Counts:")
print(class_counts)

print("\nClass Proportions:")
print(class_proportions)

print("\nImbalance Ratio:")
print(imbalance_ratio)


Class Counts:
Underground Rap    4683
Dark Trap          3608
trance             2437
Hiphop             2407
techhouse          2407
trap               2405
dnb                2367
techno             2366
psytrance          2363
hardstyle          2317
RnB                1703
Trap Metal         1572
Rap                1507
Emo                1339
Pop                 363
Name: genre, dtype: int64

Class Proportions:
Underground Rap    0.138370
Dark Trap          0.106607
trance             0.072007
Hiphop             0.071120
techhouse          0.071120
trap               0.071061
dnb                0.069939
techno             0.069909
psytrance          0.069820
hardstyle          0.068461
RnB                0.050319
Trap Metal         0.046448
Rap                0.044528
Emo                0.039564
Pop                0.010726
Name: genre, dtype: float64

Imbalance Ratio:
12.900826446280991


Class imbalance? Check. The class distribution shows that some genres have much higher sample counts than others. 

**Class Counts:**

The class counts indicate the number of samples available for each genre in our dataset.
Some genres have a substantial number of samples, such as "Underground Rap" (4683 samples) and "Dark Trap" (3608 samples).
On the other hand, genres like "Pop" have very few samples (363 samples), making it a minority class.

**Class Proportions:**

The class proportions represent the percentage of each genre's samples relative to the total number of samples in the dataset.
Genres like "Underground Rap" and "Dark Trap" account for a significant portion of the dataset (around 13.84% and 10.66% respectively).
"Pop" represents a very small proportion of the dataset (around 1.07%).
Imbalance Ratio:

The imbalance ratio is the ratio between the count of samples in the majority class ("Underground Rap") and the count of samples in the minority class ("Pop").
In this case, the imbalance ratio is approximately 12.9, indicating a substantial class imbalance.

Given the high class imbalance in the dataset, ***using macro-averaging for precision, recall, and F1 score calculation is appropriate.*** Macro-averaging will give equal weight to each class, allowing us to evaluate the model's performance across all genres without bias towards the majority classes.

In [23]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Calculating precision
precision = precision_score(y_test, y_pred, average=None)

# Calculating recall
recall = recall_score(y_test, y_pred, average=None)

# Creating a new DataFrame to store the results
result_df = pd.DataFrame({
    'Class': range(len(precision)),
    'Precision': precision,
    'Recall': recall
})

# Calculating F1 score
f1 = f1_score(y_test, y_pred, average='macro')

# Appending the F1 score to the DataFrame
result_df['F1 Score'] = [f1] * len(precision)

# Mapping class indices to genre names and create a new 'Genre' column in the DataFrame
genre_labels = ['Underground Rap', 'Dark Trap', 'Hiphop', 'trance', 'trap', 'techhouse', 'dnb', 'psytrance', 'techno', 'hardstyle', 'RnB', 'Trap Metal', 'Rap', 'Emo', 'Pop']
result_df['Genre'] = [genre_labels[class_idx] for class_idx in result_df['Class']]

# Printing the resultts DataFrame
print(result_df)


    Class  Precision    Recall  F1 Score            Genre
0       0   0.363296  0.400000  0.556652  Underground Rap
1       1   0.547009  0.563050  0.556652        Dark Trap
2       2   0.294207  0.310789  0.556652           Hiphop
3       3   0.118421  0.091837  0.556652           trance
4       4   0.331731  0.404692  0.556652             trap
5       5   0.287129  0.292929  0.556652        techhouse
6       6   0.208845  0.221354  0.556652              dnb
7       7   0.322981  0.261745  0.556652        psytrance
8       8   0.938843  0.948247  0.556652           techno
9       9   0.846527  0.846527  0.556652        hardstyle
10     10   0.904514  0.871237  0.556652              RnB
11     11   0.802698  0.838028  0.556652       Trap Metal
12     12   0.800341  0.794915  0.556652              Rap
13     13   0.758803  0.766904  0.556652              Emo
14     14   0.800000  0.783505  0.556652              Pop


Here's how we interpret the above results dataframe:

**Precision**

- An indication of how many of the predicted positive instances are actually true positive instances.
- Higher precision values indicate that the model has ***fewer false positives for that particular class.***
-  In the case of "trance" (Class 3), the precision is 0.118421. This means that, out of all the instances the model predicted as "trance," only around 11.84% of them are **correct**, while the rest are false positives.

**Recall**

- Sensitivity or true positive rate; measures how many of the actual positive instances the model correctly identified
- Higher recall values indicate that the model has ***fewer false negatives for that particular class.***
- The recall for "Dark Trap" (Class 1) is approximately 0.563050. This means that the model correctly identified around 56.31% of all instances of "Dark Trap," while some were **missed** as false negatives.

**F1 Score**

- Harmonic mean of precision and recall
- Provides a balanced measure of the model's performance, taking both false positives and false negatives into account.
- Usually used when there is an ***imbalance between the number of samples in different classes.***