# Data Classification: Crafting BeatWave's Genre Tapestry

## Business Understanding

Classifying songs into different genres can aid in organizing and categorizing a large music library. This can be useful for music streaming platforms that need to categorize and tag their music content effectively. 

We are working for a fictional startup called 'BeatWave', an up-and-coming music streaming platform specializing in electronic music. Through the iterative process of data classification, we can begin to sift through popular electronic songs and give BeatWave a better understanding of which sonic metrics predict which genres. This will allow for a streamlined approach to classifying titles into their appropriate categories.

## Data Understanding

Data source: https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify 

We are using an open source dataset that displays sample data from popular streaming platform Spotify. Using this data will help us gauge our place in competition with the world's leading streaming service.

Let's begin by importing the necessary libraries and superficially inspecting the dataset.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('Data/genres.csv')

df.head()

  df = pd.read_csv('Data/genres.csv')


Unnamed: 0.1,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,id,uri,track_href,analysis_url,duration_ms,time_signature,genre,song_name,Unnamed: 0,title
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,...,2Vc6NJ9PW9gD9q343XFRKx,spotify:track:2Vc6NJ9PW9gD9q343XFRKx,https://api.spotify.com/v1/tracks/2Vc6NJ9PW9gD...,https://api.spotify.com/v1/audio-analysis/2Vc6...,124539,4,Dark Trap,Mercury: Retrograde,,
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,...,7pgJBLVz5VmnL7uGHmRj6p,spotify:track:7pgJBLVz5VmnL7uGHmRj6p,https://api.spotify.com/v1/tracks/7pgJBLVz5Vmn...,https://api.spotify.com/v1/audio-analysis/7pgJ...,224427,4,Dark Trap,Pathology,,
2,0.85,0.893,5,-4.783,1,0.0623,0.0138,4e-06,0.372,0.0391,...,0vSWgAlfpye0WCGeNmuNhy,spotify:track:0vSWgAlfpye0WCGeNmuNhy,https://api.spotify.com/v1/tracks/0vSWgAlfpye0...,https://api.spotify.com/v1/audio-analysis/0vSW...,98821,4,Dark Trap,Symbiote,,
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,...,0VSXnJqQkwuH2ei1nOQ1nu,spotify:track:0VSXnJqQkwuH2ei1nOQ1nu,https://api.spotify.com/v1/tracks/0VSXnJqQkwuH...,https://api.spotify.com/v1/audio-analysis/0VSX...,123661,3,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote),,
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,...,4jCeguq9rMTlbMmPHuO7S3,spotify:track:4jCeguq9rMTlbMmPHuO7S3,https://api.spotify.com/v1/tracks/4jCeguq9rMTl...,https://api.spotify.com/v1/audio-analysis/4jCe...,123298,4,Dark Trap,Venom,,


In [2]:
df.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature', 'genre', 'song_name', 'Unnamed: 0', 'title'],
      dtype='object')

In [3]:
df['Unnamed: 0'].value_counts()

0.0        1
13955.0    1
13973.0    1
13972.0    1
13971.0    1
          ..
6994.0     1
6993.0     1
6992.0     1
6991.0     1
20999.0    1
Name: Unnamed: 0, Length: 20780, dtype: int64

In [4]:
len(df['genre'].value_counts())

15

** things to consider **
- Low metrics- specify in data/business understanding what a false positive or negative implies- i.e. song gets miscategorized, has impact on user experience (limitations, next steps)
- Reframe business understanding to incorporate this 
- outline what the startutp would need to do to improve their model
- set proper expectations that metrics will be weak
- decision tree
- random forest
- knearest neighbors
- evaluate which is strongest performing algorithm
- final model- recommendation to gather additional data eg. 400-500,000 songs/records 
- run final model with additional data to see how metrics improve


In [5]:
df.shape

(42305, 22)

In [6]:
df.describe

<bound method NDFrame.describe of        danceability  energy  key  loudness  mode  speechiness  acousticness  \
0             0.831   0.814    2    -7.364     1       0.4200      0.059800   
1             0.719   0.493    8    -7.230     1       0.0794      0.401000   
2             0.850   0.893    5    -4.783     1       0.0623      0.013800   
3             0.476   0.781    0    -4.710     1       0.1030      0.023700   
4             0.798   0.624    2    -7.668     1       0.2930      0.217000   
...             ...     ...  ...       ...   ...          ...           ...   
42300         0.528   0.693    4    -5.148     1       0.0304      0.031500   
42301         0.517   0.768    0    -7.922     0       0.0479      0.022500   
42302         0.361   0.821    8    -3.102     1       0.0505      0.026000   
42303         0.477   0.921    6    -4.777     0       0.0392      0.000551   
42304         0.529   0.945    9    -5.862     1       0.0615      0.001890   

       instrument

In [7]:
df['key'].head()

0    2
1    8
2    5
3    0
4    2
Name: key, dtype: int64

In [8]:
df.value_counts()

Series([], dtype: int64)

# DATA UNDERSTANDING

The dataset is 

# DATA PREPARATION

In [9]:
df.isnull().sum()

danceability            0
energy                  0
key                     0
loudness                0
mode                    0
speechiness             0
acousticness            0
instrumentalness        0
liveness                0
valence                 0
tempo                   0
type                    0
id                      0
uri                     0
track_href              0
analysis_url            0
duration_ms             0
time_signature          0
genre                   0
song_name           20786
Unnamed: 0          21525
title               21525
dtype: int64

In [10]:
df.shape

(42305, 22)

In [11]:
df = df.dropna(axis=1)

In [12]:
df.isnull().sum()

danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
type                0
id                  0
uri                 0
track_href          0
analysis_url        0
duration_ms         0
time_signature      0
genre               0
dtype: int64

In [13]:
df.shape

(42305, 19)

In [14]:
df['genre'].value_counts()

Underground Rap    5875
Dark Trap          4578
Hiphop             3028
trance             2999
trap               2987
techhouse          2975
dnb                2966
psytrance          2961
techno             2956
hardstyle          2936
RnB                2099
Trap Metal         1956
Rap                1848
Emo                1680
Pop                 461
Name: genre, dtype: int64

# MODELING

## Baseline Model

In [18]:
df.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature', 'genre'],
      dtype='object')

In [20]:
# Assuming 'df' is your dataframe
df.drop('analysis_url', axis=1, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('analysis_url', axis=1, inplace=True)


In [22]:
# Assuming 'df' is your dataframe
rows_with_string = []

# Iterate over the columns
for column in df.columns:
    # Check if the column contains the string 'audio_features'
    if 'audio_features' in df[column].values:
        # Append the rows with the string to the list
        rows_with_string.extend(df[df[column] == 'audio_features'].index)

# Print the rows with the string 'audio_features'
print(df.loc[rows_with_string])

       danceability  energy  key  loudness  mode  speechiness  acousticness  \
0             0.831   0.814    2    -7.364     1       0.4200      0.059800   
1             0.719   0.493    8    -7.230     1       0.0794      0.401000   
2             0.850   0.893    5    -4.783     1       0.0623      0.013800   
3             0.476   0.781    0    -4.710     1       0.1030      0.023700   
4             0.798   0.624    2    -7.668     1       0.2930      0.217000   
...             ...     ...  ...       ...   ...          ...           ...   
42300         0.528   0.693    4    -5.148     1       0.0304      0.031500   
42301         0.517   0.768    0    -7.922     0       0.0479      0.022500   
42302         0.361   0.821    8    -3.102     1       0.0505      0.026000   
42303         0.477   0.921    6    -4.777     0       0.0392      0.000551   
42304         0.529   0.945    9    -5.862     1       0.0615      0.001890   

       instrumentalness  liveness  valence    tempo

  if 'audio_features' in df[column].values:


In [23]:
df['type'].value_counts()

audio_features    42305
Name: type, dtype: int64

In [21]:
# Step 3: Create a decision tree model and fit it to the training data
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 4: Evaluate the model's performance on the testing data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Initial Model Accuracy:", accuracy)

ValueError: could not convert string to float: 'audio_features'

In [26]:
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Split the data into features (X) and target variable (y)
X = df.drop(['genre', 'id', 'type', 'uri', 'track_href'], axis=1)  
y = df['genre']

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [27]:
# Step 3: Create a decision tree model and fit it to the training data
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 4: Evaluate the model's performance on the testing data
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Initial Model Accuracy:", accuracy)

Initial Model Accuracy: 0.5765275972107315


In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Assuming 'model' is your decision tree model
plt.figure(figsize=(10, 8))
plot_tree(model, filled=True, rounded=True, feature_names=df.columns[:-1])  # Replace 'df.columns[:-1]' with the actual column names of your features
plt.show()

## Second Model

## Tuned Model

## Final Model

# EVALUATION

# CONCLUSION AND NEXT STEPS