## Genre Classification Model
 

**Objective.** Construct a classification model that can accurately predict the genre of a song using a dataset containing Spotify track information, including artist details and genres. The model will utilize the audio features and potentially the lyrics of each track to make its predictions. Additionally, it will be trained using the artist genre list as well as a separate dataset of tracks that have been labeled with their respective genres.


Constructing a classification model to predict the genre of a song involves several steps, including data preprocessing, feature engineering, model selection, training, and evaluation. Here's a step-by-step guide to achieve this:




In [1]:
import pandas as pd
import warnings 
warnings.simplefilter("ignore")
#from genre_seed import track_features

genres_df = pd.read_csv('../assets/data/genres_v2.csv')
genres_v2 = pd.read_csv('../assets/data/genre_seeds.csv')
#genres_v2=genres_v2.drop_duplicates(subset=['track_id']).reset_index()

In [2]:
genres_v2['genre'].value_counts()

pop         142
acoustic    139
emo         139
chill       139
grunge      134
punk        134
romance     133
rock        132
sad         132
happy       132
piano       131
hip-hop     130
indie       129
dance       127
techno       95
r-n-b        91
edm          84
Name: genre, dtype: int64

### Step 1: Data Collection
Gather the necessary datasets:
- **Spotify API**: Use the Spotify API to collect track information, including audio features (e.g., danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo).
- **Artist Genre Data**: Gather genres associated with each artist from Spotify's API.
- **Lyrics Data**: Includes the lyrics of the tracks (optional but can enhance the model's performance).
- **Genre Labels**: Obtain a labeled dataset of tracks with their respective genres for training.


We want to try and predict a song's genre based off of these audio features. Spotify provides a "genre seed:" an array of genres associated with the song used for the recommendation function of the API. We use the API to search for the top 1000 songs in a given genre, pull the audio features for each song, and add on the genre label.


In [3]:
df = genres_v2[['name', 'track_id', 'artist', 'popularity', 'artist_genres', 
                'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 
                'loudness', 'speechiness', 'tempo', 'valence', 'key', 'mode','time_signature', 'genre']]

# Group by 'Category' and describe
grouped_description = df.groupby('genre').describe()
df.describe()

Unnamed: 0,popularity,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,key,mode,time_signature
count,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0
mean,28.085394,0.229999,0.559915,0.6547,0.114476,0.18818,-7.523677,0.078464,122.833333,0.438788,5.211386,0.645824,3.929071
std,27.611174,0.312395,0.165951,0.245871,0.265948,0.151159,4.443536,0.078789,28.301679,0.231271,3.584855,0.478375,0.377482
min,0.0,1e-06,0.0621,0.0015,0.0,0.0215,-41.446,0.0226,42.646,0.0275,0.0,0.0,1.0
25%,0.0,0.00525,0.446,0.4885,0.0,0.0959,-8.931,0.0355,100.3045,0.255,2.0,0.0,4.0
50%,26.0,0.0582,0.559,0.71,4.4e-05,0.126,-6.34,0.0482,123.279,0.415,5.0,1.0,4.0
75%,53.0,0.362,0.6785,0.856,0.01205,0.234,-4.6995,0.0836,139.987,0.61,8.0,1.0,4.0
max,85.0,0.996,0.969,0.999,0.978,0.988,-1.264,0.578,214.008,0.98,11.0,1.0,5.0


---------------------------------


### Step 2: Data Preprocessing
Clean and preprocess the data to prepare it for model training.
- **Handle Missing Values**: Remove or impute missing data.
- **Normalize/Standardize Features**: Normalize or standardize the audio features to ensure they are on a similar scale.
- **Text Preprocessing for Lyrics**: Tokenize, remove stopwords, and potentially use techniques like TF-IDF or word embeddings for the lyrics.





In [4]:
# Normalize audio features
from sklearn.preprocessing import StandardScaler

# Define audio features
audio_features = ['danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness',
                  'instrumentalness', 'liveness', 'valence', 'tempo']

# Fit and transform audio features
scaler = StandardScaler()
df[audio_features] = scaler.fit_transform(df[audio_features])

------

### Step 3. Feature Engineering
Create features from the available data.
- **Audio Features**: Use the provided audio features.
- **Artist Genre Encoding**: Encode categorical features (i.e. genres) using one-hot encoding or other suitable methods.
- **Lyrics Features**: Convert lyrics into numerical features using techniques like TF-IDF, Word2Vec, or BERT embeddings.
    - Extract additional features from lyrics (e.g., sentiment analysis, topic modeling).


#### Label Encoder

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()  # Encode the target variable
df['genre_encoded'] = le.fit_transform(df['genre'])

Combine and arrange the data to create a final dataset for training the model.


In [6]:
X = df[['danceability', 'energy', 'key', 'loudness', 'speechiness',
        'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
y = df[['genre_encoded']]

-----------------------

### Step 4: Model Selection and Training
Choose an appropriate classification algorithm and train the model.
- **Algorithms**: Consider using algorithms like Random Forest, Gradient Boosting, Support Vector Machine (SVM), or Neural Networks.
- **Cross-Validation**: Use cross-validation to tune hyperparameters and avoid overfitting.


#### Model Training 

The following code divides a dataset into training and testing subsets. It divides the input variables and target variables into 80% training and 20% testing groups at random. The descriptive statistics of the training data are then outputted to aid in data exploration and the identification of possible problems.

In [7]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # test_size=0.2

#### Model Fitting

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from prettytable import PrettyTable

In [9]:
# Define models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, C=0.5, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_split=5, random_state=42),
    "SVM": SVC(probability=True, random_state=42),
    "Neural Network": MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "Decision Tree": DecisionTreeClassifier(max_depth=7, min_samples_split=5, random_state=42),
    "Gradient Boost": GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42),
    "XGB": XGBClassifier(n_estimators=300, random_state=42)
}

-------

### Step 6: Model Evaluation
Evaluate the model's performance using appropriate metrics.
- **Metrics**: Use accuracy, precision, recall, F1-score, and confusion matrix to assess the model.
- **Validation Set**: Use a separate validation set to test the model's generalization ability.


We can evaluate the performance of different models using multiple criteria: Accuracy, ROC-AUC, Precision, Recall, and F1-Score. Based on the provided metrics, let's analyze the performance of each model to determine which one is the best:

In [10]:
# Initialize an empty list to store the results
results = []

# Train models and evaluate
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test) if hasattr(model, "predict_proba") else None

    report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)

    results.append({
        "Model": name, 
        "Accuracy": accuracy_score(y_test, y_pred), 
        "ROC-AUC": roc_auc_score(y_test, y_prob, multi_class='ovr') if y_prob is not None else None,
        "Precision": report['weighted avg']['precision'], 
        "Recall": report['weighted avg']['recall'], 
        "F1-Score": report['weighted avg']['f1-score']
    })

In [11]:
# Convert results to a DataFrame
df = pd.DataFrame(results, columns=["Model", "Accuracy", "ROC-AUC", "Precision", "Recall", "F1-Score"])

# Apply styling
props = 'font-family: "Roboto"; color: #e83e8c; font-size:13px; font-weight: 800;text-transform:uppercase;'
props2 = 'font-family: "Roboto"; color: black; font-size:12px; font-weight: 400;'

styled_df = (df.style
             .set_table_styles([{'selector': 'td.col0', 'props': props},
                                {'selector': 'td', 'props': props2}])
             .highlight_max(props='color:white !important;background-color:#d23d75;font-weight:700 !important;font-size:13px !important;',
                            axis=0, subset=['Accuracy', 'ROC-AUC', 'Precision', 'Recall', 'F1-Score']))

# Display the styled DataFrame
styled_df

Unnamed: 0,Model,Accuracy,ROC-AUC,Precision,Recall,F1-Score
0,Logistic Regression,0.29291,0.823484,0.32234,0.29291,0.275275
1,Random Forest,0.266791,0.8269,0.257971,0.266791,0.2378
2,SVM,0.270522,0.81635,0.275629,0.270522,0.252128
3,Neural Network,0.268657,0.815905,0.270748,0.268657,0.258513
4,KNN,0.203358,0.662278,0.201823,0.203358,0.191504
5,Decision Tree,0.238806,0.72993,0.262212,0.238806,0.223554
6,Gradient Boost,0.270522,0.808805,0.269965,0.270522,0.263241
7,XGB,0.248134,0.778676,0.249968,0.248134,0.241814


#### Conclusion

Considering all metrics, **Random Forest** appears to be the best model overall due to its high performance across multiple metrics (accuracy and ROC-AUC), while the **Neural Network** also performs well with the highest precision and F1-Score.

These results suggest that the Random Forest, Neural Network, and SVM models are the most effective for this specific task of predicting the genre of a song using the given dataset.


<!--
Considering these metrics, the **Gradient Boosting** model is the best performing model overall. It has the highest accuracy, precision, recall, and F1-score, making it the most balanced model in terms of performance across different evaluation metrics.-->

-------------

### Step 7: Model Deployment
Deploy the model for practical use.
- **Save the Model**: Save the trained model using libraries like joblib or pickle.
- **API Creation**: Create an API using Flask or FastAPI to make predictions on new data.


In [12]:
import joblib

# Train the Random Forest model
random_forest_model = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_split=5, random_state=42)
random_forest_model.fit(X_train, y_train)

# Save scaler + model for future use
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(random_forest_model, 'random_forest_model.pkl')

['random_forest_model.pkl']

In [13]:
# Evaluate the model
y_pred = random_forest_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.2667910447761194


#### Applying Model to New Data

In [14]:
# Load the trained models + scaler
random_forest_model = joblib.load('random_forest_model.pkl')
scaler = joblib.load('scaler.pkl')

In [15]:
# Load new data
df_new = pd.read_csv("../assets/data/all_tracks+lyrics.csv")

# Extract relevant columns
new_data = df_new[['name', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 
                   'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]

# Preprocess new data
X_new_data = new_data.drop(columns=['name'])
X_new_data_scaled = scaler.transform(X_new_data)

In [16]:
# Make predictions
predictions = random_forest_model.predict(X_new_data_scaled)
probabilities = random_forest_model.predict_proba(X_new_data_scaled)

# Decode the predicted genre labels
predictions_label = le.inverse_transform(predictions)

# Display predictions
new_data['Predicted Genre'] = predictions
new_data['Predicted Genre Label'] = predictions_label
new_data['Prediction Probabilities'] = probabilities.tolist()

new_data[['name', 'Predicted Genre Label', 'Predicted Genre', 'Prediction Probabilities']]

Unnamed: 0,name,Predicted Genre Label,Predicted Genre,Prediction Probabilities
0,Please Please Please,pop,10,"[0.04088154652446443, 0.07714395919805248, 0.0..."
1,Si Antes Te Hubiera Conocido,pop,10,"[0.052278933050807844, 0.06228149646432559, 0...."
2,BIRDS OF A FEATHER,dance,2,"[0.07264102912485536, 0.10432647001182196, 0.1..."
3,"Good Luck, Babe!",pop,10,"[0.05870991975676431, 0.09131179743471457, 0.0..."
4,A Bar Song (Tipsy),pop,10,"[0.04773552198684028, 0.0836338792955852, 0.04..."
5,Not Like Us,hip-hop,7,"[0.02365012599849557, 0.07161269506638383, 0.0..."
6,MILLION DOLLAR BABY,pop,10,"[0.02922418578908816, 0.0647413931106139, 0.08..."
7,Too Sweet,pop,10,"[0.025137350612849513, 0.07401057278171215, 0...."
8,Beautiful Things,romance,14,"[0.05636255764656002, 0.06888807365798695, 0.0..."
9,I Had Some Help (Feat. Morgan Wallen),happy,6,"[0.0324812151873648, 0.07718756832982102, 0.09..."


-----

In [17]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import warnings
warnings.simplefilter("ignore")

# import track_data
genres_v2 = pd.read_csv("../assets/data/genre_seeds.csv")

client_id = "bd1c5f1d16b94210bc1776e172cbd264"
client_secret = "b152588a487b4f6e9429bdd1bfd92fb3"
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id, client_secret))


def track_features(id, artist_id, note):
    meta = sp.track(id)
    audio_features = sp.audio_features(id)
    artist_info = sp.artist(artist_id)

    if audio_features[0] is None:
        return None

    name = meta['name']
    track_id = meta['id']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    artist_id = meta['album']['artists'][0]['id']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]
    artist_followers = artist_info["followers"]['total']

    acousticness = audio_features[0]['acousticness']
    danceability = audio_features[0]['danceability']
    energy = audio_features[0]['energy']
    instrumentalness = audio_features[0]['instrumentalness']
    liveness = audio_features[0]['liveness']
    loudness = audio_features[0]['loudness']
    speechiness = audio_features[0]['speechiness']
    tempo = audio_features[0]['tempo']
    valence = audio_features[0]['valence']
    key = audio_features[0]['key']
    mode = audio_features[0]['mode']
    time_signature = audio_features[0]['time_signature']

    return [name, track_id, album, artist, artist_id, release_date, length, popularity,
            artist_pop, artist_genres, artist_followers, acousticness, danceability,
            energy, instrumentalness, liveness, loudness, speechiness,
            tempo, valence, key, mode, time_signature, note]


In [18]:


# sp.recommendation_genre_seeds() "trip-hop", "trance"
genre_seeds = ["acoustic", "chill", "dance", "edm", "emo", "grunge", "happy", "hip-hop", "indie",
               "piano", "pop", "punk", "rock", "romance", "sad", "techno", "r-n-b"]

all_genre_seed_tracks = []

for genre in genre_seeds:
    genre_rec = sp.recommendations(seed_genres=[genre])['tracks']

    for song in genre_rec:
        song_id = song['id']
        song_artist_id = song['artists'][0]['id']
        song_audio = track_features(
            id=song_id, artist_id=song_artist_id, note=genre)
        all_genre_seed_tracks.append(song_audio)


df = pd.DataFrame(all_genre_seed_tracks,
                  columns=['name', 'track_id', 'album', 'artist', 'artist_id', 'release_date', 'length', 'popularity',
                           'artist_pop', 'artist_genres', 'artist_followers', 'acousticness', 'danceability',
                           'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
                           'tempo', 'valence', 'key', 'mode', 'time_signature', 'genre'])



In [19]:

df_add = df.append(genres_v2, ignore_index=True)
df_add = df_add.drop_duplicates(subset=['track_id', 'genre'])
#df_add = df_add.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'])

df_add.to_csv("../assets/data/genre_seeds.csv", index=None)
