# AnaLyrics Engine: Predicting Music Genres with NLP

- This project leverages song lyrics and metadata to classify songs into distinct music genres.
- The dataset is comprised of ~2000 songs distributed across 4 genres: pop, rock, hip-hop, and country.

**Data Collection**

The first step in building the dataset was identifying the prediction target- music genres. [Every Noise At Once](https://everynoise.com/engenremap.html) is “an ongoing attempt at an algorithmically-generated, readability-adjusted scatter-plot of the musical genre-space." The project was created in 2013 by Glenn McDonald, a former Spotify developer, based on data tracked and analyzed for ~6000 Spotify genres. (last updated November 2023)

The songs for each genre were collected from the playlists provided by the Every Noise project, and all of the song metadata and audio features were gathered using [Spotipy](https://spotipy.readthedocs.io/en/2.22.1/), a Python wrapper for the [Spotify API](https://developer.spotify.com/documentation/web-api). Next, lyrics for each song were scraped from [Genius](https://genius.com) using [LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/reference/genius.html).

After combining all the data and dropping bad records, we were left with 2024 songs.

We start by collecting data from Spotify, extracting features like acousticness, instrumentalness, and valence that capture the essence of each track. Additionally, we gather lyrics data from Genius, a platform known for its comprehensive collection of song lyrics. The combination of audio features and textual content provides a holistic representation of each song.

Preprocessing:
To prepare the data for analysis, we perform preprocessing steps such as text cleaning, tokenization, and vectorization of lyrics. This allows us to convert the textual content into a format suitable for NLP models.

Feature Engineering:
We engineer features from both Spotify and Genius data, creating a feature-rich dataset that encapsulates both the musical and lyrical aspects of each song. This combined feature set serves as the input for our genre classification model.

Model Building:
For genre classification, we employ machine learning models, such as Naive Bayes or Random Forests, trained on the labeled dataset. These models learn patterns from the features extracted from Spotify data and lyrics, enabling them to make predictions about the genre of a song.

Hyperparameter Tuning:
To enhance model performance, we use grid search and cross-validation techniques to fine-tune hyperparameters. This process ensures that our models generalize well to new, unseen data and minimizes the risk of overfitting.

Evaluation:
We evaluate the performance of our genre classification models using metrics like accuracy, precision, recall, and F1-score. This provides insights into how well the models are able to distinguish between different genres.

Results and Insights:
Upon successful model training and evaluation, we gain valuable insights into the factors influencing genre classification. We analyze the importance of various features and explore how different genres are characterized by both musical attributes and lyrical content.

Through this project, we aim to contribute to the understanding of genre classification in the music domain, showcasing the potential of combining audio features and lyrics for accurate and insightful genre predictions.

# Modeling

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("dark")

import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_recall_fscore_support
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier


In [11]:
df = pd.read_csv('data/lyrics.csv')

In [3]:
# custom stopwords list
sw = set(stopwords.words('english'))
custom_sw = ["i'd", "i'm",
             'yeah', 'ah', 'oh']
sw.update(custom_sw)

In [4]:
def preprocess_text(lyrics):
    # remove numbers and special characters
    lyrics = re.sub(r'[^a-zA-Z\s]', '', lyrics)
    # remove extra spaces and new lines
    lyrics = re.sub(r'\s+|\n\s*\n', ' ', lyrics)
    # lowercase all
    lyrics = lyrics.lower()
    # tokenize, lemmatize, remove stopwords
    tokens = word_tokenize(lyrics)
    lemmer = WordNetLemmatizer()
    tokens = [lemmer.lemmatize(word) for word in tokens]
    tokens = ' '.join([word for word in tokens if word not in sw])
    return tokens

In [5]:
df['tokens'] = df['lyrics_text'].apply(preprocess_text)
df['tokens']

0       puttin defense cause dont wan na fall love eve...
1       thats love thats love dont need time make mind...
2       wake morning feelin like p diddy hey girl grab...
3       good gold kinda dream cant sold right til were...
4       found heart wa broke filled cup overflowed too...
                              ...                        
2019    saturday night six pack girl big star shining ...
2020    oclock friday night im still home girl keep bl...
2021    hard find perfect time say something know gon ...
2022    tried get sober didnt get far im gon na pour c...
2023    little outside elizabethtown little bar id sit...
Name: tokens, Length: 2024, dtype: object

In [26]:
def analyze_model(model, X, y):

    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=95)
    
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)

    # print("Classification Report:")
    # print(classification_report(y, pred))

    cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy', return_train_score=True) 
    print(f"Cross-Validation Scores: {cv_scores}")
    print(f"Mean Accuracy: {cv_scores.mean():.2f}")

    accuracy = accuracy_score(y, pred)
    print(f"Overall Accuracy: {accuracy:.2f}")

    # cm = confusion_matrix(y, pred, normalize='true')
    # genres = ['Pop', 'hip hop', 'rock', 'country']
    # sns.heatmap(cm, xticklabels=genres, yticklabels=genres,
    #             annot=True, fmt='.2f', cmap='Blues')
    # plt.xlabel('Predicted')
    # plt.ylabel('True')
    # plt.show()

In [22]:
# set variables for modeling
X = df['tokens']
y = df['genre']

# split holdout set for final model validation
X_df, X_hold, y_df, y_hold = train_test_split(X, y, test_size=0.2, random_state=95)

rock       30.574429
pop        26.065473
country    24.336010
hip hop    19.024089
Name: genre, dtype: float64

## Baseline

In [77]:
X_tr, X_te, y_tr, y_te = train_test_split(X_df, y_df, test_size=0.2, random_state=95)

# use dummy classifier as baseline model
baseline = DummyClassifier(strategy='uniform')
baseline.fit(X_tr, y_tr)
base_pred = baseline.predict(X_te)
base_score = accuracy_score(y_te, base_pred)
print(f'Baseline Accuracy: {base_score:.2f}')

Baseline Accuracy: 0.31


## Naive Bayes

In [73]:
nb_pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
]) 

analyze_model(nb_pipe, X_df, y_df)

Cross-Validation Scores: [0.58641975 0.61111111 0.61111111 0.58641975 0.6006192 ]
Mean Accuracy: 0.60
Overall Accuracy: 0.60


In [74]:
nb_params = {
    'vect': [CountVectorizer(), TfidfVectorizer()],
    'vect__max_features': [None, 1000, 2000],
    'vect__min_df': [2, 10, 25,50],
    'vect__max_df': [0.5, 0.75, 1.0],

    'clf__alpha': [0.1, 0.5, 1.0],
}

nb_grid = GridSearchCV(nb_pipe, nb_params, cv=5, scoring='accuracy')
nb_grid.fit(X_df, y_df)
print('Naive Bayes Best Params: ', nb_grid.best_params_)
nb_tuned = nb_grid.best_estimator_
nb_pred = nb_tuned.predict(X_te)
nb_score = accuracy_score(y_te, nb_pred)
print(f'Naive Bayes Accuracy: {nb_score:.2f}')

{'mnb__alpha': 0.1,
 'vect': CountVectorizer(),
 'vect__max_df': 0.75,
 'vect__max_features': None,
 'vect__min_df': 2}

Naive Bayes Accuracy: 0.91


## Random Forest

In [79]:
rf_pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier())
])

analyze_model(rf_pipe, X_df, y_df)

Cross-Validation Scores: [0.62654321 0.64506173 0.66358025 0.61111111 0.66873065]
Mean Accuracy: 0.64
Overall Accuracy: 0.64


In [80]:
rf_params = {
    'vect': [CountVectorizer(), TfidfVectorizer()],
    'vect__max_features': [None, 1000, 2000],
    'vect__min_df': [2, 10, 25,50],
    'vect__max_df': [0.5, 0.75, 1.0],

    'clf__n_estimators': [50, 100, 150],
    'clf__max_depth': [None, 10, 20, 30],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 4]
}

rf_grid = GridSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_df, y_df)
rf_grid.best_params_

In [None]:
print('Random Forest Best Params: ', rf_grid.best_params_)
rf_tuned = rf_grid.best_estimator_
rf_pred = rf_tuned.predict(X_te)
rf_score = accuracy_score(y_te, rf_pred)
print(f'Random Forest Accuracy: {rf_score:.2f}')