# AnaLyrics Engine: Predicting Music Genres with NLP

## Business Understanding

This project leverages song lyrics and metadata to classify songs into distinct music genres. Genre classification is not an exact science, and given the vast diversity of music today, there are innumerable categories a song could belong to.

As a passionate music lover and avid concert goer, I tend to have a curated playlist for every possible mood and occassion. This is a tedious task to do manually, and I have often thought how nice it would be to be able to input my current mood and desired music style (ex: sad bangers) and have a playlist made personally for me.

Spotify has attempted to do this with their [daylist](https://newsroom.spotify.com/2023-09-12/ever-changing-playlist-daylist-music-for-all-day/)- a personalized playlist that "ebbs and flows with unique vibes, bringing together the niche music and microgenres you usually listen to during particular moments in the day or on specific days of the week."

The daylist updates organically throughout the day and is based on your past listening habits. This project attempts to build on that concept by curating a playlist based on a user's desired vibe. In order to get to an end goal of song recommendations based on lyrics and audio features, the first step was to build a supervised model trained to predict a song's genre based purely on the words it contains.

## Data Collection

The first step in building the dataset was identifying the prediction target- music genres. As with any field that is an art and not a (data) science, the groundtruth for genre classification is fairly subjective.

[Every Noise At Once](https://everynoise.com/engenremap.html) is “an ongoing attempt at an algorithmically-generated, readability-adjusted scatter-plot of the musical genre-space." The project was created in 2013 by Glenn McDonald, a former Spotify developer, based on data tracked and analyzed for ~6000 Spotify genres. 

![noise](images/noise.png)

For the purposes of lyric classification, I chose to use the most popular distinct genre groupings- pop, rock, hip hop, and country- and the songs for each genre were collected from the playlists provided by the Every Noise project.

Song metadata and audio features were gathered using [Spotipy](https://spotipy.readthedocs.io/en/2.22.1/), a Python wrapper for the [Spotify API](https://developer.spotify.com/documentation/web-api). Lyrics for each song were scraped from [Genius](https://genius.com) using [LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/reference/genius.html). 

The scraped lyrics required preliminary text cleaning to remove extraneous content and bad records. After combining and cleaning the data, we were left with 2,024 songs for modeling.

The Python files for collecting and cleaning the lyrics are available in the code folder:
- [Spotify metadata](code/1_spotify.py)
- [Genius lyrics](code/2_genius.py)
- [Text cleaning](code/3_data_cleaning.ipynb)

## Modeling

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("dark")

import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_validate, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_recall_fscore_support
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier


In [None]:
# import cleaned lyrics
df = pd.read_csv('data/lyrics2.csv')
df['genre'] = df['genre'].str.title()
df['genre'].value_counts(normalize=True)*100

To prepare the data, we performed preprocessing steps such as removing special characters and stop words, tokenization, and lemmatization of lyrics. 

In [None]:
# function to standardize text and lemmatize to get root terminology
def preprocess_text(lyrics):
    # remove numbers and special characters
    lyrics = re.sub(r'[^a-zA-Z\s]', '', lyrics)
    # remove extra spaces and new lines
    lyrics = re.sub(r'\s+|\n\s*\n', ' ', lyrics)
    # lowercase all
    lyrics = lyrics.lower()
    # tokenize, lemmatize, remove stopwords
    tokens = word_tokenize(lyrics)
    lemmer = WordNetLemmatizer()
    tokens = [lemmer.lemmatize(word) for word in tokens]
    sw = stopwords.words('english')
    tokens = ' '.join([word for word in tokens if word not in sw])
    return tokens

In [None]:
df['tokens'] = df['lyrics_text'].apply(preprocess_text)
df['tokens']

After exploring some simple modeling techniques, we found that there was heavy overfitting across the board- given this is a fairly small dataset with an average of 500 songs per genre, the models tended to perform much better on the training data than on unseen lyrics.

Along with tuning the chosen models, we tested different parameters for vectorization and resampling, which only slightly improved overfitting.

In [None]:
# function to cross validate model followed by train test split
# generate classification report and confusion matrix

def analyze_model(model, X, y):
    # k-fold validation 
    kfold = KFold(n_splits=5, shuffle=True, random_state=95)
    nb_kfold_scores = cross_validate(model, X, y, cv=kfold, scoring='accuracy', return_train_score=True)
    print('Cross-Validation Results:')
    for fold, (train_score, test_score) in enumerate(zip(nb_kfold_scores['train_score'], nb_kfold_scores['test_score'])):
        print(f"Fold {fold + 1}: Train = {train_score:.3f}, Test = {test_score:.3f}")
    print()
    print(f"Avg Train Accuracy: {nb_kfold_scores['train_score'].mean():.3f}")
    print(f"Avg Test Accuracy: {nb_kfold_scores['test_score'].mean():.3f}")
    print()

    # train test split for predictions
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=95)
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    genres = np.unique(y_te)

    # classification report
    print("Classification Report:")
    print(classification_report(y_te, pred))
    
    # confusion matrix
    cm = confusion_matrix(y_te, pred, normalize='true')
    sns.heatmap(cm, xticklabels=genres, yticklabels=genres,
                annot=True, fmt='.2f', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

In [None]:
# split holdout set for final model validation

X = df['tokens']
y = df['genre']

X_df, X_hold, y_df, y_hold = train_test_split(X, y, test_size=0.2, random_state=95)

## Baseline

In [None]:
X_tr, X_te, y_tr, y_te = train_test_split(X_df, y_df, test_size=0.2, random_state=95)

# use dummy classifier as baseline model
baseline = DummyClassifier(strategy='uniform')
baseline.fit(X_tr, y_tr)
base_pred = baseline.predict(X_te)
base_score = accuracy_score(y_te, base_pred)
print(f'Baseline Accuracy: {base_score:.2f}')

## Naive Bayes

In [None]:
# simple naive bayes 

nb_pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', MultinomialNB())
]) 

analyze_model(nb_pipe, X_df, y_df)

In [None]:
# grid search for best parameters to address overfitting

nb_params = {
    'vect': [CountVectorizer(), TfidfVectorizer()],
    'vect__max_features': [None, 200, 500],
    'vect__min_df': [5, 10, 25],
    'vect__max_df': [0.5, 0.75, 1.0],

    'clf__alpha': [0.1, 0.5, 1.0],
}

nb_grid = GridSearchCV(nb_pipe, nb_params, cv=5, scoring='accuracy')
nb_grid.fit(X_df, y_df)
nb_grid.best_params_

In [None]:
# tuned naive bayes

nb_tuned = Pipeline([
    ('vect', CountVectorizer(min_df=20, max_df=.75)),
    ('clf', MultinomialNB(alpha=0.1))
])

analyze_model(nb_tuned, X_df, y_df)

## Random Forest

In [None]:
# simple random forest

rf_pipe = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', RandomForestClassifier())
])

analyze_model(rf_pipe, X_df, y_df)

In [None]:
# grid search for best parameters to address overfitting

rf_params = {
    'vect': [CountVectorizer(), TfidfVectorizer()],
    'vect__max_features': [None, 150, 600],
    'vect__min_df': [2, 10, 25, 50],
    'vect__max_df': [0.5, 0.75, 1.0],

    'clf__n_estimators': [500, 600, 800],
    'clf__max_depth': [20, 30, 50],
    'clf__min_samples_split': [4, 6, 12],
    'clf__min_samples_leaf': [2, 4, 7]
}

rf_grid = GridSearchCV(rf_pipe, rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_df, y_df)
rf_grid.best_params_

In [None]:
# tuned random forest

rf_tuned = Pipeline([
    ('vect', CountVectorizer(min_df= 20, 
                             max_df= .75)),
    ('clf', RandomForestClassifier(n_estimators= 500,
                                   max_depth= 70,
                                   min_samples_split= 3,
                                   min_samples_leaf= 3))
])

analyze_model(rf_tuned, X_df, y_df)

## Evaluation & Next Steps

In evaluation the final models, we found that a random forest provided the highest overall accuracy with 67%. However this may not be the preferred option as the accuracy of the classifications was heavily skewed in certain genres (rock and hip hop) compared to others. 
Naive Bayes overall accuracy dropped to 56%, but it was slightly more balanced across genres.

To improve this model, we would need to collect a larger variety of lyrics to address sample size issues. We can also consider limiting the time frame as language evolution may be a factor in prediction accuracy.For expanded analysis, we can also combine audio features such as acoustics and tempo to focus on the song structure.

In [None]:
analyze_model(nb_tuned, X_hold, y_hold)

In [None]:
analyze_model(rf_tuned, X_hold, y_hold)