## Natural Language Processing

Using the data gathered from the Spotify API, we now want to extract and process the lyrics for each song. This is accomplished through scraping textual information, namely lyrical data, from the **Genius Lyrics** website. Following extraction, the lyrics are thoroughly processed and cleaned before undergoing sentiment analysis. 



<!--### Scraping the Genius Lyrics Website
scraping textual information
Scraping the Genius Lyrics Website-->


In [1]:
import pandas as pd
import re
import contractions
import string
from better_profanity import profanity
from nltk.tokenize import word_tokenize
from deep_translator import GoogleTranslator
from langdetect import detect
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import wordnet
import lyricsgenius
from transformers import pipeline
from spellchecker import SpellChecker
from tkinter import *


all_tracks = pd.read_csv("../assets/data/all_tracks.csv")


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html





### Scraping the Web

To get started, the script below imports `lyricsgenius`, a fundamental package libary allowing for web scraping of the Genius Lyrics website to retrieve the lyrics of any given song. Through the initialization of the `genius` variable, one can access the Genius API and retrieve the lyrics of any given song, such as "Too Many Nights" by Metro Boomin.


In [2]:
import lyricsgenius

genius = lyricsgenius.Genius("epFCxujgBe-Y6WrkZedI8kerKxiCpR6Rh0DAHYNlKDf9B4H1nXTdZIkj7krNUHVV")
song = genius.search_song("Too Many Nights", "Metro Boomin")

Searching for "Too Many Nights" by Metro Boomin...


Done.


First, we define a function that retrieves the lyrics for any song and artist from the Genius database. As shown below, it first searches for the track using the provided name and artist and then extracts the lyrics from the search results.

In [3]:
def get_song_lyrics(song_name, song_artist):
    song_genius = genius.search_song(song_name, song_artist)
    song_lyrics = song_genius.lyrics.partition("Lyrics")[2]
    # Remove any numbers followed by 'Embed'
    song_lyrics = re.sub(r"[\[].*?[\]]|\d+Embed", "", song_lyrics)
    # Remove text between square brackets
    song_lyrics = re.sub(r"(\-[A-Za-z]+\-)", "", song_lyrics)

    return song_lyrics

### Pre-Processing Text Data

The following Python script contains various functions optimized for efficiently cleaning song lyrics, which is a crucial step towards building a sentiment classifier. The text pre-processing procedure involves the following main steps.

1. Language Detection
2. Expanding Contractions
3. Converting Text to Lowercase
4. Spell Checking + Censoring
5. Removing Punctuations
6. Tokenization

The function `detect_and_translate` below is designed to identify and translate text into a specified language, specifically English. It first checks the language of the original text and compares it to the target language. If the detected language differs from the target language, the function utilizes GoogleTranslator to translate the input text into the target language (English).


In [4]:
# Function to detect and translate text
def detect_and_translate(track_lyrics, target_lang='en'):
    if detect(track_lyrics) == target_lang:
        return track_lyrics
    translator = GoogleTranslator(source='auto', target=target_lang)
    return translator.translate(track_lyrics)

We also develop various functions to support the preprocessing of textual data, streamlining the process and improving the accuracy of the final output. Among these functions are a method for removing punctuation from a given string of lyrics and a spell-checker that automatically finds and corrects any spelling errors.


In [5]:
def remove_punctuation(text):
    no_punct = ""
    for char in text:
        if char not in string.punctuation:
            no_punct = no_punct + char
    return no_punct  # return unpunctuated string

In [6]:
# Spell Check + Censor
spell = SpellChecker()

def spell_check(word_list_str):
    word_corrected_list = []
    for word in word_list_str.split():
        word_corrected = spell.correction(word)
        if word_corrected is not None:
            word_corrected_list.append(word_corrected)
        else:
            word_corrected_list.append(word)
    return word_corrected_list

The `clean_song_lyrics` function is designed to simplify the processing of lyrics for a specific song and artist. The function extracts the lyrics from the Genius database and performs a series of modifications, including expanding contractions, removing repetitive phrases, and converting the text to lowercase. It also ensures that the spelling is correct and eliminates any profanity. The end result is a cleaned set of lyrics, tokenized and encoded as a list of words.

In [7]:
def clean_song_lyrics(song_name, artist_name):
    genius_lyrics = get_song_lyrics(song_name, artist_name) # <1>
    lyrics_en = detect_and_translate(genius_lyrics, "en")  # <2> 
    
    no_contract = [contractions.fix(word) for word in lyrics_en.split()] # <3>
    no_contract_str = " ".join(no_contract).lower()  # lowercase # <4>
    no_contract_str = re.sub(r"nana|lala", "", no_contract_str) # <4>
    
    corrected = spell_check(no_contract_str) # <5> # Spell Check + Censor
    censored = profanity.censor(" ".join(corrected), censor_char="") # <5>
    no_punct = remove_punctuation(censored) # <6> # Remove Punctuation
    
    tokenized = word_tokenize(no_punct)  # Tokenize # <7>
    strencode = [i.encode("ascii", "ignore") for i in tokenized]  # Encode() method # <8>
    return [i.decode() for i in strencode]  # Decode() method # <8>

#### Removing Stop Words

The code below aims to eliminate stopwords from lyrics utilizing the **Natural Language Toolkit** (NLTK) library and its `WordNetLemmatizer` tool. By eliminating frequently occurring words like "the," "and," or "of," the resulting text becomes more compact and meaningful. Without the distraction of stopwords, the analysis can more effectively capture the essence of the lyrics and the underlying conveyed message.


In [8]:
def remove_stopwords_lyrics(clean_lyrics_decode):
    stopword = stopwords.words("english")
    stopword.extend(["yeah", "nanana", "nana", "oh", "la"])
    return [word for word in clean_lyrics_decode if word not in stopword]

#### Lemmatization



Next, we define a function to perform lemmatization on a set of words using the `WordNetLemmatizer` class from the NLTK library. Lemmatization helps to standardize words and reduce their complexity by reducing words to their root or base form. Our function specifically targets verbs and transforms different variations of the same verb into its most basic form.




In [9]:
from nltk.corpus import stopwords, wordnet

def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    elif tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [10]:
from nltk.tag import pos_tag
from nltk import pos_tag

def word_lemmatize(lyrics_cleaned):  # clean_lyrics_decode):
    pos_tags = pos_tag(lyrics_cleaned)
    wordnet_pos = [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in pos_tags]

    wnl = WordNetLemmatizer()  # Lemmatize Lyrics
    return [wnl.lemmatize(word, tag) for word, tag in wordnet_pos]

In summary, the code above defines a function that makes use of the WordNetLemmatizer class from the NLTK library to conduct lemmatization specifically targeting verbs, thereby converting words to their most basic form.


--------------


## Sentiment Analysis


Subsequently, the process involves the implementation of pipeline classes to carry out predictions using models accessible in the Hub. The code imports and employs multiple transformer models specifically designed for text classification and sentiment analysis. Specifically, the following procedure creates three distinct pipelines, each equipped with different models that facilitate the assessment of emotions and sentiment in textual content.


In [11]:
import warnings
warnings.filterwarnings('ignore')
# python -m pip install "tensorflow<2.11"
# python -m pip install "protobuf<3.2"


In [12]:
import transformers
from transformers import pipeline

# Initialize Genius API and sentiment classifiers
classifiers = [
    pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True),
    pipeline("text-classification", model='cardiffnlp/twitter-roberta-base-sentiment', return_all_scores=True),
    pipeline("sentiment-analysis", return_all_scores=True)
]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The `get_lyric_sentiment` function takes in pre-processed lyrics as input and produces a dictionary of sentiment scores. It leverages three distinct classifiers to calculate the scores and aggregates them into a final result. For instance, one of these classifiers is the *distilbert-base-uncased-emotion* model, specifically trained to detect "emotions in texts such as sadness, joy, love, anger, fear, and surprise".


In [13]:
# Function to perform sentiment analysis
def get_lyric_sentiment(lyrics, classifiers):
    text = ' '.join(lyrics)
    scores = {}
    for classifier in classifiers:
        try:
            predictions = classifier(text, truncation=True)
            for prediction in predictions[0]:
                scores[prediction['label']] = prediction['score']
        except Exception as e:
            print(f"Error during sentiment analysis: {e}")
    return scores

If the lyric sequence contains more than 512 tokens, it will trigger an error message indicating an exception encountered in the 'embeddings' layer. However, we have implemented measures to properly manage lyric sequences that exceed 512 words in the function mentioned above.

---------------------


## Putting it All Together


To summarize, the code efficiently collects data and performs text analysis on every song in a playlist. Specifically, it systematically processes a list of tracks and corresponding artists while simultaneously conducting a thorough cleaning procedure on the lyrics. The cleaning process involves removing all nonessential characters, resulting in a more precise depiction of the song's content. The outcome is a comprehensive frequency analysis of each word in a song's lyrics, providing deeper insights into the overall conveyed message.

Additionally, the program computes a sentiment score for each song based on the lyrics, indicating whether the lyrics are positive, negative, or neutral. It also collects information about the song and artist, such as the release date, length, popularity, and genre. Finally, the program compiles all this information into a dataframe for further analysis.






In [14]:
track_data = []
for i, track in all_tracks.iterrows():

    song_name = track["name"] #.partition(" (")[0]
    song_name = track['name'].partition(" (with")[0]
    song_name = song_name.partition(" - From")[0]

    artist_name = track["artist"]

    try:
        track_lyrics = clean_song_lyrics(song_name, artist_name)
        stopwords_removed = remove_stopwords_lyrics(track_lyrics)
        lemmatized = word_lemmatize(stopwords_removed)

        sentiment_scores = get_lyric_sentiment(stopwords_removed, classifiers)

        track_info = track.to_dict()
        track_info.update(sentiment_scores)

        track_info["lyrics"] = track_lyrics
        track_info["stopwords_removed"] = stopwords_removed
        track_info["lemmatized"] = lemmatized

        track_data.append(track_info)

    except Exception as e:
        print(f"Error processing track {track['name']} by {track['artist']}: {e}")

df_tracks = pd.DataFrame(track_data)

Searching for "Beautiful Things" by Benson Boone...


Done.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Searching for "obsessed" by Olivia Rodrigo...


Done.


Searching for "we can't be friends (wait for your love)" by Ariana Grande...


Done.


Searching for "Lose Control" by Teddy Swims...


Done.


Searching for "greedy" by Tate McRae...


Done.


Searching for "TEXAS HOLD 'EM" by Beyoncé...


Done.


Searching for "End of Beginning" by Djo...


Done.


Searching for "Stick Season" by Noah Kahan...


Done.


Searching for "Saturn" by SZA...


Done.


Searching for "redrum" by 21 Savage...


Done.


Searching for "Training Season" by Dua Lipa...


Done.


Searching for "Water" by Tyla...


Done.


Searching for "Feather" by Sabrina Carpenter...


Done.


Searching for "Lovin On Me" by Jack Harlow...


Done.


Searching for "One Of The Girls" by The Weeknd...


Done.


Searching for "Paint The Town Red" by Doja Cat...


Done.


Searching for "Cruel Summer" by Taylor Swift...


Done.


Searching for "Scared To Start" by Michael Marcagi...


Done.


Searching for "yes, and?" by Ariana Grande...


Done.


Searching for "Strangers" by Kenya Grace...


Done.


Searching for "My Love Mine All Mine" by Mitski...


Done.


Searching for "I Remember Everything (feat. Kacey Musgraves)" by Zach Bryan...


Done.


Error during sentiment analysis: The expanded size of the tensor (528) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 528].  Tensor sizes: [1, 514]
Searching for "exes" by Tate McRae...


Done.


Searching for "Puntería" by Shakira...


Done.


Searching for "Whatever She Wants" by Bryson Tiller...


Done.


Searching for "Houdini" by Dua Lipa...


Done.


Searching for "Snooze" by SZA...


Done.


Searching for "Agora Hills" by Doja Cat...


Done.


Searching for "Type Shit" by Future...


Done.


Searching for "Never Lose Me (feat. SZA & Cardi B)" by Flo Milli...


Done.


Error during sentiment analysis: The expanded size of the tensor (525) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 525].  Tensor sizes: [1, 514]
Searching for "Made For Me" by Muni Long...


Done.


Searching for "Too Sweet" by Hozier...


Done.


Searching for "Home" by Good Neighbours...


Done.


Searching for "Austin" by Dasha...


Done.


Searching for "Slow It Down" by Benson Boone...


Done.


Searching for "vampire" by Olivia Rodrigo...


Done.


Searching for "Igual Que Un Ángel" by Kali Uchis...


Done.


Searching for "Popular" by The Weeknd...


Done.


Searching for "Rich Baby Daddy (feat. Sexyy Red & SZA)" by Drake...


Done.


Searching for "i like the way you kiss me" by Artemas...


Done.


Searching for "Make You Mine" by Madison Beer...


Done.


Searching for "CONTIGO" by KAROL G...


Done.


Searching for "Seven (feat. Latto)" by Jung Kook...


Done.


Searching for "Whatever" by Kygo...


Done.


Searching for "What Was I Made For? [From The Motion Picture "Barbie"]" by Billie Eilish...


Done.


Error during sentiment analysis: The expanded size of the tensor (584) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 584].  Tensor sizes: [1, 514]
Searching for "Flowers" by Miley Cyrus...


Done.


Searching for "Daylight" by David Kushner...


Done.


Searching for "LA FALDA" by Myke Towers...


Done.


Searching for "Is It Over Now? (Taylor's Version) (From The Vault)" by Taylor Swift...


Done.


Searching for "Standing Next to You" by Jung Kook...


Done.


In [15]:
#df_tracks = pd.DataFrame(track_data)
df_tracks.to_csv("../assets/data/all_tracks+lyrics.csv", index=False)

-----------------------------------