## Natural Language Processing

Using the data gathered from the Spotify API, we now want to extract and process the lyrics for each song. This is accomplished through scraping textual information, namely lyrical data, from the **Genius Lyrics** website. Following extraction, the lyrics are thoroughly processed and cleaned before undergoing sentiment analysis. 



<!--### Scraping the Genius Lyrics Website
scraping textual information
Scraping the Genius Lyrics Website-->


In [1]:
import pandas as pd
import re
import contractions
import string
from better_profanity import profanity
from nltk.tokenize import word_tokenize
from deep_translator import GoogleTranslator
from langdetect import detect
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import wordnet
import lyricsgenius
from transformers import pipeline
from spellchecker import SpellChecker
from tkinter import *


all_tracks = pd.read_csv("../assets/data/all_tracks.csv")


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html





### Scraping the Web

To get started, the script below imports `lyricsgenius`, a fundamental package libary allowing for web scraping of the Genius Lyrics website to retrieve the lyrics of any given song. Through the initialization of the `genius` variable, one can access the Genius API and retrieve the lyrics of any given song, such as "Too Many Nights" by Metro Boomin.


In [2]:
import lyricsgenius

genius = lyricsgenius.Genius("epFCxujgBe-Y6WrkZedI8kerKxiCpR6Rh0DAHYNlKDf9B4H1nXTdZIkj7krNUHVV")
song = genius.search_song("Too Many Nights", "Metro Boomin")

Searching for "Too Many Nights" by Metro Boomin...


Done.


First, we define a function that retrieves the lyrics for any song and artist from the Genius database. As shown below, it first searches for the track using the provided name and artist and then extracts the lyrics from the search results.

In [3]:
def get_song_lyrics(song_name, song_artist):
    song_genius = genius.search_song(song_name, song_artist)
    song_lyrics = song_genius.lyrics.partition("Lyrics")[2]
    # Remove any numbers followed by 'Embed'
    song_lyrics = re.sub(r"[\[].*?[\]]|\d+Embed", "", song_lyrics)
    # Remove text between square brackets
    song_lyrics = re.sub(r"(\-[A-Za-z]+\-)", "", song_lyrics)
    song_lyrics = re.sub(r'\d+', '', song_lyrics)

    return song_lyrics

### Pre-Processing Text Data

Using the genius package, we define a function to fetch the lyrics of a song given the track name and artist. Once retrieved, the next step is to pre-process the textual data. This involves a cleansing process to eliminate profanity and patterns that may hinder the overall readability. The Python script contains the following steps:

1. Language Detection
2. Expanding Contractions
3. Converting Text to Lowercase
4. Spell Checking + Censoring
5. Removing Punctuations
6. Tokenization

The function `detect_and_translate` below is designed to identify and translate text into a specified language, specifically English. It first checks the language of the original text and compares it to the target language. If the detected language differs from the target language, the function utilizes GoogleTranslator to translate the input text into the target language (English).


In [4]:
# Function to detect and translate text
def detect_and_translate(track_lyrics, target_lang='en'):
    if detect(track_lyrics) == target_lang:
        return track_lyrics
    translator = GoogleTranslator(source='auto', target=target_lang)
    return translator.translate(track_lyrics)

We also develop various functions to support the preprocessing of textual data, streamlining the process and improving the accuracy of the final output. Among these functions are a method for removing punctuation from a given string of lyrics and a spell-checker that automatically finds and corrects any spelling errors.


In [5]:
def remove_punctuation(text):
    no_punct = ""
    for char in text:
        if char not in string.punctuation:
            no_punct = no_punct + char
    return no_punct  # return unpunctuated string

In [6]:
# Spell Check + Censor
spell = SpellChecker()

def spell_check(word_list_str):
    word_corrected_list = []
    for word in word_list_str.split():
        word_corrected = spell.correction(word)
        if word_corrected is not None:
            word_corrected_list.append(word_corrected)
        else:
            word_corrected_list.append(word)
    return word_corrected_list

The `clean_song_lyrics` function is designed to simplify the processing of lyrics for a specific song and artist. The function extracts the lyrics from the Genius database and performs a series of modifications, including expanding contractions, removing repetitive phrases, and converting the text to lowercase. It also ensures that the spelling is correct and eliminates any profanity. The end result is a cleaned set of lyrics, tokenized and encoded as a list of words.

In [7]:
def clean_song_lyrics(song_name, artist_name):
    genius_lyrics = get_song_lyrics(song_name, artist_name) # <1>
    lyrics_en = detect_and_translate(genius_lyrics, "en")  # <2> 
    
    no_contract = [contractions.fix(word) for word in lyrics_en.split()] # <3>
    no_contract_str = " ".join(no_contract).lower()  # lowercase # <4>
    no_contract_str = re.sub(r"nana|lala", "", no_contract_str) # <4>

    corrected = spell_check(no_contract_str) # <5> # Spell Check + Censor
    censored = profanity.censor(" ".join(corrected), censor_char="") # <5>
    no_punct = remove_punctuation(censored) # <6> # Remove Punctuation
    
    tokenized = word_tokenize(no_punct)  # Tokenize # <7>
    strencode = [i.encode("ascii", "ignore") for i in tokenized]  # Encode() method # <8>
    return [i.decode() for i in strencode]  # Decode() method # <8>

#### Removing Stop Words

We employ the Natural Language Toolkit (*NLTK*) library and its `WordNetLemmatizer` tool to filter out stopwords. By removing frequently used words like "the," "and," or "of," the resulting text becomes more concise, enabling a more thorough examination of the lyrics and their underlying message.


In [8]:
def remove_stopwords_lyrics(clean_lyrics_decode):
    stopword = stopwords.words("english")
    stopword.extend(["yeah", "nanana", "nana", "oh", "la"])
    return [word for word in clean_lyrics_decode if word not in stopword]

#### Lemmatization



Next, we define a function to perform lemmatization on a set of words using the `WordNetLemmatizer` class from the NLTK library. Lemmatization helps to standardize words and reduce their complexity by reducing words to their root or base form. Our function specifically targets verbs and transforms different variations of the same verb into its most basic form.




In [9]:
from nltk.corpus import stopwords, wordnet

def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    elif tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [10]:
from nltk.tag import pos_tag
from nltk import pos_tag

def word_lemmatize(lyrics_cleaned):  # clean_lyrics_decode):
    pos_tags = pos_tag(lyrics_cleaned)
    wordnet_pos = [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in pos_tags]

    wnl = WordNetLemmatizer()  # Lemmatize Lyrics
    return [wnl.lemmatize(word, tag) for word, tag in wordnet_pos]

In summary, the code above defines a function that makes use of the WordNetLemmatizer class from the NLTK library to conduct lemmatization specifically targeting verbs, thereby converting words to their most basic form.


--------------


## Sentiment Analysis


Subsequently, the process involves the implementation of pipeline classes to carry out predictions using models accessible in the Hub. The code imports and employs multiple transformer models specifically designed for text classification and sentiment analysis. Specifically, the following procedure creates three distinct pipelines, each equipped with different models that facilitate the assessment of emotions and sentiment in textual content.


In [11]:
import warnings
warnings.filterwarnings('ignore')
# python -m pip install "tensorflow<2.11"
# python -m pip install "protobuf<3.2"


In [12]:
import transformers
from transformers import pipeline

# Initialize Genius API and sentiment classifiers
classifiers = [
    pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True),
    pipeline("text-classification", model='cardiffnlp/twitter-roberta-base-sentiment', return_all_scores=True),
    pipeline("sentiment-analysis", return_all_scores=True)
]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The `get_lyric_sentiment` function takes in pre-processed lyrics as input and produces a dictionary of sentiment scores. It leverages three distinct classifiers to calculate the scores and aggregates them into a final result. For instance, one of these classifiers is the *distilbert-base-uncased-emotion* model, specifically trained to detect "emotions in texts such as sadness, joy, love, anger, fear, and surprise".


In [13]:
# Function to perform sentiment analysis
def get_lyric_sentiment(lyrics, classifiers):
    text = ' '.join(lyrics)
    scores = {}
    for classifier in classifiers:
        try:
            predictions = classifier(text, truncation=True)
            for prediction in predictions[0]:
                scores[prediction['label']] = prediction['score']
        except Exception as e:
            print(f"Error during sentiment analysis: {e}")
    return scores

If the lyric sequence contains more than 512 tokens, it will trigger an error message indicating an exception encountered in the 'embeddings' layer. However, we have implemented measures to properly manage lyric sequences that exceed 512 words in the function mentioned above.

---------------------


## Putting it All Together


To summarize, the code efficiently collects data and performs text analysis on every song in a playlist. Specifically, it systematically processes a list of tracks and corresponding artists while simultaneously conducting a thorough cleaning procedure on the lyrics. The cleaning process involves removing all nonessential characters, resulting in a more precise depiction of the song's content. The outcome is a comprehensive frequency analysis of each word in a song's lyrics, providing deeper insights into the overall conveyed message.

Additionally, the program computes a sentiment score for each song based on the lyrics, indicating whether the lyrics are positive, negative, or neutral. It also collects information about the song and artist, such as the release date, length, popularity, and genre. Finally, the program compiles all this information into a dataframe for further analysis.






In [14]:
track_data = []
for i, track in all_tracks.iterrows():

    song_name = track["name"] #.partition(" (")[0]
    song_name = track['name'].partition(" (with")[0]
    song_name = song_name.partition(" - From")[0]
    
    artist_name = track["artist"]

    try:
        track_lyrics = clean_song_lyrics(song_name, artist_name)
        stopwords_removed = remove_stopwords_lyrics(track_lyrics)
        lemmatized = word_lemmatize(stopwords_removed)

        sentiment_scores = get_lyric_sentiment(stopwords_removed, classifiers)

        track_info = track.to_dict()
        track_info.update(sentiment_scores)

        track_info["lyrics"] = track_lyrics
        track_info["stopwords_removed"] = stopwords_removed
        track_info["lemmatized"] = lemmatized

        track_data.append(track_info)

    except Exception as e:
        print(f"Error processing track {track['name']} by {track['artist']}: {e}")

df_tracks = pd.DataFrame(track_data)

Searching for "Please Please Please" by Sabrina Carpenter...


Done.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Searching for "Si Antes Te Hubiera Conocido" by KAROL G...


Done.


Searching for "BIRDS OF A FEATHER" by Billie Eilish...


Done.


Searching for "Good Luck, Babe!" by Chappell Roan...


Done.


Searching for "A Bar Song (Tipsy)" by Shaboozey...


Done.


Searching for "Not Like Us" by Kendrick Lamar...


Done.


Error during sentiment analysis: The expanded size of the tensor (519) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 519].  Tensor sizes: [1, 514]
Searching for "MILLION DOLLAR BABY" by Tommy Richman...


Done.


Searching for "Too Sweet" by Hozier...


Done.


Searching for "Beautiful Things" by Benson Boone...


Done.


Searching for "I Had Some Help (Feat. Morgan Wallen)" by Post Malone...


Done.


Searching for "Espresso" by Sabrina Carpenter...


Done.


Searching for "i like the way you kiss me" by Artemas...


Done.


Searching for "Stargazing" by Myles Smith...


Done.


Searching for "LUNCH" by Billie Eilish...


Done.


Searching for "End of Beginning" by Djo...


Done.


Searching for "we can't be friends (wait for your love)" by Ariana Grande...


Done.


Searching for "Lose Control" by Teddy Swims...


Done.


Searching for "Tough" by Quavo...


Done.


Searching for "Austin" by Dasha...


Done.


Searching for "I Can Do It With a Broken Heart" by Taylor Swift...


Done.


Searching for "Houdini" by Eminem...


Done.


Searching for "Nasty" by Tinashe...


Done.


Searching for "Belong Together" by Mark Ambor...


Done.


Searching for "Slow It Down" by Benson Boone...


Done.


Searching for "HOT TO GO!" by Chappell Roan...


Done.


Searching for "GIRLS" by The Kid LAROI...


Done.


Searching for "greedy" by Tate McRae...


Done.


Searching for "Move" by Adam Port...


Done.


Searching for "Fortnight (feat. Post Malone)" by Taylor Swift...


Done.


Searching for "Saturn" by SZA...


Done.


Searching for "28" by Zach Bryan...


Done.


Searching for "Close To You" by Gracie Abrams...


Done.


Searching for "the boy is mine" by Ariana Grande...


Done.


Searching for "Stick Season" by Noah Kahan...


Done.


Searching for "I Don't Wanna Wait" by David Guetta...


Done.


Searching for "Smeraldo Garden Marching Band (feat. Loco)" by Jimin...


Done.


Searching for "Stumblin' In" by CYRIL...


Done.


Searching for "360" by Charli xcx...


Done.


Searching for "Rockstar" by LISA...


Done.


Searching for "One Of The Girls" by The Weeknd...


Done.


Searching for "Scared To Start" by Michael Marcagi...


Done.


Searching for "Lies Lies Lies" by Morgan Wallen...


Done.


Searching for "feelslikeimfallinginlove" by Coldplay...


Done.


Searching for "Parking Lot" by Mustard...


Done.


Searching for "Gata Only" by FloyyMenor...


Done.


Searching for "BAND4BAND (feat. Lil Baby)" by Central Cee...


Done.


Searching for "Santa" by Rvssian...


Done.


Searching for "Magnetic" by ILLIT...


Done.


Searching for "Water" by Tyla...


Done.


Searching for "Illusion" by Dua Lipa...


Done.


In [15]:
#df_tracks = pd.DataFrame(track_data)
df_tracks.to_csv("../assets/data/all_tracks+lyrics.csv", index=False)

In [16]:
df_tracks

Unnamed: 0,name,track_id,album,artist,artist_id,release_date,length,popularity,artist_pop,artist_genres,...,fear,surprise,LABEL_0,LABEL_1,LABEL_2,NEGATIVE,POSITIVE,lyrics,stopwords_removed,lemmatized
0,Please Please Please,5N3hjp1WNayUPZrA8kJmJP,Please Please Please,Sabrina Carpenter,74KM79TiuVKeVCqs8QtB0B,2024-06-06,186365,98,91,['pop'],...,0.000637,0.000841,0.251058,0.542962,0.20598,0.857851,0.142149,"[i, know, i, have, good, judgment, i, know, i,...","[know, good, judgment, know, good, taste, funn...","[know, good, judgment, know, good, taste, funn..."
1,Si Antes Te Hubiera Conocido,6WatFBLVB0x077xWeoVc2k,Si Antes Te Hubiera Conocido,KAROL G,790FomKkXshlbRYZFtlgla,2024-06-21,195824,91,89,"['reggaeton', 'reggaeton colombiano', 'trap la...",...,0.001239,0.000147,0.030611,0.523545,0.445844,0.96625,0.03375,"[what, what, we, are, ready, to, rule, summer,...","[ready, rule, summer, started, fire, would, me...","[ready, rule, summer, start, fire, would, meet..."
2,BIRDS OF A FEATHER,6dOtVTDdiauQNBQEDOtlAB,HIT ME HARD AND SOFT,Billie Eilish,6qqNVTkY8uBg9cP3Jd7DAH,2024-05-17,210373,98,94,"['art pop', 'pop']",...,0.434426,0.034063,0.122799,0.504202,0.372999,0.959893,0.040107,"[i, want, you, to, stay, til, i, am, in, the, ...","[want, stay, til, grave, til, rot, away, dead,...","[want, stay, til, grave, til, rot, away, dead,..."
3,"Good Luck, Babe!",0WbMK4wrZ1wFSty9F7FCgu,"Good Luck, Babe!",Chappell Roan,7GlBOeep6PqTfFi59PTUUN,2024-04-05,218423,94,86,"['indie pop', 'pov: indie']",...,0.000427,0.000471,0.343443,0.55466,0.101897,0.981645,0.018355,"[it, is, fine, it, is, cool, you, can, say, th...","[fine, cool, say, nothing, know, truth, guess,...","[fine, cool, say, nothing, know, truth, guess,..."
4,A Bar Song (Tipsy),2FQrifJ1N335Ljm3TjTVVf,A Bar Song (Tipsy),Shaboozey,3y2cIKLjiOlp1Np37WiUdH,2024-04-12,171291,93,81,['pop rap'],...,0.009928,0.003751,0.057218,0.763994,0.178788,0.994249,0.005751,"[my, baby, want, a, barking, she, is, been, te...","[baby, want, barking, telling, night, long, ga...","[baby, want, bark, tell, night, long, gasoline..."
5,Not Like Us,6AI3ezQ4o3HUoP6Dhudph3,Not Like Us,Kendrick Lamar,2YZyLoL8N0Wb9xBt1NhZWg,2024-05-04,274192,96,92,"['conscious hip hop', 'hip hop', 'rap', 'west ...",...,0.007257,0.004582,,,,0.997331,0.002669,"[psst, i, see, dead, people, mustard, on, the,...","[psst, see, dead, people, mustard, beat, musta...","[psst, see, dead, people, mustard, beat, musta..."
6,MILLION DOLLAR BABY,7fzHQizxTqy8wTXwlrgPQQ,MILLION DOLLAR BABY,Tommy Richman,1WaFQSHVGZQJTbf0BdxdNo,2024-04-26,155151,86,83,['chill abstract hip hop'],...,0.032734,0.004189,0.280416,0.655696,0.063888,0.992852,0.007148,"[do, it, baby, do, what, i, should, think, do,...","[baby, think, baby, could, think, baby, think,...","[baby, think, baby, could, think, baby, think,..."
7,Too Sweet,4IadxL6BUymXlh8RCJJu7T,Unheard,Hozier,2FXC3k01G6Gw61bmprjgqS,2024-03-22,251424,83,85,"['irish singer-songwriter', 'modern rock', 'po...",...,0.002293,0.001093,0.12037,0.648945,0.230685,0.984314,0.015686,"[it, can, not, be, said, i, am, an, early, bir...","[said, early, bird, clock, say, word, baby, ne...","[say, early, bird, clock, say, word, baby, nev..."
8,Beautiful Things,6tNQ70jh4OwmPGpYy6R2o9,Beautiful Things,Benson Boone,22wbnEMDvgVIAGdFeek6ET,2024-01-18,180304,91,85,['singer-songwriter pop'],...,0.996488,0.000843,0.033857,0.318134,0.648009,0.012351,0.987649,"[for, a, while, there, it, was, rough, but, la...","[rough, lately, better, last, four, cold, reme...","[rough, lately, well, last, four, cold, rememb..."
9,I Had Some Help (Feat. Morgan Wallen),7221xIgOnuakPdLqT0F3nP,I Had Some Help,Post Malone,246dkjvS1zLTtiykXe5h60,2024-05-10,178205,95,90,"['dfw rap', 'melodic rap', 'pop', 'rap']",...,0.000706,0.000557,0.210037,0.71791,0.072052,0.998456,0.001544,"[you, got, a, got, ta, nerve, do, not, you, ba...","[got, got, ta, nerve, baby, hit, curb, made, t...","[get, get, ta, nerve, baby, hit, curb, make, t..."


-----------------------------------