# **Has Pop Music Gotten Sadder over the Past 10 Years?**

## Will Young and Rainier Hardjanto

**Research Question**

 Americans are getting sadder. Ten years ago, a yearly Gallup poll ranked the US as the 11th happiest country. This year, they were ranked 23rd. Senator Chris Murphy recently spoke out on this, saying "over the last 10 years the rate of happiness, of contentment, of fulfillment, self-reported by Americans, is dropping". This brings up the interesting question of whether the music we listen to has gotten sadder with us. There is not a clear instinctual answer to this question: one could speculate that we turn to music to escape from our reality, which would mean that our music would get happier as we get sadder, but it is also plausible that we want to relate to music more than we want to use it to escape, which would lead to our music getting sadder with us. In order to investigate this question, we will be performing sentiment analysis on the lyrics of the most popular songs from 2013-2023.

**Methodology**

The data that we used to address our research question is the lyrics from every song on the year-end Billboard Hot 100 chart for every year from 2013 through 2023. This data was pulled from the Genius website (https://genius.com/), an online music encyclopedia that contains lyrics to millions of songs.

To scrape this data, we first imported our libraries; Pandas was used to help organize our scraped data and the LyricsGenius library was used to simplify the process of making requests through Genius's API. We also imported our helper function file and our API key. 

In [8]:
from lyricsgenius import Genius
import helper_function
import api_key  # You'll have to make one of these yourself on the Genius website

genius = Genius(api_key.client_access_token)

Using matplotlib backend: TkAgg


We then loaded and formatted our initial data. In this case, our initial data was a database of the title, artist, year, and chart position for every song we were planning to scrape.

In [5]:
# Load and format billboard data
billboard_df = pd.read_csv("billboard_100.csv")
artists = helper_function.generate(billboard_df)
artists_series = pd.Series(artists)
lyrics = []

Once our data was loaded, we were ready to begin scraping songs. This process took quite a long time (multiple hours). Once the process was complete, the lyrics were added to our original data and the completed dataset was saved to a file. 

In [None]:
# Number of songs to be scraped. If scraping all songs, should be 1100
NUM_SONGS = 1100

# Scrape the songs
for i in range(NUM_SONGS):
    while True:
        try:
            song = genius.search_song(
                billboard_df["Title"][i], artists_series[i]
            )
            break
        except:  # If the request times out (which often happens), request again
            pass
    lyrics.append(helper_function.format_genius_lyrics(song.lyrics))

# Save the scraped data into a DataFrame
df = pd.DataFrame(
    {
        "No.": billboard_df["No."][:NUM_SONGS],
        "Title": billboard_df["Title"][:NUM_SONGS],
        "Artists": artists_series[:NUM_SONGS],
        "Year": billboard_df["Year"][:NUM_SONGS],
        "Lyrics": pd.Series(lyrics),
    }
)

# Save the data to a file
df.to_csv("billboard_data_with_lyrics", encoding="utf-8", index=False)

In order to use this data to answer the research question, we need to understand the emotions present in the lyrics. Determining these emotions will allow us to examine the trends of said emotions over time, from which we can see whether there are trends toward sadder lyrics. In order to determine these emotions, we processed the lyrics we scraped using NLTK's VADER. VADER, which stands for "Valence Aware Dictionary and sEntiment Reasoner" is a lexicon and rule-based sentiment analysis tool that rates the sentiment of a passage of text from -1 to 1, where -1 is extremely negative sentiment and 1 is extremely positive sentiment. We specifically selected VADER as our sentiment analysis tool because in addition to being the highest rated rule-based sentiment analysis tool, it is especially good at understanding slang, which is extremely common in pop music.

The primary way that we processed this data was by finding the sentiment score for every song we scraped. We then added this data to the dataframe that we previously saved to the file, and used this dataframe to build our first visualization.

In [9]:
%matplotlib
import csv
import re
import nltk
import mplcursors  # separate package must be installed
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from wordcloud import WordCloud
from matplotlib.patheffects import withSimplePatchShadow
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download("vader_lexicon")  # Run this line the first time you run this code

# Initialize the sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Import data
loaded_data = pd.read_csv("billboard_data_with_lyrics")
lyrics_data = loaded_data["Lyrics"]

# FIRST VISUALIZATION - Polarity of Every Top Song with Trendline
positivity_scores = []

# Add a Positivity series to the Billboard dataframe
for i in range(len(lyrics_data)):
    positivity_scores.append(
        analyzer.polarity_scores(lyrics_data[i])["compound"]
    )

all_data = pd.concat(
    [loaded_data, pd.DataFrame({"Positivity": positivity_scores})], axis=1
)

Using matplotlib backend: TkAgg


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/wyoung/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


The second way we processed our data was by finding the top 5 artists for every year based on a ranking algorithm we created. This algorithm gave a relative weight to each position on the chart, and summed the weights of each artist's songs to determine the top artists. We then took those artists and found the average polarity of all their Top 100 songs from that year, which we used to create our second visualization.

In [11]:
top_artist_polarityscore = []

with open("billboard_data_with_lyrics.csv", mode="r", encoding="utf-8") as file:
    # Create a CSV reader
    csv_reader = csv.reader(file)

    # Convert it to a list to get the ability to use len() and indexing
    rows = list(csv_reader)

    # Iterate using indices
    for j in range(11):
        scores_by_artist = {}
        for i in range(100):
            score = 101 - i  # Calculate the score based on ranking
            artist = rows[j * 100 + i][2]
            if artist in scores_by_artist:
                scores_by_artist[artist] += score
            else:
                scores_by_artist[artist] = score
        # Step 2 & 3: Sort Artists by Total Score
        sorted_artists = sorted(
            scores_by_artist.items(), key=lambda x: x[1], reverse=True
        )

        # Step 4: Select the Top 5 Artists
        top_5_artists = sorted_artists[:5]
        top_5_artist_names = [artist[0] for artist in top_5_artists]

        polarity_score = {}
        polarity_count = {}
        for i in range(100):
            artist = rows[j * 100 + i][2]

            if (artist in top_5_artist_names) is False:
                continue

            if artist in polarity_score:
                polarity_score[artist] += helper_function.polarity(
                    rows[j * 100 + i][4]
                )["compound"]
                polarity_count[artist] += 1
            else:
                polarity_score[artist] = helper_function.polarity(
                    rows[j * 100 + i][4]
                )["compound"]
                polarity_count[artist] = 1

        for x, y in polarity_score.items():
            polarity_score[x] /= polarity_count[x]
        top_artist_polarityscore.append(polarity_score)

extended_data = top_artist_polarityscore

The third way we processed our data was by creating dictionaries of all the negative words (based on their VADER sentiment score) and the amount of times they appear in the lyrics as a whole. We created three dictionaries - one from 2013 to 2016, one from 2017 to 2020, and one from 2021 to 2023 - to investigate how the languaged used to describe sadness and negativity changed over time, and used these to create our third visualization.

In [12]:
CSV_FILE_PATH = "billboard_data_with_lyrics.csv"

# List of words that are not very interesting, don't have interesting changes,
# and reduce the effectiveness of the word cloud visual
irrelevant_words = [
    "fuck",
    "bitch",
    "bitches",
    "dick",
    "niggas",
    "shit",
    "fucked",
    "bad",
    "damn",
    "ass",
]

with open(CSV_FILE_PATH, encoding="utf8", newline="") as csvfile:
    csvreader = csv.reader(csvfile)

    words_dictionary = {}
    words_dictionary1 = {}
    words_dictionary2 = {}
    list_of_dictionary = [
        words_dictionary,
        words_dictionary1,
        words_dictionary2,
    ]

    for count, row in enumerate(csvreader, start=0):
        words = helper_function.split_text_into_words(row[4])
        for word in words:
            word = (
                word.lower()
            )  # Ensure upper/lower case does not affect visual
            if word in irrelevant_words:
                continue
            if helper_function.polarity(word)["compound"] < -0.3:
                # Choose the correct dictionary based on the count
                current_dict = list_of_dictionary[count // 400]
                # Use get to avoid KeyError, defaults to 0 if the key doesn't
                # exist
                current_dict[word] = current_dict.get(word, 0) + 1

NameError: name 'irrelevant_words' is not defined

**Results**


In [3]:
# FIRST VISUALIZATION - Polarity of Every Top Song with Trendline

# Plot average positivity scores for each year from the past 10 years
scores = [
    sum(positivity_scores[:101]) / 100,
    sum(positivity_scores[100:201]) / 100,
    sum(positivity_scores[200:301]) / 100,
    sum(positivity_scores[300:401]) / 100,
    sum(positivity_scores[400:501]) / 100,
    sum(positivity_scores[500:601]) / 100,
    sum(positivity_scores[600:701]) / 100,
    sum(positivity_scores[700:801]) / 100,
    sum(positivity_scores[800:901]) / 100,
    sum(positivity_scores[900:1001]) / 100,
    sum(positivity_scores[1000:1101]) / 100,
]
years = [
    2013.5,
    2014.5,
    2015.5,
    2016.5,
    2017.5,
    2018.5,
    2019.5,
    2020.5,
    2021.5,
    2022.5,
    2023.5,
]  # add .5 to all years to make visualization more effective

plt.figure()
plt.plot(years, scores, "r--")  # plot general trend line
plt.xlabel("Year")
plt.ylabel("Polarity Score")
plt.title("Polarity Score of Every Top 100 Song, 2013-2023")
# Plot every song and its score
indiv_scores = plt.scatter(
    np.linspace(2013, 2023.99, num=1100), positivity_scores
)


# Using the mplcursors library, display information about each data point
# when you hover over it.
def show_hover_panel(get_text_func=None):
    """
    Displays specified content whenever the cursor is hovering over a data
    point.

    Args:
        get_text_func: A string that contains text to be displayed. If it is
        None, no box will be displayed.

    Returns:
        a cursor object that displays a box contanining specified content
        whenever it hovers over a data point.
    """
    cursor = mplcursors.cursor(
        hover=2,  # Transient
        annotation_kwargs=dict(
            bbox=dict(
                boxstyle="square,pad=0.5",
                facecolor="white",
                edgecolor="#ddd",
                linewidth=0.5,
                path_effects=[withSimplePatchShadow(offset=(1.5, -1.5))],
            ),
            linespacing=1.5,
            arrowprops=None,
        ),
        highlight=True,
        highlight_kwargs={"linewidth": 2},
    )

    if get_text_func:
        cursor.connect(
            event="add",
            func=lambda sel: sel.annotation.set_text(get_text_func(sel.index)),
        )
    return cursor


def on_add(index):
    """
    Returns the text that should be displayed when the cursor is hovering over a
    specific data point.
    Args:
        index: an integer representing the index of the current data point in
        the dataset
        dataframe: the dataframe for the data being displayed.
    Returns:
        A string containing the text to be displayed when the specific data
        point is hovered over.
    """
    try:
        parts = [
            f"Song: {all_data['Title'][index]}",
            f"Artist: {all_data['Artists'][index]}",
            f"Chart Position: {all_data['No.'][index]}",
            f"Score: {all_data['Positivity'][index]}",
        ]
        return "\n".join(parts)
    except KeyError:  # don't hover when the cursor is over the trendline
        return None


show_hover_panel(on_add)  # add hover labels
plt.show()

NameError: name 'positivity_scores' is not defined