# INFO 498 B | Spotify Lyric Analysis

### Group 6 | Max Bennett, Sydney Castillo, Justin Tham, Ian Wang | 12th December 2023

---

In [9]:
import pandas as pd
import spacy
from collections import Counter
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

### Introduction and Motivation

### Corpus
The data corpus contains song data and their lyrics from 2965 to 2015. We are working from a curated dataset from another project we had found online at this [repo](https://github.com/zhao1701/spotify-song-lyric-analysis/tree/master#building-the-dataset). Below is a small sample of what the data looks like, including its column names.

In [2]:
df_sptfy = pd.read_csv('data/billboard-lyrics-spotify.csv')
df_sptfy = df_sptfy[df_sptfy['num_words'] > 1]
df_sptfy.dropna(subset=['lyrics'], inplace=True)
df_sptfy.head(3)

Unnamed: 0,artist_all,artist_base,rank,song,year,artist_featured,song_clean,artist_clean,lyrics,acousticness,...,speechiness,tempo,time_signature,valence,duration_min,num_words,words_per_sec,num_uniq_words,decade,uniq_ratio
0,percy faith,percy faith,1,theme from a summer place,1960,,theme from a summer place,percy faith,theres a summer place where it may rain or sto...,0.631,...,0.0253,92.631,4.0,0.749,2.414883,104.0,0.717771,58.0,1960,1.793103
1,jim reeves,jim reeves,2,he'll have to go,1960,,hell have to go,jim reeves,put your sweet lips a little closer to the pho...,0.909,...,0.0379,81.181,3.0,0.2,2.310667,152.0,1.096365,69.0,1960,2.202899
2,the everly brothers,the everly brothers,3,cathy's clown,1960,,cathys clown,the everly brothers,dont want your love any more dont want your k...,0.412,...,0.0339,119.809,4.0,0.866,2.400217,121.0,0.840202,64.0,1960,1.890625


The following are some of the basic statistics of the corpus

In [3]:
print('There are', len(df_sptfy), 'songs in the dataset')
print('The earliest year in the dataset is from', df_sptfy['year'].min())
print('The latest year in the dataset is from', df_sptfy['year'].max())
print('The shortest lyrics are', df_sptfy['num_words'].min(), 'words in length')
print('The longest lyrics are', df_sptfy['num_words'].max(), 'words in length')
print('The lyrics are', df_sptfy['num_words'].mean(), 'words in length on average')
print('The fastest tempo in the dataset is', df_sptfy['tempo'].max())
print('The shortest tempo in the dataset is', df_sptfy['tempo'].min())
print('The average tempo in the dataset is', df_sptfy['tempo'].mean())
print('There are', df_sptfy['artist_clean'].nunique(), 'unique artists')

There are 5491 songs in the dataset
The earliest year in the dataset is from 1960
The latest year in the dataset is from 2017
The shortest lyrics are 5.0 words in length
The longest lyrics are 1143.0 words in length
The lyrics are 305.39555636496084 words in length on average
The fastest tempo in the dataset is 233.429
The shortest tempo in the dataset is 50.975
The average tempo in the dataset is 119.34795571401764
There are 2229 unique artists


In [4]:
value_counts = df_sptfy['decade'].value_counts()

grouped_data = df_sptfy.groupby('decade')

for value, group in grouped_data:
    group_name = f"df_year_{value}"  # Creating a unique name for each DataFrame
    globals()[group_name] = group.copy()

We split the corpus by decade, for each decade there are a different amount of lyric entries.

In [5]:
for value in value_counts.index:
    global_variable_name = f"df_year_{value}"
    print(f"{global_variable_name} has {len(globals()[global_variable_name])} entries")

df_year_1980 has 991 entries
df_year_1970 has 981 entries
df_year_1960 has 960 entries
df_year_1990 has 931 entries
df_year_2000 has 919 entries
df_year_2010 has 709 entries


### Modeling

### Training Data

For our model, we used an off-the-shelf trained NER model, spaCy. we only ingested the `lyrics` and `decade` columns from the spotify song/lyric data as inputs. The purpose of our model was to extract different types of entities from song lyrics by decade to see if there were any noticeable trends. The original intended use of the dataset was extracted from another song lyric analysis project, which had filtered and cleaned its data consisting of Billboard Year-End Hot 100 Singles from 1965 - 2015.

JUSTIN & SYDNEY ADD STUFF HERE ABOUT YOUR TRAINING DATA/MODEL

### Model Architecture 1: NER

The first model uses Named Entity Recognition to identify how the subjects discussed in our corpus change across decades. We primarily tracked the entities of location, nationiality/religious group, product, event, person, organization, and geopolitical event. We found that these entities tend to have a combination of enough data, interesting results, and unique enough results. 

We tried multiple spaCy models when testing our code, but ulitmately landed on `en_core_web_lg`, which we determined found the most relevant entities. We filtered our corpus by decade and then applied this model to each set of lyrics in our corpus. Next, we filtered out the entities that had no relevance to our project, such as ordinals and times. Using the remaining entities, we created wordclouds for each (decade, entity) pair. 

From our corpus, we only used the `lyrics` and `decade` labels as inputs. The outputs are a set of entities linked to each set of lyrics, which is in turn grouped into dicts by the `decade` label.

*How was spacy trained and why did we use it? Provide justification for why you believe the off-the-shelf model is appropriate for your use case.*

### Model Architecture 2

### Describing Visualizing Results

Our initial question was what are the noticable trends about entities that we can observably see by decade from the corpus of song lyrics? We created  word clouds by entity by decade to answer the question. For a good portion of the lyric entity extracts, there were not any interesting findings. Below we've included summary statistics and extracted some of the interesting trends and findings we had.

In [6]:
nlp = spacy.load('en_core_web_lg')
for value in value_counts.index:
    df = f"df_year_{value}"
    list_name = f"lyrics_{value}"
    globals()[list_name] = []
    for song in globals()[df]['lyrics']:
        globals()[list_name].append(nlp(song))

#### Average entities per decade

In [19]:
stats_lists_per_decade = {}

for value in value_counts.index:
    c = Counter()
    sum_entities = 0

    for song in globals()[f'lyrics_{value}']:
        sum_entities += len(song.ents)
        c.update([e.label_ for e in song.ents])

    average_entities = sum_entities / len(globals()[f'df_year_{value}'])
    stats_for_current_decade = []

    for item in c:
        average_entity_type = c[item] / len(globals()[f'df_year_{value}'])
        stats_for_current_decade.append({'Entity Type': item, 'Average Entity Type per Document': average_entity_type})

    stats_for_current_decade.append({'Entity Type': 'Average Entities per Document', 'Average Entity Type per Document': average_entities})
    stats_lists_per_decade[value] = stats_for_current_decade

stats_dfs_per_decade = {decade: pd.DataFrame(stats_list) for decade, stats_list in stats_lists_per_decade.items()}

for decade, stats_df in stats_dfs_per_decade.items():
    print(f'Decade: {decade}')
    display(stats_df)
    print()

Decade: 1980


Unnamed: 0,Entity Type,Average Entity Type per Document
0,DATE,0.526741
1,PERSON,0.801211
2,TIME,1.099899
3,CARDINAL,0.688194
4,GPE,0.265388
5,ORDINAL,0.184662
6,NORP,0.058527
7,ORG,0.18668
8,QUANTITY,0.019173
9,MONEY,0.007064



Decade: 1970


Unnamed: 0,Entity Type,Average Entity Type per Document
0,DATE,0.703364
1,NORP,0.082569
2,PERSON,0.928644
3,CARDINAL,0.601427
4,TIME,0.714577
5,ORG,0.215087
6,GPE,0.377166
7,ORDINAL,0.136595
8,PRODUCT,0.018349
9,LOC,0.0316



Decade: 1960


Unnamed: 0,Entity Type,Average Entity Type per Document
0,DATE,0.651042
1,CARDINAL,0.59375
2,GPE,0.282292
3,NORP,0.075
4,TIME,0.527083
5,LOC,0.042708
6,ORG,0.226042
7,PERSON,1.138542
8,ORDINAL,0.082292
9,QUANTITY,0.051042



Decade: 1990


Unnamed: 0,Entity Type,Average Entity Type per Document
0,DATE,0.954887
1,TIME,0.862513
2,PERSON,1.418904
3,CARDINAL,1.179377
4,ORDINAL,0.154672
5,LOC,0.030075
6,QUANTITY,0.062299
7,LAW,0.002148
8,ORG,0.515575
9,NORP,0.137487



Decade: 2000


Unnamed: 0,Entity Type,Average Entity Type per Document
0,PERSON,1.995647
1,NORP,0.254625
2,GPE,0.57889
3,ORG,0.817193
4,DATE,0.921654
5,ORDINAL,0.322089
6,CARDINAL,1.603917
7,TIME,0.88901
8,QUANTITY,0.104461
9,LOC,0.039173



Decade: 2010


Unnamed: 0,Entity Type,Average Entity Type per Document
0,PERSON,1.609309
1,TIME,1.406206
2,DATE,0.767278
3,CARDINAL,1.454161
4,ORG,0.634697
5,LOC,0.06347
6,GPE,0.562764
7,ORDINAL,0.234133
8,NORP,0.177715
9,EVENT,0.00141





#### Most common entities per decade

In [15]:
dfs_per_decade = {}

for value in value_counts.index:
    c = Counter()

    for song in globals()[f'lyrics_{value}']:
        for ent in song.ents:
            if ent.label_ not in c:
                c[ent.label_] = []
            c[ent.label_].append(ent.text)

    data_for_current_decade = []

    for item in c:
        most_common_text = Counter(c[item]).most_common(1)
        if most_common_text:
            most_common_text = most_common_text[0][0]
        else:
            most_common_text = "No Data"

        data_for_current_decade.append({'Entity Type': item, 'Most Common Text': most_common_text})

    current_df = pd.DataFrame(data_for_current_decade)

    dfs_per_decade[value] = current_df

for decade, df in dfs_per_decade.items():
    print(f'Decade: {decade}')
    display(df)
    print()

Decade: 1980


Unnamed: 0,Entity Type,Most Common Text
0,DATE,today
1,PERSON,mickey
2,TIME,tonight
3,CARDINAL,one
4,GPE,america
5,ORDINAL,second
6,NORP,sans
7,ORG,muzik
8,QUANTITY,a million miles
9,MONEY,a million bucks



Decade: 1970


Unnamed: 0,Entity Type,Most Common Text
0,DATE,today
1,NORP,american
2,PERSON,louie louie
3,CARDINAL,one
4,TIME,tonight
5,ORG,chevy
6,GPE,new york
7,ORDINAL,first
8,PRODUCT,cherokee
9,LOC,memphis



Decade: 1960


Unnamed: 0,Entity Type,Most Common Text
0,DATE,today
1,CARDINAL,one
2,GPE,california
3,NORP,indian
4,TIME,tonight
5,LOC,nova
6,ORG,bristol
7,PERSON,rhonda
8,ORDINAL,first
9,QUANTITY,twenty miles



Decade: 1990


Unnamed: 0,Entity Type,Most Common Text
0,DATE,today
1,TIME,tonight
2,PERSON,joe
3,CARDINAL,one
4,ORDINAL,first
5,LOC,east coast
6,QUANTITY,a thousand miles
7,LAW,the y y
8,ORG,cmon
9,NORP,dem



Decade: 2000


Unnamed: 0,Entity Type,Most Common Text
0,PERSON,dj
1,NORP,dem
2,GPE,london
3,ORG,flo
4,DATE,today
5,ORDINAL,first
6,CARDINAL,one
7,TIME,tonight
8,QUANTITY,six feet
9,LOC,bay bay



Decade: 2010


Unnamed: 0,Entity Type,Most Common Text
0,PERSON,dj
1,TIME,tonight
2,DATE,today
3,CARDINAL,one
4,ORG,ayy
5,LOC,nina
6,GPE,le le
7,ORDINAL,first
8,NORP,american
9,EVENT,katrina





DISCLAIMER: It is important to note that the obseravtions and findings from this paper are correlative obseravtions and not causal. We did not prove causality for any of the following findings. Additionally, due to some vulgar language, we had to edit the wordclouds to be presentable.

#### NORP Entity wordclouds for 1970 and 2010s

NORP stands for 'Nationalities or religious or political groups' entities. It's not a surprise that the greatest entity listed is 'american' as we had used the `en_core_web_lg` english large spaCy model. Additionally, the song lyric dataset is predominantly english. It is interesting to note that in the 1970 word cloud `krushchev` is mentioned. This is relevant to the time period where `Nikita Kruschev` was the first secretary of the Soviet Union and where his memoirs were leaked to the west and published within this decade.

The 2010 word cloud tells a different story with `american`, `french` and `dem` being the entities that occur the most. It's also interesting to note that `percocets` and `xans` are listed as NORP entities. It does however show the prevalence of drug references in music within the 2010 decade.

![Word cloud for NORP in 1970](wordclouds/image3.png)
![Word cloud for NORP in 2010](wordclouds/image4.png)

#### PRODUCT Entity wordclouds for 1970s and 2010s

PRODUCT stands for 'Vehicles, weapons, foods, etc. (Not services)'. In 1970 `bmw` is listed as the greatest occurring entity followed by `maserati` and `cherokee`, where in 2000 `jeeps` was the greatest occuring entity followed by `thejayz`, `hondas`, and `broncos`. However, we cannot be for sure whether this is a identifier of the popularity of these vehicle brands in their respective decades. Jeep Cherokees were popular within this time period which may explain its occurrence within the wordclouds.

![Wordcloud for PRODUCT in 1990](wordclouds/image.png)
![Wordcloud for PRODICT in 2000](wordclouds/image2.png)

#### 

PERSON stands for `Person, including fictional`. We had chosen 3 time periods the 1960s, 2000s, and 2010s. The greatest entitiy occurrence in the 1960s was `rhonda`, with references to `louielouie`, and names such as `hazel`, `simon`, `jimmy`, and `hazel`. For both the 2000s and 2010s, `dj` has the greatest entity occurence. Other interesting entities to note in the 2000s are `jackson`, and in 2010s `nicki`, `keisha`, and `timmyturner`. Our hypothesis for why `dj` has the greatest entity occurence is that this coincides with `dj` being a prefix for many artists names, following the rise of rap music. Our hypothesis for why `rhonda` has the greatest entity occurence is that this coincides with the tv-star Rhonda Fleming.

It's interesting to note that `jesus` occurs in all 3 decade periods.

![Wordclouds for PERSON in 1960](wordclouds/image7.png)
![Wordclouds for PERSON in 2000](wordclouds/image6.png)
![Wordclouds for PERSON in 2010](wordclouds/image5.png)


## Discussion

Our findings seems fairly intuitive based on our initial understanding of the problem. However, there are definitely limitations with our approach to using NER to identify trends. The SpaCy model did not perform as we had expected it to be with some entities being incorrectly tagged. Additionally, we had expected more explicit trends and references with time period such as `vietnam` as a GPE or EVENTS to show up in the 1960s/1970s data due to the Vietnam War, however, there were no outstanding trends to note. For more pertinent data around world events, we thought of using a different data source such as newspaper headlines which may give us better results. We also hope to explore different modeling techniques which may yield better results in identifying themes.