# INFO 498 B | Spotify Lyric Analysis

### Group 6 | Max Bennett, Sydney Castello, Justin Tham, Ian Wang | 12th December 2023

---

### Introduction and Motivation

### Corpus

### Modeling

#### Training Data

The data corpus we are working with is from a curated dataset from another project we had found online at this [repo](https://github.com/zhao1701/spotify-song-lyric-analysis/tree/master#building-the-dataset) 

In [10]:
df_sptfy = pd.read_csv('data/billboard-lyrics-spotify.csv')
df_sptfy = df_sptfy[df_sptfy['num_words'] > 1]
df_sptfy.dropna(subset=['lyrics'], inplace=True)
df_sptfy.head(3)

Unnamed: 0,artist_all,artist_base,rank,song,year,artist_featured,song_clean,artist_clean,lyrics,acousticness,...,speechiness,tempo,time_signature,valence,duration_min,num_words,words_per_sec,num_uniq_words,decade,uniq_ratio
0,percy faith,percy faith,1,theme from a summer place,1960,,theme from a summer place,percy faith,theres a summer place where it may rain or sto...,0.631,...,0.0253,92.631,4.0,0.749,2.414883,104.0,0.717771,58.0,1960,1.793103
1,jim reeves,jim reeves,2,he'll have to go,1960,,hell have to go,jim reeves,put your sweet lips a little closer to the pho...,0.909,...,0.0379,81.181,3.0,0.2,2.310667,152.0,1.096365,69.0,1960,2.202899
2,the everly brothers,the everly brothers,3,cathy's clown,1960,,cathys clown,the everly brothers,dont want your love any more dont want your k...,0.412,...,0.0339,119.809,4.0,0.866,2.400217,121.0,0.840202,64.0,1960,1.890625


The following are some of the basic statistics of the corpus

In [9]:
print('There are', len(df_sptfy), 'songs in the dataset')
print('The earliest year in the dataset is from', df_sptfy['year'].min())
print('The latest year in the dataset is from', df_sptfy['year'].max())
print('The shortest lyrics are', df_sptfy['num_words'].min(), 'words in length')
print('The longest lyrics are', df_sptfy['num_words'].max(), 'words in length')
print('The fastest tempo in the dataset is', df_sptfy['tempo'].max())
print('The shortest tempo in the dataset is', df_sptfy['tempo'].min())
print('The average tempo in the dataset is', df_sptfy['tempo'].mean())
print('There are', df_sptfy['artist_clean'].nunique(), 'unique artists')

There are 5491 songs in the dataset
The earliest year in the dataset is from 1960
The latest year in the dataset is from 2017
The shortest lyrics are 5.0 words in length
The longest lyrics are 1143.0 words in length
The fastest tempo in the dataset is 233.429
The shortest tempo in the dataset is 50.975
The average tempo in the dataset is 119.34795571401764
There are 2229 unique artists


In [12]:
value_counts = df_sptfy['decade'].value_counts()

grouped_data = df_sptfy.groupby('decade')

for value, group in grouped_data:
    group_name = f"df_year_{value}"  # Creating a unique name for each DataFrame
    globals()[group_name] = group.copy()

We split the corpus by decade, for each decade there are a different amount of lyric entries.

In [13]:
for value in value_counts.index:
    global_variable_name = f"df_year_{value}"
    print(f"{global_variable_name} has {len(globals()[global_variable_name])} entries")

df_year_1980 has 991 entries
df_year_1970 has 981 entries
df_year_1960 has 960 entries
df_year_1990 has 931 entries
df_year_2000 has 919 entries
df_year_2010 has 709 entries


#### Model Architecture 1: NER

The first model uses Named Entity Recognition to identify how the subjects discussed in our corpus change across decades. We primarily tracked the entities of location, nationiality/religious group, product, event, person, organization, and geopolitical event. We found that these entities tend to have a combination of enough data, interesting results, and unique enough results. 

We tried multiple spaCy models when testing our code, but ulitmately landed on `en_core_web_lg`, which we determined found the most relevant entities. We filtered our corpus by decade and then applied this model to each set of lyrics in our corpus. Next, we filtered out the entities that had no relevance to our project, such as ordinals and times. Using the remaining entities, we created wordclouds for each (decade, entity) pair. 

From our corpus, we only used the `lyrics` and `decade` labels as inputs. The outputs are a set of entities linked to each set of lyrics, which is in turn grouped into dicts by the `decade` label.

*How was spacy trained and why did we use it? Provide justification for why you believe the off-the-shelf model is appropriate for your use case.*

#### Model Architecture 2

### Describing Visualizing Results

Our initial question was what are the noticable trends about entities that we can observably see by decade from the corpus of song lyrics? 

In [None]:
nlp = spacy.load('en_core_web_lg')
for value in value_counts.index:
    df = f"df_year_{value}"
    list_name = f"lyrics_{value}"
    globals()[list_name] = []
    for song in globals()[df]['lyrics']:
        globals()[list_name].append(nlp(song))

In [20]:
for value in value_counts.index: 
    c = Counter()
    sum_entities = 0

    for song in globals()[f'lyrics_{value}']:
        sum_entities += len(song.ents)
        c.update([e.label_ for e in song.ents])

    average_entities = sum_entities / len(globals()[f'df_year_{value}'])

    print(f'Decade: {value}')
    print('Average entities per document:', average_entities)
    print('\nAverage entity type per document:')
    for item in c:
        print(item, c[item] / len(globals()[f'df_year_{value}']))
    print()

Decade: 1980
Average entities per document: 3.8708375378405653

Average entity type per document:
DATE 0.5267406659939455
PERSON 0.8012108980827447
TIME 1.099899091826438
CARDINAL 0.6881937436932392
GPE 0.26538849646821394
ORDINAL 0.1846619576185671
NORP 0.05852674066599395
ORG 0.18668012108980828
QUANTITY 0.01917255297679112
MONEY 0.007063572149344097
LOC 0.02119071644803229
PRODUCT 0.007063572149344097
LAW 0.0010090817356205853
FAC 0.0020181634712411706
EVENT 0.0020181634712411706

Decade: 1970
Average entities per document: 3.8868501529051986

Average entity type per document:
DATE 0.7033639143730887
NORP 0.08256880733944955
PERSON 0.928644240570846
CARDINAL 0.601427115188583
TIME 0.7145769622833843
ORG 0.21508664627930682
GPE 0.37716615698267075
ORDINAL 0.1365953109072375
PRODUCT 0.01834862385321101
LOC 0.03160040774719674
QUANTITY 0.05198776758409786
FAC 0.011213047910295617
LANGUAGE 0.0010193679918450561
MONEY 0.011213047910295617
LAW 0.0020387359836901123

Decade: 1960
Average e

### Discussion