# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Michael Vaden
- Userid: mtv2eva
- GitHub Repo URL: https://github.com/mtvaden1/Text_Analytics_Final_Project
- UVA Box URL: https://virginia.box.com/s/x0gai3x6ulfcrp13bkb9cklt81sij13k

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives youâ€™ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

*For this project, I wanted to use various song lyrics as my corpus. Specifically, I found 6 playlists on Spotify that corresponded to each decade ranging from the 70s to 2020s. These playlists are authored by Spotify, are tailored to my individual listening preferences, and contain the top 150 songs from each decade which Spotify recommends that I listen to. Using the Spotify API, I was able to query each of these playlists and songs (900 songs in total) to get the song title, artist, genre, and Spotify-specific metadata ranging from features such as key and time signature to popularity and danceability. After getting the song title and artist from each decade playlist, I then leveraged the Genius API to query for the song's lyrics. Although the Genius API is somewhat unreliable, occasionally returning the wrong lyrics, lyrics in another language, or no lyrics at all, it performs the intended task with a high enough success rate to get proper lyrics for most of the songs from each decade playlist. Finally, once I had all of the song features, metadata, and lyrics, I aggregated all of these to be able to compute my core tables and begin my advanced analysis.*

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: https://developer.spotify.com/documentation/web-api
- UVA Box URL: https://virginia.box.com/s/ceqzkn1v5tps305w5pud7m2ndmvis51j
- Number of raw documents: **6**
- Total size of raw documents (e.g. in MB): **1.029**
- File format(s), e.g. XML, plaintext, etc.: **txt**

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

**Each of the decade .txt files, after being parsed from the lyrics of the song from the Genius API, has the same structure. Each has the header in the style of [Trackname: x] [Artist: y], which was manually inserted by me. After this, The song lyrics appear in stanzas (similar to paragraphs), but without periods. Rather, each stanza is filled with lines, which have sentence-like structure. Each of these lines simply breaks to a new line to signify that it ends. There is an empty line before and after each header, with the header dictating the end of the previous song's lyrics and the beginning of the next song's lyrics.**

# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/gyo9hh8fuyagoj9wqjuxgvu55d46bgl6
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Load_Song_Corpus.ipynb
- Delimitter: **,**
- Number of observations: **6**
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.):

**['Decade', 'danceability', 'energy', 'loudness', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'source_file_path', 'song_regex', 'document_length']**

- Average length of each document in characters: 

**(Decade, document_length): (70s, 127933), (80s, 151152), (90s, 174393), (2000s, 216931), (2010s, 195291), (2020s, 212348)**

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/dcb5rsj5ikteuch0bmu79gnp3p0x7o6u
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Load_Song_Corpus.ipynb
- Delimitter: **,**
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): **207176** (got approval from Professor)
- OHCO Structure (as delimitted column names): 

**['decade_id', 'song_num', 'stanza_num', 'line_num', 'token_num']**

- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`):

**['pos_tuple', 'pos', 'token_str', 'term_str', 'pos_group']**

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/14hmeklg9j5q5719k9gs8trj90nnr1mr
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Load_Song_Corpus.ipynb and https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Compute_Derived_Tables.ipynb (to add dfidf)
- Delimitter: **,**
- Number of observations: **8975**
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`):

**['term_str', 'n', 'n_chars', 'p', 'i', 'max_pos', 'max_pos_group', 'stop', 'dfidf']**

- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

**['1',
 'ridin',
 'extra',
 'ey',
 'roamin',
 'buzz',
 'buying',
 'timeless',
 'roadside',
 'rivers',
 'butter',
 'rises',
 'facts',
 'ting',
 'rips',
 'faded',
 'fags',
 'fail',
 'bursting',
 'rights']**

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/itgetr7qm86iwe78knkqokytyi5fpwec
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Compute_Derived_Tables.ipynb
- Delimitter: **,**
- Bag (expressed in terms of OHCO levels): **['decade_id']**
- Number of observations: **17678**
- Columns (as delimitted names, including `n`, `tfidf`):

**['decade_id', 'term_str', 'n', 'tfidf'] where 'decade_id' and 'term_str' are indices**

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/6fr36v9z0wotx9o87quv3lzfd746eqz9
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/itgetr7qm86iwe78knkqokytyi5fpwec
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Compute_Derived_Tables.ipynb
- Delimitter: **,**
- Bag (expressed in terms of OHCO levels): **['decade_id']**

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/zxk8sd35ny6e861h8dyqcgn8ao7daxlr
- UVA Box URL of DTM or BOW used to create: https://virginia.box.com/s/itgetr7qm86iwe78knkqokytyi5fpwec
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Compute_Derived_Tables.ipynb
- Delimitter: **,**
- Description of TFIDF formula ($\LaTeX$ OK): 

$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$

where $t$ represents a word in the document (in this case a given decade playlist), $d$ represents a document (decade playlist), and $D$ is a collection of documents (decade playlists).

$TF(t,d)$ is the term frequency of word $t$ in document (decade playlist) $d$, calculated based on the **max** $TF$ method. $IDF(t,D)$ is the inverse document frequency of term $t$ in the $D$, calculated based on the **standard** $IDF$ method

## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/bcg6i25y0hf00np7vfk5s2tbj7do2uen
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/zxk8sd35ny6e861h8dyqcgn8ao7daxlr
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Compute_Derived_Tables.ipynb
- Delimitter: **,**
- Number of features (i.e. significant words): **5000**
- Principle of significant word selection: **Here, the principle of significant word selection meant only including nouns, verbs, and adjectives, with the assumption that these parts of speech are more relevant and informative for the task. I also excluded proper nouns, since they are specific entity names and are not likely to contribute to the analysis. Finally, I took the top 5000 words as sorted by DFIDF to make a more manageable dataset.**

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/6p28lg9wodtllsvvxh21jt52nlrcem9j
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/bcg6i25y0hf00np7vfk5s2tbj7do2uen
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**
- Number of components: **6**
- Library used to generate: **sklearn PCA**
- Top 5 positive terms for first component:

**['eheu', 'para', 'que', 'nanananana', 'ahah']**

- Top 5 negative terms for second component:

**['dun', 'earn', 'tonights', 'dit', 'nyu']**



## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/gncxsnj9a294np1xiig8hthny6iotu79
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/t8xhl7fu8xxzzalj0r00xuutpmyjbe41
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![](DS5001_Plots/PCA_Vis1a1.png)

![](DS5001_Plots/PCA_Vis1b.png)

Briefly describe the nature of the polarity you see in the first component:

**When describing the graph of the documents in space that is created by the first two components, what really strikes me is that the first component really effectively separates the decades over time. Specifically, we see that the 70s, 80s, and 90s are are negative for the first component, while the 2000s, 2010s, and 2020s are all positive. Interestingly, the first component also does a good job of showing loudness, which shows a natural trend over time with the decades. The 70s and 80s have the lowest loudness, with an increase in the 90s. The 2000s have the highest loudness, followed by the 2010s and 2020s. The loadings also show a little bit of information with the words, with a lot of the negative words (or sounds, which are captured in the lyrics) being more rhythmic. The positive words appear more loud or energetic, with some onomatopoetic representing louder (or screaming) noises**

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![](DS5001_Plots/PCA_Vis2a2.png)

![](DS5001_Plots/PCA_Vis2b.png)

**The graphs of the second component pair well with the acousticness of each decade as shown on the graph of the document space between the second two components. The positive documents for the second component appear to be much more acoustic than the negative documents, with the 2020s and 70s being the most acoustic. The 90s and 2000s are the least acoustic, and are the most negative documents for the second component. The loadings also show that words such as jam and jamming, which can relate to acoustic guitars and other similar instruments, are positive. In constrast, the negative loadings for the second component such as the onomatopoetic words woohoo, nyu, and dun relate more to electronic or synthetic sounds**



## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/orbt496o1qchrfydrqdjzpwk35wh0huh
- UVA Box URL of count matrix used to create: https://virginia.box.com/s/r3o925jentm48xq4k89zyv6srxgq3ex4
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**
- Libary used to compute: **sklearn decomposition LDA**
- A description of any filtering, e.g. POS (Nouns and Verbs only): **Only the length of strings of song lyrics > 1, no parts of speech. Results were similar regardless**
- Number of components: **20**
- Any other parameters used: 

ngram_range = (1, 2),
n_terms = 4000,
n_topics = 20,
max_iter = 20,
n_top_terms = 8

- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00: **"yeah yeah yeah just na dont im love na na"** -> sing-along topic
  - T01: **"love know im dont oh ill like got"** -> love and uncertainty
  - T02: **"hey oh hey hey im dont oh oh baby yeah"** -> pursuing love
  - T03: **"way got just dont youre want light hold"** -> more romantic and wholesome
  - T04: **"im dont yeah got ill feel oh just"** -> more uncertainty

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/2xi2k0hir5gwr9wmzvgxdtvxozx82so3
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**,

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/5zphu4855l8kfwsbzvqipyvr3sb2nktu
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Provide a brief interpretation of what you see.

![](DS5001_Plots/LDA_PCA_labeled.png)

**After looking at the plot of the topics in space for the first two components, what I can interpret is that the first component seems to separate the topics by sound uncertainty. Love and dancing are the complete overarching themes here (as is mostly true in music), but the negative documents in the first component show words that demonstrate a lot more uncertainty and clear want of intimacy. These topics have the highest document weights and frequency when compared to the positive documents for the first component, which are sparse and either onomatopoetic in representing electronic dance music (beep, dun, nyu, la), or more focused on pleasure.**

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/6rny0du981ke0283jn0o8kb566i3axtg
- UVA Box URL for source lexicon: https://virginia.box.com/s/482z365s0o71ga9sgfwx1lkffrixlsoq
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/g7jkjr16diiz4x7d2lkvb30c8ge9x1fu
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/bjfvug1sd3oszs4qdpzxyhv8bm55s7nl
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**
- Document bag expressed in terms of OHCO levels: **['decade_id']**

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![](DS5001_Plots/emo_sentiments.png)

![](DS5001_Plots/emo_general_sentiment.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/zdjw4egbbmptd28xqa673i9d4owfbz4q
- GitHub URL for notebook used to create: https://github.com/mtvaden1/Text_Analytics_Final_Project/blob/main/Models.ipynb
- Delimitter: **,**
- Document bag expressed in terms of OHCO levels: **['term_str]**
- Number of features generated: **246**
- The library used to generate the embeddings: **gensim word2vec**

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![](DS5001_plots/tSNE.png)

![](DS5001_plots/w2v_cluster.png)

**Looking at the entire shape of the TSNE plot, I thought that the shape was very interesting and non-linear. However, one cluster that really caught my attention when zoomed in (second photo) is at the bottom right of the plot, with a high positive x value and negative y value. Some of the words we see here are sky, bright, lord, heart, dream, moonlight, star, dream, hallelujah, music, and dancing. This was a very interesting cluster as it is an intersection of music, nature, mystery, and religion. This cluster in totality feels very ethereal, optimistic, and surreal.**

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

### LDA + PCA Visualization combined with LIB table to see decade with highest weight

![](DS5001_plots/Riff1_updated.png)

**This visualization was my first Riff based on the initial instructions to combine the LDA and PCA Visualization with metadata from the lib table, which felt very difficult based on the indices of the data. However, here I was able to uses the Theta table to discover which decade from the LIB table had the highest document weight for each of the topics, which I then joined with the DCM and TOPICS tables. Here we can see that the 2020s have the highest document-weighted topics, corresponding to the topics of uncertainty and intimacy shown from the earlier LDA PCA visualization. We can see that here that the topics that are most strongly influenced by the 90s and 2000s are also grouped more concisely within the first component on the plot, with the topics most heavily influenced by the 2010s decade being the most sporadic.**

## Riff 2 (5)

### Euclidean - Average Hierarchical Cluster

![](DS5001_plots/Riff2_hc1.png)


**Looking at the hierarchical clustering using Euchlidean Distance and the average linkage method, we can see that the decades are separated in roughly chronological order. Euclidean distance separates the 70s and the 90s lyrics recommended from Spotify from the rest of the decades, followed by the 80s, then 2020s, then 2000s and 2010s. This intuitively makes sense over time that each decade is clustered with decades that precede or follow it, although it is interesting that the 2000s and 2010s are considered to be the most similar and the 80s are colored along with the 2000, 10s, and 20s rather than the 70s and 90s. This could be reflective of my personalized taste in the alternative and rock genres from the 80s, as well as my taste for alternative rock from current artists.**

## Riff 3 (5)

### Heatmap of Probabilistic TFIDF over the Decades for Sample Words

![](DS5001_plots/Riff3a.png)

### TFIDF per Decade for Example Words

*Apologies for the explicit words here, but I actually thought the conclusions were pretty meaningful*

![](DS5001_plots/Riff3b.png)
![](DS5001_plots/Riff3c.png)
![](DS5001_plots/Riff3d.png)
![](DS5001_plots/Riff3e.png)

**When examining the heatmap of probabilistic TFIDF over the decades for this given list of sample words, a few things stood out to me. The biggest thing is the change in TFIDF of these words over time, which we can examine by looking at the gradient per decade. Some familial and traditional themes such as husband and wife are more present in the 70s and 80s, as well as folk and country music words such as country, guitar, and beer. The most interesting of these sample words to me is violence, which appears to increase steadily in TFIDF over each decade until 2020, when it suddenly disappears completely. Additionally, both peace and stress also become more prevalent as violence increases, representing potential political and ideological extremes which may have grown over time in modern music that is catered to my taste.**

**I also included these plots of TFIDF per Decade for some specific words because the conclusions were so interesting. All of the themes of alcohol, violence, sex, and swearing appear to have generally grown over time in the music data here. These controversal topics are shown by the data to have become more prominent in the 21st century, especially with the use of more expletives and alcohol/substance abuse in the 2020s. One explanation for this trend could be because I listen to more current rap music and tend to listen to older classic rock, as rap music can have more extreme and controversial lyrics.**

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

### Hierarchical Agglomerative Cluster Tree of Vector Space

![](DS5001_plots/Riff4.png)

**I wanted to start off my final interpretation with this HAC because I think it relates to a theme that I found a lot in my PCA and LDA sections of my project. After talking about the separation of onomatopoetic words that describe rhythm or dance music lyrics from other english words in the principle components earlier in this project, we can see that these onomatopoetic and shouting style words (lo, hey, lala, li) are the first to be separated in our HAC. Additionally, we can see that the next split also concerns a previous observation, where words that are related to desire and uncertainty (can, can't, want, give) are separated next. This trend reflects the themes that I noticed in the principle component analysis based on the loadings that were separated by the first and second princple components.**

**Beyond the onomatopoetic/shouting and uncertainty themes that were very prevalant in my analysis, I was surprised by how all of my topics were about love in some way. I know that most of music is about love, but this still exceeded my expectations.**

**Finally, I thought that all of the metadata and lyrical trends that evolved over time in each decade were very interesting. It made sense to me that loudness increased in music over time, and acousticness general decreased (except for my love of 2020s alternative and indie music). However, I thought it was very suprising that all of my music recommendations in each decade had relatively negative sentiment. I was not surprised to see that the 70s and 80s had the most positive sentiment for me, but this project has helped me become more aware of how sad the music I listen to may be, as that can often have an effect on my mood. Lastly, the increase in expletives and controversial lyrics about violence, love, and substance abuse over time does not surprise me, as I feel like mental health concerns and addictions are more public and open-addressed than ever in both music and society as a whole. I really enjoyed doing this project and feel like I learned a lot about the challenges of working with lyrics that do not form full sentences or are very repetitive, but I also feel like I learned some about myself. Part of me wishes that there were Spotify playlists available for each decade that did not take my current listening preferences into account so that I could better generalize my findings, but I am extremely satisfied to have a lot to think about each time I listen to a song going forward.**