# Final Project Notebook

DS 5001 Text as Data | Spring 2025

# Metadata

- Full Name: James Siegener
- Userid: gnq2mr
- GitHub Repo URL: https://github.com/jamessiegener/DS-5001-Final-Project
- UVA Box URL: https://virginia.box.com/s/0deakjbcfh6l0l0i7rthv2592xdzt9fd

# Overview

The goal of the final project is for you to create a **digital analytical edition** of a corpus using the tools, practices, and perspectives you’ve learning in this course. You will select a corpus that has already been digitized and transcribed, parse that into an F-compliant set of tables, and then generate and visualize the results of a series of fitted models. You will also draw some tentative conclusions regarding the linguistic, cultural, psychological, or historical features represented by your corpus. The point of the exercise is to have you work with a corpus through the entire pipeline from ingestion to interpretation. 

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- **Convert** the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- **Annotate** these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- **Produce** a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- **Model** the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- **Explore** your results using statistical and visual methods.
- **Present** conclusions about patterns observed in the corpus by means of these operations.

When you are finished, you will make the results of your work available in GitHub (for code) and UVA Box (for data). You will submit to Gradescope (via Canvas) a PDF version of a Jupyter notebook that contains the information listed below.

# Some Details

- Please fill out your answers in each task below by editing the markdown cell. 
- Replace text that asks you to insert something with the thing, i.e. replace `(INSERT IMAGE HERE)` with an image element, e.g. `![](image.png)`.
- For URLs, just paste the raw URL directly into the text area. Don't worry about providing link labels using `[label](link)`.
- Please do not alter the structure of the document or cell, i.e. the bulleted lists. 
- You may add explanatory paragraphs below the bulleted lists.
- Please name your tables as they are named in each task below.
- Tasks are indicated by headers with point values in parentheses.

# Raw Data

## Source Description (1)

Provide a brief description of your source material, including its provenance and content. Tell us where you found it and what kind of content it contains.

The source data consists of lyrics from 1,992 songs across 127 albums and 19 artists. The Spotify ID for each album was gathered manually. Song titles were then scraped from Spotify and used to query Genius, from which the corresponding lyric pages were scraped and compiled into a DataFrame.

## Source Features (1)

Add values for the following items. (Do this for all following bulleted lists.)

- Source URL: https://genius.com and https://spotify.com
- UVA Box URL: https://virginia.box.com/s/7kswaar7yixjokbmo55lby1kn7nl9gwv
- Number of raw documents: 1974
- Total size of raw documents (e.g. in MB): 5.3mb
- File format(s), e.g. XML, plaintext, etc.: CSV

## Source Document Structure (1)

Provide a brief description of the internal structure of each document. That, describe the typical elements found in document and their relation to each other. For example, a corpus of letters might be described as having a date, an addressee, a salutation, a set of content paragraphs, and closing. If they are various structures, state that.

Each document in the corpus represents a single song. The internal structure of each song typically includes a sequence of labeled sections such as verses, choruses, pre-choruses, bridges, and outros. These sections are marked explicitly in the text (e.g., “[Verse 1]”, “[Chorus]”) and may repeat throughout the song.  For the purposes of analysis, the corpus organizes the content hierarchically by artist, album, song, and verse, with each verse treated as a meaningful unit of context. While most documents follow a common verse-chorus format, there is some variation in structure across songs.



# Parsed and Annotated Data

Parse the raw data into the three core tables of your addition: the `LIB`, `CORPUS`, and `VOCAB` tables.

These tables will be stored as CSV files with header rows.

You may consider using `|` as a delimitter.

Provide the following information for each.

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/x13cw57aa3bn2h618m3pgj0i4740wgha
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Number of observations: 1974
- List of features, including at least three that may be used for model summarization (e.g. date, author, etc.): `album_title`, `artist_name`, `song_title`, `song_id`, `char_len`, `release_year`, `label`, `genius_url`
- Average length of each document in characters: 2520

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/g5tb58v4xha5rogq3gbw604jlselfftv
- GitHub URL for notebook used to create:
- Delimitter: "|"
- Number of observations Between (should be >= 500,000 and <= 2,000,000 observations.): 912,702
- OHCO Structure (as delimitted column names): `album_title`,`song_name`,`verse_num`,`token_num`
- Columns (as delimitted column names, including `token_str`, `term_str`, `pos`, and `pos_group`): `token_str`, `term_str`, `pos`, and `pos_group`

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/8wz6k6bj5jpzuwcw6l3kmkuoe0k8q80a
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Number of observations: 25064
- Columns (as delimitted names, including `n`, `p`', `i`, `dfidf`, `porter_stem`, `max_pos` and `max_pos_group`, `stop`): `n`, `p`', `i`, `dfidf`, `max_pos`, `max_pos_group`, `stop`, `porter_stem`
- Note: Your VOCAB may contain ngrams. If so, add a feature for `ngram_length`.
- List the top 20 significant words in the corpus by DFIDF.

too,1032.282762 <br>
ill,1032.278672 <br>
feel,1032.272558 <br>
come,1032.264967 <br>
tell,1032.264967 <br>
want,1032.242976 <br>
then,1032.172870 <br>
been,1031.781146 <br>
need,1031.768128 <br>
right,1031.453660 <br>
are,1031.268117 <br>
let,1031.237665 <br>
way,1030.663884 <br>
wanna,1030.583547 <br>
thats,1030.061592 <br>
have,1030.061592 <br>
take,1029.872454 <br>
could,1028.223764 <br>
off,1028.090738 <br>
think,1028.090738 <br>

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/e9wyf2ue33v3tz3h0q80v49pwf7e4g0x
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Bag (expressed in terms of OHCO levels): [`album_title`, `song_id`]
- Number of observations: 331342
- Columns (as delimitted names, including `n`, `tfidf`): `term_str`,`n`,`tfidf`

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/2ynvltqc32u3mdi2zlhimvk4um4hgw1d
- UVA Box URL of BOW used to generate (if applicable): https://virginia.box.com/s/e9wyf2ue33v3tz3h0q80v49pwf7e4g0x
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Bag (expressed in terms of OHCO levels): [`album_title`, `song_id`]

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/1e4mugger7w53ngqwjhplozjvopqj7s9
- UVA Box URL of DTM or BOW used to create: https://virginia.box.com/s/e9wyf2ue33v3tz3h0q80v49pwf7e4g0x
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Description of TFIDIF formula ($\LaTeX$ OK): 
**Term Frequency (TF):**

$$
\text{TF}_{t,d} = \frac{f_{t,d}}{\max(f_{t,d'})}
$$

**Document Frequency (DF):**

$$
\text{DF}_t = \sum_{d=1}^{N} \mathbf{1}_{t \in d}
$$

**Inverse Document Frequency (IDF):**

$$
\text{IDF}_t = \log_2\left(\frac{N_{\text{docs}}}{\text{DF}_t}\right)
$$

**TF-IDF:**

$$
\text{TF-IDF}_{t,d} = \text{TF}_{t,d} \times \text{IDF}_t
$$


## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/cnaikr1rasxl20kvqutbuynkugr0q0il
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/1e4mugger7w53ngqwjhplozjvopqj7s9
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Number of features (i.e. significant words): 1000
- Principle of significant word selection: dfidf

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/0jf0k539qufkci95m2hl2kjsmeh1mh41
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/cnaikr1rasxl20kvqutbuynkugr0q0il
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Number of components: 10
- Library used to generate: Scratch implementation from class
- Top 5 positive terms for first component: love baby heart were babe 
- Top 5 negative terms for second component: was didnt never looks youd

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/1q7jq8ki0zu3gwj6wtdna9cwg0ql2un5
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/hg5xs9d615i3ghsntk411bvq8009u3gf
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![PCA1.png](attachment:8d17aa20-cef7-441e-967f-9c2f4faa5362.png)

![loadings1.png](attachment:b21976b5-c136-467f-87ff-55b63a3a2c17.png)
Briefly describe the nature of the polarity you see in the first component: 

In the first component, there is a very clear separation between albums by artists that are generally considered rappers, like Kanye West, and Eminem, and the pop artists like Ariana Grande and Taylor Swift. Interestingly, it seems that for the first component no artist crosses into the positive for the first component, they are all negative. 

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components.

Color the points based on a metadata feature associated with the documents.

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![pca2.png](attachment:5987fe41-dcaa-49ed-9707-d81f08454506.png)

![loadings2.png](attachment:d3f9cb9a-ce0c-4400-87d9-ae5890d4fce3.png)

Briefly describe the nature of the polarity you see in the second component:

For this component(the fourth) I do not see too extreme of a polarity, but it did manage to separate all of Eminem, Drake, and Ariana Grande's albums in the negative, while the positive is much more mixed. While there isn't much of a polarity, the PCA did group artists' albums together pretty effectively.

## LDA TOPIC (4)

- UVA Box URL: https://virginia.box.com/s/9yu44vm1q69nr3we5r9t7em0mx2aqbey
- UVA Box URL of count matrix used to create: https://virginia.box.com/s/1q7jq8ki0zu3gwj6wtdna9cwg0ql2un5 
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Libary used to compute: scikit learn
- A description of any filtering, e.g. POS (Nouns and Verbs only): Only Nouns
- Number of components: 20
- Any other parameters used: max_df=0.9, min_df=2
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00: Time Love Thing Gonna Heart - Love
  - T01: Baby Time Mind Boy Youre - Romantic partners
  - T02: Love Yeah Woah Baby Time - Declarations of Love
  - T03: Shit Hands Bitch Fuck People - Profanity/Aggression
  - T04: Night Lights Yeah Time Love - Love in the City

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/p6zgozyov2o07cwrn1amyyeq7awq9nln
- GitHub URL for notebook used to create:
- Delimitter: "|"

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/r49fp9amc6rmq8gsmmx29yuh8h7iwo9r
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"

## LDA + PCA Visualization (4)

Apply PCA to the THETA table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

![lda_pca.png](attachment:a5338327-b5ff-411c-98b0-4f0df4a36a12.png)

The two largest dots, representing the two topics with the highest mean weight, are at the extremes of the scatterplot. T017 is at the very positive end of component 0, and T04 is at the very positive end of component 1. These topics both are colored blue, meaning that they are most prevalent in the work of Taylor Swift. This to me indicates that Taylor Swift's work is distinct from the other artists in some way that can easily be represented through the topics, or that her work is overwhelming the algorithm, as she is one of the most prevalent artists. The topics are also only dominated by 5 artists, Taylor Swift, Drake, Eminem, Kanye West, and Beyonce, which could indicate that their work is most influential within the datset.

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/swug0pvpyvo8dghe0m55stir3usxdx7u
- UVA Box URL for source lexicon: https://virginia.box.com/s/elvk3paqb9csjrylluo44st7wepwj3nk
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://virginia.box.com/s/4c4nvdn4xwfmiwoxfqgp494agieupxn6
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://virginia.box.com/s/upargjtbcv1n0682py9m2ckvlunyp6lw
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Document bag expressed in terms of OHCO levels: [`album_title`, `song_id`]

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

![sentimentovertime.png](attachment:5ebe721a-1743-414c-9658-92c1bb46aefd.png)


## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/u921fniv9i3crvcq780p115rzjy75wnk
- GitHub URL for notebook used to create: https://github.com/jamessiegener/DS-5001-Final-Project/blob/main/textasdata_report.ipynb
- Delimitter: "|"
- Document bag expressed in terms of OHCO levels: [`album_title`,`song_id`]
- Number of features generated: 200
- The library used to generate the embeddings: gensim

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

Describe a cluster in the plot that captures your attention.

![tsne.png](attachment:c84705f2-3285-4c03-8daf-3804321b8c9b.png)

There is a cluster located at around (10,20) that contains many words related to time. This cluster is interesting to me because it seems to float above the main group of words in the center. It contains the words, "minute," "hour," "day," "year," "friday," "week," "next," "same" and other words that either describe time or are used frequently with time words. 

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)


![joyscatter.png](attachment:fd3f12fa-a83f-4281-8d03-4032e0ec6dbb.png)
![angerscatter.png](attachment:597142d0-f3d9-4d0e-9a3e-ee2e47834b4b.png)


When sentiment was plotted against the average length of song(in characters) there was a very distinct trend that albums with longer average song lengths tended to be less positive. This is shown most clearly with the anger and joy sentiments. As song lengths increase anger increases and joy decreases. 

## Riff 2 (5)

![drakesentiment.png](attachment:d2a22a42-c8b6-48b8-b3c5-9028324a6bb0.png)


I wanted to investigate how an individual artist's sentiment could change over time. I chose Drake because he was the most prominent artist in the dataset. It is very clear that Drake's music has gotten more negative over time. This started in around ~2020, which is when his work goes from fluctuating to consistently getting more negative. 2020 is the year the COVID-19 pandemic began, which could have been an influence on Drake's work. 

## Riff 3 (5)

![Heatmap.png](attachment:d1fb3a42-1683-4f58-84ea-f24c920a413c.png)

This heatmap shows the relative importance of different keywords across artists' lyrics, based on normalized TF-IDF scores. What stands out to me is how some words are strongly associated with a single artist, while being barely used by others. For example, Katy Perry clearly dominates the use of "party", while Lady Gaga and Lana Del Rey stand out with their emphasis on the word "cry."

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

What I found most interesting during this project was discovering how clearly artists differ from each other in ways that can be detected just by analyzing their lyrics. This became immediately apparent when I ran the first PCA on albums and colored them by artist. The results showed distinct zones for each artist. While some overlap existed, the albums of each individual artist were overwhelmingly clustered in their own region. It was fascinating how well the PCA captured these stylistic differences. When PCA was conducted in conjunction with LDA, it also showed a clear separation of Taylor Swift from other artists, with her topics dominating the first 2 components. 

Seeing the clear PCA results, I also experimented with hierarchical clustering in an attempt to group artists by genre. However, I was not successful in getting meaningful clusters. This may suggest that while individual artists have distinct lyrical styles, the vocabulary differences between broader genres aren’t strong or consistent enough to drive clear separations using clustering. 

Most of my deeper exploration involved sentiment analysis. One of the most surprising discoveries came when I plotted sentiment against the average song length per album. I wasn’t expecting any particular pattern, but a strong trend emerged: albums with longer songs tended to be less joyful and more angry. Reflecting on this, it makes sense. These albums might act as emotional outlets or “rants,” where artists use extended tracks to express frustration or sadness. I also found it fascinating to see how sentiment shifted across an artist’s career, potentially mirroring personal growth, changes in creative direction, or responses to life events.

I would be very interested in seeing how these patterns evolve with a more robust dataset. Mine was produced by manually grabbing Spotify IDs for albums and using the Spotify and Genius APIs to scrape for the lyrics. This became quite time consuming, and I was unable to get as much data as I would have liked. I also would have liked to explore "genre" as a feature, but I could not find any readily available source with that data and it would have been quite tedious to gather genre info for over 1000 songs manually. It would have been interesting to focus on artists from a specific genre to see how they compare and differ.