# <center>Hip-Hop Lyrics: An Exploratory Data Analysis</center>

<center> Levi Davis <br> ljd3frf@virginia.edu <br> DS 5001: Exploratory Text Analysis, Spring 2023</center>

## 1. Introduction: 

Hip-Hop, a genre of music and cultural movement that emerged in the late 1970s and early 1980s, has had a significant impact on popular culture since its inception. Born as a counterculture movement in basements and on the streets, Hip-Hop has grown into a global phenomenon that dominates the music industry. Despite its widespread popularity, Hip-Hop has often faced criticism from certain conservative and older communities due to its often-explicit language, crude insinuations, and depictions of immoral or illegal behavior. Despite these criticisms, Hip-Hop's popularity continues to soar, and it has become one of the most influential genres of music in the world. In this project, I will delve into the lyrics of Hip-Hop music to better understand the cultural significance of this ever-evolving art form. Some questions to guide my exploration: which artists have the longest and most unique lyrics, which artists are the most similar, how has lyrical sentiment changed over time, and what are the most popular topics in hip-hop songs?

## 2. Data Source and Preprocessing

To gather data for this project, I used the [lyricgenius](https://pypi.org/project/lyricsgenius/) package to download song information for hip-hop artists from the [Genius.com API](https://docs.genius.com/). There is a separate JSON file for each artist, and these are combined to make a complete corpus. 116 artists were selected from the 80s to present day, and each artist has more than 50 songs and no more than 250. The artists were selected based on popularity rankings from various websites while making sure to include at least 20 artists from each decade, except the 1980's, as hip-hop was just begining to emerge as a genre. I manually created a dictionary to include each artist's hometown and state, and an artist year column was created by taking the median song year for each artist. 

## 3. Data Model

Here is an overview of the main initial dataframes that are used in the analysis. 

- LIB: A dataframe containing file metadata; Artist, file path, number of songs, Genius site link, hometown, and median year. 
- FULL_CORPUS: A dataframe containing song lyrics year, artist region, artist, album, and song.
- CORPUS: The filtered version of FULL_DF used in analysis. 
- TOKENS: Song lyrics tokenized by word with NLK package with  part of speech tag
- VOCAB: A dataframe containing words counts for every word in CORPUS and song-level TF-IDF metrics.

The OHCO level of this dataset is artist, album, song; implemented as a multiindex in CORPUS, DF, and TOKENS.
Additonal dataframes for specific modeling techniques are included in the data files.
  
Song lyrics can be messy or incoherent, and it took much deliberation to decide how to process the corpus. I needed to keep enough words so each song retains meaning but not enough so that over half the content was ad-libs, non-sense words, etc. After preprocssing the lyrics, I filtered the corpus to only keep songs with more than 1000 characters (\~200 words) but fewer than 10,000 characters (most song lyrics greater than this length are not traditional songs). This resulted in 19983 songs, about 89% of the total original songs. the resulting corpus has an average song lyric length of 2614 characters (\~500 words) and a standard deviation of 947 characters (\~190 words).

# 4. Exploration

To better understand the corpus, before diving into modeling techniques, I explore the dataset with some simple plots.  First I explore the distribution of artists by their state of origin.

![My Image](images/state_dist.png)

This plot delineates the four prominent hip-hop regions of the United States - the East coast, West coast, South, and Midwest - and showcases their representation in the dataset. New York boasts the highest number of artists, as it served as the birthplace of hip-hop and produced a majority of artists in the genre's nascent days. The West coast established its own unique hip-hop culture, distinguished by 'gangsta rap', towards the end of the 1980s and subsequently emerged as the second most prominent region. However, in the 1990's, the ascendancy of southern hip-hop in cities like Atlanta (GA), New Orleans (LA), Miami (FL), Houston (TX), and Memphis (TN) marked a turning point, leading to a more diverse soundscape. Finally, the two major midwestern hip-hop cities, Chicago (IL) and Detroit (MI), bring their respective states near the top of the rankings. 

Note: While being located in a major city was critical to gain recognition as a hip-hop artist in the early days, now this is much less important.

Below I select only the artists whose median release year was before or during the year 2000, and it is evident that East coast hip-hop dominated the scene, with the West coast in close pursuit.

![My Image](images/state_old_dist.png)

Next I plot the distribution of median artist release year.

![My Image](images/Artist_median_release_year.png)

### Term Frequency - Inverse Document Frequency (TF-IDF)

I calculated TF-IDF scores at the song level to determine the significance of words across the corpus of documents. By averaging the TF-IDF scores per song and artist, I obtained a final metric representing distinctness for each artist, or how unique an artist's lyrical choices are compared to their peers. 

![My Image](images/Agg_artist_tfidf.png)

Next, I plot median artist year against average unique words per song and average song length for each artist. Both plots show a noticeable decline over time, mathcing the TF-IDF plot.

![My Image](images/Avg_unq_words.png)

![My Image](images/Avg_song_length.png)

It's important to note that the average unique words per song metric is influenced by song length. Therefore, I divide unique words per song by song length and calculate the average for each artist, and the resulting trendline becomes nearly flattened.

![My Image](images/Avg_unq_word_ratio.png)

This suggests the downward trends across time observed in the previous two plots are artifacts of newer songs having shorter lyrics on average rather than newer artists using a less extensive vocabulary. However, the downward trend  across time shown in the initial TF-IDF plot implies that older artists may have more unique lyrics compared to newer artists. As a final examination of this phenomenom, I calcualte TF-IDF using artists as the bag level instead of individual songs as done previously. The results are very similar to the song-level TF-IDF; artist positions are extremly similar and the trendline slope appears the same.

![My Image](images/Artist_level_tfidf.png)

In conclusion, this exploration indicates that older hip-hop artists tend to use more words per song and thus have more unique words per song on average; but the porpotion of unique words per song is about the same. Both levels of TF-IDF analysis show downward trends which suggests there are differences in lyrical content between older and newer artists. Therefore, while the proportion of unique words per song is consistent across time, older artists used a higher percentage of meaningful, distinct words compared to newer artists.

### Principle Component Analysis (PCA)

Next, I utilize principal component analysis (PCA) to cluster artists based on their lyrical similarity. To reduce the dimensionality of the feature space I selected the top 8,000 most significant nouns, verbs, and adjectives based on the TFIDF metric, and further selected the top 20 principal components.

![My Image](images/PCA_dendrogram.png)

The results obtained from the PCA analysis provide valuable insights into the clustering of hip-hop artists and align well with my existing knowledge of the hip-hop landscape. The artists were successfully clustered into five main groups based on a color threshold of 2.5. 

The first cluster, represented by the color brown, mostly contains lyrically proficient artists from the 2000s era. The bottom subgroup contains artists exclusively from the NYC area with a heavy representation of the 'Boom-Bap' subgenre, while the other contains slightly newer artists.

The second cluster, colored purple, is comprised old school East Coast artists, with all but four hailing from New York, and the most recent artist represented dating back to 2003.

The third cluster, marked in red, consists of many artists representing gansta rap. It has two main subgroups - the first is a mix of artsits from the late 2000's and early 2010's, and the second is mostly West Coast artists mixed with some older southern artists.

The fourth and fifth clsuters, colored in green and orange, respectively, consists of newer artists. Interestingly, the green cluster was comprised of slightly older and more lyrical artists. In contrast, the orange cluster consisted of the youngest artists who largely represent the hip-hop subgenres trap, mumble, and drill rap.

Overall, the PCA analysis provides an insightful and comprehensive view of the landscape of hip-hop artists and their lyrical similarities. The findings have significant implications for understanding the evolution of the hip-hop over time and the impact of geographic regions on its development.

### Topic Modeling

I extract topics from the corpus with Latent Dirichlet Allocation (LDA) using the top 10,000 terms, 20 topics with 15 terms each, and 50 iterations. The LDA analysis revealed the prevalence of certain topics in Hip-Hop lyrics, with the topics of money, jewelery, music/partying, life, drugs/street life, and sex being the most prominent. It is interesting to note that these topics are reflective of the cultural and socio-economic contexts in which Hip-Hop emerged and evolved. 

![My Image](images/pretty_topics_table.png)

To explore topics over time, I assign category names to the following top eight most distinct topics: Spiritual/Family - T01, Club - T02, Violence - T05, Jewelry/Bling - T07, Sex - T11, Music - T15, Party - T14, Money - T19. Below is a table showing the artists with the highest and lowest score for each selected topic.

![My Image](images/pretty_top_artist_topics.png)

Next, I create a plot to showcase the selected topics over time. To account for using a subset of the LDA topics and to plot all topics on one graph, I normalized each row to make all topic scores for each year sum to one. Additionally, I limited the plot to years after 1985, as earlier years resulted in sharp topic changes due to a scarcity of songs from 1980-1985.

![My Image](images/Topics_over_time.svg)

Although a large proportion of early hip-hop song lyrics were about music, the popularity of the topic has continously decreased over time. Violence rose throughout the 90's and was the leading topic in the 2000's - on par with the rise of gansta rap. The money topic began rising in the 2000's and the jewelry/bling topic quickly rose in popularity thoughout the 2010's. Sex has been a steady topic throughout the years but has peaked in the last three years, possibly partly due to the rise of sex positive female rappers such as Megan Thee Stallion (#1 in sex topic), Lizzo, and Cardi B. 

### Word Embeddings 

I implement the word2vec algorithm (window=2, vector_size=256, min_count=10) to generate a vectorized representation of all the nouns and verbs found in the corpus of lyrics, which allows us to explore the semantic similarity between words present in the corpus. Word2vec produces a high-dimensional vector space, sp I utilize t-Distributed Stochastic Neighbor Embedding (t-SNE) (learning rate=200, perplexity=20, n_components=2,n_iter=1000) to reduce the dimensionality of the space to two dimensions. I make an interactive plotly scatter plot to visually explore prominent word clusters related to hip-hop culture.

One cluster that stands out is drug-related words which are specifically grouped according to the type of drug. Alcohol and cough syrup-related words are located in the top left, while pill-form drugs are clustered in the bottom left. Smoking drugs, such as marijuana and tobacco, are located on the right side of the plot.

![My Image](images/ETA_WE_DRUGS1000.png)

Another cluster I identify is related to jewelry and clothing. Words associated with jewelry and bling form a subgroup near the top left, while designer brands and clothing items are located on the bottom right. An interesting finding is the presence of the term 'baguette' next to necklace and chain, which is slang for jewelry in hip-hop culture.

![My Image](images/ETA_WE_BLING.png)

Finally, I show a cluster of words related to vehicles. High-end luxury cars like Bugatti, Lamborghini, and Porsche are grouped closely together in the left-top of the center cluster, while more generic car terms are located below.

![My Image](images/ETA_WE_VEHICLES.png)

### Sentiment Analysis

I perform sentiment analysis using the NRC emotion lexicon to discover which artists have the most exteme sentiment scores as well as explore changes in sentiment over time. To calculate artist sentiment score, I first used TF-IDF to weight the emotion lexicon word scores and then average these scores at the artist level, resulting in sentiment scores for each artist. 

![My Image](images/pretty_sents_table.png)

To analyze sentiment across time I appliy the same methods described above but instead group the sentiment scores by album, providing a more granular analysis by year comapred to using artist median year. The results of plotting sentiment across time are meger; the fluctions don't seem to be more stochastic than a function of time-related processes or events. However, regardless of year, fear and anger seem to be the most popular sentiments while surprise is the least popular. Additionally, the average sentiment score is always slightly negitive, altough this is not too suprising given the often gritty nature of Hip-Hop. 

![My Image](images/Sent_over_time.svg)

Considering the results of sentiment analysis and topic modeling, we can conclude that the tone of hip-hop lyrics has not changed much over time yet the content of lyrics has changed somewhat significantly. 

### Conclusion

This exploratory data analysis project has provided valuable insights into the world of hip-hop music lyrics. The analysis focused on the lyrics of 116 hip-hop artists from the 80s to present day, investigating factors such as the artists' state of origin and top year, lyrical sentiment, and most popular topics. The results suggest that older hip-hop artists tend to use more words per song, and thus have more unique words per song on average, but the proportion of unique words per song is about the same. Both levels of TF-IDF analysis show downward trends which may suggest there are differences in lyrical content between older and newer artists. PCA provides insights into the clustering of hip-hop artists and LDA topic modeling generated 20 topics - the results of both align well with existing knowledge of the genre.  Overall, this project sheds light on the historical evolution and current state of hip-hop music, a genre that has left an indelible mark on popular culture.