
# Repeat Themes in Kanye West’s Lyrics: Wealth, Religion, and Relationships

This notebook implements the full analysis pipeline for your **Data 620 final project**, using the Kaggle dataset:

> `convolutionalnn/kanye-west-lyrics-dataset`

It includes:

1. Data download and loading (via `kagglehub`)
2. Text preprocessing (tokenization, stopword removal, lemmatization)
3. Word frequency analysis
4. Sentiment analysis (VADER)
5. Word co-occurrence network (NetworkX)
6. Topic modeling (LDA)
7. Word clouds (overall and per album)
8. Theme keyword analysis (wealth, religion, relationships)



## 1. Setup and Library Installation

Run this cell **once** to install any missing libraries.  
If you already have them installed, you can skip or comment out the `pip install` lines.


In [None]:

# If needed, uncomment these lines to install dependencies in your environment.

# !pip install kagglehub
# !pip install pandas matplotlib seaborn nltk networkx gensim wordcloud

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer

import string
import networkx as nx

from wordcloud import WordCloud
from gensim import corpora
from gensim.models import LdaModel

import kagglehub

# Make plots a bit nicer
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

# Download NLTK resources (safe to run multiple times)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('vader_lexicon', quiet=True)



## 2. Download the Kanye West Lyrics Dataset (Kaggle)

We will use `kagglehub` to automatically download the latest version of the dataset from Kaggle.


In [None]:

# Download latest version of the dataset from Kaggle
dataset_dir = kagglehub.dataset_download("convolutionalnn/kanye-west-lyrics-dataset")
print("Dataset directory:", dataset_dir)

# Inspect files in the dataset directory
files = os.listdir(dataset_dir)
print("Files found in dataset directory:", files)

# Automatically pick the first .txt or .csv file
data_file = None
for f in files:
    if f.lower().endswith(".txt") or f.lower().endswith(".csv"):
        data_file = os.path.join(dataset_dir, f)
        break

if data_file is None:
    raise FileNotFoundError("No .txt or .csv file found in the dataset directory. Check Kaggle dataset contents.")

print("Using data file:", data_file)



## 3. Load and Inspect the Dataset

The dataset file may be in `.txt` or `.csv` format.  
We try to read it as a regular CSV first, and if that fails, we try tab-separated text.

We also normalize column names to lowercase and ensure that there is a `lyrics` column.


In [None]:

def load_dataset(file_path: str) -> pd.DataFrame:
    """Load the Kanye West lyrics dataset from a text/CSV file.

    - Tries standard CSV reading first.
    - If that fails, tries tab-separated format.
    - Normalizes column names to lowercase.
    - Ensures there is a 'lyrics' column (renaming common alternatives if needed).
    """
    print("Loading dataset from:", file_path)
    
    # Try reading as standard CSV
    try:
        df = pd.read_csv(file_path)
    except Exception as e:
        print("Standard CSV read failed:", e)
        print("Trying tab-separated format (sep='\t')...")
        df = pd.read_csv(file_path, sep="\t", engine="python")
    
    print("Initial shape:", df.shape)
    print("Original columns:", df.columns.tolist())
    
    # Normalize column names to lowercase
    df.columns = [col.lower() for col in df.columns]
    print("Normalized columns:", df.columns.tolist())
    
    # Try to ensure there is a 'lyrics' column
    if "lyrics" not in df.columns:
        for alt in ["lyric", "text", "content"]:
            if alt in df.columns:
                df.rename(columns={alt: "lyrics"}, inplace=True)
                break
    
    if "lyrics" not in df.columns:
        raise KeyError("Could not find a 'lyrics' column. Please check the dataset structure.")
    
    # Optional: standardize 'album' column name if needed
    if "album" not in df.columns:
        for alt in ["albums", "record", "release"]:
            if alt in df.columns:
                df.rename(columns={alt: "album"}, inplace=True)
                break
    
    # Drop rows with missing lyrics and remove duplicates
    df = df.dropna(subset=["lyrics"])
    df = df.drop_duplicates(subset=["lyrics"])
    
    print("Shape after cleaning lyrics:", df.shape)
    
    return df


df = load_dataset(data_file)

# Quick peek
df.head()



## 4. Text Preprocessing: Tokenization, Stopwords, Lemmatization

To prepare the lyrics for analysis, we:

- Convert text to lowercase  
- Remove punctuation  
- Tokenize into words  
- Remove English stopwords (e.g., 'the', 'and')  
- Lemmatize words to their base form (e.g., 'running' → 'run')  

This results in a normalized `tokens` column we can use for all downstream analysis.


In [None]:

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def clean_text(text: str):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and very short tokens
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

# Apply to the lyrics column
df["tokens"] = df["lyrics"].apply(clean_text)

# Show a few examples
df[["lyrics", "tokens"]].head(3)



## 5. Word Frequency Analysis

As a first exploratory step, we examine the most frequently used words across Kanye's lyrics.

This helps us see which concepts are dominant in his vocabulary.


In [None]:

from collections import Counter

def analyze_word_frequencies(df: pd.DataFrame, top_n: int = 30):
    all_tokens = [token for sublist in df["tokens"] for token in sublist]
    freq = Counter(all_tokens)
    common = freq.most_common(top_n)
    
    print(f"Top {top_n} most common words:")
    for word, count in common:
        print(f"{word}: {count}")
    
    words, counts = zip(*common)
    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(counts), y=list(words))
    plt.title(f"Top {top_n} Words in Kanye West Lyrics")
    plt.xlabel("Count")
    plt.ylabel("Word")
    plt.tight_layout()
    plt.show()

analyze_word_frequencies(df, top_n=30)



## 6. Sentiment Analysis (VADER)

We use NLTK's VADER sentiment analyzer to measure:

- Negative (`neg`)  
- Neutral (`neu`)  
- Positive (`pos`)  
- Overall compound score (`compound`)  

for each lyrics entry. This gives us a sense of the emotional tone of Kanye's songs.


In [None]:

def run_sentiment_analysis(df: pd.DataFrame) -> pd.DataFrame:
    sia = SentimentIntensityAnalyzer()
    sentiment_dicts = df["lyrics"].apply(sia.polarity_scores)
    sentiment_df = pd.json_normalize(sentiment_dicts)
    df_sent = pd.concat([df.reset_index(drop=True), sentiment_df], axis=1)
    
    print("Example sentiment scores:")
    display(df_sent[["lyrics", "neg", "neu", "pos", "compound"]].head(3))
    
    # Plot distribution of compound scores
    plt.figure(figsize=(8, 5))
    sns.histplot(df_sent["compound"], bins=50, kde=True)
    plt.title("Distribution of Compound Sentiment Scores")
    plt.xlabel("Compound Score")
    plt.ylabel("Number of Songs")
    plt.tight_layout()
    plt.show()
    
    # If album info exists, show sentiment by album
    if "album" in df_sent.columns:
        plt.figure(figsize=(12, 6))
        sns.boxplot(data=df_sent, x="album", y="compound")
        plt.xticks(rotation=45, ha="right")
        plt.title("Sentiment (Compound) by Album")
        plt.tight_layout()
        plt.show()
    
    return df_sent

df = run_sentiment_analysis(df)



## 7. Word Co-occurrence Network (Network Analysis)

To satisfy the **network analysis** component, we build a graph where:

- Nodes = words  
- Edges = words that appear together in the same song  
- Edge weight = how many songs the word pair co-occurs in  

We apply a threshold to filter out weak connections so the network is more interpretable.


In [None]:

from itertools import combinations

COOCCURRENCE_THRESHOLD = 8  # adjust if too dense or too sparse

def build_cooccurrence_network(df: pd.DataFrame, threshold: int = COOCCURRENCE_THRESHOLD) -> nx.Graph:
    cooccurrence = Counter()
    
    for tokens in df["tokens"]:
        unique_tokens = set(tokens)
        for pair in combinations(unique_tokens, 2):
            cooccurrence[pair] += 1
    
    G = nx.Graph()
    for (w1, w2), weight in cooccurrence.items():
        if weight >= threshold:
            G.add_edge(w1, w2, weight=weight)
    
    print("Number of nodes:", G.number_of_nodes())
    print("Number of edges:", G.number_of_edges())
    return G

def visualize_cooccurrence_network(G: nx.Graph):
    if G.number_of_nodes() == 0:
        print("Graph is empty; nothing to visualize.")
        return
    
    plt.figure(figsize=(14, 12))
    pos = nx.spring_layout(G, k=0.55, seed=42)
    weights = [G[u][v]["weight"] for u, v in G.edges()]
    
    nx.draw_networkx_nodes(G, pos, node_size=60, node_color="red", alpha=0.8)
    nx.draw_networkx_edges(G, pos, width=np.array(weights) / 2.0, alpha=0.4)
    nx.draw_networkx_labels(G, pos, font_size=8)
    
    plt.title("Kanye West Lyrics – Word Co-occurrence Network")
    plt.axis("off")
    plt.tight_layout()
    plt.show()

G = build_cooccurrence_network(df, threshold=COOCCURRENCE_THRESHOLD)
visualize_cooccurrence_network(G)



## 8. Topic Modeling with LDA

We apply Latent Dirichlet Allocation (LDA) to discover **latent topics** in the lyrics.

This helps us see whether themes like **wealth**, **religion**, and **relationships** emerge naturally from the data.


In [None]:

NUM_TOPICS = 5

def run_lda_topic_modeling(df: pd.DataFrame, num_topics: int = NUM_TOPICS):
    dictionary = corpora.Dictionary(df["tokens"])
    corpus = [dictionary.doc2bow(tokens) for tokens in df["tokens"]]
    
    lda_model = LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=num_topics,
        passes=10,
        random_state=42,
    )
    
    print("Topics discovered by LDA:\n")
    for i, topic in lda_model.print_topics():
        print(f"Topic {i}: {topic}\n")
    
    return lda_model, corpus, dictionary

lda_model, corpus, dictionary = run_lda_topic_modeling(df, num_topics=NUM_TOPICS)



## 9. Word Clouds (Overall and Per Album)

Word clouds provide an intuitive visual summary of the most frequent words.

- The **global** word cloud shows overall vocabulary.  
- **Per-album** word clouds highlight how themes shift across Kanye's discography.


In [None]:

def generate_global_wordcloud(df: pd.DataFrame):
    all_text = " ".join(df["lyrics"].tolist())
    wc = WordCloud(width=1200, height=800, background_color="white").generate(all_text)
    
    plt.figure(figsize=(12, 8))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title("Kanye West Lyrics – Global WordCloud")
    plt.tight_layout()
    plt.show()

def generate_wordclouds_by_album(df: pd.DataFrame):
    if "album" not in df.columns:
        print("No 'album' column found; skipping per-album word clouds.")
        return
    
    for album in df["album"].dropna().unique():
        subset = df[df["album"] == album]
        text = " ".join(subset["lyrics"].tolist())
        if not text.strip():
            continue
        
        wc = WordCloud(width=1000, height=600, background_color="black").generate(text)
        
        plt.figure(figsize=(12, 7))
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")
        plt.title(f"WordCloud – {album}")
        plt.tight_layout()
        plt.show()

generate_global_wordcloud(df)
generate_wordclouds_by_album(df)



## 10. Theme Keyword Analysis: Wealth, Religion, Relationships

To directly test the hypothesis, we:

1. Define keyword lists for each theme.  
2. Count how many times theme-related words appear in each song.  
3. Aggregate counts by album to see which albums emphasize which themes.


In [None]:

THEMES = {
    "wealth": ["money", "cash", "gold", "rich", "bank", "chain", "ballin", "bread", "dollars"],
    "religion": ["god", "pray", "church", "heaven", "faith", "jesus", "bible", "saint"],
    "relationships": ["love", "heart", "baby", "girl", "boy", "relationship", "wife", "husband"],
}

def theme_count(tokens, keywords):
    return sum(1 for t in tokens if t in keywords)

def add_theme_counts(df: pd.DataFrame) -> pd.DataFrame:
    for theme_name, keywords in THEMES.items():
        col_name = f"{theme_name}_count"
        df[col_name] = df["tokens"].apply(lambda tokens: theme_count(tokens, keywords))
    return df

df = add_theme_counts(df)

df[[col for col in df.columns if col.endswith("_count")]].head()


In [None]:

def plot_theme_trends_by_album(df: pd.DataFrame):
    if "album" not in df.columns:
        print("No 'album' column found; skipping theme trends by album.")
        return
    
    theme_cols = [f"{t}_count" for t in THEMES.keys()]
    theme_summary = df.groupby("album")[theme_cols].sum()
    
    print("Theme counts by album:")
    display(theme_summary)
    
    theme_summary.plot(kind="bar", figsize=(12, 6))
    plt.title("Theme Frequency by Album (Wealth, Religion, Relationships)")
    plt.xlabel("Album")
    plt.ylabel("Keyword Count")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

plot_theme_trends_by_album(df)



## 11. Save Processed Data (Optional)

We can save the processed DataFrame (with tokens, sentiment scores, and theme counts) for later use.


In [None]:

df.to_pickle("kanye_lyrics_processed.pkl")
df.to_csv("kanye_lyrics_processed.csv", index=False)
print("Saved processed data to 'kanye_lyrics_processed.pkl' and 'kanye_lyrics_processed.csv'.")



## 12. Summary

In this notebook, we:

- Downloaded Kanye West's lyrics dataset from Kaggle
- Cleaned and preprocessed the lyrics
- Explored word frequencies
- Performed sentiment analysis
- Built and visualized a word co-occurrence network (network analysis)
- Applied LDA topic modeling
- Generated word clouds (overall and per album)
- Quantified theme frequencies for **wealth**, **religion**, and **relationships**

These results support the hypothesis that Kanye's lyrics contain recurring themes related to wealth, religion, and relationships, and that these themes appear consistently across his albums, with some variation.
