
# 📚 Zipf’s Law in Nicki Minaj’s Lyrics

Zipf's Law is a fascinating principle of natural language: it states that in any corpus of natural language, the frequency of a word is inversely proportional to its rank.  
That is, the second most common word appears about half as often as the most common, the third one-third as often, and so on.

In this notebook, we analyze the lyrics of Nicki Minaj to see if her lyrics follow this behavior. This helps us explore language structure even in artistic domains like music.


In [None]:

import pandas as pd
import re
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# nltk.download('punkt')
# nltk.download('stopwords')

sns.set(style="whitegrid")


In [None]:

from google.colab import files
uploaded = files.upload()  # Upload NickiMinaj (1).csv here

df = pd.read_csv(next(iter(uploaded)))
df.head()


In [None]:

text = " ".join(df['Lyrics'].dropna().astype(str)).lower()
words = word_tokenize(re.sub(r'[^a-zA-Z\s]', '', text))
filtered_words = [w for w in words if w not in stopwords.words('english') and len(w) > 1]

word_freq = Counter(filtered_words)


In [None]:

top_words = word_freq.most_common(20)
words_, freqs = zip(*top_words)

plt.figure(figsize=(12,6))
sns.barplot(x=list(words_), y=list(freqs), palette="viridis")
plt.title("Top 20 Most Frequent Words in Nicki Minaj's Lyrics")
plt.xticks(rotation=45)
plt.ylabel("Frequency")
plt.show()


In [None]:

plt.figure(figsize=(10,6))
plt.hist(word_freq.values(), bins=50, color='skyblue', edgecolor='black')
plt.title("Histogram of Word Frequencies")
plt.xlabel("Frequency")
plt.ylabel("Number of Words")
plt.show()


In [None]:

sorted_freqs = sorted(word_freq.values(), reverse=True)
ranks = range(1, len(sorted_freqs)+1)

plt.figure(figsize=(10,6))
plt.plot(ranks, sorted_freqs)
plt.xscale('log')
plt.yscale('log')
plt.title("Zipf’s Law: Rank vs Frequency (Log-Log)")
plt.xlabel("Rank (log scale)")
plt.ylabel("Frequency (log scale)")
plt.grid(True)
plt.show()


In [None]:

log_ranks = np.log10(ranks)
log_freqs = np.log10(sorted_freqs)
slope, intercept = np.polyfit(log_ranks, log_freqs, 1)

plt.figure(figsize=(10,6))
plt.plot(log_ranks, log_freqs, label='Data')
plt.plot(log_ranks, slope*log_ranks + intercept, label=f'Fit: y={slope:.2f}x + {intercept:.2f}', color='red')
plt.title("Log-Log Plot with Regression Line")
plt.xlabel("log10(Rank)")
plt.ylabel("log10(Frequency)")
plt.legend()
plt.grid(True)
plt.show()



## ✅ Conclusion

Our analysis of Nicki Minaj's lyrics shows that word frequencies follow Zipf’s Law quite well.  
From the log-log plot and regression fit, we observe a near-linear relationship, with a slope close to -1 — a signature of Zipf’s Law.

This reaffirms that even in artistic expressions like rap lyrics, natural language patterns persist.
