Applying Shannon entropy from information theory to 40,949 YouTube trending video titles to test whether lexical diversity in a title is a statistically meaningful predictor of audience engagement.
Content optimization on platforms like YouTube is dominated by heuristics: "use power words," "keep titles under 60 characters," "front-load keywords." These rules are based on practitioner intuition and A/B test folklore, rarely on a formal quantitative framework.
This project takes a different approach. Shannon entropy, a measure developed by Claude Shannon in 1948 to quantify information density in communication systems, can be applied directly to the word distribution of a title. The question becomes: does the information richness of a title, measured rigorously, correlate with how many people watch the video?
The hypothesis draws from the psychological concept of optimal arousal (Berlyne, 1960): humans are most engaged by stimuli that are neither too predictable nor too complex. In information-theoretic terms, this predicts an inverted-U relationship between entropy and engagement. That is exactly what the data shows.
| Property | Value |
|---|---|
| Source | YouTube Trending Videos (Kaggle) |
| Region | United States |
| Raw rows | 40,949 |
| Columns used | title, views, likes |
| Null rows dropped | 0 |
| Date range | Late 2017 to mid 2018 |
Each row represents a video on a specific trending day. The same video can appear multiple times as it re-enters the trending list, with a cumulative view count per appearance.
Shannon entropy is computed per title as:
H(X) = -Σ p(x) * log₂(p(x))
Where p(x) is the relative frequency of each unique word token in the title
(lowercased, whitespace-split). A title where every word is unique scores at
maximum entropy; a one-word title or a title with fully repeated words scores 0.
Implementation uses scipy.stats.entropy with base=2, giving scores in bits.
| Statistic | Value (bits) |
|---|---|
| Mean | 2.92 |
| Median | 3.00 |
| Std dev | 0.63 |
| Min | 0.00 |
| Max | 4.32 |
143 titles scored exactly 0 (single-word or fully repeated word titles). These are retained as valid data points.
Titles were classified into four equal-width entropy buckets using pd.cut:
| Bucket | Entropy Range | Count |
|---|---|---|
| Very Low | 0.00 – 1.08 | 619 |
| Low | 1.08 – 2.16 | 3,923 |
| High | 2.16 – 3.24 | 23,239 |
| Very High | 3.24 – 4.32 | 13,168 |
Average and median view counts were compared across buckets. Median is included alongside mean to account for the heavy right-skew in view counts; viral outliers inflate bucket averages significantly.
| Bucket | Avg Views | Median Views | Avg Likes |
|---|---|---|---|
| Very Low | 1,513,906 | 842,207 | 67,709 |
| Low | 2,272,910 | 745,849 | 88,998 |
| High | 2,773,282 | 742,377 | 86,703 |
| Very High | 1,698,795 | 575,302 | 48,238 |
The High entropy bucket drives peak average engagement at 2,773,282 views, outperforming Very High entropy titles by 63% and Very Low entropy titles by 83%. This is the inverted-U pattern predicted by optimal surprise theory: engagement peaks before maximum entropy, then declines.
The median view count tells a complementary story. While average views peak in the High bucket, median views are highest in Very Low (842,207), suggesting that low-entropy titles have fewer viral outliers but a stronger floor. Very High entropy titles have both the lowest median (575,302) and a sharply reduced average, indicating they underperform across the distribution, not just in the tail.
images/entropy_vs_views.png: bar chart of average views per entropy bucket
with an overall mean reference line (2,508,753 views). The High bucket sits
clearly above the mean; Very Low and Very High both fall below it.
images/entropy_scatter.png: scatter plot of entropy score vs. log₁₀(views)
for a random 5,000-video sample, with an OLS trend line overlaid. The scatter
is wide, confirming entropy is not a strong individual predictor, but the
aggregate pattern is consistent with the bucket-level result.
The findings suggest a practical title-writing heuristic: aim for word variety without exhaustiveness. Titles that pack in unique, descriptive words, without crossing the length where information value gets diluted, tend to outperform both minimalist titles and keyword-stuffed ones.
This aligns with what platform guidelines recommend anecdotally, but here it is grounded in a measurable quantity. A content team could score title candidates with the entropy function before publishing and use it as a lightweight signal alongside CTR testing.
YouTube's recommendation and trending algorithms are known to incorporate engagement signals (CTR, watch time, likes) in a feedback loop. If higher-entropy titles attract more clicks, and more clicks drive more recommendations, the causal chain may run through the algorithm rather than directly through viewer psychology. Disentangling these effects requires pre-trend data that this dataset does not provide.
The same entropy-over-engagement framework can be applied to:
- Email subject lines: open rate as the engagement proxy
- Ad headline testing: CTR across entropy-bucketed copy variants
- News headline analysis: correlating entropy with article shares
- Product naming: entropy of product name tokens vs. search volume
- Social media copy: post reach or engagement rate by caption entropy
These results are directional, not causal. A skeptical analyst should account for:
-
Title length correlation. Entropy is mechanically tied to word count. Longer titles produce higher entropy simply by having more unique tokens. A partial correlation controlling for word count is needed to isolate the diversity effect from the length effect.
-
Trending algorithm selection bias. This dataset captures videos after they trended. The algorithm may already filter toward certain title styles, meaning the distribution reflects algorithmic curation, not organic audience preference.
-
Genre and category confounds. Music, news, and comedy dominate trending and each has distinct titling conventions. The bucket pattern may be a genre composition artifact rather than an entropy effect.
-
Repeat trending inflation. The same video appears across multiple trending days with an accumulated view count. High-view titles in a single entropy bucket (e.g. "Childish Gambino - This Is America" appearing 10 times) can artificially inflate that bucket's average.
Requirements: Python 3.8+, pandas, numpy, scipy, matplotlib
Install dependencies:
pip install pandas numpy scipy matplotlibRun the full pipeline:
python run_pipeline.pyThe pipeline loads the data, scores entropy, runs the engagement analysis, and
saves both charts to images/. Total runtime is approximately 35–40 seconds.
EntropyVsEngagement/
├── run_pipeline.py # Single entry point
├── README.md
├── src/
│ ├── 01_data_setup.py # Load, audit, and clean dataset
│ ├── 02_entropy_engine.py # Compute entropy scores and buckets
│ └── 03_engagement_analysis.py # Aggregate stats and charts
├── images/
│ ├── entropy_vs_views.png # Bar chart: avg views by bucket
│ └── entropy_scatter.png # Scatter: entropy vs. log(views)
└── data/
└── USvideos.csv # Source data (download from Kaggle)
- Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
- Berlyne, D.E. (1960). Conflict, Arousal, and Curiosity. McGraw-Hill.
- YouTube Trending Dataset (datasnaek via Kaggle)