Entropy vs. Engagement: Does Title Unpredictability Predict YouTube Views?

Applying Shannon entropy from information theory to 40,949 YouTube trending video titles to test whether lexical diversity in a title is a statistically meaningful predictor of audience engagement.

Background and Motivation

Content optimization on platforms like YouTube is dominated by heuristics: "use power words," "keep titles under 60 characters," "front-load keywords." These rules are based on practitioner intuition and A/B test folklore, rarely on a formal quantitative framework.

This project takes a different approach. Shannon entropy, a measure developed by Claude Shannon in 1948 to quantify information density in communication systems, can be applied directly to the word distribution of a title. The question becomes: does the information richness of a title, measured rigorously, correlate with how many people watch the video?

The hypothesis draws from the psychological concept of optimal arousal (Berlyne, 1960): humans are most engaged by stimuli that are neither too predictable nor too complex. In information-theoretic terms, this predicts an inverted-U relationship between entropy and engagement. That is exactly what the data shows.

Dataset

Property	Value
Source	YouTube Trending Videos (Kaggle)
Region	United States
Raw rows	40,949
Columns used	`title`, `views`, `likes`
Null rows dropped	0
Date range	Late 2017 to mid 2018

Each row represents a video on a specific trending day. The same video can appear multiple times as it re-enters the trending list, with a cumulative view count per appearance.

Methodology

1. Entropy Scoring

Shannon entropy is computed per title as:

H(X) = -Σ p(x) * log₂(p(x))

Where p(x) is the relative frequency of each unique word token in the title (lowercased, whitespace-split). A title where every word is unique scores at maximum entropy; a one-word title or a title with fully repeated words scores 0.

Implementation uses scipy.stats.entropy with base=2, giving scores in bits.

2. Entropy Distribution

Statistic	Value (bits)
Mean	2.92
Median	3.00
Std dev	0.63
Min	0.00
Max	4.32

143 titles scored exactly 0 (single-word or fully repeated word titles). These are retained as valid data points.

3. Bucket Classification

Titles were classified into four equal-width entropy buckets using pd.cut:

Bucket	Entropy Range	Count
Very Low	0.00 – 1.08	619
Low	1.08 – 2.16	3,923
High	2.16 – 3.24	23,239
Very High	3.24 – 4.32	13,168

4. Engagement Comparison

Average and median view counts were compared across buckets. Median is included alongside mean to account for the heavy right-skew in view counts; viral outliers inflate bucket averages significantly.

Results

Engagement by Entropy Bucket

Bucket	Avg Views	Median Views	Avg Likes
Very Low	1,513,906	842,207	67,709
Low	2,272,910	745,849	88,998
High	2,773,282	742,377	86,703
Very High	1,698,795	575,302	48,238

Key Finding

The High entropy bucket drives peak average engagement at 2,773,282 views, outperforming Very High entropy titles by 63% and Very Low entropy titles by 83%. This is the inverted-U pattern predicted by optimal surprise theory: engagement peaks before maximum entropy, then declines.

The median view count tells a complementary story. While average views peak in the High bucket, median views are highest in Very Low (842,207), suggesting that low-entropy titles have fewer viral outliers but a stronger floor. Very High entropy titles have both the lowest median (575,302) and a sharply reduced average, indicating they underperform across the distribution, not just in the tail.

Visual Outputs

images/entropy_vs_views.png: bar chart of average views per entropy bucket with an overall mean reference line (2,508,753 views). The High bucket sits clearly above the mean; Very Low and Very High both fall below it.

images/entropy_scatter.png: scatter plot of entropy score vs. log₁₀(views) for a random 5,000-video sample, with an OLS trend line overlaid. The scatter is wide, confirming entropy is not a strong individual predictor, but the aggregate pattern is consistent with the bucket-level result.

Industry Relevance

Content Strategy

The findings suggest a practical title-writing heuristic: aim for word variety without exhaustiveness. Titles that pack in unique, descriptive words, without crossing the length where information value gets diluted, tend to outperform both minimalist titles and keyword-stuffed ones.

This aligns with what platform guidelines recommend anecdotally, but here it is grounded in a measurable quantity. A content team could score title candidates with the entropy function before publishing and use it as a lightweight signal alongside CTR testing.

Algorithmic Implications

YouTube's recommendation and trending algorithms are known to incorporate engagement signals (CTR, watch time, likes) in a feedback loop. If higher-entropy titles attract more clicks, and more clicks drive more recommendations, the causal chain may run through the algorithm rather than directly through viewer psychology. Disentangling these effects requires pre-trend data that this dataset does not provide.

Broader Applications

The same entropy-over-engagement framework can be applied to:

Email subject lines: open rate as the engagement proxy
Ad headline testing: CTR across entropy-bucketed copy variants
News headline analysis: correlating entropy with article shares
Product naming: entropy of product name tokens vs. search volume
Social media copy: post reach or engagement rate by caption entropy

Limitations and Confounds

These results are directional, not causal. A skeptical analyst should account for:

Title length correlation. Entropy is mechanically tied to word count. Longer titles produce higher entropy simply by having more unique tokens. A partial correlation controlling for word count is needed to isolate the diversity effect from the length effect.
Trending algorithm selection bias. This dataset captures videos after they trended. The algorithm may already filter toward certain title styles, meaning the distribution reflects algorithmic curation, not organic audience preference.
Genre and category confounds. Music, news, and comedy dominate trending and each has distinct titling conventions. The bucket pattern may be a genre composition artifact rather than an entropy effect.
Repeat trending inflation. The same video appears across multiple trending days with an accumulated view count. High-view titles in a single entropy bucket (e.g. "Childish Gambino - This Is America" appearing 10 times) can artificially inflate that bucket's average.

Reproduce This Analysis

Requirements: Python 3.8+, pandas, numpy, scipy, matplotlib

Install dependencies:

pip install pandas numpy scipy matplotlib

Run the full pipeline:

python run_pipeline.py

The pipeline loads the data, scores entropy, runs the engagement analysis, and saves both charts to images/. Total runtime is approximately 35–40 seconds.

Project Structure

EntropyVsEngagement/
├── run_pipeline.py                  # Single entry point
├── README.md
├── src/
│   ├── 01_data_setup.py             # Load, audit, and clean dataset
│   ├── 02_entropy_engine.py         # Compute entropy scores and buckets
│   └── 03_engagement_analysis.py   # Aggregate stats and charts
├── images/
│   ├── entropy_vs_views.png         # Bar chart: avg views by bucket
│   └── entropy_scatter.png          # Scatter: entropy vs. log(views)
└── data/
    └── USvideos.csv                 # Source data (download from Kaggle)

References

Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
Berlyne, D.E. (1960). Conflict, Arousal, and Curiosity. McGraw-Hill.
YouTube Trending Dataset (datasnaek via Kaggle)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entropy vs. Engagement: Does Title Unpredictability Predict YouTube Views?

Background and Motivation

Dataset

Methodology

1. Entropy Scoring

2. Entropy Distribution

3. Bucket Classification

4. Engagement Comparison

Results

Engagement by Entropy Bucket

Key Finding

Visual Outputs

Industry Relevance

Content Strategy

Algorithmic Implications

Broader Applications

Limitations and Confounds

Reproduce This Analysis

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Entropy vs. Engagement: Does Title Unpredictability Predict YouTube Views?

Background and Motivation

Dataset

Methodology

1. Entropy Scoring

2. Entropy Distribution

3. Bucket Classification

4. Engagement Comparison

Results

Engagement by Entropy Bucket

Key Finding

Visual Outputs

Industry Relevance

Content Strategy

Algorithmic Implications

Broader Applications

Limitations and Confounds

Reproduce This Analysis

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages