Skip to content

ike10/EntropyVsEngagement

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entropy vs. Engagement: Does Title Unpredictability Predict YouTube Views?

Applying Shannon entropy from information theory to 40,949 YouTube trending video titles to test whether lexical diversity in a title is a statistically meaningful predictor of audience engagement.


Background and Motivation

Content optimization on platforms like YouTube is dominated by heuristics: "use power words," "keep titles under 60 characters," "front-load keywords." These rules are based on practitioner intuition and A/B test folklore, rarely on a formal quantitative framework.

This project takes a different approach. Shannon entropy, a measure developed by Claude Shannon in 1948 to quantify information density in communication systems, can be applied directly to the word distribution of a title. The question becomes: does the information richness of a title, measured rigorously, correlate with how many people watch the video?

The hypothesis draws from the psychological concept of optimal arousal (Berlyne, 1960): humans are most engaged by stimuli that are neither too predictable nor too complex. In information-theoretic terms, this predicts an inverted-U relationship between entropy and engagement. That is exactly what the data shows.


Dataset

Property Value
Source YouTube Trending Videos (Kaggle)
Region United States
Raw rows 40,949
Columns used title, views, likes
Null rows dropped 0
Date range Late 2017 to mid 2018

Each row represents a video on a specific trending day. The same video can appear multiple times as it re-enters the trending list, with a cumulative view count per appearance.


Methodology

1. Entropy Scoring

Shannon entropy is computed per title as:

H(X) = -Σ p(x) * log₂(p(x))

Where p(x) is the relative frequency of each unique word token in the title (lowercased, whitespace-split). A title where every word is unique scores at maximum entropy; a one-word title or a title with fully repeated words scores 0.

Implementation uses scipy.stats.entropy with base=2, giving scores in bits.

2. Entropy Distribution

Statistic Value (bits)
Mean 2.92
Median 3.00
Std dev 0.63
Min 0.00
Max 4.32

143 titles scored exactly 0 (single-word or fully repeated word titles). These are retained as valid data points.

3. Bucket Classification

Titles were classified into four equal-width entropy buckets using pd.cut:

Bucket Entropy Range Count
Very Low 0.00 – 1.08 619
Low 1.08 – 2.16 3,923
High 2.16 – 3.24 23,239
Very High 3.24 – 4.32 13,168

4. Engagement Comparison

Average and median view counts were compared across buckets. Median is included alongside mean to account for the heavy right-skew in view counts; viral outliers inflate bucket averages significantly.


Results

Engagement by Entropy Bucket

Bucket Avg Views Median Views Avg Likes
Very Low 1,513,906 842,207 67,709
Low 2,272,910 745,849 88,998
High 2,773,282 742,377 86,703
Very High 1,698,795 575,302 48,238

Key Finding

The High entropy bucket drives peak average engagement at 2,773,282 views, outperforming Very High entropy titles by 63% and Very Low entropy titles by 83%. This is the inverted-U pattern predicted by optimal surprise theory: engagement peaks before maximum entropy, then declines.

The median view count tells a complementary story. While average views peak in the High bucket, median views are highest in Very Low (842,207), suggesting that low-entropy titles have fewer viral outliers but a stronger floor. Very High entropy titles have both the lowest median (575,302) and a sharply reduced average, indicating they underperform across the distribution, not just in the tail.

Visual Outputs

images/entropy_vs_views.png: bar chart of average views per entropy bucket with an overall mean reference line (2,508,753 views). The High bucket sits clearly above the mean; Very Low and Very High both fall below it.

images/entropy_scatter.png: scatter plot of entropy score vs. log₁₀(views) for a random 5,000-video sample, with an OLS trend line overlaid. The scatter is wide, confirming entropy is not a strong individual predictor, but the aggregate pattern is consistent with the bucket-level result.


Industry Relevance

Content Strategy

The findings suggest a practical title-writing heuristic: aim for word variety without exhaustiveness. Titles that pack in unique, descriptive words, without crossing the length where information value gets diluted, tend to outperform both minimalist titles and keyword-stuffed ones.

This aligns with what platform guidelines recommend anecdotally, but here it is grounded in a measurable quantity. A content team could score title candidates with the entropy function before publishing and use it as a lightweight signal alongside CTR testing.

Algorithmic Implications

YouTube's recommendation and trending algorithms are known to incorporate engagement signals (CTR, watch time, likes) in a feedback loop. If higher-entropy titles attract more clicks, and more clicks drive more recommendations, the causal chain may run through the algorithm rather than directly through viewer psychology. Disentangling these effects requires pre-trend data that this dataset does not provide.

Broader Applications

The same entropy-over-engagement framework can be applied to:

  • Email subject lines: open rate as the engagement proxy
  • Ad headline testing: CTR across entropy-bucketed copy variants
  • News headline analysis: correlating entropy with article shares
  • Product naming: entropy of product name tokens vs. search volume
  • Social media copy: post reach or engagement rate by caption entropy

Limitations and Confounds

These results are directional, not causal. A skeptical analyst should account for:

  1. Title length correlation. Entropy is mechanically tied to word count. Longer titles produce higher entropy simply by having more unique tokens. A partial correlation controlling for word count is needed to isolate the diversity effect from the length effect.

  2. Trending algorithm selection bias. This dataset captures videos after they trended. The algorithm may already filter toward certain title styles, meaning the distribution reflects algorithmic curation, not organic audience preference.

  3. Genre and category confounds. Music, news, and comedy dominate trending and each has distinct titling conventions. The bucket pattern may be a genre composition artifact rather than an entropy effect.

  4. Repeat trending inflation. The same video appears across multiple trending days with an accumulated view count. High-view titles in a single entropy bucket (e.g. "Childish Gambino - This Is America" appearing 10 times) can artificially inflate that bucket's average.


Reproduce This Analysis

Requirements: Python 3.8+, pandas, numpy, scipy, matplotlib

Install dependencies:

pip install pandas numpy scipy matplotlib

Run the full pipeline:

python run_pipeline.py

The pipeline loads the data, scores entropy, runs the engagement analysis, and saves both charts to images/. Total runtime is approximately 35–40 seconds.


Project Structure

EntropyVsEngagement/
├── run_pipeline.py                  # Single entry point
├── README.md
├── src/
│   ├── 01_data_setup.py             # Load, audit, and clean dataset
│   ├── 02_entropy_engine.py         # Compute entropy scores and buckets
│   └── 03_engagement_analysis.py   # Aggregate stats and charts
├── images/
│   ├── entropy_vs_views.png         # Bar chart: avg views by bucket
│   └── entropy_scatter.png          # Scatter: entropy vs. log(views)
└── data/
    └── USvideos.csv                 # Source data (download from Kaggle)

References

  • Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
  • Berlyne, D.E. (1960). Conflict, Arousal, and Curiosity. McGraw-Hill.
  • YouTube Trending Dataset (datasnaek via Kaggle)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages