# Posts Context Evaluation

**Objective:** Determine if post-title context improves player attribution for comments.

From comments EDA, only 22.4% of comments explicitly mention player names. The remaining 77.6% may be attributable via their parent post's title.

**Kill Question:** What % of post titles contain player names?

| Threshold | Decision |
|-----------|----------|
| >50% | Post-context is valuable, continue evaluation |
| <30% | Not worth the complexity |

In [1]:
import json
import yaml
from pathlib import Path
from datetime import datetime

DATA_DIR = Path("data")
POSTS_PATH = DATA_DIR / "raw" / "r_nba_posts.jsonl"
PLAYERS_CONFIG = Path("config/players.yaml")

## Load Data

In [2]:
posts = [json.loads(line) for line in POSTS_PATH.open()]
print(f"Posts loaded: {len(posts):,}")

Posts loaded: 92,531


In [3]:
with open(PLAYERS_CONFIG) as f:
    player_config = yaml.safe_load(f)

PLAYER_PATTERNS = player_config["players"]
print(f"Players tracked: {len(PLAYER_PATTERNS)}")

Players tracked: 85


## Data Quality Check

In [4]:
timestamps = [p.get("created_utc", 0) for p in posts]
min_date = datetime.fromtimestamp(min(timestamps))
max_date = datetime.fromtimestamp(max(timestamps))
print(f"Date range: {min_date.date()} to {max_date.date()}")

print("\nField completeness:")
fields = [
    "id",
    "title",
    "link_flair_text",
    "author_flair_text",
    "score",
    "num_comments",
]
for field in fields:
    present = sum(1 for p in posts if p.get(field) is not None)
    print(f"  {field}: {present:,} ({present / len(posts) * 100:.1f}%)")

Date range: 2024-10-01 to 2025-06-29

Field completeness:
  id: 92,531 (100.0%)
  title: 92,531 (100.0%)
  link_flair_text: 15,609 (16.9%)
  author_flair_text: 42,071 (45.5%)
  score: 92,531 (100.0%)
  num_comments: 92,531 (100.0%)


## Kill Question: Player Mentions in Titles

If <30% of titles mention players, post-context adds complexity without sufficient value.

In [5]:
def mentions_player(text: str) -> tuple[bool, list[str]]:
    """Check if text mentions any tracked player."""
    if not text:
        return False, []
    text_lower = text.lower()
    found = []
    for player, aliases in PLAYER_PATTERNS.items():
        if any(alias in text_lower for alias in aliases):
            found.append(player)
    return len(found) > 0, found


titles_with_players = 0
for post in posts:
    has_mention, _ = mentions_player(post.get("title", ""))
    if has_mention:
        titles_with_players += 1

pct = titles_with_players / len(posts) * 100
print(
    f"Titles mentioning players: {titles_with_players:,} / {len(posts):,} ({pct:.1f}%)"
)

Titles mentioning players: 62,789 / 92,531 (67.9%)


**Result:** 67.9% of post titles mention at least one player.

This exceeds the 50% threshold. Post-context is potentially valuable — continue evaluation.

## Player Mention Comparison: Titles vs Comments

Rerun player detection on comments using the same comprehensive player config to enable direct comparison.

In [6]:
COMMENTS_PATH = DATA_DIR / "filtered" / "r_nba_cleaned.jsonl"

comment_count = 0
comments_with_players = 0

with open(COMMENTS_PATH) as f:
    for line in f:
        comment = json.loads(line)
        comment_count += 1
        has_mention, _ = mentions_player(comment.get("body", ""))
        if has_mention:
            comments_with_players += 1

comment_pct = comments_with_players / comment_count * 100
title_pct = titles_with_players / len(posts) * 100

print(
    f"Comments mentioning players: {comments_with_players:,} / {comment_count:,} ({comment_pct:.1f}%)"
)
print(
    f"Titles mentioning players:   {titles_with_players:,} / {len(posts):,} ({title_pct:.1f}%)"
)

Comments mentioning players: 3,754,033 / 6,891,163 (54.5%)
Titles mentioning players:   62,789 / 92,531 (67.9%)


## False Positive Audit

Test impact of word boundary matching for problematic aliases.

In [7]:
import re

# Aliases that are too ambiguous even with word boundaries (remove entirely)
REMOVE_ENTIRELY = {
    "ball",  # "ball movement", "ball handler" — ubiquitous in NBA
    "green",  # color
    "brown",  # color
    "rose",  # common word
    "young",  # "young player", "young team"
    "wall",  # "hit a wall"
    "smart",  # "smart play"
    "bridges",  # common word
    "butler",  # common word
    "turner",  # common surname
    "gordon",  # common name
    "murray",  # common name
    "hart",  # common word
    "tre",  # common word fragment
}

# Aliases that need word boundary matching (from YAML + additional)
WORD_BOUNDARY_REQUIRED = set(player_config.get("short_aliases", [])) | {
    "curry",  # food references
    "booker",  # common surname
    "brooks",  # common surname
    "leonard",  # common name
    "mitchell",  # common name
    "edwards",  # common name
    "thomas",  # common name
    "johnson",  # common name
    "davis",  # common name
    "paul",  # common first name
    "james",  # common first name
    "george",  # common first name
}


def mentions_player_strict(text: str) -> tuple[bool, list[str]]:
    """Stricter player detection with word boundaries for problematic aliases."""
    if not text:
        return False, []
    text_lower = text.lower()
    found = []

    for player, aliases in PLAYER_PATTERNS.items():
        for alias in aliases:
            alias_lower = alias.lower()

            # Skip removed aliases
            if alias_lower in REMOVE_ENTIRELY:
                continue

            # Use word boundary for problematic aliases
            if alias_lower in WORD_BOUNDARY_REQUIRED:
                pattern = r"\b" + re.escape(alias_lower) + r"\b"
                if re.search(pattern, text_lower):
                    found.append(player)
                    break
            else:
                # Standard substring matching
                if alias_lower in text_lower:
                    found.append(player)
                    break

    return len(found) > 0, found

In [8]:
titles_original = sum(1 for p in posts if mentions_player(p.get("title", ""))[0])
titles_strict = sum(1 for p in posts if mentions_player_strict(p.get("title", ""))[0])

print("Title player mentions:")
print(f"  Original: {titles_original:,} ({titles_original / len(posts) * 100:.1f}%)")
print(f"  Strict:   {titles_strict:,} ({titles_strict / len(posts) * 100:.1f}%)")
print(
    f"  Diff:     {titles_original - titles_strict:,} ({(titles_original - titles_strict) / len(posts) * 100:.1f}%)"
)

Title player mentions:
  Original: 62,789 (67.9%)
  Strict:   49,346 (53.3%)
  Diff:     13,443 (14.5%)


In [9]:
comments_original = 0
comments_strict = 0

with open(COMMENTS_PATH) as f:
    for line in f:
        body = json.loads(line).get("body", "")
        if mentions_player(body)[0]:
            comments_original += 1
        if mentions_player_strict(body)[0]:
            comments_strict += 1

print("\nComment player mentions:")
print(
    f"  Original: {comments_original:,} ({comments_original / comment_count * 100:.1f}%)"
)
print(f"  Strict:   {comments_strict:,} ({comments_strict / comment_count * 100:.1f}%)")
print(
    f"  Diff:     {comments_original - comments_strict:,} ({(comments_original - comments_strict) / comment_count * 100:.1f}%)"
)


Comment player mentions:
  Original: 3,754,033 (54.5%)
  Strict:   2,841,666 (41.2%)
  Diff:     912,367 (13.2%)


## Flair Distribution Analysis

Only 16.9% of posts have `link_flair_text`. Examine what's tagged and whether it's useful for filtering.

In [10]:
from collections import Counter

flair_counts = Counter(p.get("link_flair_text") for p in posts)

# Separate None from actual flairs
no_flair = flair_counts.pop(None, 0)
print(f"Posts without flair: {no_flair:,} ({no_flair / len(posts) * 100:.1f}%)\n")

print("Flair distribution:")
for flair, count in flair_counts.most_common(20):
    pct = count / len(posts) * 100
    print(f"  {flair}: {count:,} ({pct:.1f}%)")

Posts without flair: 76,922 (83.1%)

Flair distribution:
  Highlight: 11,544 (12.5%)
  Post Game Thread: 1,565 (1.7%)
  Game Thread: 1,420 (1.5%)
  Index Thread: 250 (0.3%)
  Discussion: 243 (0.3%)
  News: 225 (0.2%)
  All-Access: 206 (0.2%)
  Original Content: 101 (0.1%)
  Self-Promo and Fan Art Thread: 39 (0.0%)
  Misleading: 8 (0.0%)
  TRASH TALK THREAD: 3 (0.0%)
  Announcement: 2 (0.0%)
  AMA: 1 (0.0%)
  Poll: 1 (0.0%)
  Mod Post: 1 (0.0%)


In [11]:
# build post lookup for joining
posts_lookup = {p["id"]: p for p in posts}
print(f"Post lookup built: {len(posts_lookup):,} posts")

Post lookup built: 92,531 posts


In [None]:
# comment distribution by flair
flair_comment_counts = Counter()
join_failures = 0

with open(COMMENTS_PATH) as f:
    for line in f:
        comment = json.loads(line)
        link_id = comment.get("link_id", "")
        post_id = link_id.replace("t3_", "")

        post = posts_lookup.get(post_id)
        if post:
            flair = post.get("link_flair_text") or "No Flair"
            flair_comment_counts[flair] += 1
        else:
            join_failures += 1

total_joined = sum(flair_comment_counts.values())
print(f"Comments joined: {total_joined:,}")
print(
    f"Join failures: {join_failures:,} ({join_failures / comment_count * 100:.1f}%)\n"
)

print("Comments by post flair:")
for flair, count in flair_comment_counts.most_common(20):
    pct = count / total_joined * 100
    print(f"  {flair}: {count:,} ({pct:.1f}%)")

Comments joined: 6,883,715
Join failures: 7,448 (0.1%)

Comments by post flair:
  No Flair: 4,073,826 (59.2%)
  Game Thread: 1,546,398 (22.5%)
  Highlight: 835,279 (12.1%)
  Post Game Thread: 345,563 (5.0%)
  News: 39,326 (0.6%)
  All-Access: 10,608 (0.2%)
  Index Thread: 10,373 (0.2%)
  Discussion: 6,887 (0.1%)
  Announcement: 5,513 (0.1%)
  Original Content: 4,888 (0.1%)
  Misleading: 3,055 (0.0%)
  TRASH TALK THREAD: 1,583 (0.0%)
  Poll: 306 (0.0%)
  Self-Promo and Fan Art Thread: 104 (0.0%)
  Mod Post: 6 (0.0%)


## Player Mention Rate by Flair

Determine which post types have highest signal for player attribution.

In [13]:
flair_mention_counts = Counter()  # comments with player mentions
flair_total_counts = Counter()  # all comments

with open(COMMENTS_PATH) as f:
    for line in f:
        comment = json.loads(line)
        link_id = comment.get("link_id", "")
        post_id = link_id.replace("t3_", "")

        post = posts_lookup.get(post_id)
        if not post:
            continue

        flair = post.get("link_flair_text") or "No Flair"
        flair_total_counts[flair] += 1

        has_mention, _ = mentions_player_strict(comment.get("body", ""))
        if has_mention:
            flair_mention_counts[flair] += 1

print("Player mention rate by flair:\n")
print(f"{'Flair':<25} {'Mentions':>12} {'Total':>12} {'Rate':>8}")
print("-" * 60)

for flair, total in flair_total_counts.most_common(10):
    mentions = flair_mention_counts[flair]
    rate = mentions / total * 100
    print(f"{flair:<25} {mentions:>12,} {total:>12,} {rate:>7.1f}%")

Player mention rate by flair:

Flair                         Mentions        Total     Rate
------------------------------------------------------------
No Flair                     1,794,315    4,073,826    44.0%
Game Thread                    548,199    1,546,398    35.5%
Highlight                      321,969      835,279    38.5%
Post Game Thread               143,271      345,563    41.5%
News                            15,290       39,326    38.9%
All-Access                       3,512       10,608    33.1%
Index Thread                     3,372       10,373    32.5%
Discussion                       3,226        6,887    46.8%
Announcement                     1,271        5,513    23.1%
Original Content                 2,323        4,888    47.5%


## Combined Coverage Analysis

For comments without explicit player mentions, can post titles provide attribution?

In [14]:
# Track: comment has mention, title has mention, both, neither

comment_only = 0
title_only = 0
both = 0
neither = 0

with open(COMMENTS_PATH) as f:
    for line in f:
        comment = json.loads(line)
        link_id = comment.get("link_id", "")
        post_id = link_id.replace("t3_", "")

        post = posts_lookup.get(post_id)
        if not post:
            continue

        comment_mention, _ = mentions_player_strict(comment.get("body", ""))
        title_mention, _ = mentions_player_strict(post.get("title", ""))

        if comment_mention and title_mention:
            both += 1
        elif comment_mention:
            comment_only += 1
        elif title_mention:
            title_only += 1
        else:
            neither += 1

total = comment_only + title_only + both + neither

print("Combined coverage:\n")
print(f"  Comment + Title mention: {both:,} ({both / total * 100:.1f}%)")
print(
    f"  Comment only:            {comment_only:,} ({comment_only / total * 100:.1f}%)"
)
print(f"  Title only:              {title_only:,} ({title_only / total * 100:.1f}%)")
print(f"  Neither:                 {neither:,} ({neither / total * 100:.1f}%)")
print(
    f"\n  Total attributable:      {both + comment_only + title_only:,} ({(both + comment_only + title_only) / total * 100:.1f}%)"
)

Combined coverage:

  Comment + Title mention: 1,726,086 (25.1%)
  Comment only:            1,112,655 (16.2%)
  Title only:              2,120,662 (30.8%)
  Neither:                 1,924,312 (28.0%)

  Total attributable:      4,959,403 (72.0%)


## Summary & Recommendation

### Key Findings

| Metric | Value |
|--------|-------|
| Posts downloaded | 92,531 |
| Date range | 2024-10-01 to 2025-06-29 |
| Join success rate | 99.9% |

**Player Mention Rates (Strict Matching):**

| Source | Mentions | Rate |
|--------|----------|------|
| Comment body | 2,841,666 | 41.2% |
| Post title | 49,346 | 53.3% |

**Combined Coverage:**

| Category | Comments | % |
|----------|----------|---|
| Comment + Title | 1,726,086 | 25.1% |
| Comment only | 1,112,655 | 16.2% |
| Title only (would be "rescued") | 2,120,662 | 30.8% |
| Neither | 1,924,312 | 28.0% |

**Flair Analysis:**

- 83.1% of posts have no flair — unreliable for filtering
- Game Threads: 22.5% of comments but only 35.5% player mention rate
- Flair-based filtering provides minimal signal improvement (~6pp)

### Decision: Do Not Implement Post-Context Attribution

**Rationale:**

1. **2.84M player-mention comments is statistically robust** — no rescue needed
2. **Title-only attribution is noisy** — a comment on a "LeBron" post may not mention LeBron at all
3. **Flair filtering not worth it** — marginal signal improvement doesn't justify join complexity
4. **Budget fit** — 2.84M comments × ~$0.00005/comment ≈ $142, well within $200 ceiling

### Pipeline Implication

The `CommentPipeline` filters on **comment body player mentions only**. Posts data is not required.