# NBA Hate Tracker - EDA Exploration

**Goal:** Validate whether Reddit data contains a usable "hate signal" for NBA players.

**This notebook answers:**
1. Can VADER detect sentiment in NBA Reddit comments?
2. Where does it fail? (sarcasm, slang, context)
3. Go/no-go recommendation for V1


In [1]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

## Sample Comments

Manually curated to test edge cases before we get real data.

In [2]:
sample_comments = [
    # Clear sentiment
    {"text": "LeBron is the GOAT, no question", "expected": "positive"},
    {"text": "Westbrook is absolutely trash", "expected": "negative"},
    {"text": "Lakers play tomorrow at 7pm", "expected": "neutral"},
    # Sarcasm (VADER will likely fail)
    {"text": "Great defense from Harden as usual", "expected": "negative (sarcastic)"},
    {
        "text": "Wow Westbrook only 5 turnovers, he's really improving",
        "expected": "negative (sarcastic)",
    },
    # NBA slang (inverted sentiment)
    {"text": "That dunk was NASTY", "expected": "positive"},
    {"text": "Curry is absolutely disgusting from three", "expected": "positive"},
    {"text": "Another brick from Westbrook", "expected": "negative"},
    # Mixed/complex
    {"text": "LeBron played like garbage but hit the game winner", "expected": "mixed"},
    {"text": "I hate how good Jokic is", "expected": "positive (grudging respect)"},
]

In [3]:
def analyze_sample(comments):
    """Run VADER on samples and compare to expected."""
    results = []
    for item in comments:
        scores = analyzer.polarity_scores(item["text"])
        results.append(
            {
                "text": item["text"],
                "expected": item["expected"],
                "compound": scores["compound"],
                "vader_label": "positive"
                if scores["compound"] > 0.05
                else "negative"
                if scores["compound"] < -0.05
                else "neutral",
            }
        )
    return results


results = analyze_sample(sample_comments)
for r in results:
    match = "✓" if r["expected"].startswith(r["vader_label"]) else "✗"
    print(
        f"{match} [{r['compound']:+.2f}] {r['vader_label']:8} | expected: {r['expected']:20} | {r['text']}"
    )

✗ [-0.30] negative | expected: positive             | LeBron is the GOAT, no question
✗ [+0.00] neutral  | expected: negative             | Westbrook is absolutely trash
✗ [+0.34] positive | expected: neutral              | Lakers play tomorrow at 7pm
✗ [+0.68] positive | expected: negative (sarcastic) | Great defense from Harden as usual
✗ [+0.78] positive | expected: negative (sarcastic) | Wow Westbrook only 5 turnovers, he's really improving
✗ [-0.65] negative | expected: positive             | That dunk was NASTY
✗ [-0.57] negative | expected: positive             | Curry is absolutely disgusting from three
✗ [+0.00] neutral  | expected: negative             | Another brick from Westbrook
✗ [+0.82] positive | expected: mixed                | LeBron played like garbage but hit the game winner
✗ [-0.20] negative | expected: positive (grudging respect) | I hate how good Jokic is


## Experiment: Custom Lexicon Patch
 
### VADER allows adding/updating word scores. Scale: -4 (most negative) to +4 (most positive)

In [4]:
# NBA-specific lexicon additions
nba_lexicon = {
    # Positive slang (words VADER thinks are negative)
    "nasty": 3.0,
    "disgusting": 2.5,
    "filthy": 2.5,
    "sick": 2.0,
    "insane": 2.0,
    "crazy": 1.5,
    "killed": 2.0,  # "he killed it"
    "murdered": 2.0,  # "murdered that dunk"
    "goat": 4.0,  # Greatest of all time
    # Negative terms (words VADER misses)
    "trash": -3.0,
    "brick": -2.5,
    "bricks": -2.5,
    "choke": -3.0,
    "choked": -3.0,
    "washed": -2.5,
    "cooked": -2.5,  # "he got cooked on defense"
    "fraud": -3.0,
    "overrated": -2.5,
    "poverty": -2.0,  # "poverty franchise"
}

# Create patched analyzer
patched_analyzer = SentimentIntensityAnalyzer()
patched_analyzer.lexicon.update(nba_lexicon)

In [5]:
def compare_analyzers(comments):
    """Compare vanilla VADER vs patched."""
    print(f"{'Text':<55} | {'Vanilla':^8} | {'Patched':^8} | Expected")
    print("-" * 100)

    for item in comments:
        vanilla = analyzer.polarity_scores(item["text"])["compound"]
        patched = patched_analyzer.polarity_scores(item["text"])["compound"]

        # Highlight improvements
        marker = "→" if abs(patched - vanilla) > 0.1 else " "
        print(
            f"{item['text']:<55} | {vanilla:+.2f}    | {patched:+.2f} {marker}  | {item['expected']}"
        )


compare_analyzers(sample_comments)

Text                                                    | Vanilla  | Patched  | Expected
----------------------------------------------------------------------------------------------------
LeBron is the GOAT, no question                         | -0.30    | +0.67 →  | positive
Westbrook is absolutely trash                           | +0.00    | -0.65 →  | negative
Lakers play tomorrow at 7pm                             | +0.34    | +0.34    | neutral
Great defense from Harden as usual                      | +0.68    | +0.68    | negative (sarcastic)
Wow Westbrook only 5 turnovers, he's really improving   | +0.78    | +0.78    | negative (sarcastic)
That dunk was NASTY                                     | -0.65    | +0.69 →  | positive
Curry is absolutely disgusting from three               | -0.57    | +0.58 →  | positive
Another brick from Westbrook                            | +0.00    | -0.54 →  | negative
LeBron played like garbage but hit the game winner      | +0.82    | +0.82 

## Finding: VADER Limitations

| Category | Vanilla VADER | Patched VADER | Fixable? |
|----------|---------------|---------------|----------|
| Sports slang (nasty, disgusting) | ❌ | ✅ | Yes, but manual |
| Domain terms (trash, brick, GOAT) | ❌ | ✅ | Yes, but manual |
| Sarcasm | ❌ | ❌ | No - needs context |
| Complex sentiment | ❌ | ❌ | No - needs reasoning |

**Conclusion:** Lexicon-based approaches hit a ceiling at ~50% accuracy for NBA content.
Sarcasm and nuanced sentiment require either:
- Transformer models (RoBERTa) - heavier, but trained for social media
- LLM API (Claude Haiku) - handles context natively, ~$0.25/1M tokens
- `/s` tag detection - cheap sarcasm filter, Reddit-specific

**Recommendation for V1:** Evaluate RoBERTa or Haiku once real data is available.
VADER could serve as a fast baseline for comparison.

## Haiku Sentiment Classifier

### Testing Claude Haiku as a context-aware sentiment classifier.

In [None]:
import json
import os
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

SYSTEM_PROMPT = """You analyze NBA basketball comments from Reddit and classify sentiment.

You must understand:
- Sports slang inverts meaning: "nasty", "disgusting", "filthy", "sick" = POSITIVE (impressive play)
- "brick" = missed shot (negative)
- "cooked", "washed", "fraud" = negative
- "GOAT" = greatest of all time (positive)
- Sarcasm is extremely common in r/NBA
- Nicknames: "The King", "Bron" = LeBron James; "Chef Curry", "Steph" = Stephen Curry

Respond ONLY with valid JSON:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "target_player": "Full Name or null", "reasoning": "brief explanation"}"""


def classify_sentiment(comment: str) -> dict:
    """Classify sentiment of an NBA Reddit comment using Claude Haiku."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        temperature=0.0,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": comment}],
    )

    raw_text = response.content[0].text

    # Strip markdown code blocks if present
    cleaned = raw_text.strip()
    if cleaned.startswith("```json"):
        cleaned = cleaned[7:]
    if cleaned.startswith("```"):
        cleaned = cleaned[3:]
    if cleaned.endswith("```"):
        cleaned = cleaned[:-3]
    cleaned = cleaned.strip()

    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        return {"error": raw_text}

## Test Suite

In [8]:
test_comments = [
    # Obvious sentiment
    {
        "text": "LeBron is the GOAT, no question",
        "expected_sentiment": "positive",
        "expected_player": "LeBron James",
    },
    {
        "text": "Westbrook is absolutely trash",
        "expected_sentiment": "negative",
        "expected_player": "Russell Westbrook",
    },
    {
        "text": "Lakers play tomorrow at 7pm",
        "expected_sentiment": "neutral",
        "expected_player": None,
    },
    # Sarcasm (VADER failed these)
    {
        "text": "Great defense from Harden as usual",
        "expected_sentiment": "negative",
        "expected_player": "James Harden",
    },
    {
        "text": "Wow Westbrook only 5 turnovers, he's really improving",
        "expected_sentiment": "negative",
        "expected_player": "Russell Westbrook",
    },
    # Sports slang - positive
    {
        "text": "That dunk was NASTY",
        "expected_sentiment": "positive",
        "expected_player": None,
    },
    {
        "text": "Curry is absolutely disgusting from three",
        "expected_sentiment": "positive",
        "expected_player": "Stephen Curry",
    },
    {
        "text": "Ant is so filthy with the handles",
        "expected_sentiment": "positive",
        "expected_player": "Anthony Edwards",
    },
    # Sports slang - negative
    {
        "text": "Another brick from Westbrook",
        "expected_sentiment": "negative",
        "expected_player": "Russell Westbrook",
    },
    {
        "text": "Harden got cooked on defense again",
        "expected_sentiment": "negative",
        "expected_player": "James Harden",
    },
    # Nicknames
    {
        "text": "The King can't guard anyone anymore",
        "expected_sentiment": "negative",
        "expected_player": "LeBron James",
    },
    {
        "text": "Chef Curry cooking tonight",
        "expected_sentiment": "positive",
        "expected_player": "Stephen Curry",
    },
    {
        "text": "Bron haters are so quiet right now",
        "expected_sentiment": "positive",
        "expected_player": "LeBron James",
    },
    # Mixed/complex
    {
        "text": "LeBron played like garbage but hit the game winner",
        "expected_sentiment": "positive",
        "expected_player": "LeBron James",
    },
    {
        "text": "I hate how good Jokic is",
        "expected_sentiment": "positive",
        "expected_player": "Nikola Jokic",
    },
    {
        "text": "Love Curry but he choked tonight",
        "expected_sentiment": "negative",
        "expected_player": "Stephen Curry",
    },
    # No player target
    {
        "text": "These refs are garbage",
        "expected_sentiment": "negative",
        "expected_player": None,
    },
    {"text": "What a game!", "expected_sentiment": "positive", "expected_player": None},
    # Negative nicknames
    {
        "text": "LeGM trading away all our picks",
        "expected_sentiment": "negative",
        "expected_player": "LeBron James",
    },
    {
        "text": "Westbrick with another airball",
        "expected_sentiment": "negative",
        "expected_player": "Russell Westbrook",
    },
]

In [12]:
def run_test_suite(comments):
    """Run Haiku on test suite and report accuracy."""
    results = []

    for item in comments:
        result = classify_sentiment(item["text"])

        if "error" in result:
            sentiment_match = False
            player_match = False
        else:
            sentiment_match = result.get("sentiment") == item["expected_sentiment"]

            # Player matching (flexible - check if expected is in result or vice versa)
            result_player = result.get("target_player")
            expected_player = item["expected_player"]

            if expected_player is None and result_player is None:
                player_match = True
            elif expected_player is None or result_player is None:
                player_match = False
            else:
                player_match = (
                    expected_player.lower() in result_player.lower()
                    or result_player.lower() in expected_player.lower()
                )

        results.append(
            {
                **item,
                "result": result,
                "sentiment_match": sentiment_match,
                "player_match": player_match,
            }
        )

    return results


print("Running test suite (this will make API calls)...")
results = run_test_suite(test_comments)

Running test suite (this will make API calls)...


In [14]:
# Display results
sentiment_correct = sum(1 for r in results if r["sentiment_match"])
player_correct = sum(1 for r in results if r["player_match"])
total = len(results)

print(f"\n{'=' * 80}")
print(
    f"SENTIMENT ACCURACY: {sentiment_correct}/{total} ({100 * sentiment_correct / total:.0f}%)"
)
print(
    f"PLAYER EXTRACTION:  {player_correct}/{total} ({100 * player_correct / total:.0f}%)"
)
print(f"{'=' * 80}\n")

for r in results:
    sent_mark = "✓" if r["sentiment_match"] else "✗"
    player_mark = "✓" if r["player_match"] else "✗"

    result = r["result"]
    if "error" in result:
        print(f"{sent_mark}{player_mark} ERROR: {result['error'][:50]}")
    else:
        player = result.get("target_player") or "None"
        print(
            f"{sent_mark}{player_mark} [{result['sentiment']:8}] {player:20} | {r['text'][:50]}"
        )
        if not r["sentiment_match"]:
            print(
                f"      Expected: {r['expected_sentiment']}, Got: {result['sentiment']}"
            )
            print(f"      Reasoning: {result.get('reasoning', 'N/A')}")


SENTIMENT ACCURACY: 19/20 (95%)
PLAYER EXTRACTION:  20/20 (100%)

✓✓ [positive] LeBron James         | LeBron is the GOAT, no question
✓✓ [negative] Russell Westbrook    | Westbrook is absolutely trash
✓✓ [neutral ] None                 | Lakers play tomorrow at 7pm
✓✓ [negative] James Harden         | Great defense from Harden as usual
✓✓ [negative] Russell Westbrook    | Wow Westbrook only 5 turnovers, he's really improv
✓✓ [positive] None                 | That dunk was NASTY
✓✓ [positive] Stephen Curry        | Curry is absolutely disgusting from three
✓✓ [positive] Anthony Edwards      | Ant is so filthy with the handles
✓✓ [negative] Russell Westbrook    | Another brick from Westbrook
✓✓ [negative] James Harden         | Harden got cooked on defense again
✓✓ [negative] LeBron James         | The King can't guard anyone anymore
✓✓ [positive] Stephen Curry        | Chef Curry cooking tonight
✓✓ [positive] LeBron James         | Bron haters are so quiet right now
✓✓ [positive] LeBr

## Finding: Claude Haiku as Sentiment Classifier

**Model:** `claude-haiku-4-5-20251001`

| Metric | Result |
|--------|--------|
| Sentiment Accuracy | 95% (19/20) |
| Player Extraction | 100% (20/20) |

**Haiku successfully handled:**
- ✅ Sarcasm ("Great defense from Harden as usual" → negative)
- ✅ Sports slang inversion ("disgusting from three" → positive)
- ✅ Nicknames and variants (Westbrick, LeGM, The King, Chef Curry)
- ✅ Complex/mixed sentiment ("hate how good Jokic is" → positive)

**Single miss:** "What a game!" classified as neutral (expected positive).
Model reasoning: ambiguous without context. Defensible.

**Cost estimate:** ~100-150 tokens per classification → ~$0.03-0.05 per 1,000 comments

**Recommendation:** Green light. Haiku is the primary classifier for V1.

In [16]:
print("Running test suite (this will make API calls)...")
results = run_test_suite(test_comments)

# Display results
sentiment_correct = sum(1 for r in results if r["sentiment_match"])
player_correct = sum(1 for r in results if r["player_match"])
total = len(results)

print(f"\n{'=' * 80}")
print(
    f"SENTIMENT ACCURACY: {sentiment_correct}/{total} ({100 * sentiment_correct / total:.0f}%)"
)
print(
    f"PLAYER EXTRACTION:  {player_correct}/{total} ({100 * player_correct / total:.0f}%)"
)
print(f"{'=' * 80}\n")

for r in results:
    sent_mark = "✓" if r["sentiment_match"] else "✗"
    player_mark = "✓" if r["player_match"] else "✗"

    result = r["result"]
    if "error" in result:
        print(f"{sent_mark}{player_mark} ERROR: {result['error'][:50]}")
    else:
        player = result.get("target_player") or "None"
        print(
            f"{sent_mark}{player_mark} [{result['sentiment']:8}] {player:20} | {r['text'][:50]}"
        )
        if not r["sentiment_match"]:
            print(
                f"      Expected: {r['expected_sentiment']}, Got: {result['sentiment']}"
            )
            print(f"      Reasoning: {result.get('reasoning', 'N/A')}")

Running test suite (this will make API calls)...

SENTIMENT ACCURACY: 18/20 (90%)
PLAYER EXTRACTION:  20/20 (100%)

✓✓ [positive] LeBron James         | LeBron is the GOAT, no question
✓✓ [negative] Russell Westbrook    | Westbrook is absolutely trash
✓✓ [neutral ] None                 | Lakers play tomorrow at 7pm
✓✓ [negative] James Harden         | Great defense from Harden as usual
✗✓ [positive] Russell Westbrook    | Wow Westbrook only 5 turnovers, he's really improv
      Expected: negative, Got: positive
      Reasoning: Sarcastic comment highlighting historically high turnover rate, implies Westbrook's ball security is typically poor
✓✓ [positive] None                 | That dunk was NASTY
✓✓ [positive] Stephen Curry        | Curry is absolutely disgusting from three
✓✓ [positive] Anthony Edwards      | Ant is so filthy with the handles
✓✓ [negative] Russell Westbrook    | Another brick from Westbrook
✓✓ [negative] James Harden         | Harden got cooked on defense again
✓✓ [n