# Notebook 1: Data Collection & Exploratory Data Analysis

This notebook walks through:
1. How Reddit data is collected via PRAW
2. Loading and inspecting the raw Parquet files
3. EDA: post volume, score distributions, subreddit breakdown

> **Before running:** ensure `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET` are set in `.env`,  
> then run `make collect` (or the cell below) to fetch live data.

In [None]:
import sys
sys.path.insert(0, '../src')

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

## 1. Collection Architecture

The `SubredditCollector` fetches posts and comments from 9 sneaker subreddits using three sort methods (`hot`, `top`, `new`) and writes the result to Parquet with a crash-recovery checkpoint.

```
SubredditCollector
  ├── sort: hot  → up to 167 posts
  ├── sort: top  → up to 167 posts (last month)
  └── sort: new  → up to 167 posts
       └── per post: up to 50 comments
```

In [None]:
# Optionally run collection (takes 10-30 min for full dataset)
# Uncomment if you want to fetch fresh data:

# from reddit_sentiment.collection.collector import SubredditCollector
# collector = SubredditCollector()
# raw_path = collector.collect()
# print(f'Collected data saved to: {raw_path}')

In [None]:
# Load the most recent raw data file
from reddit_sentiment.collection.collector import SubredditCollector

raw_dir = Path('../data/raw')
try:
    df = SubredditCollector.load_latest(raw_dir)
    print(f'Loaded {len(df):,} records')
    print(f'Columns: {list(df.columns)}')
except FileNotFoundError:
    print('No raw data found. Run `make collect` first.')
    # Create a small synthetic demo dataset
    df = pd.DataFrame({
        'id': [f'post_{i}' for i in range(100)],
        'subreddit': ['Sneakers'] * 50 + ['Nike'] * 30 + ['Adidas'] * 20,
        'record_type': ['post'] * 60 + ['comment'] * 40,
        'score': [100, 50, 200, 10, 30] * 20,
        'num_comments': [5, 10, 2, 20, 8] * 20,
        'created_utc': pd.date_range('2024-01-01', periods=100, freq='6H', tz='UTC'),
        'full_text': ['Nike Air Max review'] * 100,
        'extracted_urls': [[]] * 100,
    })
    print('Using synthetic demo data.')
    
df.head()

## 2. Basic Stats

In [None]:
print(f"Total records: {len(df):,}")
if 'record_type' in df.columns:
    print(df['record_type'].value_counts().to_string())
print(f"\nSubreddits: {df['subreddit'].nunique()}")
print(df['subreddit'].value_counts().head(10).to_string())

In [None]:
# Post score distribution
if 'score' in df.columns:
    posts = df[df.get('record_type', 'post') == 'post'] if 'record_type' in df.columns else df
    fig = px.histogram(
        posts, x='score', nbins=50, log_y=True,
        title='Distribution of Post Scores (log scale)',
        labels={'score': 'Reddit Score', 'count': 'Number of Posts'},
    )
    fig.show()

In [None]:
# Records per subreddit
sub_counts = df['subreddit'].value_counts().reset_index()
sub_counts.columns = ['subreddit', 'count']

fig = px.bar(
    sub_counts.head(10),
    x='subreddit', y='count',
    title='Records per Subreddit',
    color='count',
    color_continuous_scale='Blues'
)
fig.show()

In [None]:
# Posts over time
if 'created_utc' in df.columns:
    df['created_utc'] = pd.to_datetime(df['created_utc'], utc=True, errors='coerce')
    df['date'] = df['created_utc'].dt.date
    daily = df.groupby('date').size().reset_index(name='count')
    
    fig = px.line(daily, x='date', y='count', title='Daily Post/Comment Volume')
    fig.show()

## 3. URL & Channel Preview

In [None]:
if 'extracted_urls' in df.columns:
    from urllib.parse import urlparse
    
    all_urls = [url for urls in df['extracted_urls'] if isinstance(urls, list) for url in urls]
    domains = [urlparse(u).netloc.lower().lstrip('www.') for u in all_urls if u]
    
    from collections import Counter
    top_domains = pd.DataFrame(Counter(domains).most_common(15), columns=['domain', 'count'])
    print(f'Total URLs found: {len(all_urls):,}')
    print(f'Unique domains: {len(set(domains)):,}')
    display(top_domains)