# Notebook 1: Data Collection & Exploratory Data Analysis

This notebook walks through:
1. How Reddit data is collected via PRAW
2. Loading and inspecting the raw Parquet files
3. EDA: post volume, score distributions, subreddit breakdown

> **Before running:** ensure `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET` are set in `.env`,  
> then run `make collect` (or the cell below) to fetch live data.

In [1]:
import sys
sys.path.insert(0, '../src')

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

## 1. Collection Architecture

Two collectors are available — both write the same Parquet schema:

| Collector | Credentials | Speed | Comments |
|-----------|-------------|-------|----------|
| `SubredditCollector` (PRAW) | Reddit API key required | Faster, full metadata | ✅ Yes (50/post) |
| `PublicSubredditCollector` | None needed | Slightly slower (~10 min) | ✅ Yes (top 20 posts/subreddit) |

```
PublicSubredditCollector
  ├── r/Sneakers, r/Nike, r/Adidas, r/Jordans, …  (9 subreddits)
  ├── sort: hot / top / new  → up to ~167 posts each
  └── comments: top 20 posts × 50 comments = up to 1,000 comments/subreddit
```

Run `reddit-sentiment collect --public` (default, no credentials) or `reddit-sentiment collect` (PRAW).

In [2]:
# Optionally run collection (takes 10-30 min for full dataset)
# Uncomment if you want to fetch fresh data:

# from reddit_sentiment.collection.collector import SubredditCollector
# collector = SubredditCollector()
# raw_path = collector.collect()
# print(f'Collected data saved to: {raw_path}')

In [3]:
# Load the most recent raw data file — tries PRAW collector glob first,
# then falls back to synthetic demo data
from reddit_sentiment.collection.collector import SubredditCollector

raw_dir = Path('../data/raw')
try:
    df = SubredditCollector.load_latest(raw_dir)
    print(f'Loaded {len(df):,} records from {raw_dir}')
    print(f'Columns: {list(df.columns)}')
except FileNotFoundError:
    print('No raw data found. Run: reddit-sentiment collect --public')
    df = pd.DataFrame({
        'id': [f'post_{i}' for i in range(200)],
        'subreddit': ['Sneakers'] * 80 + ['Nike'] * 50 + ['Adidas'] * 40 + ['Jordans'] * 30,
        'record_type': ['post'] * 120 + ['comment'] * 80,
        'score': [100, 50, 200, 10, 30] * 40,
        'num_comments': [5, 10, 2, 20, 8] * 40,
        'created_utc': pd.date_range('2025-01-01', periods=200, freq='6H', tz='UTC'),
        'full_text': ['Nike Air Jordan review — great quality and comfortable fit'] * 200,
        'extracted_urls': [[]] * 200,
    })
    print(f'Using {len(df):,}-row synthetic demo data.')

df.head(3)

No raw data found. Run: reddit-sentiment collect --public
Using 200-row synthetic demo data.


  'created_utc': pd.date_range('2025-01-01', periods=200, freq='6H', tz='UTC'),


Unnamed: 0,id,subreddit,record_type,score,num_comments,created_utc,full_text,extracted_urls
0,post_0,Sneakers,post,100,5,2025-01-01 00:00:00+00:00,Nike Air Jordan review — great quality and com...,[]
1,post_1,Sneakers,post,50,10,2025-01-01 06:00:00+00:00,Nike Air Jordan review — great quality and com...,[]
2,post_2,Sneakers,post,200,2,2025-01-01 12:00:00+00:00,Nike Air Jordan review — great quality and com...,[]


## 2. Basic Stats

In [4]:
print(f"Total records: {len(df):,}")
if 'record_type' in df.columns:
    print(df['record_type'].value_counts().to_string())
print(f"\nSubreddits: {df['subreddit'].nunique()}")
print(df['subreddit'].value_counts().head(10).to_string())

Total records: 200
record_type
post       120
comment     80

Subreddits: 4
subreddit
Sneakers    80
Nike        50
Adidas      40
Jordans     30


In [5]:
# Posts vs Comments breakdown
if 'record_type' in df.columns:
    type_counts = df['record_type'].value_counts()
    posts = type_counts.get('post', 0)
    comments = type_counts.get('comment', 0)
    total = len(df)
    print(f"Posts:    {posts:,}  ({posts/total*100:.1f}%)")
    print(f"Comments: {comments:,}  ({comments/total*100:.1f}%)")
    print(f"Total:    {total:,}")
    print()
    # Comment-to-post ratio
    if posts > 0:
        print(f"Avg comments fetched per post: {comments/posts:.1f}")

    fig = px.pie(
        values=type_counts.values,
        names=type_counts.index,
        title='Posts vs Comments',
        color_discrete_map={'post': '#4f46e5', 'comment': '#06b6d4'},
        hole=0.4,
    )
    fig.show()

Posts:    120  (60.0%)
Comments: 80  (40.0%)
Total:    200

Avg comments fetched per post: 0.7


In [6]:
# Post score distribution
if 'score' in df.columns:
    posts = df[df.get('record_type', 'post') == 'post'] if 'record_type' in df.columns else df
    fig = px.histogram(
        posts, x='score', nbins=50, log_y=True,
        title='Distribution of Post Scores (log scale)',
        labels={'score': 'Reddit Score', 'count': 'Number of Posts'},
    )
    fig.show()

In [7]:
# Records per subreddit
sub_counts = df['subreddit'].value_counts().reset_index()
sub_counts.columns = ['subreddit', 'count']

fig = px.bar(
    sub_counts.head(10),
    x='subreddit', y='count',
    title='Records per Subreddit',
    color='count',
    color_continuous_scale='Blues'
)
fig.show()

In [8]:
# Posts over time
if 'created_utc' in df.columns:
    df['created_utc'] = pd.to_datetime(df['created_utc'], utc=True, errors='coerce')
    df['date'] = df['created_utc'].dt.date
    daily = df.groupby('date').size().reset_index(name='count')

    fig = px.line(daily, x='date', y='count', title='Daily Post/Comment Volume')
    fig.show()

## 3. Text Length Analysis

Comments and posts have different length profiles. Posts tend to have longer, more deliberate text; comments are shorter and more reactive. Both contribute differently to the sentiment signal.

In [9]:
# Text length distribution: posts vs comments
if 'full_text' in df.columns:
    df['text_len'] = df['full_text'].str.len().fillna(0).astype(int)

    if 'record_type' in df.columns:
        fig = px.histogram(
            df, x='text_len', color='record_type', nbins=60, barmode='overlay',
            title='Text Length Distribution: Posts vs Comments',
            labels={'text_len': 'Character count', 'count': 'Records'},
            color_discrete_map={'post': '#4f46e5', 'comment': '#06b6d4'},
            opacity=0.7,
        )
    else:
        fig = px.histogram(
            df, x='text_len', nbins=60,
            title='Text Length Distribution',
            labels={'text_len': 'Character count'},
        )
    fig.update_layout(xaxis_range=[0, 2000])
    fig.show()

    # Summary stats by type
    len_stats = df.groupby('record_type')['text_len'].describe()[['mean','50%','max']].round(0)
    len_stats.columns = ['Mean chars', 'Median chars', 'Max chars']
    display(len_stats)

Unnamed: 0_level_0,Mean chars,Median chars,Max chars
record_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
comment,58.0,58.0,58.0
post,58.0,58.0,58.0


## 4. URL & Channel Preview

In [10]:
if 'extracted_urls' in df.columns:
    from urllib.parse import urlparse
    
    all_urls = [url for urls in df['extracted_urls'] if isinstance(urls, list) for url in urls]
    domains = [urlparse(u).netloc.lower().lstrip('www.') for u in all_urls if u]
    
    from collections import Counter
    top_domains = pd.DataFrame(Counter(domains).most_common(15), columns=['domain', 'count'])
    print(f'Total URLs found: {len(all_urls):,}')
    print(f'Unique domains: {len(set(domains)):,}')
    display(top_domains)

Total URLs found: 0
Unique domains: 0


Unnamed: 0,domain,count
