In [1]:
%%capture
!pip install -r requirements.txt

## Data Export

This notebook exports Reddit data from r/movies for social network analysis using a **hybrid approach**:
1. **Pushshift API** - Get post IDs from specific date range (no 1000-post limit)
2. **Reddit API (PRAW)** - Fetch current post data with up-to-date scores and comments

This allows us to collect historical data while getting the most current engagement metrics.

### Imports

In [2]:
import pandas as pd
from datetime import datetime, timedelta

from utils.reddit_utils import get_reddit_instance, fetch_post_ids_by_daterange, fetch_posts_by_ids
from utils.file_utils import save_to_csv, read_from_csv, save_to_json, read_from_json, file_exists, get_date_range

In [3]:
reddit = get_reddit_instance()

print(f"Read-only mode: {reddit.read_only}")
print(f"User agent: {reddit.config.user_agent}")

Read-only mode: True
User agent: Miserable_Seaweed691


### Data Extraction

In [4]:
start_date, end_date = get_date_range(
    '2024-07-01', '2024-08-01'
)

subreddit_name = 'movies'
date_format = "%Y%m%d"

post_ids_filename = f'id_{subreddit_name}_{start_date.strftime(date_format)}_{end_date.strftime(date_format)}.json'
posts_filename = f'{subreddit_name}_{start_date.strftime(date_format)}_{end_date.strftime(date_format)}.csv'

In [None]:
if file_exists(post_ids_filename):
    post_ids = read_from_json(post_ids_filename)
else:
    post_ids = fetch_post_ids_by_daterange(
        subreddit_name,
        start_date, end_date,
        limit=2000,
        reddit=reddit
    )
    save_to_json(post_ids, post_ids_filename)

if file_exists(posts_filename):
    posts = read_from_csv(posts_filename)
else:
    posts = fetch_posts_by_ids(reddit, post_ids)
    save_to_csv(posts, posts_filename)

Fetching post IDs (2024-07-01 to 2024-08-01)...


Not all PushShift shards are active. Query results may be incomplete.


Fetching 0 posts...


[Official Pushshift website](https://pushshift.io/signup)

```
{
    detail: "User is not an authorized moderator."
}
```

Pushshift now only allows access to Reddit-authorized moderators, and only for mod-related purposes (moderation, safety, guideline enforcement), so a regular Reddit user (or a non-mod script) will be blocked exactly with that message.