# Scrape Reddit with Bright Data

## Configure a Scraper

1. In the Bright Data user dashboard, hover over `Data` and click on `Scrapers Library`.
2. In the search box, enter `reddit` and click on the `reddit.com` result.
3. Click on `reddit.com` under `Results for Scrapers`.
4. Under `Available Endpoints`, click on `Discover by subreddit url`.
5. Under `Configuration`, click on `Set record limit`.
6. Toggle `Set record limit` so that it's on, and enter `100` for `Maximum records per input`.
7. Click `Back to scraper settings`.
8. Under `Inputs`, change the first row's `sort_by` to `Top`, ensure that `sort_by_time` is set to `Today`, set `num_of_posts` to `10`, and remove the `keyword`.
9. Select `Asynchronous` under `Choose scraper mode`.

## Monitor a Snapshot

1. Click on `Snapshots`.
2. Click on a `Snapshot ID` to copy the value.
3. Click on `Manage snapshots API`.
4. Click on `Monitor progress API`.
5. Select an API token.
6. Paste in the snapshot ID.

## Download a Snapshot

1. Click on `Snapshots`.
2. Click on a `Snapshot ID` to copy the value.
3. Click on `Manage snapshots API`.
4. Click on `Download snapshot`.
5. Select an API token.
6. Paste in the snapshot ID.

In [None]:
import json
import os

import requests
from dotenv import load_dotenv

load_dotenv()

In [None]:
BRIGHT_DATA_API_KEY = os.environ.get('BRIGHT_DATA_API_KEY')

assert BRIGHT_DATA_API_KEY is not None

In [None]:
def get_crawl_headers() -> dict[str, str]:
    return {
        'Authorization': f"Bearer {BRIGHT_DATA_API_KEY}",
        'Content-Type': 'application/json',
    }

In [None]:
def perform_scrape_snapshot(subreddit_url: str, num_of_posts: int = 20) -> str:
    url = 'https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lvz8ah06191smkebj4&notify=false&include_errors=true&type=discover_new&discover_by=subreddit_url&limit_per_input=100'
    headers = get_crawl_headers()
    data = json.dumps({
        'input': [
            {
                'url': f"{subreddit_url}",
                'sort_by': 'Top',
                'sort_by_time': 'Today',
                'num_of_posts': num_of_posts,
            },
        ],
    })

    response = requests.post(
        url=url,
        headers=headers,
        data=data
    )

    response.raise_for_status()

    scrape_data = response.json()

    return scrape_data.get('snapshot_id')

In [None]:
scrape_snapshot_result = perform_scrape_snapshot('https://www.reddit.com/r/django')

print(scrape_snapshot_result)

In [None]:
def get_snapshot_progress(snapshot_id: str) -> bool | None:
    url = f"https://api.brightdata.com/datasets/v3/progress/{snapshot_id}"
    headers = get_crawl_headers()

    try:
        response = requests.get(url=url, headers=headers)

        response.raise_for_status()

        result = response.json()

        return result['status'] == 'ready'

    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

In [None]:
get_snapshot_progress(scrape_snapshot_result)

In [None]:
def download_snapshot(snapshot_id: str) -> dict[str, str] | None:
    url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}"
    headers = get_crawl_headers()
    params = {
        'format': 'json',
    }

    try:
        response = requests.get(url=url, headers=headers, params=params)

        response.raise_for_status()

        return response.json()

    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

In [None]:
reddit_results = download_snapshot(scrape_snapshot_result)

In [None]:
for thread in reddit_results:
    print(thread.get('title'), thread.get('num_upvotes'))