# Hacker News Story Downloader

This notebook downloads Hacker News stories related to LLMs and AI coding tools by searching for predefined keywords (like “openai”, “chatgpt”, “copilot”, etc.) using the Hacker News API by Algolia. It systematically fetches all available story pages for each keyword, handles API errors with retry logic, and saves the results as separate JSON files in a data directory. The tool is designed to be polite with rate limiting and skips keywords that have already been processed.

## Keywords

The cell below defines keywords related to LLMs and Copilot apps that will be used to search and download relevant Hacker News stories. The approach could introduce duplicates, so we will perform deduplication later.

In [None]:
KEYWORDS_TO_SEARCH = [
    "openai",
    "chatgpt",
    "gpt",
    "gemini",
    "anthropic",
    "claude",
    "deepseek",
    "grok",
    "llama",
    "mistral",
    "copilot",
    "cursor",
    "cline",
    "tabnine",
    "JetBrains AI",
    "Codeium",
    "Windsurf",
    "aider",
    "zed",
]

## Configurations

In [None]:
BASE_URL = "http://hn.algolia.com/api/v1/search_by_date"

HITS_PER_PAGE = 50  # Number of results per API request page
TAGS = "story"  # We are interested in stories
NUMERIC_FILTERS = "num_comments>0"  # Only stories with comments

REQUEST_DELAY_SECONDS = 1  # Politeness delay between successful page fetches
REQUEST_TIMEOUT_SECONDS = 5  # Timeout for each API request attempt

MAX_FETCH_RETRIES = 3  # Max attempts for a single page fetch
RETRY_DELAY_SECONDS = 5  # Delay between retries for a failed page fetch

OUTPUT_DIR = "data/stories"  # Directory to save the downloaded files

## Functions

This section defines all helper functions needed for the story download workflow, including API fetching with retry logic, file I/O operations, and orchestration functions that coordinate the entire process from keyword processing to JSON file creation.

In [None]:
import json
import os
import time
from typing import List, Dict, Any, Optional

import requests


def get_json_filepath(keyword: str) -> str:
    """Get the JSON filepath for a given keyword."""
    return os.path.join(OUTPUT_DIR, f"{keyword}.json")


def process_single_keyword(keyword: str) -> bool:
    """Process a single keyword: check skip conditions, fetch stories, and save."""
    print(f'\n--- Processing keyword: "{keyword}" ---')

    if should_skip_keyword(keyword):
        return True

    all_stories = fetch_all_stories_for_keyword(keyword)
    save_stories_to_json(keyword, all_stories)

    print(f'--- Finished processing keyword: "{keyword}" ---')
    return True


def should_skip_keyword(keyword: str) -> bool:
    """Check if keyword should be skipped (file already exists)."""
    json_filepath = get_json_filepath(keyword)
    if os.path.exists(json_filepath):
        print(
            f'  JSON file for "{keyword}" already exists at {json_filepath}. Skipping.'
        )
        return True
    return False


def fetch_all_stories_for_keyword(keyword: str) -> List[Dict[str, Any]]:
    """Fetch all pages of stories for a given keyword."""
    all_hits: List[Dict[str, Any]] = []
    current_page = 0

    while True:
        try:
            page_data = fetch_page_data(keyword, current_page)
            if not page_data:
                break

            hits = page_data.get("hits", [])
            if not hits:
                break

            all_hits.extend(hits)
            current_page += 1
            time.sleep(REQUEST_DELAY_SECONDS)

        except Exception as e:
            print(
                f"  Warning: Error occurred while fetching stories: {type(e).__name__} - {e}"
            )
            break

    return all_hits


def fetch_page_data(keyword: str, page_num: int) -> Optional[Dict[str, Any]]:
    """
    Fetches a single page of search results for a given keyword with retry logic.
    """
    params = {
        "query": keyword,
        "tags": TAGS,
        "numericFilters": NUMERIC_FILTERS,
        "hitsPerPage": HITS_PER_PAGE,
        "page": page_num,
    }

    for attempt in range(MAX_FETCH_RETRIES):
        print(
            f'  Fetching page {page_num + 1} for "{keyword}" (Attempt {attempt + 1}/{MAX_FETCH_RETRIES})...'
        )
        try:
            response = requests.get(
                BASE_URL, params=params, timeout=REQUEST_TIMEOUT_SECONDS
            )
            response.raise_for_status()
            return response.json()
        except json.JSONDecodeError as e:
            print(f"    Attempt {attempt + 1} FAILED: {type(e).__name__} - {e}")
            return None
        except requests.exceptions.RequestException as e:
            print(f"    Attempt {attempt + 1} FAILED: {type(e).__name__} - {e}")
            if attempt < MAX_FETCH_RETRIES - 1:
                print(f"    Retrying in {RETRY_DELAY_SECONDS} seconds...")
                time.sleep(RETRY_DELAY_SECONDS)
            else:
                print(
                    f'    All {MAX_FETCH_RETRIES} retries failed for page {page_num + 1} of keyword "{keyword}".'
                )
    return None


def save_stories_to_json(keyword: str, stories: List[Dict[str, Any]]) -> None:
    """
    Saves a list of story dictionaries to a single JSON file for the given keyword.
    """
    if not stories:
        print(f'  No stories were accumulated for keyword "{keyword}" to save.')
        json_filepath = get_json_filepath(keyword)
        if not os.path.exists(json_filepath):
            print(
                f'  JSON file for "{keyword}" will not be created as no data was fetched successfully.'
            )
        return

    json_filepath = get_json_filepath(keyword)

    try:
        with open(json_filepath, "w", encoding="utf-8") as f:
            json.dump(stories, f, indent=4, ensure_ascii=False)
        print(
            f'SUCCESS: Saved {len(stories)} stories for "{keyword}" to {json_filepath}'
        )
    except Exception as e:
        print(
            f'ERROR: An unexpected error occurred while saving JSON for "{keyword}". Error: {e}'
        )

## Main Processing Logic

Iterate through each keyword, fetch all pages of stories, and save them to a JSON file specific to that keyword.

In [None]:
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)
    print(f"Created output directory: {OUTPUT_DIR}")

print("--- Starting Hacker News Story Downloader ---")
total_keywords_processed = 0

for keyword in KEYWORDS_TO_SEARCH:
    if process_single_keyword(keyword):
        total_keywords_processed += 1

print(
    f"\n--- All {total_keywords_processed}/{len(KEYWORDS_TO_SEARCH)} keywords processed. "
    f"Check the '{OUTPUT_DIR}' directory for JSON files. ---"
)