UPDATE August 7th, 2017: All Hacker News submissions are now available on BigQuery, and the dataset is updated daily. If you are scraping Hacker News data at scale, it may be more efficient to use BigQuery instead.
An example query to get the top 2,000 Hacker News submissions:
#standardSQL SELECT title, score FROM `bigquery-public-data.hacker_news.full` WHERE type = 'story' ORDER BY score DESC LIMIT 2000
The web interface can only download up to 10,000 titles; you'll need to use an API to get more.
This repository contains simple Python scripts to download all Hacker News submissions and comments and store them in a PostgreSQL database, for use in ad-hoc data analysis. These scripts are optimized from the scripts used to gather data for my October 2014 blog post The Quality, Popularity, and Negativity of 5.6 Million Hacker News Comments. Parameters for connecting to the appropriate PostgreSQL database are set at the beginning of each file.
This script uses the older Algolia API for Hacker News (instead of the official HN API) due to its support for bulk requests and comment scores for most comments. Run-time of downloading and processing all Hacker News submissions is about 2 hours; run-time of downloading and processing all Hacker News comments is about 11 hours.
Average point score for HN submissions, by hour (EST) of submission:
SELECT EXTRACT(hour from created_at) AS hour, AVG(num_points) AS avg_points FROM hn_submissions WHERE num_points IS NOT NULL GROUP BY hour
Number of users who have made atleast n comments, and the average point score for the nth comment a user makes:
SELECT nth_comment, COUNT(num_points) AS users_who_made_num_comments, AVG(num_points) AS avg_points FROM ( SELECT num_points, ROW_NUMBER() OVER (PARTITION BY author ORDER BY created_at ASC) AS nth_comment FROM hn_comments WHERE num_points IS NOT NULL ) AS foo WHERE nth_comment <= 25 GROUP BY nth_comment ORDER BY nth_comment
Create the Hacker News leaderboard of users with the most karma, the hard way. (note that aggregated karma values will differ from true values due to vote obfuscation, among other things):
SELECT author, SUM(num_points) - COUNT(num_points) AS karma FROM ( SELECT author, num_points FROM hn_submissions UNION ALL SELECT author, num_points FROM hn_comments ) AS foo WHERE num_points IS NOT NULL GROUP BY author ORDER BY total_points DESC LIMIT 25
Known Data Fidelity Caveats
Unfortunately, there are a few issues with the source data, which the scripts attempt to mitigate:
- Hacker News automatically converts certain punctuation in Submissions/Comments contain into stylistic unicode (e.g. "smart quotes") which cannot be stored in the database; the scripts will convert the punctuation back to UTF-8.
- Comments contain style and link HTML; the scripts attempt to strip it.
- On the server-side, there are gaps of missing submission and comment data before 2010.
- Comment scores are hidden server-size for comments after October 2014; this is coincidentally the month my blog post was published / the official API was published)