# Exploring Hacker News Posts

Once more, this notebook is about doing a guided project in DataQuest, but using Pandas instead of elementary Python

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Below are descriptions of the columns:

* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the posts links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission. The format is "%m/%d/%Y %H:%M"

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

In [2]:
data_path = Path.home() / "datasets" / "tabular_practice"
df_hn = pd.read_csv(data_path / "hacker_news.csv")
df_hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [3]:
df_hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20100 non-null  int64 
 1   title         20100 non-null  object
 2   url           17660 non-null  object
 3   num_points    20100 non-null  int64 
 4   num_comments  20100 non-null  int64 
 5   author        20100 non-null  object
 6   created_at    20100 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


Let us convert the "created_at" values into `datetime`, so we can work with them later.

In [4]:
created_at_format = "%m/%d/%Y %H:%M"
df_hn["created_at"] = pd.to_datetime(df_hn["created_at"], format=created_at_format)

In [5]:
df_hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            20100 non-null  int64         
 1   title         20100 non-null  object        
 2   url           17660 non-null  object        
 3   num_points    20100 non-null  int64         
 4   num_comments  20100 non-null  int64         
 5   author        20100 non-null  object        
 6   created_at    20100 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 1.1+ MB


In [6]:
df_hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,2016-08-04 11:52:00
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,2016-01-26 19:30:00
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,2016-06-23 22:20:00
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,2016-06-17 00:01:00
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,2015-09-30 04:12:00


Next, we would like to separately analyze ask posts, show posts, and others. The two former are defined by the title starting with "ask hn" or "show hn".

In [7]:
titles = df_hn["title"]
indicator_ask = titles.str.lower().str.startswith("ask hn")
ask_posts = df_hn[indicator_ask]
indicator_show = titles.str.lower().str.startswith("show hn")
show_posts = df_hn[indicator_show]
indicator_other = ~(indicator_ask | indicator_show)
other_posts = df_hn[indicator_other]
len(ask_posts), len(show_posts), len(other_posts)

(1744, 1162, 17194)

In [8]:
ask_posts.head(10)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
7,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,2016-08-16 09:55:00
17,10610020,Ask HN: Am I the only one outraged by Twitter ...,,28,29,tkfx,2015-11-22 13:43:00
22,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,2016-05-02 10:14:00
30,12210105,Ask HN: Looking for Employee #3 How do I do it?,,1,3,sph130,2016-08-02 14:20:00
31,10394168,Ask HN: Someone offered to buy my browser exte...,,28,17,roykolak,2015-10-15 16:38:00
49,10284812,"Ask HN: Limiting CPU, memory, and I/O usage on...",,2,1,zatkin,2015-09-26 23:23:00
51,11548576,Ask HN: Which framework for a CRUD app in 2016?,,4,4,deafcalculus,2016-04-22 12:24:00
65,10573430,Ask HN: Enter market with a well-funded compet...,,2,1,sparkling,2015-11-16 09:22:00
70,11168708,Ask HN: Do you use any realtime PaaS/framework...,,2,1,stemuk,2016-02-24 17:57:00
118,11837056,Ask HN: Is there a home Dropbox-style solution...,,3,2,coreyp_1,2016-06-04 17:17:00


In [9]:
show_posts.head(10)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
13,10627194,Show HN: Wio Link ESP8266 Based Web of Things...,https://iot.seeed.cc,26,22,kfihihc,2015-11-25 14:03:00
39,10646440,Show HN: Something pointless I made,http://dn.ht/picklecat/,747,102,dhotson,2015-11-29 22:46:00
46,11590768,"Show HN: Shanhu.io, a programming playground p...",https://shanhu.io,1,1,h8liu,2016-04-28 18:05:00
84,12178806,Show HN: Webscope Easy way for web developers...,http://webscopeapp.com,3,3,fastbrick,2016-07-28 07:11:00
97,10872799,Show HN: GeoScreenshot Easily test Geo-IP bas...,https://www.geoscreenshot.com/,1,9,kpsychwave,2016-01-09 20:45:00
114,11237259,Show HN: Run with Mark (Runkeeper only),http://runwithmark.github.io/#/,3,3,ecesena,2016-03-07 05:17:00
120,10603601,Show HN: Send an email from your shell to your...,https://ping.registryd.com,4,1,ybrs,2015-11-20 20:23:00
125,11370446,Show HN: Underline.js is like underscore.js bu...,http://ankurp.github.io/underline/?hn,8,1,agp2572,2016-03-27 16:19:00
127,10284074,Show HN: Real-Time Stats for an iOS MMORPG Gam...,http://aftermath.io/this-is-not-a-blog-wordpre...,6,1,ZaneClaes,2015-09-26 19:02:00
129,12255593,Show HN: Bild A collection of image processin...,https://github.com/anthonynsimon/bild,2,2,amzans,2016-08-09 16:11:00


Instead of creating three dataframes, it is simpler just to add another category column `post_type` with values "ask", "show", "other"

In [11]:
df_hn["post_type"] = "other"
new_column = df_hn["post_type"].copy()
new_column.where(~indicator_ask, "ask", inplace=True)
new_column.where(~indicator_show, "show", inplace=True)
df_hn["post_type"] = new_column.astype("category")

In [12]:
df_hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            20100 non-null  int64         
 1   title         20100 non-null  object        
 2   url           17660 non-null  object        
 3   num_points    20100 non-null  int64         
 4   num_comments  20100 non-null  int64         
 5   author        20100 non-null  object        
 6   created_at    20100 non-null  datetime64[ns]
 7   post_type     20100 non-null  category      
dtypes: category(1), datetime64[ns](1), int64(3), object(3)
memory usage: 1.1+ MB


In [13]:
df_hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,post_type
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,2016-08-04 11:52:00,other
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,2016-01-26 19:30:00,other
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,2016-06-23 22:20:00,other
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,2016-06-17 00:01:00,other
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,2015-09-30 04:12:00,other


In [18]:
df_hn.groupby("post_type", observed=True)["id"].count()

post_type
ask       1744
other    17194
show      1162
Name: id, dtype: int64

Let us analyze the average number of comments, depending on the post type

In [19]:
df_hn.groupby("post_type", observed=True)["num_comments"].mean()

post_type
ask      14.038417
other    26.873037
show     10.316695
Name: num_comments, dtype: float64

In the sequel, we will look at ask posts, because they receive more comments on average than show posts. Let us have a look at the average number of comments and number of posts, depending on the hour of creation of the post.

Just like `str` offers functions on strings, `dt` offers functions on datetime values. Also note that the argument of `groupby` is really a series (or dataframe). Instead of creating a new column for the hour of "created_at", we can just create the series for grouping on the fly.

In [33]:
ask_posts.groupby(ask_posts["created_at"].dt.hour).agg({"num_comments": "mean", "id": "count"}).sort_values("num_comments", ascending=False)

Unnamed: 0_level_0,num_comments,id
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
15,38.594828,116
2,23.810345,58
20,21.525,80
16,16.796296,108
21,16.009174,109
13,14.741176,85
10,13.440678,59
14,13.233645,107
18,13.201835,109
17,11.46,100


There are more posts during the afternoon and evening (above 100 between 14 and 19, and 21), the largest number at 15. Posts at 15 receive the most comments on average (38.6), but followed by posts done at 2 in the morning (23.8).

According to these results, ask posts done around 3pm, 2am, 8pm receive the most comments on average. It is not clear whether this is a causal relationship, though.

What about the other types of posts?

In [35]:
show_posts.groupby(show_posts["created_at"].dt.hour).agg({"num_comments": "mean", "id": "count"}).sort_values("num_comments", ascending=False)

Unnamed: 0_level_0,num_comments,id
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
18,15.770492,61
0,15.709677,31
14,13.44186,86
23,12.416667,36
22,12.391304,46
12,11.803279,61
16,11.655914,93
7,11.5,26
11,11.159091,44
3,10.62963,27


In [36]:
other_posts.groupby(other_posts["created_at"].dt.hour).agg({"num_comments": "mean", "id": "count"}).sort_values("num_comments", ascending=False)

Unnamed: 0_level_0,num_comments,id
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
14,32.330898,958
13,30.896514,918
12,30.347275,789
11,29.593939,660
15,29.519231,1040
17,27.995723,1169
2,27.786848,441
9,27.588015,534
0,27.076923,611
8,27.02621,496
