# Hacker News Posts

Do posts on the Hacker News website with the tags "Ask HN" and "Show HN" receive more comments on average than posts without these tags?

We will try to answer this question by analyzing data from recent 20,000 posts on the site

In [1]:
import csv
data_set = open("hacker_news.csv")
hn = list(csv.reader(data_set))
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
hn = hn[1:]
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [3]:
import re
pattern1 = r"^Ask HN"
pattern2 = r"^Show HN"
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    match1 = re.search(pattern1, title, flags=re.I)
    match2 = re.search(pattern2, title, flags=re.I)
    if match1:
        ask_posts.append(row)
    elif match2:
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts), len(show_posts), len(other_posts))

1744 1162 17194


In [32]:
ask_comments = [int(row[4]) for row in ask_posts]
show_comments = [int(row[4]) for row in show_posts]
other_comments = [int(row[4]) for row in other_posts]
avg_ask_comments = sum(ask_comments) / len(ask_comments)
avg_show_comments = sum(show_comments) / len(show_comments)
avg_other_comments = sum(other_comments) / len(other_comments)
print("Average 'Ask HN' comments: {:.1f}\nAverage 'Show HN' comments: {:.1f}\nAverage 'Other' comments: {:.1f}".format(avg_ask_comments, avg_show_comments, avg_other_comments))

Average 'Ask HN' comments: 14.0
Average 'Show HN' comments: 10.3
Average 'Other' comments: 26.9


As shown above, the average 'Ask HN' post receives 40% more comments (14) than the average 'Show HN'.  Intuitively, this makes sense since an 'Ask HN' is actively soliciting feedback (ie answers) whereas a 'Show HN' isn't directly asking for responses.  Interestingly, the average 'Other' post (neither 'Ask HN' nor 'Show HN' post) has a much higher average (27) comments than either 'Ask HN' posts or 'Show HN' posts.  We'll explore how time impacts the number of comments next.

In [19]:
import datetime as dt
created_date = [row[6] for row in ask_posts]
counts_by_hour = {}
comments_by_hour = {}
result_list = [[item1, item2] for item1, item2 in zip(created_date, ask_comments)]
for date, comment in result_list:
    date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

In [25]:
avg_by_hour = [[i, comments_by_hour[i] / counts_by_hour[i]] for i in comments_by_hour]

In [26]:
print(avg_by_hour)

[['08', 10.25], ['23', 7.985294117647059], ['03', 7.796296296296297], ['10', 13.440677966101696], ['12', 9.41095890410959], ['01', 11.383333333333333], ['16', 16.796296296296298], ['15', 38.5948275862069], ['22', 6.746478873239437], ['11', 11.051724137931034], ['14', 13.233644859813085], ['09', 5.5777777777777775], ['17', 11.46], ['06', 9.022727272727273], ['19', 10.8], ['18', 13.20183486238532], ['21', 16.009174311926607], ['05', 10.08695652173913], ['04', 7.170212765957447], ['07', 7.852941176470588], ['00', 8.127272727272727], ['20', 21.525], ['13', 14.741176470588234], ['02', 23.810344827586206]]


In [31]:
avg_by_hour = sorted(avg_by_hour, key = lambda x: x[1], reverse = True)
for i in avg_by_hour[0:5]:
    time = dt.datetime.strptime(i[0], '%H')
    time = time.strftime('%H:%M')
    print("{}: {:.2f} average comments per post".format(time, i[1]))
    

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


As seen above, posts created in the evening hours Eastern Time tend to produce more average comments.  4 of the top 5 hours fall between 3:00 PM and 9:00 PM ET. Not surprisingly, these are prime hours of activity for major population centers such as Manhattan, Philadelphia and Boston.  Additionally, most people in West Coast hubs such as LA and San Francicsco are awake during these hours.  And for the later hours during this period (eg. 7:00 PM to 9:00 PM), many people in China arw awake.  Thus to maximize comments, users should consider posting during these hours.  Interestingly, the hour with the second most average comments is 2:00 AM ET.  Perhaps, this is driven by populations outside of the US.  We should explore the data set further to gain further insights.