# What type of posts performs best on Hacker News?

We will analyze which type of Hacker News post performs best, in terms of amount of comments received.

The analysis will be focused on two types of post: HN ask (in which the poster asks a question to the community) and HN show (in which the poster shares something with the community). We will explore if there is a difference in performance between both types of post, and how it is influenced by the time of posting.

We will use this dataset, which is a random sample from this dataset and doesn't include posts with no comments.

We read the file and transform it into a list of lists to start exploring it.

In [35]:
import csv
import datetime
opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


We separate the headers from the content.

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we're only interested in posts whose titles begin with Ask HN or Show HN, we create new lists of lists containing just the data for those titles.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Now we find out the average number of comments for each type of post. We begin with Ask post comments

In [5]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)


14.038417431192661


And we do the same for Show post comments

In [6]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


We see that Ask comments receive, on average, 36% more comments.

Now that we know that, we will analyze which time of day is optimal for posting a HN Ask type post. First we extract the datetime and post number columns for each Ask HN post.

In [7]:
import datetime as dt

ask_posts_date_comments = []

for post in ask_posts:
    post_date_comments = [post[6], int(post[4])]
    ask_posts_date_comments.append(post_date_comments)

Next, we count the number of posts and the number of comments by hour

In [18]:
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts_date_comments:
    hour =  row[0]
    hour = dt.datetime.strptime(hour, '%m/%d/%Y %H:%M')
    hour_string = hour.strftime('%H')
    if hour_string not in counts_by_hour:
        counts_by_hour[hour_string] = 1
        comments_by_hour[hour_string] = row[1]
    else:
        counts_by_hour[hour_string] += 1
        comments_by_hour[hour_string] += row[1]

print(counts_by_hour)
print(comments_by_hour)

{'17': 100, '04': 47, '18': 109, '05': 46, '00': 55, '21': 109, '23': 68, '15': 116, '11': 58, '08': 48, '01': 60, '06': 44, '13': 85, '03': 54, '02': 58, '12': 73, '19': 110, '16': 108, '20': 80, '14': 107, '09': 45, '07': 34, '10': 59, '22': 71}
{'17': 1146, '04': 337, '18': 1439, '05': 464, '00': 447, '21': 1745, '23': 543, '15': 4477, '11': 641, '08': 492, '01': 683, '06': 397, '13': 1253, '03': 421, '02': 1381, '12': 687, '19': 1188, '16': 1814, '20': 1722, '14': 1416, '09': 251, '07': 267, '10': 793, '22': 479}


... and we calculate the average number of posts by hour.

In [19]:
avg_posts_by_hour = []
    
for hour in counts_by_hour:
    avg_posts_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(avg_posts_by_hour)

[['17', 11.46], ['04', 7.170212765957447], ['18', 13.20183486238532], ['05', 10.08695652173913], ['00', 8.127272727272727], ['21', 16.009174311926607], ['23', 7.985294117647059], ['15', 38.5948275862069], ['11', 11.051724137931034], ['08', 10.25], ['01', 11.383333333333333], ['06', 9.022727272727273], ['13', 14.741176470588234], ['03', 7.796296296296297], ['02', 23.810344827586206], ['12', 9.41095890410959], ['19', 10.8], ['16', 16.796296296296298], ['20', 21.525], ['14', 13.233644859813085], ['09', 5.5777777777777775], ['07', 7.852941176470588], ['10', 13.440677966101696], ['22', 6.746478873239437]]


Just by looking at the results we see there are differences, but we need to order the list to make them more clear. We will sort it by amount of comments.

In [47]:
swap_avg_posts_by_hour = []

for post in avg_posts_by_hour:
    swap_avg_posts_by_hour.append([post[1], post[0]])

sorted_avg_by_hour = sorted(swap_avg_posts_by_hour, reverse = True)

for avg in sorted_avg_by_hour:
    print(avg)

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


So, in conclusion, the best hours to post HN Ask posts, according to this dataset, are... *drumroll*

In [53]:
print('Top 5 Hours for Ask HN comments')

for item in sorted_avg_by_hour[:5]:
    hour = dt.datetime.strptime(item[1], '%H')
    hour = hour.strftime('%H:%M')
    average_comments = item[0]
    print('{hr}: {avg:2f} average comments per post'.format(hr = hour, avg = average_comments))
    

Top 5 Hours for Ask HN comments
15:00: 38.594828 average comments per post
02:00: 23.810345 average comments per post
20:00: 21.525000 average comments per post
16:00: 16.796296 average comments per post
21:00: 16.009174 average comments per post


Great! Now we know the peak times to post HN Ask comments if you want to get the most amount of replies.

Note that these times are in the US Eastern Time zone, so if you want to take advantage of the best posting times you have to convert them to your own time zone.

Further exploring:
- It would be interesting to graph all the data and see whether there's a function that explains the variation.
- In the top 5 results we see that mid-afternoon (15-16) hs and early in the night (20-21) hs are peak times, but it is peculiar to see 2 am as the second best time. One hypotheses that could explain this is that these posts are not coming from the US but from another timezone.