# The consideration of Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

In our analysis, we are specifically interested in posts with titles that begin with either **Ask HN** (Users submit posts to ask the Hacker News community a specific question) or **Show HN** (Users submit posts to show the Hacker News community a project, product or just something interesting).

So we will compare these two types of posts to determine if **Ask HN** or **Show HN** reveive more comments and whether posts get more comments at a certain time.

In [1]:
# Import the dataset and display the first five rows

from csv import reader
import datetime as dt

open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
header = hn[0]
hn = hn[1:]

for row in hn[:5]:
    print(row)
    print('\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Below you can find the descriptions of the columns and the printed header

| index | columns | description |
| :-: | -: | :- |
| 0 | id | the unique identifier from Hacker News for the post |
| 1 | title | the title of the post |
| 2 | url | the URL that the posts links to, if the post has a URL |
| 3 | num_points | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| 4 | num_comments | the number of comments on the post |
| 5 | author | the username of the person who submitted the post |
| 6 | created_at | the date and time of the post's submission |

In [2]:
print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In our first step, we search the dataset for the corresponding keywords using **Ask HN** and **Show HN**.
For this we use the string method `startswith` to find out which post starts with the corresponding keywords.The appropriate posts will then be split into the respective lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

# to filter all posts by their initial letters,
# we iterate through each post and check if the first letters
# match the ones queried here and if, we append the full row to the list above

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

In [4]:
# now we check the exact number of posts in each list

print('Number of all posts: ' + str(len(hn)))
print('')
print('Number of ask_posts: ' + str(len(ask_posts)))
print('Number of show_posts: ' + str(len(show_posts)))
print('Number of other_posts: ' + str(len(other_posts)))

Number of all posts: 20100

Number of ask_posts: 1744
Number of show_posts: 1162
Number of other_posts: 17194


To verify if the lists are working correct, we have a look at the first and last two entries of each list.

In [5]:
for entry in range(0, 2):
    print(ask_posts[entry][1])
    print(ask_posts[-entry-1][1])
    
print('\n')
    
for entry in range(0, 2):
    print(show_posts[entry][1])
    print(show_posts[-entry-1][1])

Ask HN: How to improve my personal website?
Ask HN: Why are papers still published as PDFs?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: How do you balance a serious relationship with starting a company?


Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform
Show HN: Parse recipe ingredients using JavaScript
Show HN: Something pointless I made
Show HN: PhantomJsCloud, Headless Browser SaaS


Now let's move on to answering the first basic question, whether **ask posts** or **show posts** gets more comments on average.

To check this, we have to find the number of comments in each post at index 4, add this value to a total_comments variable and divide it by the length of each list.

In [6]:
# find the total comments for the ask section
total_ask_comments = 0

for post in ask_posts:
    comments = post[4]
    total_ask_comments += int(comments)

# calculate the average number of comments on ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average number of comments on ask posts is: ' + '{:.2f}'.format(avg_ask_comments))

The average number of comments on ask posts is: 14.04


In [7]:
# find the total comments for the show section
total_show_comments = 0

for post in show_posts:
    comments = post[4]
    total_show_comments += int(comments)
    
# calculate the average number of comments on shpw posts
avg_show_comments = total_show_comments / len(show_posts)
print('The average number of comments on show post is: ' + '{:.2f}'.format(avg_show_comments))

The average number of comments on show post is: 10.32


In [8]:
# to get a conclusion,
# we now subtract the calculated average number of aks posts
# from the calculated average number of show posts 
# and get the average number of comments which we receive more per post

avg_comments = avg_ask_comments - avg_show_comments

if avg_ask_comments > avg_show_comments:
    print('The ask section gets ' + '{:.2f}'.format(avg_comments) + ' more comments.')
elif avg_show_comments > avg_ask_comments:
    print('The show section gets ' + '{:.2f}'.format(avg_comments) + ' more comments.')
elif avg_show_comments == avg_ask_comments:
    print('Both sections generate on average the same number of comments')


The ask section gets 3.72 more comments.


**Ask posts** receive on average about **4 comments** more per post.

This could be explained, for example, by the fact that in the Ask-Section, users are asked in particular for solution approaches and are therefore more involved than in the Show-Section.

Since aks posts are more likely to receive comments, we will focus our remaining analysis just on these posts.

As a next step, we will investigate whether posts created at a certain time receive more comments. For this we will calculate the number of ask posts created in each hour of the day, along with the number of comments received and calculate the average number of comments ask posts receive by hour.

In [9]:
result_list = []

# we iterate trough each post in ask_post, then asign
# the timestamp to created_at and the number of comments
# to num_comments. At the end we append a list with that
# data to the result_list

for post in ask_posts:
    created_at = post[6] 
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

In [10]:
counts_by_hour = {} # contains the number of ask posts created during each hour of the day
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received

for row in result_list:
    date_str = row[0]
    num_comments = row[1]
    dt_object = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M') # we convert the date string to a datetime object
    hour = dt_object.strftime('%H') # we ectract the hour (%H) from the datetime object
    
    # now we create a frequent table and count the hour to get
    # the number of posts for each hour and set the comments by hour
    # equal to the comment number to get the comments in each hour.
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        

In [11]:
# with sorted(counts_by_hour) we sorted each row from 00 to 23

for row in sorted(counts_by_hour):
    print('Hour: ' + str(row) + ' Posts: ' + str(counts_by_hour[row]))

Hour: 00 Posts: 55
Hour: 01 Posts: 60
Hour: 02 Posts: 58
Hour: 03 Posts: 54
Hour: 04 Posts: 47
Hour: 05 Posts: 46
Hour: 06 Posts: 44
Hour: 07 Posts: 34
Hour: 08 Posts: 48
Hour: 09 Posts: 45
Hour: 10 Posts: 59
Hour: 11 Posts: 58
Hour: 12 Posts: 73
Hour: 13 Posts: 85
Hour: 14 Posts: 107
Hour: 15 Posts: 116
Hour: 16 Posts: 108
Hour: 17 Posts: 100
Hour: 18 Posts: 109
Hour: 19 Posts: 110
Hour: 20 Posts: 80
Hour: 21 Posts: 109
Hour: 22 Posts: 71
Hour: 23 Posts: 68


In [12]:
# To get the hour with the most posts, we iterate though
# the rows of counts_by_hour and if the row of counts_by_hour is 
# bigger than max_by_hour we asign the row and number to max_hour

max_by_hour = 0
max_hour = []

for row in counts_by_hour:
    if counts_by_hour[row] > max_by_hour:
        max_by_hour = counts_by_hour[row]
        max_hour = [row, max_by_hour]

print('With ' + str(max_hour[1]) + ' most of the posts were written around ' + str(max_hour[0]) + " o'clock.") 

With 116 most of the posts were written around 15 o'clock.


In [13]:
# with sorted(comments_by_hour) we sorted each row from 00 to 23

for row in sorted(comments_by_hour):
    print('Hour: ' + str(row) +  ' Comments: ' + str(comments_by_hour[row]))

Hour: 00 Comments: 447
Hour: 01 Comments: 683
Hour: 02 Comments: 1381
Hour: 03 Comments: 421
Hour: 04 Comments: 337
Hour: 05 Comments: 464
Hour: 06 Comments: 397
Hour: 07 Comments: 267
Hour: 08 Comments: 492
Hour: 09 Comments: 251
Hour: 10 Comments: 793
Hour: 11 Comments: 641
Hour: 12 Comments: 687
Hour: 13 Comments: 1253
Hour: 14 Comments: 1416
Hour: 15 Comments: 4477
Hour: 16 Comments: 1814
Hour: 17 Comments: 1146
Hour: 18 Comments: 1439
Hour: 19 Comments: 1188
Hour: 20 Comments: 1722
Hour: 21 Comments: 1745
Hour: 22 Comments: 479
Hour: 23 Comments: 543


In [14]:
# To get the hour with the most comments, we iterate though
# the rows of comments_by_hour and if the row of comments_by_hour is 
# bigger than max_by_comments we asign the row and number to max_comments

max_by_comments = 0
max_comments = []

for row in comments_by_hour:
    if comments_by_hour[row] > max_by_comments:
        max_by_comments = comments_by_hour[row]
        max_comments = [row, comments_by_hour[row]]

print('With ' + str(max_comments[1]) + ' most of the comments were written around ' + str(max_comments[0]) + " o'clock.")

With 4477 most of the comments were written around 15 o'clock.


Now that we have checked counts_by_hour and comments_by_hour, we found out that the most posts and the most comments were written around **3 pm**.

To support this statement, we calculate the average number of comments per post for posts created during each hour of the day.

In [15]:
avg_by_hour = []
for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

for row in sorted(avg_by_hour):
    print('Hour: ' + str(row[0]) + ' Comments (avg): ' + str(row[1])) 

Hour: 00 Comments (avg): 8.127272727272727
Hour: 01 Comments (avg): 11.383333333333333
Hour: 02 Comments (avg): 23.810344827586206
Hour: 03 Comments (avg): 7.796296296296297
Hour: 04 Comments (avg): 7.170212765957447
Hour: 05 Comments (avg): 10.08695652173913
Hour: 06 Comments (avg): 9.022727272727273
Hour: 07 Comments (avg): 7.852941176470588
Hour: 08 Comments (avg): 10.25
Hour: 09 Comments (avg): 5.5777777777777775
Hour: 10 Comments (avg): 13.440677966101696
Hour: 11 Comments (avg): 11.051724137931034
Hour: 12 Comments (avg): 9.41095890410959
Hour: 13 Comments (avg): 14.741176470588234
Hour: 14 Comments (avg): 13.233644859813085
Hour: 15 Comments (avg): 38.5948275862069
Hour: 16 Comments (avg): 16.796296296296298
Hour: 17 Comments (avg): 11.46
Hour: 18 Comments (avg): 13.20183486238532
Hour: 19 Comments (avg): 10.8
Hour: 20 Comments (avg): 21.525
Hour: 21 Comments (avg): 16.009174311926607
Hour: 22 Comments (avg): 6.746478873239437
Hour: 23 Comments (avg): 7.985294117647059


To get a better overview, we swap the corresponding entries and sort them by size.

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]]) # here we swap the index of the row and append it the new list
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [17]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # we sorted the new list

print('Top 5 Hours for Ask Posts Comments')

for average, hour in sorted_swap[:5]:
    hour_object = dt.datetime.strptime(hour, '%H') # convert the string to datetime object
    time = hour_object.strftime('%H:%M') # format the datetime object 
    print('{time}: {average:.2f} comments per post'.format(time=time, average=average) )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.80 comments per post
21:00: 16.01 comments per post


We should create a post at 15:00 o'clock to have a higher chance of reveiving comments. 

This dataset refers to the time zone in eastern time, so because we live in berlin, we need to add 6 hours to each time to get the right time to post in germany on the website of Hacker News.

In [18]:
print('Top 5 Hours in CET (Berlin) for Ask Posts Comments')

for average, hour in sorted_swap[:5]:
    hour_object = dt.datetime.strptime(hour, '%H') # convert the string to datetime object
    cet = hour_object + dt.timedelta(hours=6)
    time = cet.strftime('%H:%M') # format the datetime object
    print('{time}: {average:.2f} comments per post'.format(time=time, average=average) )

Top 5 Hours in CET (Berlin) for Ask Posts Comments
21:00: 38.59 comments per post
08:00: 23.81 comments per post
02:00: 21.52 comments per post
22:00: 16.80 comments per post
03:00: 16.01 comments per post


In our analysis we found out that there are the most comments in the Ask-Section. In order for our articles to have a higher chance of receiving as many comments as possible, we should publish an article around **3pm**. Since we are located in Berlin, we stick to the CET and should therefore publish an article around **9pm**.

Even though this analysis has now given us an estimate of when it is worth writing an article, it doesn't mean that we will always get the most comments if we publish at 3pm. So, of course, the content of the article still plays a key role.