# Exploring Hacker News Posts

Hacker News is a site where user-submitted stories are voted and commented upon. It is extremely popular in technology and startup circles, and posts that are at the top of their listings get hundreds of thousands of visitors as a result.

We will be working with a data set of these posts that has been reduced from around 300,000 rows to approximately 20,000 by removing the submissions that did not receive any comments. The data includes details about each post including the url, the title, the number of points acquired as a difference of upvotes to downvotes, number of comments, author and when it was created.

We will be focusing on those that being with either `Ask HN` or `Show HN` for where users ask questions of the community or post projects or other interesting information to the community.

We want to determine which of these two types of posts receive more comments on average and are their certain times that seem to increase the average number of comments.

In [5]:
import csv

In [6]:
with open('hacker_news.csv', 'r') as f:
    reader = csv.reader(f)
    hn = list(reader)
    
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [7]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Let's remvoe the rows from our data that we are not concerned with. We will filter to only those rows that start with either `Ask HN` or `Show HN`.

In [8]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    if row[1].lower().startswith('ask hn'):
        ask_posts.append(row)
    elif row[1].lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of "Ask HN" posts: {}'.format(len(ask_posts)))
print('Number of "Show HN posts: {}'.format(len(show_posts)))
print('Number of other posts: {}'.format(len(other_posts)))

Number of "Ask HN" posts: 1744
Number of "Show HN posts: 1162
Number of other posts: 17194


Let's find the total number of comments in the ask posts and then determine the average number of comments for ask posts.

In [9]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


Let's also find the total number of comments in the show posts and then determine the average number of comments in show posts.

In [11]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


It appears that ask posts seem to have more comments on average than show posts. Based on this finding, we'll focus on only the ask posts for the remainder of our analysis.

# Ask Posts and Comments by Hour

We'll separate out the posts by each hour of the day along with the comments associated to each posts. We'll then use this to determine the best time to create a post if we want to maximize the comments we get.

In [14]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6],int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M').strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '22': 71, '04': 47, '07': 34, '11': 58, '21': 109, '12': 73, '19': 110, '14': 107, '06': 44, '10': 59, '15': 116, '05': 46, '18': 109, '16': 108, '13': 85, '23': 68, '08': 48, '20': 80, '17': 100, '03': 54, '02': 58, '00': 55, '01': 60}
{'09': 251, '22': 479, '04': 337, '07': 267, '11': 641, '21': 1745, '12': 687, '19': 1188, '14': 1416, '06': 397, '10': 793, '15': 4477, '05': 464, '18': 1439, '16': 1814, '13': 1253, '23': 543, '08': 492, '20': 1722, '17': 1146, '03': 421, '02': 1381, '00': 447, '01': 683}


We now have the number of posts by hour and the number of comments by hour. Now, we'll combine these to get the average number of comments per post per hour.

In [15]:
avg_by_hour = []


for hour in comments_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['22', 6.746478873239437],
 ['04', 7.170212765957447],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034],
 ['21', 16.009174311926607],
 ['12', 9.41095890410959],
 ['19', 10.8],
 ['14', 13.233644859813085],
 ['06', 9.022727272727273],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['05', 10.08695652173913],
 ['18', 13.20183486238532],
 ['16', 16.796296296296298],
 ['13', 14.741176470588234],
 ['23', 7.985294117647059],
 ['08', 10.25],
 ['20', 21.525],
 ['17', 11.46],
 ['03', 7.796296296296297],
 ['02', 23.810344827586206],
 ['00', 8.127272727272727],
 ['01', 11.383333333333333]]

We'll finish by sorting our list to place the hours with the highest average comments per post toward the top.

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [6.746478873239437, '22'], [7.170212765957447, '04'], [7.852941176470588, '07'], [11.051724137931034, '11'], [16.009174311926607, '21'], [9.41095890410959, '12'], [10.8, '19'], [13.233644859813085, '14'], [9.022727272727273, '06'], [13.440677966101696, '10'], [38.5948275862069, '15'], [10.08695652173913, '05'], [13.20183486238532, '18'], [16.796296296296298, '16'], [14.741176470588234, '13'], [7.985294117647059, '23'], [10.25, '08'], [21.525, '20'], [11.46, '17'], [7.796296296296297, '03'], [23.810344827586206, '02'], [8.127272727272727, '00'], [11.383333333333333, '01']]


In [17]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Post Comments')

for row in sorted_swap[:5]:
    avg = row[0]
    hour = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    print('{}: {:.2f} average comments per post'.format(hour, avg))

Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on these results, it appears that we should write an `Ask HN` post around 3:00 PM EST if we want to get the most comments on our post. We did originally exclude posts without any comments, so we can only conclude that of the posts that get comments, we would expect to get the most comments with this type of post around this time of day.