# Hacker News: Ask HN or Show HN Posts

<br></br>
<i>Created by the startup incubator Y Combinator in 2007, [Hacker News](https://news.ycombinator.com/) is a social news site where *posts* — user-submitted content — are voted and commented upon, highly similar to Reddit's format. However, unlike Reddit, users can only upvote or downvote once they've accumulated enough karma (user points) to discourage [trolling](https://unlcms.unl.edu/engineering/james-hanson/trolls-and-their-impact-social-media) and affirm intelligent, respectful discourse. Hacker News' top posts can get hundreds of thousands of user engagements since it is fairly popular in technology and startup circles.</i>

## Dataset

The source dataset for this project contains 300,000 rows about `Ask HN` or `Show HN` posts, which are:

- `Ask HN` posts are community questions, such as "What's the best online course you've taken?"
- `Show HN` posts are about projects, products, or interesting things for the Hacker News community

The goal is to learn more about comments that these types of posts get. Specifically:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Because comments are the primary focus (and to streamline resources), the dataset has been reduced to about 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Here is a sample of the smaller dataset:

In [1]:
from csv import reader

hn_csv = open('datasets/hacker_news.csv')
hn = list(reader(hn_csv))
hn[:6]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 

## Headers

The headers are being removed so the succeeding analyses only looks at rows with raw values.

In [2]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]

def sample(dataset, start=0, end=5):
    row_count = start
    for row in dataset[start:end]:
        row_count += 1
        print(row)
        if (row_count) != end:
            print('\n')

sample(hn)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## `Ask HN` and `Show HN` Lists

So it's easier to analyze later, `Ask HN` or `Show HN` posts need to be extracted from the current dataset, and split into their own lists.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for rows in hn:
    title = rows[1]
    title_lower = title.lower()
    
    if title_lower.startswith('ask hn'):
        ask_posts.append(rows)
    elif title_lower.startswith('show hn'):
        show_posts.append(rows)
    else:
        other_posts.append(rows)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [5]:
sample(ask_posts)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


In [6]:
sample(show_posts)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


In [7]:
sample(other_posts)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Avg Comments

Now that ask posts and show posts have been separated into different lists, the next step is to calculate the average number of comments each type of post receives.

In [8]:
total_ask_comments = 0

for posts in ask_posts:
    num_comments = int(posts[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("avg_ask_comments = " + str(round(avg_ask_comments,2)))

avg_ask_comments = 14.04


In [9]:
total_show_comments = 0

for posts in show_posts:
    num_comments = int(posts[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("avg_show_comments = " + str(round(avg_show_comments,2)))

avg_show_comments = 10.32


The data suggests that, on average, there are more comments for Ask posts than Show posts. This coincides with the usual flow of interaction within these threads. Asks posts would usually elicit elucidation and/or additional information for clarity, while show posts tend to be more straightforward and may not require community engagement.

## Avg Ask HN Posts and Comments by Hour

Calculate the amount of ask posts created during each hour of day and the number of comments received.

In [10]:
import datetime as dt

result_list = []

for posts in ask_posts:
    created_at = posts[6]
    num_comments = int(posts[4])
    result_list.append([created_at, num_comments])
    
sample(result_list)

['8/16/2016 9:55', 6]


['11/22/2015 13:43', 29]


['5/2/2016 10:14', 1]


['8/2/2016 14:20', 3]


['10/15/2015 16:38', 17]


In [11]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dt_format = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    post_hour = dt_format.strftime('%H')
    num_comments = row[1]
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = num_comments
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += num_comments
        
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Avg Ask HN Comments by Hour

Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.

In [12]:
avg_by_hour = []

for h in counts_by_hour:
    avg_count = round(comments_by_hour[h] / counts_by_hour[h],2)
    avg_by_hour.append([h, avg_count])
    
for l in sorted(avg_by_hour):
    print(l)

['00', 8.13]
['01', 11.38]
['02', 23.81]
['03', 7.8]
['04', 7.17]
['05', 10.09]
['06', 9.02]
['07', 7.85]
['08', 10.25]
['09', 5.58]
['10', 13.44]
['11', 11.05]
['12', 9.41]
['13', 14.74]
['14', 13.23]
['15', 38.59]
['16', 16.8]
['17', 11.46]
['18', 13.2]
['19', 10.8]
['20', 21.52]
['21', 16.01]
['22', 6.75]
['23', 7.99]


In [13]:
swap_avg_by_hour = []
for hours in avg_by_hour:
    swap_avg_by_hour.append([hours[1],hours[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)    

for l in sorted_swap:
    print(l)

[38.59, '15']
[23.81, '02']
[21.52, '20']
[16.8, '16']
[16.01, '21']
[14.74, '13']
[13.44, '10']
[13.23, '14']
[13.2, '18']
[11.46, '17']
[11.38, '01']
[11.05, '11']
[10.8, '19']
[10.25, '08']
[10.09, '05']
[9.41, '12']
[9.02, '06']
[8.13, '00']
[7.99, '23']
[7.85, '07']
[7.8, '03']
[7.17, '04']
[6.75, '22']
[5.58, '09']


In [14]:
print("Top 5 Hours for Ask Posts Comments")
for hours in sorted_swap[:5]:
    print('{1}:00: {0:.2f} average comments per post'.format(hours[0],hours[1]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

With an average of 38.59 comments per post (or 60% / 15 comments more than the next highest average), the busiest hour for post comments seem to be 3:00 pm EST ([documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home) has mentioned timezone is in EST). To maximize the amount of comments, posting an `ASK HN` between 3:00 - 4:00 pm EST seems like a good strategy, since that is the hour where user engagement seems to be at its daily peak.