# Comparison of Hacker News Posts
[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator [Y Combinator](https://ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to [reddit](https://reddit.com/). Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the original data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but for this project we have reduced the data set from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

Our aim in this project is to find out:
* Do `Ask HN` or `Show HN` posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Let's start by opening the datasets and exploring the data:

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# Display first 5 rows:
for i in range(5):
    print(hn[i])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


The first row contains the column headers, so we can extract that:

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

In [3]:
ask_posts, show_posts, other_posts = [], [], []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Now we'll check the number of posts in `ask_posts`, `show_posts`, and `other_posts`:

In [4]:
print('ask_posts: {}'.format(len(ask_posts)))
print('show_posts: {}'.format(len(show_posts)))
print('other_posts: {}'.format(len(other_posts)))

ask_posts: 1744
show_posts: 1162
other_posts: 17194


Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)

print('Average number of comments on ask posts: {}'.format(avg_ask_comments))
print('Average number of comments on show posts: {}'.format(avg_show_comments))

Average number of comments on ask posts: 14.038417431192661
Average number of comments on show posts: 10.31669535283993


On average, ask posts receive slightly more comments than show posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [6]:
import datetime as dt
result_list = [[post[6], int(post[4])] for post in ask_posts]
counts_by_hour, comments_by_hour = {}, {}
for res in result_list:
    date = dt.datetime.strptime(res[0], '%m/%d/%Y %H:%M')
    hour = date.hour
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = res[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += res[1]

In the last block, we created two dictionaries:

* `counts_by_hour`: contains the number of ask posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [7]:
avg_by_hour = [[hour, comments_by_hour[hour]/counts_by_hour[hour]] for hour in counts_by_hour.keys()]
avg_by_hour = sorted(avg_by_hour, key=lambda x: x[1], reverse=True)
print('Top 5 Hours for Ask Posts Comments:')
for row in avg_by_hour[:5]:
    hour = dt.time(hour=row[0])
    print('{}: {:.2f} comments per post'.format(hour.strftime('%H:%M'), row[1]))

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.80 comments per post
21:00: 16.01 comments per post


To have the highest chance of receiving comments on a post, the best hour to create a new post would be 3 PM EST.