# Exploring Hacker News Posts
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News (HN) is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

Let's look at two interesting sub-types of posts on HN: Ask HN and Show HN. Ask HN posts are posts made by users to ask the HN community about a specific question. For example:
```
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
```
Conversely, Show HN posts are made by users to showcase a project, product, or something else of interest. Here are a few examples of Show HN posts:
```
Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
```

Our inital objective will be to examine these two types of posts and compare them. We'll try to answer questions like:
- Which type of posts recieves more comments on average?
- Which type of posts recieves more points on average?
- Do posts created within a specific time window recieve more comments on average?

In [1]:
import pandas as pd
import csv

In [2]:
with open('hacker_news.csv') as file:
    raw_dframe = pd.read_csv(file)
    
headers = list(raw_dframe.columns)
raw_dframe.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [3]:
# Filter and separate Ask HN and Show HN posts

criteria_ask = raw_dframe['title'].map(lambda s: s.lower().startswith('ask hn'))
askhn_dframe = raw_dframe[criteria_ask].copy()

criteria_show = raw_dframe['title'].map(lambda s: s.lower().startswith('show hn'))
showhn_dframe = raw_dframe[criteria_show].copy()

criteria_other = (~criteria_ask) & (~criteria_show)
other_dframe = raw_dframe[criteria_other].copy()

# Dictionary of different dframes for convienence
dframe_dict = {"askhn": askhn_dframe, "showhn":showhn_dframe, "other":other_dframe}

Now lets look at some basic information about these types of posts:

In [4]:
# Count number of posts
s = "There are {count} posts in {name}."
print(s.format(count=len(dframe_dict["askhn"].index), name='Ask HN'))
print(s.format(count=len(dframe_dict["showhn"].index), name='Show HN'))
print(s.format(count=len(dframe_dict["other"].index), name='Other'))

# Average number of comments
total_ask_comments = dframe_dict["askhn"]["num_comments"].sum()
total_show_comments = dframe_dict["showhn"]["num_comments"].sum()
total_other_comments = dframe_dict["other"]["num_comments"].sum()

avg_ask_comments = total_ask_comments/len(dframe_dict["askhn"].index)
avg_show_comments = total_show_comments/len(dframe_dict["showhn"].index)
avg_other_comments = total_other_comments/len(dframe_dict["other"].index)

print("***")
s = "Average number of comments on {name} posts: {avg:.2f}"
print(s.format(name='Ask HN', avg=avg_ask_comments))
print(s.format(name='Show HN', avg=avg_show_comments))
print(s.format(name='Other', avg=avg_other_comments))


There are 1744 posts in Ask HN.
There are 1162 posts in Show HN.
There are 17194 posts in Other.
***
Average number of comments on Ask HN posts: 14.04
Average number of comments on Show HN posts: 10.32
Average number of comments on Other posts: 26.87


From this early investigation, it looks like the average number of comments recieved is larger for Ask HN posts compared to Show HN posts, by almost 40% in this dataset! Note that this is just an early exploratory analysis, to really confirm that Ask HN posts actually recieve more comments we would need to perform a more rigorous statistical test of significance.

Next, we'll look at the times at which posts are made and see if there's any difference in how many comments they attract. We'll do this by:
1. Binning `Ask HN` posts made by hour
2. Calculating the average number of comments for posts made within each hourly bin

In [26]:
# Convert created_at column to dtime
format_str = "%m/%d/%Y %H:%M"
dframe_dict['askhn']['created_at'] = pd.to_datetime(dframe_dict['askhn']['created_at'], format=format_str)

dframe = dframe_dict['askhn']
counts_by_hour = {}
comments_by_hour = {}

for hour in range(0,24):
    criteria = dframe['created_at'].dt.hour == hour
    result = dframe[criteria]
    
    counts_by_hour[hour] = len(result.index)
    comments_by_hour[hour] = result['num_comments'].mean()

s = "For hour {hour:2}, there were {count:3} posts in Ask HN with an average of {mean:.2f} comments."
for hour in sorted(counts_by_hour.keys()):
    print(s.format(hour=hour, count=counts_by_hour[hour], mean=comments_by_hour[hour]))

For hour  0, there were  55 posts in Ask HN with an average of 8.13 comments.
For hour  1, there were  60 posts in Ask HN with an average of 11.38 comments.
For hour  2, there were  58 posts in Ask HN with an average of 23.81 comments.
For hour  3, there were  54 posts in Ask HN with an average of 7.80 comments.
For hour  4, there were  47 posts in Ask HN with an average of 7.17 comments.
For hour  5, there were  46 posts in Ask HN with an average of 10.09 comments.
For hour  6, there were  44 posts in Ask HN with an average of 9.02 comments.
For hour  7, there were  34 posts in Ask HN with an average of 7.85 comments.
For hour  8, there were  48 posts in Ask HN with an average of 10.25 comments.
For hour  9, there were  45 posts in Ask HN with an average of 5.58 comments.
For hour 10, there were  59 posts in Ask HN with an average of 13.44 comments.
For hour 11, there were  58 posts in Ask HN with an average of 11.05 comments.
For hour 12, there were  73 posts in Ask HN with an averag

In [28]:
avg_by_hour = [ [k,v] for k,v in sorted(comments_by_hour.items(), key=lambda item: item[1], reverse=True)]
s = "{hr}:00: {mean:.2f} average comments per post"
for hr,mean in avg_by_hour[:5]:
    print(s.format(hr=hr, mean=mean))

15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour recieving the most comm