# Ask and Show Hacker News, What Makes a Popular Post?
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Aside from linking to outside articles and websites for discussion, Hacker News (HN) has two other types of posts called *Ask HN* where users ask the HN community a short, specific question, and *Show HN* where users show a project, product, or something else interesting.

We are interested in comparing these two types of posts to determine the following:
- Do 'Ask HN' or 'Show HN' receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Dataset
You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted

Let's begin by importing the data set and looking at the first few rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [9]:
hn = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')
print('''DataFrame contains {} Hacker News posts and {} columns'''.format(*hn.shape))
hn.head(5)

DataFrame contains 293119 Hacker News posts and 7 columns


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


Now since we're only interested in *Ask HN* and *Show HN* posts, let's separate these types of posts out from the rest. 

In [7]:
starts_with_ask = hn['title'].str.lower().str.startswith('ask hn')
starts_with_show = hn['title'].str.lower().str.startswith('show hn')

ask_posts = hn[starts_with_ask]
show_posts = hn[starts_with_show]
other_posts = hn[~starts_with_ask & ~starts_with_show]

In [11]:
print('''Ask posts: {}
Show posts: {}
Other posts: {}
Total: {}'''.format(len(ask_posts),
                    len(show_posts),
                    len(other_posts),
                    sum([len(ask_posts), len(show_posts), len(other_posts)])))

Ask posts: 9139
Show posts: 10158
Other posts: 273822
Total: 293119
