# Ask and Show Hacker News, What Makes a Popular Post?
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Aside from linking to outside articles and websites for discussion, Hacker News (HN) has two other types of posts called *Ask HN* where users ask the HN community a short, specific question, and *Show HN* where users show a project, product, or something else interesting.

We are interested in comparing these two types of posts to determine the following:
- Do 'Ask HN' or 'Show HN' receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Dataset
You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- **id**: The unique identifier from Hacker News for the post
- **title**: The title of the post
- **url**: The URL that the posts links to, if it the post has a URL
- **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: The number of comments that were made on the post
- **author**: The username of the person who submitted the post
- **created_at**: The date and time at which the post was submitted

Let's begin by importing the data set and looking at the first few rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
hn = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')
print('''DataFrame contains {} Hacker News posts and {} columns'''.format(*hn.shape))
hn.head(5)

DataFrame contains 293119 Hacker News posts and 7 columns


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


Now since we're only interested in *Ask HN* and *Show HN* posts, let's separate these types of posts out from the rest. 

In [3]:
starts_with_ask = hn['title'].str.lower().str.startswith('ask hn')
starts_with_show = hn['title'].str.lower().str.startswith('show hn')

ask_posts = hn[starts_with_ask]
show_posts = hn[starts_with_show]
other_posts = hn[~starts_with_ask & ~starts_with_show]

print('''Ask posts: {}
Show posts: {}
Other posts: {}
Total: {}'''.format(len(ask_posts),
                    len(show_posts),
                    len(other_posts),
                    sum([len(ask_posts), len(show_posts), len(other_posts)])))

Ask posts: 9139
Show posts: 10158
Other posts: 273822
Total: 293119


## Analysis
With the Ask and Show posts separated we can begin looking at differences in engagement and what may factor into those differences.

Let's first look to see which of the two post types receives more comments and points on average.

In [32]:
def avg_comments_and_points(df, df_name):
    avg_comments = df['num_comments'].mean()
    avg_points = df['num_points'].mean()
    
    print('''Average comments on {0} post: {1:.1f}
Average points on {0} post: {2:.1f}
    '''.format(df_name, avg_comments, avg_points))

avg_comments_and_points(ask_posts, 'Ask')
avg_comments_and_points(show_posts, 'Show')

Average comments on Ask post: 10.4
Average points on Ask post: 11.3
    
Average comments on Show post: 4.9
Average points on Show post: 14.8
    


This shows that Ask posts receive more than double the number of comments on average with 10.4 vs. 4.9 for Show posts. However, Ask posts get an average of 11.3 points per post whereas Show posts get 14.8. So Ask posts generally get more comment engagement, but slightly less points on average than Show posts. 

What about time of day? Does a post get more engagement depending on the time of day that it is posted? Let's take a look.

In [37]:
ask_times = pd.to_datetime(ask_posts['created_at'])
hour_table = ask_posts.drop('id', axis=1).groupby(ask_times.dt.hour).mean()

hour_table

Unnamed: 0_level_0,num_points,num_comments
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9.418605,7.564784
1,9.439716,7.407801
2,10.944238,11.137546
3,9.369004,7.948339
4,10.90535,9.711934
5,9.789474,8.794258
6,8.675214,6.782051
7,9.026549,7.013274
8,10.677043,9.190661
9,7.941441,6.653153


This shows us the average number of points and comments for posts broken down by the hour they were posted. Note that this is on a 24 hour time scale so for example 16 is 4pm and 0 is 12am.

To make get a better look at the hours with the most engagement we'll look at only the top 5 hours for both points and comments.

In [35]:
print('Top 5 Hours of Day with Most Average Points')
hour_table.sort_values('num_points', ascending=False).head(5)

Unnamed: 0_level_0,num_points,num_comments
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
15,13978,18525
13,7962,7245
17,7155,5547
18,6850,4877
16,5970,4466


In [36]:
print('Top 5 Hours of Day with Most Average Comments')
hour_table.sort_values('num_comments', ascending=False).head(5)

Top 5 Hours of Day with Most Average Comments


Unnamed: 0_level_0,num_points,num_comments
created_at,Unnamed: 1_level_1,Unnamed: 2_level_1
15,13978,18525
13,7962,7245
17,7155,5547
14,5390,4972
18,6850,4877


Clearly posts created at 3pm have a significant advantage over posts made at other times of day, followed by 1pm and 5pm for both points and comments.

# Conclusion
#