# Exploring Hacker News Posts

In this project, I'll work with a dataset of submissions to popular technology site, Hacker News
Hacker News is a site where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

I am specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:
- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples: 
- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Shanhu.io, a programming playground powered by e8vm


## Goal of the project

To compare these two types of posts to determine the following:
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

# Exploring the dataset

-   `id`: The unique identifier from Hacker News for the post
-   `title`: The title of the post
-   `url`: The URL that the posts links to, if it the post has a URL
-   `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
-   `num_comments`: The number of comments that were made on the post
-   `author`: The username of the person who submitted the post
-   `created_at`: The date and time at which the post was submitted

In [102]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [103]:
path = '../../../../08_Zadania_baza/DataScience/DataQuest/Guided Projects/Beginner/Exploring Hacker News Posts'
hn = pd.read_csv(f'{path}/hacker_news.csv')
hn

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14
...,...,...,...,...,...,...,...
293114,10176919,Ask HN: What is/are your favorite quote(s)?,,15,20,kumarski,9/6/2015 6:02
293115,10176917,Attention and awareness in stage magic: turnin...,http://people.cs.uchicago.edu/~luitien/nrn2473...,14,0,stakent,9/6/2015 6:01
293116,10176908,Dying vets fuck you letter (2013),http://dangerousminds.net/comments/dying_vets_...,10,2,mycodebreaks,9/6/2015 5:56
293117,10176907,"PHP 7 Coolest Features: Space Ships, Type Hint...",https://www.zend.com/en/resources/php-7,2,0,Garbage,9/6/2015 5:55


In [104]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


Looking at columns type, we can see that column `created_at` should be datetime. 

In [105]:
hn.describe().round(2)

Unnamed: 0,id,num_points,num_comments
count,293119.0,293119.0,293119.0
mean,11330462.68,15.03,6.53
std,696105.48,58.5,30.38
min,10176903.0,1.0,0.0
25%,10716358.0,1.0,0.0
50%,11303026.0,2.0,0.0
75%,11931517.0,4.0,1.0
max,12579008.0,5771.0,2531.0


# Extracting Ask HN and Show HN Posts

In [106]:
ask_hn_df = hn[hn['title'].str.startswith('Ask HN')].copy()
show_hn_df = hn[hn['title'].str.startswith('Show HN')].copy()

ask_hn_df

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
42,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,9/26/2016 1:17
76,12577908,Ask HN: How a DNS problem can be limited to a ...,,1,0,kuon,9/25/2016 22:57
80,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,9/25/2016 22:48
102,12577647,Ask HN: Someone uses stock trading as passive ...,,5,2,00taffe,9/25/2016 21:50
...,...,...,...,...,...,...,...
293047,10177359,Ask HN: Is coursera specialization in product ...,,1,0,pipipzz,9/6/2015 11:27
293052,10177317,Ask HN: Any meteor devs out there who could sp...,,2,1,louisswiss,9/6/2015 10:52
293055,10177309,Ask HN: Any recommendations for books about ra...,,2,4,rationalthrowa,9/6/2015 10:46
293073,10177200,Ask HN: Where do you look for work if you need...,,14,20,coroutines,9/6/2015 9:36


In [107]:
print(f'Number of posts with title "Ask HN": {ask_hn_df.shape[0]}')

Number of posts with title "Ask HN": 9122


In [108]:
show_hn_df

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
52,12578335,Show HN: Finding puns computationally,http://puns.samueltaylor.org/,2,0,saamm,9/26/2016 0:36
58,12578182,Show HN: A simple library for complicated anim...,https://christinecha.github.io/choreographer-js/,1,0,christinecha,9/26/2016 0:01
64,12578098,Show HN: WebGL visualization of DNA sequences,http://grondilu.github.io/dna.html,1,0,grondilu,9/25/2016 23:44
70,12577991,"Show HN: Pomodoro-centric, heirarchical projec...",https://github.com/jakebian/zeal,2,0,dbranes,9/25/2016 23:17
140,12577142,Show HN: Jumble Essays on the go #PaulInYourP...,https://itunes.apple.com/us/app/jumble-find-st...,1,1,ryderj,9/25/2016 20:06
...,...,...,...,...,...,...,...
292995,10177714,Show HN: Repartee The SMS Messaging Stack for...,https://github.com/markgreenall/Repartee,2,0,Nuratu,9/6/2015 14:21
293002,10177631,Show HN: Immutable and type-checked state and ...,https://github.com/gcanti/redux-tcomb,20,2,gcanti,9/6/2015 13:50
293019,10177511,Show HN: MockTheClock A tiny JavaScript libra...,https://github.com/zb3/MockTheClock,18,6,zb3,9/6/2015 13:02
293028,10177459,Show HN: AppyPaper Gift wrap with app icons p...,http://www.appypaper.com/,6,4,submitstartup,9/6/2015 12:38


In [109]:
print(f'Number of posts with title "Show HN": {show_hn_df.shape[0]}')

Number of posts with title "Show HN": 10150


#  Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [110]:
print(f'Average Number of comments for "Show HN": {show_hn_df['num_comments'].mean():.0f}')
print(f'Average Number of comments for "Ask HN": {ask_hn_df['num_comments'].mean():.0f}')

Average Number of comments for "Show HN": 5
Average Number of comments for "Ask HN": 10


# Finding the Number of Ask Posts and Comments by Hour Created

In [111]:
ask_hn_df['created_at'].describe()

count               9122
unique              9011
top       9/1/2016 15:00
freq                   4
Name: created_at, dtype: object

In [112]:
ask_hn_df['created_at'].unique().tolist()

['9/26/2016 2:53',
 '9/26/2016 1:17',
 '9/25/2016 22:57',
 '9/25/2016 22:48',
 '9/25/2016 21:50',
 '9/25/2016 19:30',
 '9/25/2016 19:22',
 '9/25/2016 17:55',
 '9/25/2016 15:48',
 '9/25/2016 15:35',
 '9/25/2016 15:28',
 '9/25/2016 14:43',
 '9/25/2016 14:17',
 '9/25/2016 13:08',
 '9/25/2016 11:27',
 '9/25/2016 10:51',
 '9/25/2016 10:47',
 '9/25/2016 9:04',
 '9/25/2016 7:09',
 '9/25/2016 3:00',
 '9/24/2016 23:04',
 '9/24/2016 22:02',
 '9/24/2016 21:18',
 '9/24/2016 20:58',
 '9/24/2016 19:57',
 '9/24/2016 19:02',
 '9/24/2016 17:55',
 '9/24/2016 17:27',
 '9/24/2016 16:50',
 '9/24/2016 16:03',
 '9/24/2016 15:29',
 '9/24/2016 14:03',
 '9/24/2016 10:10',
 '9/24/2016 8:46',
 '9/24/2016 8:39',
 '9/24/2016 8:38',
 '9/24/2016 8:28',
 '9/24/2016 3:36',
 '9/24/2016 0:21',
 '9/23/2016 23:38',
 '9/23/2016 23:35',
 '9/23/2016 22:13',
 '9/23/2016 20:58',
 '9/23/2016 20:42',
 '9/23/2016 20:23',
 '9/23/2016 20:18',
 '9/23/2016 19:33',
 '9/23/2016 19:32',
 '9/23/2016 18:56',
 '9/23/2016 18:53',
 '9/23/2016

In [113]:
ask_hn_df['created_at'] = pd.to_datetime(ask_hn_df['created_at'], format='%m/%d/%Y %H:%M')
show_hn_df['created_at'] = pd.to_datetime(show_hn_df['created_at'], format='%m/%d/%Y %H:%M')
ask_hn_df

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,2016-09-26 02:53:00
42,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,2016-09-26 01:17:00
76,12577908,Ask HN: How a DNS problem can be limited to a ...,,1,0,kuon,2016-09-25 22:57:00
80,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,2016-09-25 22:48:00
102,12577647,Ask HN: Someone uses stock trading as passive ...,,5,2,00taffe,2016-09-25 21:50:00
...,...,...,...,...,...,...,...
293047,10177359,Ask HN: Is coursera specialization in product ...,,1,0,pipipzz,2015-09-06 11:27:00
293052,10177317,Ask HN: Any meteor devs out there who could sp...,,2,1,louisswiss,2015-09-06 10:52:00
293055,10177309,Ask HN: Any recommendations for books about ra...,,2,4,rationalthrowa,2015-09-06 10:46:00
293073,10177200,Ask HN: Where do you look for work if you need...,,14,20,coroutines,2015-09-06 09:36:00


In [114]:
ask_hn_df.dtypes

id                       int64
title                   object
url                     object
num_points               int64
num_comments             int64
author                  object
created_at      datetime64[ns]
dtype: object

In [115]:
import datetime as dt 
ask_hn_df.loc[:, 'created_hour'] = ask_hn_df['created_at'].dt.hour
show_hn_df.loc[:, 'created_hour'] = show_hn_df['created_at'].dt.hour


In [116]:
ask_hn_df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,created_hour
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,2016-09-26 02:53:00,2
42,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,2016-09-26 01:17:00,1
76,12577908,Ask HN: How a DNS problem can be limited to a ...,,1,0,kuon,2016-09-25 22:57:00,22
80,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,2016-09-25 22:48:00,22
102,12577647,Ask HN: Someone uses stock trading as passive ...,,5,2,00taffe,2016-09-25 21:50:00,21


In [120]:
ask_hn_hour = ask_hn_df.groupby(by='created_hour')['num_comments'].agg('mean').round(2).reset_index()
show_hn_hour = show_hn_df.groupby(by='created_hour')['num_comments'].agg('mean').round(2).reset_index()

# Zmiana nazw kolumn dla lepszej czytelności
ask_hn_hour.rename(columns={'num_comments': 'avg_num_comments_ask_hn'}, inplace=True)
show_hn_hour.rename(columns={'num_comments': 'avg_num_comments_show_hn'}, inplace=True)

compare_avg_df = pd.merge(ask_hn_hour, show_hn_hour, on='created_hour', how='outer')
compare_avg_df.sort_values('created_hour', inplace=True)
compare_avg_df.sort_values()

Unnamed: 0,created_hour,avg_num_comments_ask_hn,avg_num_comments_show_hn
0,0,7.58,4.65
1,1,7.41,4.07
2,2,11.14,5.15
3,3,7.97,4.53
4,4,9.74,5.04
5,5,8.79,3.46
6,6,6.78,4.71
7,7,7.04,6.69
8,8,9.19,5.62
9,9,6.65,4.67


In [125]:
compare_avg_df.sort_values('avg_num_comments_ask_hn', inplace=True, ascending=False)
compare_avg_df

Unnamed: 0,created_hour,avg_num_comments_ask_hn,avg_num_comments_show_hn
15,15,28.68,4.58
13,13,16.35,5.43
12,12,12.38,6.99
2,2,11.14,5.15
10,10,10.68,3.8
4,4,9.74,5.04
14,14,9.71,5.52
17,17,9.45,4.25
8,8,9.19,5.62
11,11,9.01,6.0


In [126]:
compare_avg_df.sort_values('avg_num_comments_show_hn', inplace=True, ascending=False)
compare_avg_df

Unnamed: 0,created_hour,avg_num_comments_ask_hn,avg_num_comments_show_hn
12,12,12.38,6.99
7,7,7.04,6.69
11,11,9.01,6.0
8,8,9.19,5.62
14,14,9.71,5.52
13,13,16.35,5.43
2,2,11.14,5.15
4,4,9.74,5.04
19,19,7.18,5.02
18,18,7.95,4.94


# Conclusion

In this project, I analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on my analysis, to maximize the amount of comments a post receives, we'd recommend the posts:
 - ask post between 12:00 and 15:00 
 - show post between 7:00 and 12:00 
