# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Our firts step is to reduced the data from almost 300,000 rows to approximately 70,000 rows by removing all submissions that did not receive any comments. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

Here are the first few rows of the data set:

In [1]:
from csv import reader
import pandas as pd

data = pd.read_csv('Data\hacker-news-posts\HN_posts_year_to_Sep_26_2016.csv')
data = data.dropna()
data = data[data['num_comments'] > 0]
data.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
5,12578975,Saving the Hassle of Shopping,https://blog.menswr.com/2016/09/07/whats-new-w...,1,1,bdoux,9/26/2016 3:13
17,12578822,Amazons Algorithms Dont Find You the Best Deals,https://www.technologyreview.com/s/602442/amaz...,1,1,yarapavan,9/26/2016 2:26
28,12578694,Emergency dose of epinephrine that does not co...,http://m.imgur.com/gallery/th6Ua,2,1,dredmorbius,9/26/2016 1:54
34,12578624,Phone Makers Could Cut Off Drivers. So Why Don...,http://www.nytimes.com/2016/09/25/technology/p...,4,1,danso,9/26/2016 1:37
37,12578556,"OpenMW, Open Source Elderscrolls III: Morrowin...",https://openmw.org/en/,32,3,rocky1138,9/26/2016 1:24


In [2]:
data.dtypes

id               int64
title           object
url             object
num_points       int64
num_comments     int64
author          object
created_at      object
dtype: object

Note that for the DataQuest tutorial that this exercise is taken from, the data is analysed as a csv. Hence, even though I could analyse this data with the pandas library, I shall be using the long-hand version by iterating over a list of lists.

In [3]:
# Convert the dataframe to the list of rows i.e. list of lists
hn = data.to_numpy().tolist()
for row in hn[:5]:
    print(row)

[12578975, 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', 1, 1, 'bdoux', '9/26/2016 3:13']
[12578822, 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', 1, 1, 'yarapavan', '9/26/2016 2:26']
[12578694, 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', 2, 1, 'dredmorbius', '9/26/2016 1:54']
[12578624, 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', 4, 1, 'danso', '9/26/2016 1:37']
[12578556, 'OpenMW, Open Source Elderscrolls III: Morrowind Reimplementation', 'https://openmw.org/en/', 32, 3, 'rocky1138', '9/26/2016 1:24']


In [4]:
# Extract headers of the dataset as a list
headers = list(data)
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
# Group the different posts by type
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f'Number of ask posts: {len(ask_posts)}')
print(f'Number of show posts: {len(show_posts)}')
print(f'Number of other posts: {len(other_posts)}')

Number of ask posts: 31
Number of show posts: 4845
Number of other posts: 65788


In [6]:
# Determine whether ask or show posts recieve more comments on average
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print(f'Average comments per ask post: {avg_ask_comments}')

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)
print(f'Average comments per show post: {avg_show_comments}')

Average comments per ask post: 2.3225806451612905
Average comments per show post: 10.067285861713106


On average, we can see that show posts recieve more than 4 times as many comments as ask posts. We can also note that there are far more show posts than ask posts.

Now lets calculate the amount of ask posts created per hour.

In [7]:
import datetime as dt

In [8]:
result_list = []
for row in ask_posts:
    # datetime format mm/dd/yyyy H:M
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print(f"Posts by hour:\n{counts_by_hour}")
print(f"Comments by hour:\n{comments_by_hour}")

Posts by hour:
{'07': 2, '18': 4, '14': 3, '13': 3, '15': 2, '02': 1, '21': 3, '20': 1, '17': 2, '16': 2, '06': 2, '11': 1, '09': 1, '04': 1, '22': 2, '00': 1}
Comments by hour:
{'07': 3, '18': 7, '14': 8, '13': 7, '15': 4, '02': 2, '21': 8, '20': 2, '17': 4, '16': 5, '06': 5, '11': 3, '09': 1, '04': 6, '22': 5, '00': 2}


Now we can calculate the average number of comments per post for posts created during each hour of the day.

In [9]:
avg_by_hour = []
for hour in comments_by_hour:
    num_posts = counts_by_hour[hour]
    num_comments = comments_by_hour[hour]
    avg = num_comments / num_posts
    avg_by_hour.append([hour, avg])

print(avg_by_hour)

[['07', 1.5], ['18', 1.75], ['14', 2.6666666666666665], ['13', 2.3333333333333335], ['15', 2.0], ['02', 2.0], ['21', 2.6666666666666665], ['20', 2.0], ['17', 2.0], ['16', 2.5], ['06', 2.5], ['11', 3.0], ['09', 1.0], ['04', 6.0], ['22', 2.5], ['00', 2.0]]


In [10]:
# Sort the list of lists by average comments per hour
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1], hour[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for avg in sorted_swap[:5]:
    print("{hour}: {avg:.2f} average comments per post".format(
        hour=dt.datetime.strftime(dt.datetime.strptime(avg[1], "%H"), "%H:%M"), 
        avg=avg[0]
        )
    )

Top 5 Hours for Ask Posts Comments
04:00: 6.00 average comments per post
11:00: 3.00 average comments per post
21:00: 2.67 average comments per post
14:00: 2.67 average comments per post
22:00: 2.50 average comments per post


So if you want to create a post on Hacker News and for that post to recieve the most number of comments possible, the dataset suggests creating the post at 4am!