# Analyzing 'Hacker News' Posts
by Nicholas Archambault

Hacker News is a website created by startup incubator Y-Combinator which allows users to submit posts related to technology and startups.  These posts can be voted and commented on, in a format similar to that of Reddit.  Posts with the most engagement can reach hundreds of thousands of visitors.

This project analyzes information on 20,000 Hacker News posts.  The dataset has been abridged from its full, 300,000-row version, and it represents a sample of all posts which received comments. 

Two types of popular Hacker News posts are `Ask HN`, where users submit a question to the Hacker News community, and `Show HN`, where users post their projects, products, or interesting stories and facts.

This project seeks to understand the metrics of these posts' popularity.  We examine whether `Ask HN` or `Show HN` posts receive higher levels of engagement, and whether posts created at a certain time generally garner more interactions from the community.

## Introduction

First, we'll read in the data and remove the headers.

In [1]:
# Import data, clean header
from csv import reader
file = open("hacker_news.csv")
read_file = reader(file)
hn = list(read_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# Explore data
headers = hn[0]
hn = hn[1:]
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN Posts

We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let's start by exploring the number of comments for each type of post. 

First, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [3]:
# Create empty lists for each category
ask_posts = []
show_posts = []
other = []

# Increment each list for each occurrence of its post type
for row in hn:
    title = str(row[1].lower())
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other))

1744
1162
17194


## Average Number of Comments for Ask and Show Posts

Now that we've separated ask posts and show posts into different lists, we'll calculate the average number of comments each type of post receives.

In [4]:
# Increment to count total number of ask comments
total_ask_comments = 0

for post in ask_posts:
    comments = int(post[4])
    total_ask_comments += comments

# Find average per post
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_comments

14.038417431192661

In [5]:
# Repeat for show comments
total_show_comments = 0

for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments

avg_show_comments = total_show_comments/len(show_posts)
avg_show_comments

10.31669535283993

These figures reveal that Ask posts receive more comments, on average, than Show posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

In [6]:
# Import datetime reader
import datetime as dt
result_list = []

# Pull time and comment numbers from each post
for post in ask_posts:
    time = post[6]
    comments = int(post[4])
    result_list.append([time, comments])

# Initialize dictionaries to store hourly totals
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

# Strip hours from datetime objects; increment dictionaries 
for i in result_list:
    date = i[0]
    comments = int(i[1])
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments

In [7]:
# Calculate average comments per hour
avg_by_hour = []

for i in counts_by_hour:
    avg_by_hour.append([i, comments_by_hour[i]/counts_by_hour[i]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

To sort the results in order of comment numbers rather than by hour, we must swap the position of the two values.

In [8]:
swap = []
for i in avg_by_hour:
    swap.append([i[1], i[0]])
swap = sorted(swap, reverse = True)
swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [9]:
# Display top five hours of comment engagement
for i in swap[:5]:
    time = dt.datetime.strptime(i[1], "%H").strftime("%H:%M")
    comments = i[0]
    print("{}: {:.2f} average comments per post".format(time,comments))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We can conclude that the 3:00pm EST hour averages about 60% more comments than the next closest hour, with ~39 comments per Ask post. We would maximize comment engagement with an Ask post by submitting it in the 3:00 hour.

## Finding the Amount of Show Posts and Comments by Hour Created

We can repeat these same steps for all Show posts, then compare whether the optimal posting hours are the same for both types.

In [10]:
result = []
for post in show_posts:
    time = post[6]
    points = int(post[3])
    result.append([time, points])


In [11]:
count_by_hour = {}
points_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for i in result:
    date = i[0]
    points = int(i[1])
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time not in count_by_hour:
        count_by_hour[time] = 1
        points_by_hour[time] = points
    else:
        count_by_hour[time] += 1
        points_by_hour[time] += points

In [12]:
avg_by_hour_2 = []

for i in counts_by_hour:
    avg_by_hour_2.append([i, points_by_hour[i]/counts_by_hour[i]])

avg_by_hour_2

[['09', 12.28888888888889],
 ['13', 28.68235294117647],
 ['10', 11.542372881355933],
 ['14', 20.439252336448597],
 ['16', 24.38888888888889],
 ['23', 22.441176470588236],
 ['12', 34.83561643835616],
 ['17', 25.21],
 ['15', 19.20689655172414],
 ['21', 7.944954128440367],
 ['20', 22.7375],
 ['02', 5.862068965517241],
 ['18', 20.321100917431192],
 ['03', 12.574074074074074],
 ['05', 2.260869565217391],
 ['19', 15.472727272727273],
 ['01', 11.666666666666666],
 ['22', 26.140845070422536],
 ['08', 10.8125],
 ['04', 8.212765957446809],
 ['00', 21.327272727272728],
 ['06', 8.522727272727273],
 ['07', 14.529411764705882],
 ['11', 25.517241379310345]]

In [13]:
swap_2 = []
for i in avg_by_hour_2:
    swap_2.append([i[1], i[0]])

swap_2 = sorted(swap_2, reverse = True)
swap_2

[[34.83561643835616, '12'],
 [28.68235294117647, '13'],
 [26.140845070422536, '22'],
 [25.517241379310345, '11'],
 [25.21, '17'],
 [24.38888888888889, '16'],
 [22.7375, '20'],
 [22.441176470588236, '23'],
 [21.327272727272728, '00'],
 [20.439252336448597, '14'],
 [20.321100917431192, '18'],
 [19.20689655172414, '15'],
 [15.472727272727273, '19'],
 [14.529411764705882, '07'],
 [12.574074074074074, '03'],
 [12.28888888888889, '09'],
 [11.666666666666666, '01'],
 [11.542372881355933, '10'],
 [10.8125, '08'],
 [8.522727272727273, '06'],
 [8.212765957446809, '04'],
 [7.944954128440367, '21'],
 [5.862068965517241, '02'],
 [2.260869565217391, '05']]

In [14]:
for i in swap_2[:5]:
    time = dt.datetime.strptime(i[1], "%H").strftime("%H:%M")
    points = i[0]
    print("{}: {:.2f} average points per post".format(time,points))

12:00: 34.84 average points per post
13:00: 28.68 average points per post
22:00: 26.14 average points per post
11:00: 25.52 average points per post
17:00: 25.21 average points per post


## Conclusion

After examining the popularity of each post type at all hours of the day, we find that the best hour for Ask posts is 3:00pm EST, when it garners around 60% more comments than the next most optimal hour. For Show posts, the best hour is 12:00pm EST. At their best hours, the two post types garner similar numbers of comments: Ask posts average ~4 more than Show posts. 

Ask posts tend to perform best in the afternoon and evening -- four of the top five hours are between 3:00pm and 9:00pm. Show posts, meanwhile, accrue the most comments in the middle of the day -- three of its top five hours are 11:00am, 12:00pm, and 1:00pm. 

The spread of comment totals for the top five hours for each post is a final intriguing factor. The difference between the average comments of the top and fifth most popular hours is ~23 for Ask posts, but just 10 for Show posts. This could indicate that the best-performing Ask posts, at any hour, tend to attract a concentration of comments, whereas comments are distributed across Show posts somewhat more evenly.