# Hacker News Post Interest Analysis

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

* Ask HN: How to improve my personal website?
* Ask HN: Am I the only one outraged by Twitter shutting down share counts?
* Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

* Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
* Show HN: Something pointless I made
* Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [14]:
from csv import reader

opened_file = open('D:/Library/datasci/datasets/HN_posts_year_to_Sep_26_2016.csv', encoding = "utf-8")
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [17]:
#extract first row of data and assign it to the headers variable.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


#### Extracting Posts of Interest

We're only interested in the posts starting with Ask HN and Show HN. Lets pull these from the dataset and put them into their own lists.


In [22]:
#initialize empty lists
ask_posts = []
show_posts = []
other_posts = []


#loop over the rows in the dataset.
for row in hn:
    #assign the title of the post to a variable
    title = row[1]
    #append entire row off of starting string in title
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    if title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))

9139 10158 282961


#### Average Number of Comments by Posting Type

Now we check if the show or ask posts recieve more comments on average.


In [30]:
#initialize empty count
total_ask_comments = 0
total_show_comments = 0

#start adding
for row in ask_posts:
    num = int(row[4])
    total_ask_comments = total_ask_comments + num

for row in show_posts:
    num = int(row[4])
    total_show_comments = total_show_comments + num

#calculate the average
avg_ask_comments = total_ask_comments/len(ask_posts)  
avg_show_comments = total_show_comments/len(show_posts)

#print averages
print(("Average ask comments: {} Average show comments: {}".format(round(avg_ask_comments,2),round(avg_show_comments,2))))


Average ask comments: 10.39 Average show comments: 4.89


We see on average that Ask HN posts get about twice as many comments than the Show HN posts.

#### Average Number of Comments by Hour

As Ask HN posts result in significantly more comments we will focus only on the Ask HN dataset. Next, we are investigating whether posts created at a certain time are more likely to attract comments. To do so we will perform the following steps:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments recieved.
2. Calculate the average number of comments ask posts receive by hour created.

We will work on the first step using the datetime module to work with the data in the created_at column.

In [41]:
#First we create a list of the relevant data, then we will populate two dictionaries to hold the count and comment data by hour.

#import datetime
import datetime as dt

result_list = []

#pull the created_at column(index 7) and number column (index 5)
for row in ask_posts:
    temp = [row[6], int(row[4])]
    result_list.append(temp)

#empty dictionaries to populate
counts_by_hour = {}
comments_by_hour = {}


for row in result_list:
    #store time data and comments
    hour = row[0]
    comments = row[1]
    dt_time = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")    #convert the time to a datetime object
    dt_hour = dt.datetime.strftime(dt_time, "%H")     #output just the hour
    
    #assign to dictionaries and count
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1
        comments_by_hour[dt_hour] = comments
    else:
        counts_by_hour[dt_hour] +=1
        comments_by_hour[dt_hour] += comments
        

In [46]:
#calculate the averages
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]] 2


We will now run a bit of code to make things more readable and print out the top 5 hours by post count.

In [53]:
#Create a list with swapped columns equivalent to avg_by_hour
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

for row in sorted_swap[:5]:
    count = row[0]
    dtime = dt.datetime.strptime(row[1], "%H") #convert to datetime object
    hour = dt.datetime.strftime(dtime, "%H:00") #pull the hour and format
    print("{}: {:.2f} average comments per post".format(hour, count))

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


We see that the hour with the most average comments per post is 15:00 Eastern Time. If a poster is interested in receiving comments and potentially traction on the site they should be making a post in the early afternoon adjusted for their timezone.


#### Gauging Reader Value

The dataset additionally contains information on the points a post receives. A post receiving more points can indicate number of viewers and some measure of interaction or value. A similar analysis can be done to check the average number of points based off of post type, and time. This can then be compared to what we have produced already summarizing the average comments per post by time. To do so we will perform the following steps:

1. Determine if show or ask posts receive more points on average.
2. Determine if posts created at a certain time are more likely to receive more points.
3. Compare your results to the average number of comments and points other posts receive.

In [55]:
#initialize empty count
total_ask_points = 0
total_show_points = 0

#start adding
for row in ask_posts:
    num = int(row[3])
    total_ask_points = total_ask_points + num

for row in show_posts:
    num = int(row[3])
    total_show_points = total_show_points + num

#calculate the average
avg_ask_points = total_ask_points/len(ask_posts)  
avg_show_points = total_show_points/len(show_posts)

#print averages
print(("Average ask points: {} Average show points: {}".format(round(avg_ask_points,2),round(avg_show_points,2))))

Average ask points: 11.31 Average show points: 14.84


The average number of points that a post receives is higher for the Show HN posts. This is in contrast with the results for number of comments but it is not unreasonable to assume that asking will result in more people posting comments to essentially answer the poster. Show HN posts might be more interesting or valuable to a larger base of readers. 

However, as we have done an analysis already on the Ask HN posts, I will continue working on that dataset. A more comprehensive analysis may be done on both datasets if desired.

Lets execute some very similar code and see what we get.


In [57]:
result_list2 = []

#pull the created_at column(index 7) and point column (index 4)
for row in ask_posts:
    temp = [row[6], int(row[3])]
    result_list2.append(temp)

#empty dictionaries to populate
counts_by_hour2 = {}
points_by_hour = {}


for row in result_list2:
    #store time data and comments
    hour = row[0]
    points = row[1]
    dt_time = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")    #convert the time to a datetime object
    dt_hour = dt.datetime.strftime(dt_time, "%H")     #output just the hour
    
    #assign to dictionaries and count
    if dt_hour not in counts_by_hour2:
        counts_by_hour2[dt_hour] = 1
        points_by_hour[dt_hour] = points
    else:
        counts_by_hour2[dt_hour] +=1
        points_by_hour[dt_hour] += points
        
#calculate the averages
avg_by_hour2 = []

for hour in points_by_hour:
    avg_by_hour2.append([hour, points_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour2)
        

[['02', 10.944237918215613], ['01', 9.439716312056738], ['22', 9.402088772845953], ['21', 9.733590733590734], ['19', 8.66304347826087], ['17', 12.189097103918229], ['15', 21.637770897832816], ['14', 10.50682261208577], ['13', 17.93243243243243], ['11', 9.153846153846153], ['10', 13.436170212765957], ['09', 7.941441441441442], ['07', 9.026548672566372], ['03', 9.3690036900369], ['23', 7.626822157434402], ['20', 8.805882352941177], ['16', 10.310880829015543], ['08', 10.67704280155642], ['00', 9.418604651162791], ['18', 11.156351791530945], ['12', 13.576023391812866], ['04', 10.905349794238683], ['06', 8.675213675213675], ['05', 9.789473684210526]]


In [58]:
#Create a list with swapped columns equivalent to avg_by_hour
swap_avg_by_hour2 = []

for row in avg_by_hour2:
    swap_avg_by_hour2.append([row[1], row[0]])

sorted_swap2 = sorted(swap_avg_by_hour2, reverse = True)

for row in sorted_swap2[:5]:
    count = row[0]
    dtime = dt.datetime.strptime(row[1], "%H") #convert to datetime object
    hour = dt.datetime.strftime(dtime, "%H:00") #pull the hour and format
    print("{}: {:.2f} average points per post".format(hour, count))

15:00: 21.64 average points per post
13:00: 17.93 average points per post
12:00: 13.58 average points per post
10:00: 13.44 average points per post
17:00: 12.19 average points per post


#### Conclusions

We find that the average number of points per post is highest at 15:00 Eastern Time once again. This time likely corresponds to the highest traffic that the website recieves. If we are interested in making a post with high reader engagement that is certainly the time to make it if we are making an Ask HN post.

Further analysis could be done to check if there is any difference in results based on the Show HN posts or the Other category of posts. 