# Hacker News - Post Performance Analysis

In the following project we are analysing 2 types of posts from the site Hacker News. These posts are "Ask HN", where users submit their post asking about something, and "Show HN" where users submit a news or a project for other people to see. 

We will be analysing:

1. Which kind of post receives more comments on average
2. If posts created at a specific time of the day receive more comments on average

We will work with the database 'hacker_news.csv', which has approximately 20,000 rows, all of which are posts in Hacker News that received comments

We start by opening and reading this database as a list of lists.

In [1]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

We take a look at our column headers and take a sample to see how each row looks like and what kind of data are we working with:

In [2]:
print(hn[:1])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [3]:
print(hn[1:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


To work with the data and obtain proper results we do not need the headers. Hoverver, it will be handy to have a variable for them, just in case we need a refresher. We assign the headers to this variable and remove them from the rest of the database.

In [4]:
headers = hn[:1]
hn = hn[1:]
print(headers)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that we have a cleaner database, we proceed to look for posts starting with "Ask HN" and "Show HN". We probably have posts with capital and lowercase letters, so to be safe we convert everything to lowercase with the method string.lower()

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for rows in hn:
    title = rows[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(rows)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(rows)
    
    else:
        other_posts.append(rows)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


It is interesting to see that there are much more "Other" posts than Ask HN or Show HN. A very good thing about these numbers is that the amount of Ask HN and Show HN is not too different, so when we compare them we can get more reasonable conclusions without having to account for a very big difference in the total number of posts.

### Finding out the average number of comments for each kind of post

Next step, now that we have split Ask HN and Show  HN in 2 clean lists, is to find out how many comments each of them has on average.

In [6]:
total_ask_comments = 0

for posts in ask_posts:
    num_comments = posts[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
print("Average comments per Ask HN post:", avg_ask_comments)

total_show_comments = 0

for posts in show_posts:
    num_comments = posts[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

avg_show_comments = round(total_show_comments / len(show_posts), 2)
print("Average comments per Show HN post: ", avg_show_comments)

Average comments per Ask HN post: 14.04
Average comments per Show HN post:  10.32


This answers our question 1: Which  posts receive more comments on average?

#### It seems like the Ask HN posts receive 26% more comments on average than Show HN posts. 

In the most practical scenario, Hacker News can use this information to assign more moderators to the Ask HN category, given that they have both more posts and more comments.

### Calculating the posts and comments accross time

After seeing that the posts with the most comments are the Ask HN posts, we will focus the analysis on this category. We will start by calculating the number of Ask HN posts per hour created, along with the number of comments received.

In [7]:
import datetime as dt

result_list = []

#We create a list only with the post creation times and the number of comments received on that post

for posts in ask_posts:
    time_and_comments = [posts[6], int(posts[4])]
    result_list.append(time_and_comments)

#Now, for each row (1 post) in this new list, we count how many posts we had per hour and how many comments we had per hour

counts_by_hour = {}
comments_by_hour = {}

for rows in result_list:
    hour = rows[0]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M" )
#Converted string for the time into a datetime object, then extracted only the hour and converted it back to a string    
    hour = hour.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour]= rows[1]
    
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += rows[1]

print(counts_by_hour)
print(comments_by_hour)
    


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Adding all the keys in the first dictionary we confirm that all the 1744 posts have been accounted for. Our next step is to find out the average number of comments per post per hour.

In [8]:
avg_by_hour = []

for posts in counts_by_hour:
    avg_by_hour.append([posts, round(comments_by_hour[posts]/counts_by_hour[posts],2)])

print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


Right now the data looks a little bit messy and it is hard to read. Our next step is to sort the values in this list of lists. To sort it, we need the integer to be in the first place of the list, so we will swap the numbers.

In [9]:
swap_avg_by_hour = []

for rows in avg_by_hour:
    swap_avg_by_hour.append([rows[1], rows[0]])

print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


In [10]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21']]


Now we would want to show the time in a more proper way, so we will proceed to reformat it again as we did earlier with the datetime.strptime constructor and strftime method.

In [11]:
for values in sorted_swap:
    values[1] = dt.datetime.strptime(values[1], "%H")
    values[1] = values[1].strftime("%H:%M")
    print("{0}: {1} average comments per post".format(values[1], values[0]))  
    

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.2 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.8 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.8 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


## Conclusions on the correlation between posting time and engagement

As we can see, the best hour to post and get comments is 15:00. 

Looking at the first several rows, we can see that the most engaging times, and therefore the times when moderators are most needed, are the early evening (15:00, 16:00), and night (02:00, 20:00). 

This makes sense since a lot of the participants might have full time jobs, hence that a lot of these happen either after work or right before finishing.

## Now, what about the "Other posts"?

### What kind of posts are these and how much engagement do they bring?

We have concluded that for the most standardized kind of posts the best time to get engagement is 15.00, but we had a considerable pool 10 times bigger of "Other posts". What are these?

To find out, we will analyze how many comments do they have on average, and which posts have the most comments and the most likes. We will print a sample out of these and we will also analyze the comments and likes per hour for these kind of posts.

In [12]:
total_comments_other = 0

for rows in other_posts:
    comments = int(rows[4])
    total_comments_other += comments

avg_comments_other = total_comments_other / len(other_posts)

print(avg_comments_other)

26.8730371059672


This confirms the suspicion that "Other posts" indeed contain an even higher level of activity than the other two categories, and that we are ignoring them inadvertently. 

In order to have a better picture, we will also compare the number of comments per hour for these kind of posts, and then we will take a sample of the posts with most engagement.

In [13]:
time_comment_list = []

for rows in other_posts:
    time_comments_other = [rows[6], int(rows[4])]
    time_comment_list.append(time_comments_other)

othercounts_hour = {}
othercomments_hour = {}

for rows in time_comment_list:
    hour = rows[0]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    if hour not in othercounts_hour:
        othercounts_hour[hour] = 1
        othercomments_hour[hour] = rows[1]
    else:
        othercounts_hour[hour] += 1
        othercomments_hour[hour] += rows[1]

#Now we get the average comments per post per hour

otheraverage_comments_hour = []

for rows in othercounts_hour:
    otheraverage_comments_hour.append([rows, round(othercomments_hour[rows]/othercounts_hour[rows],2)])

print("Number of comments per post, per hour: ", otheraverage_comments_hour)

Number of comments per post, per hour:  [['11', 29.59], ['19', 26.7], ['22', 23.27], ['00', 27.08], ['04', 24.13], ['09', 27.59], ['16', 25.39], ['18', 26.92], ['10', 26.61], ['12', 30.35], ['20', 23.14], ['03', 26.83], ['17', 28.0], ['14', 32.33], ['13', 30.9], ['01', 23.07], ['23', 24.62], ['08', 27.03], ['02', 27.79], ['21', 23.61], ['15', 29.52], ['06', 21.36], ['05', 25.18], ['07', 26.81]]


This information tells us that there is much more engagement in the Other Posts category than in the Ask HN posts. We can see as well which hours are the ones with the highest engagement:

In [18]:
for rows in otheraverage_comments_hour:
    number = float(rows[1])
    if number > 30:
        print(rows)

['12', 30.35]
['14', 32.33]
['13', 30.9]


We see that there are 3 hours that have more than 30 comments per hour. Now, let's take a sample of posts to see what are we dealing with. We will find out which comments have the most posts and then print their titles: 

In [28]:
comments_other = []

for rows in other_posts:
    comments_other.append(float(rows[4]))

print(max(comments_other))
print(" ")

#We sort the list to find out the top range

comments_sorted = sorted(comments_other, reverse = True)
print(comments_sorted[0:20])
print(" ")

#Now we create a list with names and number of comments

final_sample_otherposts = []
for rows in other_posts:
    comment_and_title = [rows[1], int(rows[4])]
    if int(rows[4]) > 500:
        print(comment_and_title)
       

        
    
    

1733.0
 
[1733.0, 809.0, 781.0, 760.0, 705.0, 677.0, 644.0, 624.0, 599.0, 569.0, 552.0, 552.0, 547.0, 521.0, 519.0, 516.0, 515.0, 514.0, 513.0, 503.0]
 
['Master Plan, Part Deux', 677]
['I switched to Android after 7 years of iOS', 502]
['Soaring Student Debt Prompts Calls for Relief', 516]
['July was the hottest month ever recorded, according to Nasa', 502]
['Tech workers are increasingly looking to leave Silicon Valley', 569]
['Massachusetts Bans Employers from Asking Applicants About Previous Pay', 760]
["It's The Future", 521]
['A letter to our daughter', 519]
['New Windows 10 Devices From Microsoft', 644]
['Elon Musk on How to Build the Future', 513]
["Instagram's Million Dollar Bug", 514]
['iPhone 7', 1733]
['Pardon Snowden', 781]
['Paris Shootings and Explosions Kill Over 100, Police Say', 624]
['How the Sugar Industry Shifted Blame to Fat', 599]
['Economic Inequality', 547]
['VLC contributor living in Aleppo writing about the Paris attacks', 705]
['Apple introduces the iPhone S

## Conclusion - Other posts that gain significant engagement in HN

As we can see, a lot of these posts with very high activity are news articles (we do not know if it is a copy-paste or an opinion article), all of them about technology. Ask HN might want to set up a category for news articles in their post creation interface to properly track these and assign more moderation resources to them. 

If these are indeed opinion articles, then commenting on technology news might be a very good path for HN "Influencers" to explore! 