# Hacker News Research Project

In this project, we will be analyzing Hacker News exports and evaluating "Ask HN" vs "Show HN" posts. We will look to answer:

- Which posts are more popular
- If there is a certain time that gets more comments on average

In [1]:
from csv import reader
file = open('hacker_news.csv')
readfile = reader(file)
hn = list(readfile)
headers = hn[:1]
hn = hn[1:]

print(hn[:1])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']]


In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask Posts: ", len(ask_posts))
print("Show Posts: ", len(show_posts))
print("Other Posts: ", len(other_posts))

Ask Posts:  1744
Show Posts:  1162
Other Posts:  17194


In [3]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
avg_ask_comments = (total_ask_comments)/len(ask_posts)
print("Average Ask Comments: ",avg_ask_comments)

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
avg_show_comments = (total_show_comments) / len(show_posts)
print("Average Show Comments: ", avg_show_comments)
    


Average Ask Comments:  14.038417431192661
Average Show Comments:  10.31669535283993


# Conclusion:

Ask posts are far more popular then Show posts (50% more than Show) and also have a higher average of comments per post. Ask posts have 14 comments per post on average while Show posts have 10.

In [4]:
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    comments_by_hour = int(row[4])
    each_list = [created_at, comments_by_hour]
    result_list.append(each_list)

print(result_list[0])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_and_time = row[0]
    date_and_time = dt.datetime.strptime(date_and_time, "%m/%d/%Y %H:%M")
    hour = date_and_time.strftime("%H")
    date = date_and_time.strftime("%m/%d")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = 1
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    
print(counts_by_hour)
print(comments_by_hour)

['8/16/2016 9:55', 6]
{'17': 100, '02': 58, '09': 45, '20': 80, '08': 48, '18': 109, '05': 46, '19': 110, '15': 116, '13': 85, '23': 68, '01': 60, '04': 47, '22': 71, '03': 54, '14': 107, '07': 34, '12': 73, '10': 59, '00': 55, '16': 108, '06': 44, '11': 58, '21': 109}
{'17': 1146, '02': 1379, '09': 246, '20': 1721, '08': 488, '18': 1438, '05': 436, '19': 1186, '15': 4477, '13': 1225, '23': 543, '01': 651, '04': 335, '22': 478, '03': 421, '14': 1414, '07': 266, '12': 684, '10': 793, '00': 438, '16': 1798, '06': 397, '11': 640, '21': 1742}


In [6]:
avg_by_hour = []

for row in counts_by_hour:
    avg_by_hour.append([row,(comments_by_hour[row]/counts_by_hour[row])])
    print([row,(comments_by_hour[row]/counts_by_hour[row])])
    


['17', 11.46]
['02', 23.775862068965516]
['09', 5.466666666666667]
['20', 21.5125]
['08', 10.166666666666666]
['18', 13.192660550458715]
['05', 9.478260869565217]
['19', 10.781818181818181]
['15', 38.5948275862069]
['13', 14.411764705882353]
['23', 7.985294117647059]
['01', 10.85]
['04', 7.127659574468085]
['22', 6.732394366197183]
['03', 7.796296296296297]
['14', 13.214953271028037]
['07', 7.823529411764706]
['12', 9.36986301369863]
['10', 13.440677966101696]
['00', 7.963636363636364]
['16', 16.64814814814815]
['06', 9.022727272727273]
['11', 11.03448275862069]
['21', 15.98165137614679]


In [38]:
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    comments = row[1]
    swap_avg_by_hour.append([comments, hour])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for row in sorted_swap:
    print(row)

for row in sorted_swap[:5]:
    comments = row[0]
    hour = row[1]
    hour = dt.datetime.strptime(hour, "%H")
    hour = hour.strftime('%H:%M')
    result = "{}: {:.2f} average comments per post."
    print(hour,": ", comments, " average comments per post.")


[38.5948275862069, '15']
[23.775862068965516, '02']
[21.5125, '20']
[16.64814814814815, '16']
[15.98165137614679, '21']
[14.411764705882353, '13']
[13.440677966101696, '10']
[13.214953271028037, '14']
[13.192660550458715, '18']
[11.46, '17']
[11.03448275862069, '11']
[10.85, '01']
[10.781818181818181, '19']
[10.166666666666666, '08']
[9.478260869565217, '05']
[9.36986301369863, '12']
[9.022727272727273, '06']
[7.985294117647059, '23']
[7.963636363636364, '00']
[7.823529411764706, '07']
[7.796296296296297, '03']
[7.127659574468085, '04']
[6.732394366197183, '22']
[5.466666666666667, '09']
15:00 :  38.5948275862069  average comments per post.
02:00 :  23.775862068965516  average comments per post.
20:00 :  21.5125  average comments per post.
16:00 :  16.64814814814815  average comments per post.
21:00 :  15.98165137614679  average comments per post.


# Conclusion

Postings from 3pm are the most likely to get comments. This is the highest tier by far, beating the 2nd most popular time (2am) by 65%.

4 of the 5 most popular times to post are in the late afternoon to early evening (3pm - 9pm).

2am appears to be an anomaly, as this is the 2nd most popular time for comments.