# Exploring Hacker News Posts

A quick dive into the posts that perform the best on Hacker News

In [1]:
from csv import reader

dataset = open('Hacker_News.csv',  encoding="utf8")
read = reader(dataset)
hn = list(read)
headers = hn[0]
hn = hn[1:]
print('Length of dataset: ' + str(len(hn)))

print(headers)
print(hn[:5])

Length of dataset: 293119
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/201

Now we will separate each type of post into its own data set to compare them.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are {} in the ask hn list'.format(len(ask_posts)))
print('There are {} in the show hn list'.format(len(show_posts)))
print('There are {} in the other list'.format(len(other_posts)))
print('Total number of posts: {}'.format((len(ask_posts) + len(show_posts) + len(other_posts))))

There are 9139 in the ask hn list
There are 10158 in the show hn list
There are 273822 in the other list
Total number of posts: 293119


Now we will analyze the number of comments left on ask posts and compare it to the number of comments left on show posts.

In [3]:
total_ask_comments = 0 

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of Ask post comments: ' + str(avg_ask_comments))
print('Total number of Ask post comments: ' + str(total_ask_comments))

Average number of Ask post comments: 10.393478498741656
Total number of Ask post comments: 94986


In [4]:
total_show_comments = 0 

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of Show post comments: ' + str(avg_show_comments))
print('Total number of Show post comments: ' + str(total_show_comments))

Average number of Show post comments: 4.886099625910612
Total number of Show post comments: 49633


Based off of the average number of comments for both Ask posts and Show posts, it looks as though show posts get fewer comments. On average, Ask posts will receieve more user interaction. 

Since ask posts receive more comments than show posts on average, we will focus our remaining analysis on just ask posts.
We will be determining if ask posts created at a certain time are more likely to attract comments by analyzing:
- The number of ask posts created in each hour of the day along with the number of comments receieved
- The average number of comments ask posts receive by the hour they are created

In [5]:
import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = row[0]
    comments = int(row[1])
    time_dt = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = time_dt.hour
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

print(counts_by_hour)
print(comments_by_hour)

print("There are {} lines in counts by hours".format(len(counts_by_hour)))
print("There are {} lines in comments by hours".format(len(comments_by_hour)))

{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
There are 24 lines in counts by hours
There are 24 lines in comments by hours


In [6]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])


print("There are {} lines in average comments by hours".format(len(avg_by_hour)))
for row in avg_by_hour:
    print(row)

There are 24 lines in average comments by hours
[2, 11.137546468401487]
[1, 7.407801418439717]
[22, 8.804177545691905]
[21, 8.687258687258687]
[19, 7.163043478260869]
[17, 9.449744463373083]
[15, 28.676470588235293]
[14, 9.692007797270955]
[13, 16.31756756756757]
[11, 8.96474358974359]
[10, 10.684397163120567]
[9, 6.653153153153153]
[7, 7.013274336283186]
[3, 7.948339483394834]
[23, 6.696793002915452]
[20, 8.749019607843136]
[16, 7.713298791018998]
[8, 9.190661478599221]
[0, 7.5647840531561465]
[18, 7.94299674267101]
[12, 12.380116959064328]
[4, 9.7119341563786]
[6, 6.782051282051282]
[5, 8.794258373205741]


This data is a bit hard to read, so we are going to sort it. 

In [7]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
                             
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for row in sorted_swap:
    print(row)

[28.676470588235293, 15]
[16.31756756756757, 13]
[12.380116959064328, 12]
[11.137546468401487, 2]
[10.684397163120567, 10]
[9.7119341563786, 4]
[9.692007797270955, 14]
[9.449744463373083, 17]
[9.190661478599221, 8]
[8.96474358974359, 11]
[8.804177545691905, 22]
[8.794258373205741, 5]
[8.749019607843136, 20]
[8.687258687258687, 21]
[7.948339483394834, 3]
[7.94299674267101, 18]
[7.713298791018998, 16]
[7.5647840531561465, 0]
[7.407801418439717, 1]
[7.163043478260869, 19]
[7.013274336283186, 7]
[6.782051282051282, 6]
[6.696793002915452, 23]
[6.653153153153153, 9]


In [8]:
print('Top 5 Hours for Ask Posts Comments')
for avg, hr in sorted_swap[0:5]:
    print('{}: {avg:,.2f} average comments per post'. format(dt.datetime.strptime(str(hr),'%H').strftime("%H"), avg = avg))

Top 5 Hours for Ask Posts Comments
15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
02: 11.14 average comments per post
10: 10.68 average comments per post


# Conclusions

- We have found that Ask Hacker News posts are more popular than Show Hacker News posts
- 3PM is the time with the most user interaction, followed by 1PM, 12PM, 2AM, and 10AM (EST)
- 9AM is the time with the least number of comments and user interaction
