## Hacker News Analytics

Hacker News is a site created by Y Combinator, where user-submitted posts are voted and commented upon, similar to Reddit. We are particularly interested in posts beginning with "Ask HN" and "Show HN". Below are some examples.

--------------------------------------------
- Ask HN: How to improve my personal website?

- Ask HN: Am I the only one outraged by Twitter shutting down share counts?

- Ask HN: Aby recent changes to CSS that broke mobile?

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'

- Show HN: Something pointless I made

--------------------------------------------

The aim of this project is to determine whether these "Ask HN" and "Show HN" posts are more receptive in comments on average and whether posts created on certain times of the day receive more comments on average.

The dataset used for this project is from HackerNews, and is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts), however, it has been reduced from 300,000 to around 20,000 rows for convenience purposes. See CSV for dataset used for this project in particular.

### Removing headers

In [47]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)


hn_header = hn[0]
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [50]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)

    elif title.lower().startswith('show hn'):
        show_posts.append(row)

    else:
        other_posts.append(row)
    
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))  

1744
1162
17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that the posts have been categorised into ask, show and others, we can calculate the average number of comments each category of post receives.

In [53]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(float(post[4]))
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [54]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Ask posts receive an average of 14 comments, while show posts receive an average of 10 comments. Since ask posts generate more comments, we'll focus on this.

### Find the quantity of Ask Posts and Comments by the Hour Posted

To do this, we first cateogorise posts into time posted during the day. Then, we get the total commentes generated by the hour.

In [63]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}
d_format = "%m/%d/%Y %H:%M"

for result in result_list:
    date = result[0]
    n_comment = result[1]
    time = dt.datetime.strptime(date, d_format).strftime('%H')
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = n_comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += n_comment
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [68]:
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])
    
avg_by_hour

[['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['08', 10.25],
 ['11', 11.051724137931034],
 ['18', 13.20183486238532],
 ['00', 8.127272727272727],
 ['04', 7.170212765957447],
 ['16', 16.796296296296298],
 ['03', 7.796296296296297],
 ['15', 38.5948275862069],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['14', 13.233644859813085],
 ['02', 23.810344827586206],
 ['09', 5.5777777777777775],
 ['12', 9.41095890410959],
 ['20', 21.525],
 ['17', 11.46],
 ['19', 10.8],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['13', 14.741176470588234],
 ['05', 10.08695652173913]]

### Sorting and Printing

In [74]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

### Top 5 Hours for Ask Post Comments

In [78]:
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),
                                                      avg)
         )
    
    

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The peak hour for posting is at 15:00 or 3:00 pm (U.S. timezone). That is 60% more comments and engagement generated than the succeeding peak hours.

### Conclusion

The posts that receive the most interaction are those of Ask type and posts created at 15:00 (U.S. timezone). 

This may be explainable because Ask questions urge readers to give their opinions or reply to the opinions of others. In addition, 3:00 pm - 4:00 pm is the time of the day when people are wrapping up at work, and may have time to interact and reply.