# Hacker News' Posts Analysis

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We are here to analyse these posts to answer the following questions: 

Do Ask HN or Show HN receive more comments on average?

Do posts created at a certain time receive more comments on average?

In [3]:
from csv import reader

In [4]:
hn = list(reader(open('hacker_news.csv'))) #reading the csv file
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [5]:
headers = hn[0] #making a list of the header row
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [6]:
hn = hn[1:] #removing the header from the dataset list hn and checking the same
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [7]:
ask_posts, show_posts, other_posts = [],[],[] #making lists of ask hn, show hn, and other posts

In [8]:
#populating above mentioned list

for row in hn: 
    title = row[1]
    cc_title = title.lower()
    if cc_title.startswith('ask hn'):
        ask_posts.append(row)
    elif cc_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [9]:
print(len(ask_posts), len(show_posts), len(other_posts)) #checking the length of these posts

1744 1162 17194


In [10]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


### Checking the average no. of comments for ask_posts, show_posts, and comparing them.

In [11]:

total_ask_comments = 0       #checking total comments in ask_posts

for row in ask_posts:
    total_ask_comments += int(row[4])  #adding number of comments of row to total.

avg_ask_comments = total_ask_comments/len(ask_posts)

#Doing the Same for show_posts below

total_show_comments = 0       #checking total comments in show_posts

for row in show_posts:
    total_show_comments += int(row[4])  #adding number of comments of row to total.

avg_show_comments = total_show_comments/len(show_posts)

print(' The average no of comments for an ASK HN post is', avg_ask_comments)
print(' The average no of comments for an SHOW HN post is', avg_show_comments)

 The average no of comments for an ASK HN post is 14.038417431192661
 The average no of comments for an SHOW HN post is 10.31669535283993


### We find from the above analysis that on average, ASK HN posts receive more comments than SHOW HN posts.

### Next, we'll determine if ask posts created at a certain TIME are more likely to attract comments. 

#### We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created.

In [12]:
import datetime as dt

In [13]:
#making a list which contains information of time of creation and num of comments

result_list = []
for row in ask_posts: 
    to_app = [row[6], int(row[4])]
    result_list.append(to_app)


In [14]:
counts_by_hour, comments_by_hour = {}, {}

In [15]:
# Making Two Dictionaries
#counts_by_hour: contains the number of ask posts created during each hour of the day.
#comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

for row in result_list:
    created_at_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M").time()
    created_at_hour = created_at_time.strftime("%H")
    
    if created_at_hour not in counts_by_hour:
        counts_by_hour[created_at_hour] = 1
        comments_by_hour[created_at_hour] = row[1]
    else: 
        counts_by_hour[created_at_hour] += 1
        comments_by_hour[created_at_hour] += row[1]
    
print(counts_by_hour)

{'14': 107, '15': 116, '05': 46, '06': 44, '17': 100, '22': 71, '23': 68, '09': 45, '10': 59, '08': 48, '20': 80, '03': 54, '16': 108, '07': 34, '13': 85, '04': 47, '01': 60, '00': 55, '19': 110, '18': 109, '02': 58, '21': 109, '12': 73, '11': 58}


In [16]:
print(comments_by_hour)

{'14': 1416, '15': 4477, '05': 464, '06': 397, '17': 1146, '22': 479, '23': 543, '09': 251, '10': 793, '08': 492, '20': 1722, '03': 421, '16': 1814, '07': 267, '13': 1253, '04': 337, '01': 683, '00': 447, '19': 1188, '18': 1439, '02': 1381, '21': 1745, '12': 687, '11': 641}


In [19]:
#making a list of average number of comments per post for posts created during each hour of the day.

acbh = [] #average comments per post by hour.

for hour in counts_by_hour:
    av = comments_by_hour[hour]/counts_by_hour[hour]
    acbh.append([hour, av])

avg_by_hour = acbh

In [21]:
print(avg_by_hour)

[['14', 13.233644859813085], ['15', 38.5948275862069], ['05', 10.08695652173913], ['06', 9.022727272727273], ['17', 11.46], ['22', 6.746478873239437], ['23', 7.985294117647059], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['08', 10.25], ['20', 21.525], ['03', 7.796296296296297], ['16', 16.796296296296298], ['07', 7.852941176470588], ['13', 14.741176470588234], ['04', 7.170212765957447], ['01', 11.383333333333333], ['00', 8.127272727272727], ['19', 10.8], ['18', 13.20183486238532], ['02', 23.810344827586206], ['21', 16.009174311926607], ['12', 9.41095890410959], ['11', 11.051724137931034]]


#### Now we will be sorting the list of lists and printing the five highest values in a format that's easier to read.

In [22]:
#Creating a list that equals avg_by_hour but with swapped columns.

swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[13.233644859813085, '14'], [38.5948275862069, '15'], [10.08695652173913, '05'], [9.022727272727273, '06'], [11.46, '17'], [6.746478873239437, '22'], [7.985294117647059, '23'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [10.25, '08'], [21.525, '20'], [7.796296296296297, '03'], [16.796296296296298, '16'], [7.852941176470588, '07'], [14.741176470588234, '13'], [7.170212765957447, '04'], [11.383333333333333, '01'], [8.127272727272727, '00'], [10.8, '19'], [13.20183486238532, '18'], [23.810344827586206, '02'], [16.009174311926607, '21'], [9.41095890410959, '12'], [11.051724137931034, '11']]


In [27]:
#Sorting the above list in descending order.

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

for element in sorted_swap: print(element)

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


In [31]:
print('Top 5 Hours for Ask Posts Comments: ', '\n')
for element in sorted_swap[:5]:
    print(' {}:00: {} average comments per post.' .format(element[1], element[0]), )

Top 5 Hours for Ask Posts Comments:  

 15:00: 38.5948275862069 average comments per post.
 02:00: 23.810344827586206 average comments per post.
 20:00: 21.525 average comments per post.
 16:00: 16.796296296296298 average comments per post.
 21:00: 16.009174311926607 average comments per post.


## Conlusions 

Based on our findings above, the optimum time to make an ASK HN post for receiving maximum number of comments would be 3pm-4pm, 8pm-9pm, and at 2am.