# Exploring Posts on the Website Hacker News

Hacker News is a popular technology website started by the startup incubator "Y Combinator". On this website, users submit their stories, known as "posts". These posts are voted and commented upon. Hacker News is very popular technology site and posts belonging to the top of Hacker News' listings can get hundreds of thousands of visitors.

In this project, we explore a huge data set with about 300,000 rows contating data of posts on Hacker News. But we reduce to 20,000 rows of a type of interesting posts which receive more comments than the rest. Our goal is to calculate the average number of comments receiving in these posts and more importantly, determine which hours of the day such that the posts created at that hours attract more comments.

First of all, we open the data set as list of lists and explore the first 5 rows of this data set.

In [2]:
from csv import reader
open_file=open('hacker_news.csv', encoding ='utf8')
read_file=reader(open_file)
hn=list(read_file)
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

We need to exclude the header in order to analyze our data. In the below we also print the first 5 rows of the data set after exlculing the header. The most important columns for our analysis are 'title', 'num_comments' and 'created_at'.

In [3]:
header=hn[0]
print(header)
hn=hn[1:]
hn[0:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

Our aim is to work with posts having titles starting by 'Ask HN' or 'Show HN' since they receive more comments than the rest. For this reason, we will creat sub-datasets of hn which contain only those titles. 

In [4]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    title=row[1]# title column
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
        

9139
10158
273822


We have thus created sub lists of our data set. More precisely, the number of posts starting with 'ask hn' is 9139 and the number of posts starting with 'show hn' is 10158. Let us print the first few rows of these two new lists.

In [5]:
print(ask_posts[0:5])
print('\n')
print(show_posts[0:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/

Next step is to check if ask posts or show posts receive more comments. We calculate the average number of comments receiving by those posts. To do this, we calculate the total number of comments receiving by those posts and then divide to the total number of posts.

In [6]:
total_ask_comments=0
for row in ask_posts:
    num_comments=int(row[4]) # comments column
    total_ask_comments+=num_comments
avg_ask_comments=total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments=0
for row in show_posts:
    num_comments=int(row[4]) # comments column
    total_show_comments+=num_comments
avg_show_comments=total_show_comments/len(show_posts)
print(avg_show_comments)

10.393478498741656
4.886099625910612


We can see that on average, ask posts receive more comments than show posts. It would be more interesting to analyze ask posts only. 

The goal is to check which created time is more likely to attract more comments. To do this, we calculate the average the number of comments received by ask posts in each hour of the day. For convenience, we import the module datetime to work with date and time objects in this case.

In [7]:
import datetime as dt

# Creat list of lists, each list consists of a time and the number of comments created at that time.
result_list=[]
for row in ask_posts:
    created_at=row[6] # created time of a post column
    num_comments=int(row[4])
    add_list=[created_at, num_comments]
    result_list.append(add_list)

# Total number of posts and comments at a certain time.   
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    date_time=row[0]
    num_comments=row[1]
    date_time_object = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_time_object, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=num_comments
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=num_comments
print(counts_by_hour)        
print('\n')
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


With these two dictionaties, we are able to calculate the average number of comments of posts which are created at a certain time of the day. We obtain lists of lists by the following program.

In [8]:
avg_by_hour=[]
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)    

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


We have obtained the average number of comments for ask posts at certain time. But the result seems to be difficult to read. To solve this problem, we re-arrange this result by the descending order. One need to swap the columns in order to use the sorted function in descending order. The reason is because in each list of the above list of lists avg_by_hour, the first element is time while we want to look for the highest number of comments.

In [9]:

swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
print('\n')

sorted_wap=sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments:')
print(sorted_wap[0:5])
print('\n')
for row in sorted_wap[0:5]:
    avg=row[0]
    hour=row[1]
    hour_object=dt.datetime.strptime(hour, '%H')
    hour_format=dt.datetime.strftime(hour_object, '%H:%M')
    result='{}: {:.2f} average comments per post'.format(hour_format, avg)
    print(result)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


Top 5 Hours for Ask Posts Comments:
[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]


15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 a

We have conclude that to get a chance of receiving more commments, one should creat a post at 15:00, 13:00, 12:00, 02:00 or 10:00. 

# Conclusion:

In this project, we have explored the data set of posts on the website Hacker News. Our conclusion is that the posts staring with 'Ask HN' attract more comments. Moreover, by calculating the average number of comments of those posts at a certain time, we conclude that to get a chance of receiving more commments, one should creat a post at 15:00, 13:00, 12:00, 02:00 or 10:00. 