
# Exploring Hackers News Posts

In this project, we'll compare two different types of posts from Hacker News,which begin with either Ask HN or Show HN.

Our focus in this project is going to be finding answers to the questions below:

    Do Ask HN or Show HN receive more comments on average?
    Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader

In [2]:
opened_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding="utf8")
hn_r = reader(opened_file,)
hn=list(hn_r)

In [3]:
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

In order to analyze the data, we should get rid of the columns row at first

In [4]:
headers = hn[0]
del(hn[0])
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

We are going to use ASK HN and SHOW HN posts only. Upcoming code is to extract those from others.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for item in hn:
    title = item[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(item)
    elif title.lower().startswith("show hn"):
        show_posts.append(item)
    else:
        other_posts.append(item)


In [6]:
print("Number of posts with ASK HN is " + str (len(ask_posts)))
print("\n")
print("Number of posts with SHOW HN is "  + str (len(show_posts)))
print("\n")
print("Number of OTHER posts " + str (len(other_posts)))


Number of posts with ASK HN is 9139


Number of posts with SHOW HN is 10158


Number of OTHER posts 273822


Lets check if ASK HN and SHOW HN posts get more comments than other posts

In [12]:
total__comments = 0
for item in hn:
    total__comments += int(item[4])
    
avg__comments = total__comments /len(hn)
print("Average Posts' comments are " + str(avg__comments))

Average Posts' comments are 6.5255442328883495


In [9]:
total_ask_comments = 0
for item in ask_posts:
    total_ask_comments += int(item[4])
    
avg_ask_comments = total_ask_comments /len(ask_posts)
print("Average ASK HN Posts' comments are " + str(avg_ask_comments))

Average ASK HN Posts' comments are 10.393478498741656


In [10]:
total_show_comments = 0
for item in show_posts:
    total_show_comments += int(item[4])
    
avg_show_comments = total_show_comments /len(show_posts)
print("Average SHOW HN Posts' comments are " + str(avg_show_comments))

Average SHOW HN Posts' comments are 4.886099625910612


In [11]:
total_other_comments = 0
for item in other_posts:
    total_other_comments += int(item[4])
    
avg_other_comments = total_other_comments /len(other_posts)
print("Average OTHER Posts' comments are " + str(avg_other_comments))

Average OTHER Posts' comments are 6.4572678601427205


It is seen that ASK HN's are getting more comments in comparison with the all posts average. On the other side, SHOW HN posts get fewer comments. This make sense, since people have a tendency of discarding things that others want to advertise or show. 

Now we would like to determine if ASK HN posts, which are more likely to get comments, get more comments if posted at a particular time of the day

In [13]:
from datetime import datetime as dt

In [48]:
result_list = []
for item in ask_posts:
    result_list.append([dt.strptime(item[6], "%m/%d/%Y %H:%M"), int(item[4])])
result_list[:5]

[[datetime.datetime(2016, 9, 26, 2, 53), 7],
 [datetime.datetime(2016, 9, 26, 1, 17), 3],
 [datetime.datetime(2016, 9, 25, 22, 57), 0],
 [datetime.datetime(2016, 9, 25, 22, 48), 3],
 [datetime.datetime(2016, 9, 25, 21, 50), 2]]

__Now is the time for the frequency table of posts due to their posting time__

In [54]:
counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    hour = item[0].strftime("%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += item[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = item[1]

In [57]:
avg_by_hour = []
for item in counts_by_hour:
    avg_by_hour.append([item, comments_by_hour[item]/counts_by_hour[item]])

avg_by_hour
    

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

It'd be better to give the list in order

In [59]:
swap_avg_by_hour = []
for item in avg_by_hour:
    swap_avg_by_hour.append([item[1], item[0]])
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [62]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap[:5])

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]


In [63]:
for item in sorted_swap[:5]:
    hour = dt.strptime(item[1],"%H")
    hour = dt.strftime(hour,"%H:%M")
    print(hour + ": {:.2f} average comments per post".format(item[0]))

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


At 15:00, posts get the most number of comments on average. __It is seen that posts that are posted at afternoon hours of several highly populated countries are more likely get comments.__